AI for Anomaly detection in network area

Marzena Ołubek
Data Scientist
Automatic Summary

Introduction

Welcoming all to the enlightening sphere of Orange Innovation Day and AI, I am Ma Ma Beck excited to walk you through the intriguing world of animal detection network based on GCP Vertex AI. Being well aware of the frustration that crops up when your call drops or internet breaks down, today, let's see how Orange is tackling this network malfunction.

About Orange Innovation

Our mission is making people's everyday lives better through AI, aiding customers and our colleagues, like network experts, as we work towards preventing the network malfunction. We offer products and services in machine learning and artificial intelligence to increase efficiency in preventing network malfunctions. Ultimately, our primary objective is for Orange to become AI driven with AI at scale.

Our Innovation Model

We follow a model with four strategic pillars:

  1. Smarter Networks: We aim to build adaptive, efficient, sustainable, and secure networks with AI.
  2. Reinvent Customer Experience and Management: We utilize AI to improve customers' journeys with churn reduction models and propensity to buy models.
  3. Operational efficiency: With the help of AI and data, we strive to amplify profitability.
  4. Responsible and Sustainable AI: Align all the pillars with sustainability, trustworthiness, and human insight.

Focusing on Smarter Networks

Smarter Networks involve managing network data that comes from network equipment. The challenges that we often encounter with network data include dealing with a huge amount of data, dealing with false alarms, and detecting rarely occurring anomalies in the data. We use unsupervised machine learning methods due to the lack of labels in our data.

How we process Network Data

We use the network data for anomaly detection. The data is gathered from various equipment and managed in the data management layer. After processing the data using AI, we push the output into an incident management system, triggering automatic detection of anomalies. The aim is to enrich incident management with AI and data solutions.

To elaborate on the process more, we start with data from a fault management umbrella, encompassing alarms from different types of network equipment. The goal is to detect anomalies in the volume of alarms. The actual process begins with feature engineering.

Feature Engineering

The process involves:

  1. Filtering data per type of alarms.
  2. Partitioning data by time for example, into three-minute windows.
  3. Aggregating data per levels of the network.
  4. Counting alarms within each time window per each network level.

Data Modeling

We define an anomaly as an observation that seems inconsistent with the rest of the data. To detect these anomalies, we fit data into machine learning models, predict the volume of alarms, and then calculate the difference between model output and real data. If this difference exceeds a certain threshold, it's considered an outlier.

Methods for Detecting Anomalies

  • Arima (auto-regressive integrated moving average): Good for continuous data.
  • Isolation forest: Based on decision trees and focusing on outliers.
  • Local outlier factor: A model based on density.
  • Poisson distribution: For estimating rare events.
  • Autoencoder neural networks: These are used to detect data points with a higher reconstruction error.

Deployment of our Use Case on Google Cloud Platform

We deploy our use case on Google Cloud Platform, leveraging tools like vertex AI, BigQuery, cloud storage, and Data Looker.

Conclusion

Orange is transforming into an AI-driven company with a focus on smarter networks through its partnership with Google. Even amidst the challenges of unlabelled data, we have strived to deliver an unparalleled experience while addressing network malfunctions.

Questions?

If you have any queries or require more insights into our work, feel free to connect with me through Linkedin.


Video Transcription

Good morning, everybody. Uh My name is uh Ma Ma Beck. I'm from Orange Innovation Day, 10 day I, and today I will tell you about the animal detection network based on GCP vertex A I.So uh at first think about when somebody calls you and your call drops or when your internet is breaking down, I think it's very frustrating uh even for me. So I think for you also. So um today, I'm gonna tell you how orange is uh dealing with this network malfunction. Uh First of all, uh I would like to say uh about the agenda. So uh at first, I will say a few words about Orange Innovation data and A I. So my department when I'm working on and then I will show you my use case uh so on my detection, the network. And uh then I will show you how this use case was uh developed on Google cloud platform. So, uh starting from, from the first, why we need in Orange day and day. I, first of all, our mission is uh making people's everyday lives better through A I. It means uh that we are helping our customers and our colleagues, for example, network expert uh in their everyday life because we are preventing the network malfunction.

And we are providing products and services uh within the machine learning and artificial intelligence in order to um faster, uh uh faster, faster, um prevent the network malfunction. So uh our goal is for orange to become the and A I driven with at A I at scale. It means that we are focusing on solutions that can be replicable uh for other countries. Our ambition is to place state in the eye at the heart of our innovation model. So what is our innovation model? Uh It has uh 34 strategic pillars for day to day. I uh First of all, it's smarter networks. So uh we are building network that should be adaptive efficient, uh sustainable and very secure uh with A I. So it should be smart. Another strategic pillar for, for us is uh reinvented uh customer experience and management where for example, we are doing um uh mm model models with propensity to buy uh churn reduction and other uh customer journey with, with A I. Uh We are also using data and A I for greater operational efficiency to be more profitable with A I. And all those three pillars must be within responsible and sustainable A I so trustworthy green because orange goes green and align with our human insight mission. And so, uh indeed in the eye, now we will focus on smarter networks. Uh So in that pr we are focusing on network data. So what are the network data, network data are those data that comes from network equipment? So uh system locks errors.

Mm and other data coming from from uh uh routers and other uh network equipment. So what are the biggest challenges in network data? Uh First of all, this is a very big data because uh we are dealing with um few terabytes data per one country per month. So it's very huge amount of data. It's a very big data. Uh They are coming from different countries. So we have different methodologies for aggregation data. And um um those network equipments uh produces alarms and those existing tools provided also false alarms. So uh we have to deal with that problem. And uh the most important challenge is to detect anomalies and those anomalies occurs relatively rarely. It means that we have real class imbalance. So uh it's very hard to uh to define those those anomalies because they are very, very very we. And uh moreover, if they occurs um uh in our data, we don't have labels. So we are focusing on unsupervised methods of uh machine learning because of the lack of the labels. So this is the word of network data. Um What can we do with those, those data? I will present you my use case where I'm working on in orange. So uh uh anomaly detection based on alarms, volumetric. Um What what does it mean? So those orange mm uh ones are, are systems that are on premise in orange. And this blue one is uh a Google one because we have a partnership with Google.

So first of all, we have network equipment that produce the data, network data uh which are gathered and managed in data management layer and um and dose data. Uh This is our source of the data into a IML based anomaly detection systems. And those solutions are built in Google cloud platform. Uh So in this blue blue one and output of machine learning models that are detecting anomalies uh are pushed into the incident management system that are triggering the ticketing system. And this all uh this process is uh done automatically.

Uh So um our goal is to enrich this uh incident management system with a IML and data solutions that will be automatically detecting anomalies. Um So what's what we are starting with? We are starting with the data, what are the data data are coming from uh some kind of uh umbrella. So the data warehouse uh data source that is uh managing the network data, it's fault management umbrella. And uh the those data in this use case are coming from orange Romania. And uh we are focusing on alarms and we are dividing those uh alarms into different types of alarms. For example, power alarms is one of the type. And so this is our input of the data. And our goal is to detect anomalies in volume of alarms because not each alarm, but uh is uh an anomaly. Uh we are focusing off uh on anomalies in volume. So, for example, detecting the the peaks of the alarms. Um how we are doing it. First of all, we are starting with uh feature engineering. What does it mean? Uh that uh we are uh first of all filtering the data uh per type of the alarm. So uh let's focus on one type of the uh alarms. So power alarms and it will be those orange dots. So we are filtering those uh fire alarms. Uh So at the second step, we are partitioning the data. So splitting those data, those alarms by time, for example, into three minutes time window as it's uh on this picture.

Uh In the third step, we are aggregating those, those data per levels of network because uh we have few levels of network equipments very low. So each site site is the lowest uh type of network equipment. So um BT S based uh Telco Telco station and this is the site region and this is the mm region, geographical region in Romania. And there are 44 regions and oasis. There are three or four uh oasis regions in um in orange Romania. So we are then counting those alarms within the time window, this Freman time window and per each side, each region in or SS so we are uh having some kind of alarms counter and this table is an input to our model. Mm So how we are doing those data modeling? Uh First of all, what is the, our goal is to detect anomaly? So what is the anomaly anomaly is some kind of an outlier in the data? Uh So it is defined as an observation or subset of observation which appear to be inconsistent with the rest of the data. Uh So um we are looking for no natural behavior of the data uh which is not the same as as it was uh before and how we are detecting those anomalies.

We are fitting data to machine learning models, which models I will uh tell you uh in a minute, uh we are predicting the volume of the alarm. So the most probable probable volume of the alarm is based on the uh data that uh was in the past. And then we're calculating, for example, the difference between the model output and the real data. So when the model is uh well fitted to uh w when the data will fit to the model, and uh the predicted value is the same as it was in the real. So there's no anomaly but uh when we are predicting the value which should be uh uh not anomaly value and the real Anno data is far, far higher. Um It's uh it's supposed to be an anomaly because we are calculating the difference between the model output and the real data. After setting of a threshold, we detect the points that are exceeding those thresholds as outliers. And uh here we have some few um methods that we are using uh in our use case to detect anomalies. First of all is the uh time series forecasting method. So Arima autoregressive um integrated moving coverage um and it is very, it was very good for the continuous data. Mm But uh we also tried another models uh based, for example, on decision tree, which is the isolation forest.

So it's building the decision trees based on the one feature and uh dividing uh those um um elements that are those uh points in a tree that are far, far away. So the first um points of the tree seems to be the most anomalous. We also tried um mm density based uh models like uh local out life factor and um uh some um statistical approach um based on the pu uh distribution because of the uh because the Poisson distribution is a poison law uh seems to be um estimating the, the rare events and our alarms are very, very rare events.

That's why we were trying the pool um calculating the probability of the uh mm rare events. And at the end, we also tried some neural networks. So O coder, this is the um uh the model that uh uh destruct the, the architecture of this uh neural networks uh which uh decodes and, and codes and then decodes the data and uh those uh data that have the higher reconstruction errors, ours seems to be the most anomalous data.

Uh So, um uh those are the results of, of uh of our research as you can see on, on the graphs. I'm sorry. Uh All those models are uh detecting the highest peaks. The Arima was on isolation for or local applier factor. But for example, only Arima is detecting the lowest anomalies. Uh It's also important uh because for example, uh when the network equipments are not producing any alarms, it also can be an anomaly. So um um we are looking for the, the highest peaks, but for some kind of the network equipment, it's very important to have a look also for the uh lowest peaks. That's why aroma is uh is good for it. Um Poisson model uh was uh the best for a very sparse data. So, for example, uh we have some kind of traffic alarms alarms that were more sparse than power alarms. And uh in that situation, Poisson uh was uh better fitted to the possum distribution um isolation forest was detecting uh much so, so much anomalies that uh um it may need to hyper tune this model uh to, to, to lower the threshold and locally factor was uh uh have uh mm nice results uh even for uh more continuous.

And this and this part data. Uh so uh to sum up different machine learning models, output different results, uh all for those models take at the highest peaks, but uh really to validate those models, um we can do it after the feedback loop with annotations. Annotations are the information from the, for example, network expert that were confirmed or not. Uh if the anomaly detected by the mother was, was true or not. So, this is an important step uh uh to uh to check uh if uh those models uh to tune those models. Yes. And uh how we have uh developed deployed our use case on Google cloud platform. Uh First of all, we have an input data. Uh It's um uh it's stored in cloud storage, Google cloud storage. And then we are preprocessing uh the data and training the model. And this is the training pipeline in vertex A I the model when it's uh pro trainin, it's uh um saved and registered in artifact registry. Um So this is uh the, the first pipeline ends here. But we can have, we also have insurance pipeline. So we are calculating predicting the uh the new values based on the pre trainin model. And then we are storing our results in Bitcoin. And we also have a dashboard in data looker uh to uh to visualize detector all the pipelines I are managed in vertex A I workbench in Google cloud platform.

And a few words about those pipelines uh pipelines uh mm helps to automatically train and predict our machine learning models. This is the end to end process starting from loading the data featuring engineering training, predicting and saving results. Uh So all the steps are uh done step by step and it's a easy tool for I have four data scientists. Um We can have also a dashboard. Uh The this dashboard is done in a data looker. So we have um uh separate dashboard per alarm type alarm uh per per model because of because we have a few models per dimension dimension meets those uh uh network dimension like uh site name uh uh or the highest level of network equipment and the per time window, for example, uh we can have three minutes time window when we are counting those alarms and or uh do the experiment with any other time window.

And um what are the most important facts I want you to remember uh from my speech is that uh orange is a day and A I driven company because Dayton Day I is uh in our heart of the innovation. And uh with our partnership with uh Google, uh we are building uh smarter network uh using the cloud based solutions. So uh we are using bigquery as a big data tool and we are building pipelines in vertex A I and uh those pipelines uh I mean machine learning pipelines and as uh and we have uh Um And we are focusing on unsupervised machine learning models for detection as we uh don't have um or have so much, not so much labors in our data.

Uh So thank you for listening to me. Uh If you have any questions, do not hesitate to, to contact me uh on uh linkedin or he or here now on, on chat.