Designing a Scalable Data Pipeline for AI in Network Operations

Many network engineers pride themselves in being able to look at a config or a packet capture and intuitively understand what’s going on. But turning packets into real-time, AI-powered insight is entirely different. Modern NetOps demands more than dashboards full of metrics and counters. It needs data pipelines that can ingest billions of flow records, normalize them in-flight, and feed machine-learning models that predict trouble before users ever notice.

To get there, AI needs a reliable, stable, and well-designed data pipeline to feed it the data necessary to make useful predictions and for the latest LLM-query tools to answer prompts with accurate, timely information. In this post, we’ll step through a high level architecture for a data pipeline supporting AI in network operations. 

A High Level Flow

Sources

For a production-ready, enterprise-scale network data pipeline, we want to ingest from a variety of sources to get a complete picture. This includes familiar telemetry like:

  • NetFlow/IPFIX

  • Cloud flow logs (VPC Flow logs, NSG and VNet Flow logs)

  • gNMI streaming telemetry

  • SNMP traps

  • Ticket-system webhooks

  • Device configurations

And we can certainly add to that routing tables, metadata like application and security tags, DNS records, SoT platforms like NetBox, and so on. Basically, choose sources that are relevant to your AI initiative.  

Ingestion

There are a variety of methods and platforms to ingest network telemetry data, but just to get started, remember that we’re dealing with multiple data types and sources often in real-time, so solutions like Apache Kafka on-prem or AWS Kinesis for a cloud-native design.

Kafka is especially popular, and a typical workflow would use lightweight collectors like Telegraf to push raw events into a Kafka topic per telemetry type. We can also use OpenTelemetry to create one single standard for all our telemetry types using OTEL collectors and a consumer.

Stream Processing and Feature Engineering

Once the data enters our low-latency Kafka message bus, it’s queued and forwarded to our stream processing layer. A popular solution is Apache Flink which normalizes the data in flight, for example, changing all values to bits per second, deduplicating flows from multi-collectors that might have the same data, or calculating rolling p95 latency measurements.

Though I’m not getting into it much in this post, we also have to consider batch processing for data like rollups, tickets, and device configurations. A popular solution for batch processing network telemetry data is Spark Structured Streaming which accomplishes much the same thing when low-latency stream processing isn’t as critical. 

Feature engineering is basically identifying or creating important characteristics from the raw data that we send to an ML model. In this case, calculating a p95 score for latency is an example of feature engineering, as would be calculating standard deviations, rolling deltas, etc. These are derived statistics from raw data, and it doesn’t matter whether you do that in Flink, Spark Streaming, pandas, or SQL. It’s the transformation into something useful, not the tool, that makes it feature engineering.    

Storage

Our stream processing layer next sends our pre-processed data to storage, which depending on the type of data could be a data warehouse for structured data, a data lake for predominantly raw unstructured data, and so on. 

Examples of object storage for your data lake are S3, Blob, or GCS. Examples of enterprise-scale data warehouses for structured data, often in columnar databases for SQL, are Amazon Redshift, ClickHouse, Snowflake, and even InfluxDB. 

Model Training

Now that our data is in appropriate persistent storage, we can start training our ML models with it. Today we can more easily orchestrate this with canned solutions like Apache Airflow, Google’s Vertex AI and AutoML solutions, and Databricks. 

You can absolutely train models locally with classic ML frameworks like scikit-learn, XGBoost, and deep learning frameworks like PyTorch and TensorFlow. Most organizations will eventually face resource and scaling limits, though, so ML services like the few mentioned above are very popular.   

Online Inference 

With trained models, we can now perform inference on new data, which in our case of NetOps would often be in real-time. So for example, real-time features generated in Flink would be sent to another tool such as Feast which is more-or-less a feature store that is integrated into the workflow to serve models via FastAPI, BentoML, or some other service. 

This way, the features we generated in our stream processing stage get stored and served very fast so folks like network operators can get insight into traffic data in (near) real-time. 

Observability and Governance

This last stage isn’t a part of the workflow per se, but it’s nevertheless critical for managing the data pipeline and for keeping data secure. Using traces via tools like OpenTelemetry we can see how each component is working in the pipeline. We can also apply data-quality rules with platforms like Deequ and Soda for open source options, and commercial platforms like Great Expectations.  

Observability and governance is sometimes an afterthought, but it’s important to think about it when you architect your data pipeline so that you have the tools in place when things don’t work (which is almost inevitable) and to keep data secure and in compliance. 

—-

In AI for NetOps, remember that something like 80% of the work is data engineering, or in other words, the process of architecting, building, and operating the data pipeline that ingests, processes, stores, and secures the data our AI solution relies on completely to be useful. 

In this post we looked at some high level concepts of big data, and we stepped through the high level structure of a typical pipeline. However, in reality it’s more complicated than that. We have to consider budget, staffing, the business objectives, and how to architect around that to achieve the best results in an acceptable timeframe.

This is why it’s rare to have a single person build an entire AI solution end-to-end from configuring sources, architecting and building the data pipeline, training and applying models, and interacting with the business to make sure the initiative aligns with business goals, budget, and security requirements. 

I’ve heard it said that for every data scientist applying models, there are two or three data engineers building and running the pipeline behind those models. I think that in NetOps and with the many services being offered today to orchestrate a lot of this activity, we can get away with building more simple AI solutions (relatively speaking) without necessarily needing teams of data engineers, data scientists, and dedicated ML practitioners. But the core knowledge needs to be there, as does an appreciation for the complexity of the end-to-end system.  

Previous
Previous

Becoming a Next-Gen Network Engineer in a Software-Centric World

Next
Next

The Importance of Data Pipelines for AI in Network Operations