The Importance of Data Pipelines for AI in Network Operations
I love a good packet capture. In fact, there’s little I love more on a Friday night than watching the packets stream across my screen in Wireshark. Sometimes I feel a little like Cypher from the Matrix watching the bits fly by.
Unfortunately, machine learning doesn’t quite work that way. An ML model doesn’t stop to stare at SYN flags or marvel at a weird DF bit. It just ingests numbers and builds statistical intuition. If the numbers are wrong, like skewed timestamps, missing values, duplicated flow records across collectors, then the model learns the wrong story. Clean, correct, diverse, and sufficient data is the foundation of any serious AI initiative, and that’s true when we apply AI to network operations.
If you want AI for your network operations that actually works, you have to treat network telemetry like big-data engineers treat clickstreams.
Introducing the Data Pipeline
Until recently, most of that data landed in time-series dashboards or CSV exports and went no further. Today, however, the same streams can feed machine-learning workflows that forecast congestion, detect anomalies, and steer traffic automatically. For many network engineers the stumbling block isn’t the algorithms but the data pipeline.
A data pipeline is the set of tools that moves raw telemetry from source to ingestion to processing to storage to serving for ML and dashboards.
Source → Ingest → Process → Store → Serve
Especially when dealing with real-time data like we do in networking, we need to understand how to ingest, clean, store, and serve billions of records per day so models can learn and infer in (near) real time. In AI for NetOps, and really any AI initiative, the data pipeline is the bulk of the work and activity needed to apply models and make inferences.
The Four ‘V’s of Data
You may have heard about the “four Vs” of data: variety, volume, velocity, and veracity (some people say there are 5, 6, 7, or more). In NetOps, we haven’t had to think about that much unless you were with a visibility vendor specializing in network telemetry. However, I believe that’s changing now. Traditional network engineers accustomed to configuring routers and tweaking QoS policies are becoming more interested in leveraging AI in their operational workflows, but they’ve never had to think about (let alone build) complex data pipelines.
Unfortunately, that’s the first step for building an AI solution for NetOps, but fortunately, it’s well understood in the broader data engineering and data science world. That means we, as network engineers relatively new to the world of AI workflows, don’t have to start from scratch.
Variety of Network Telemetry
So the first problem we need to solve is handling the variety of data we deal with in NetOps. Think of variety in terms of format, type, and source. We ingest flow records from our routers and switches, device metrics from APs and firewalls, cloud flow logs from CSPs, collect SNMP traps, and ingest a flurry of new tickets from end-users every day. This represents a wide variety of data in terms of their format, such as a proprietary record from AWS, JSON from a router, the structured format of an SNMP trap, or the actual text of a trouble ticket written in English.
A typical network or application problem will manifest themselves in several of these telemetry types, so analyzing only one, such as only flow or only SNMP, will result in an incomplete understanding of what’s going on. Sure, it makes our lives easier to deal with only one type of data, but if we’re willing to engineer a solution that handles multiple types of network telemetry, we’ll increase the accuracy, effectiveness, and usefulness of our AI initiative.
Volume of Network Telemetry
Next, we have to deal with an enormous volume of data, which isn’t so hard to imagine if you’ve done a packet capture for only a few minutes. A decent-sized enterprise network can generate millions upon millions of flow records per day, and on top of that there’s a stream of device metrics, new tickets, log events, and so on. More concretely, a single 10 Gbps link can generate millions of NetFlow v9 records per hour in a busy network.
We also have to consider the volume of data stored for historical purposes. If we’re using AI to make predictions, for example, we need to train our ML models on sufficient historical data so it can make accurate predictions with new, real-time data the model hasn’t seen before.
Velocity of Network Telemetry
Then there’s data velocity, or in other words, the speed data is generated, transmitted, and made available for processing and analysis. Velocity is all about how quickly events arrive and how fast your entire AI workflow can ingest, store, and act on them. In NetOps, those events would be in the form of all the telemetry we’ve been talking about so far, like flow records, streaming telemetry, syslogs, trouble tickets, and so on.
Some of that data could be batch processed. For example, certain data that doesn’t decay within seconds, such as device inventory in a CSV, or configuration snapshots, could be processed once a week. Trouble tickets might come into the ticketing system in real-time but the tickets themselves enter the data pipeline for our AI workflow only once a day.
Flow records, live packet-loss KPIs, and BGP state changes can’t wait for a nightly batch. They have to hit the pipeline, be normalized, and feed inference within seconds, or else the alert arrives after the outage.
If reacting after a coffee break won’t hurt the business, batch it. If seconds matter, stream it.
This means we need to engineer a data pipeline that ingests, cleans, processes, stores, and analyzes new data fast enough that it’s useful to an engineer troubleshooting a problem right now in real-time, not as historical analysis.
Veracity of Network Telemetry
Lastly, we have data veracity, or in other words the accuracy and reliability of the data. This is the degree to which the telemetry that feeds your network-AI pipeline is accurate, complete, consistent, and trustworthy. In the same way routing fails without reliable reachability information, an ML model collapses without reliable data. So if the inputs mis-represent reality, the model will learn, predict, and automate the wrong things.
For example, in your data you may have metrics from one vendor reporting interface counters in bytes, but another vendor reports in bits. Similarly, one vendor exports temperature in °C, another in °F. If you feed both to the same model without normalising, feature scales diverge and the optimiser “learns” noise.
So in this case, veracity of our network telemetry data involves deduplication, normalization, scaling, dealing with missing rows or individual values in the incoming stream, and more.
–
Your Wireshark skills still matter, but in the age of AI they’re only the beginning. As we move into the age of AI for NetOps, even traditional network engineers will have to have at least a basic understanding of what a data pipeline for AI looks like. And the good news is we don’t need a massive budget to get started. A lean stack like Telegraf > Kafka > Flink > ClickHouse > Feast can be stood up in a lab this week and scaled when the business case proves itself.
If we can get the plumbing right, AI will stop being a buzzword and actually become a relevant and useful tool in our NetOps arsenal.