Summary:
1. The blog discusses the importance of collecting high-quality data for AI to effectively alert us in advance, rather than relying on traditional methods like Ping and SNMP.
2. The research conducted focused on collecting detailed logs from various devices to monitor SLA violations, CPU spikes, bandwidth thresholds, and more.
3. Despite the challenges of managing a large volume of data and ensuring proper labeling, the efforts resulted in a wealth of valuable information for AI analysis.
Article:
In the realm of AI technology, the key to its effectiveness lies in the quality and reliability of the data it processes. Traditional methods such as Ping and SNMP may provide some insights, but they often fall short in delivering real-time, detailed information. To address this limitation, a thorough research initiative was undertaken to determine the optimal level of data collection required for AI to proactively alert us to potential issues.
The focus of the research was on gathering comprehensive logs from a vast array of global devices, totaling around 2,500 units. This extensive data collection effort aimed to capture a wide range of information, including SLA violations, hardware performance metrics, network configuration changes, and even netflow data. By delving deep into the intricacies of network operations, the team was able to paint a clearer picture of the underlying trends and patterns affecting their systems.
One of the key strategies employed was the integration of SLA monitors on SD-WAN routers to track DNS, HTTPS, and SaaS application performance. These monitors served as synthetic emulators, generating logs whenever a layer 7 service failed to meet its SLA or when website performance deteriorated. By monitoring layer 7 protocols at the router level, the team gained valuable insights into potential bottlenecks and performance issues.
Additionally, logs from radius/TACACS servers provided visibility into security violations on layer two ports and occasional MAC flooding incidents. Detailed data on wireless infrastructure, including signal strength, SSID information, and client counts, was obtained through a vendor API, facilitating comprehensive monitoring of access points. Similarly, data from switches encompassed a wide range of metrics, from VLAN changes to OSPF convergence, ensuring a holistic view of network operations.
Despite the challenges of managing a large volume of data, the team successfully aggregated all the information into a centralized data lake. However, the data presented a new hurdle as it lacked proper labeling and had multiple timestamps, resembling more of a data swamp than a structured repository. Addressing this labeling issue was crucial, as AI algorithms rely heavily on accurately labeled data to derive meaningful insights.
In conclusion, the journey of collecting and organizing vast amounts of network data was not without its challenges. However, the meticulous efforts paid off, paving the way for AI-driven analysis and proactive monitoring of network operations. By ensuring that data is not just abundant but also properly labeled, organizations can harness the power of AI to anticipate issues before they escalate, ultimately enhancing network efficiency and reliability.