AIOps Series II: The Exploration of Intelligent Monitoring System
Abstract: According to the latest definition from Gartner, Artificial Intelligence Operations (AIOps) integrates big data and machine learning to extract and analyze the growing volume, variety, and velocity of data in a scalable and loose-coupling manner, supporting IT Service Management platform. To share our daily practice and front-line experience with peers, the WeBank AIOps team has written article series on the subject of AIOps.
Previously on WeBank AIOps Series: AIOps Series Ⅰ: The Emergence and Practice of AIOps
Origins
In recent years, AIOps has gained popularity with the rise of big data and AI technologies. However, most enterprises face significant challenges in AI applications. WeBank initiated the AIOps project with anticipation to solve the problems we encountered as we grow. We rolled out a few pilots after sufficient data and algorithms were accumulated. In the process of exploration, we have experienced various difficulties and setbacks. After three years of efforts, WeBank had successfully built the second generation of Ops tools in early 2017 including CMDB (Configuration Management Database), ITSM (IT Service Management), PACE+ (automated publishing tools), IMS (Intelligent Monitor System), and ICP (Intelligent Capacity Planner). Until early 2020, we finally achieved significant progress for the AIOps project.
Before reaching the latest milestone of AIOps in WeBank this year, the Ops team endured a lot of pressures from identifying and locating anomalies in a swift manner to maintain the quality of IT products/services in a leading position. Regarding the first challenge related to root cause analysis (RCA), the Ops team encountered the following pain points.
- When a system produced many alerts, Ops engineers were not able to promptly identify the overall impact. The escalation of anomaly took too long to process, which involved multiple product owners going through multiple Ops manager for specific impact assessment.
- Ops engineers couldn’t deal with single problems effectively due to the lack of systematic processes of cause location.
As a result, the Ops team was challenged constantly during anomaly reviews as the monitoring system did not show the required information with clarity.
WeBank’s renowned distributed architecture of core banking systems has received numerous awards since the bank’s debut. The maintenance of such industrial-leading infrastructure requires Ops management systems in top shapes and high standards — handling daily routine offhand while minimizing labor costs perpetually. In 2015, WeBank began developing an AlphaGo-like system that can self-study and automatedly locate the root cause of anomalies.
To achieve this goal, the Ops team initially designed a simple system that showed KPIs’ base values and calculated the deviation between KPIs and the current indicators. Then it alerted if the variation was out of standard range. It was just a simple alert system rather than an AIOps system. Therefore, WeBank has been continuously accumulating data and optimizing systems to build an AIOps system in the past few years. Let’s take a closer look at how we worked on this.
Accumulation of Ops Data: Standardization
From 2015 to 2017, WeBank has made significant progress on standardization and automation in Ops systems, making a solid foundation for the Ops data accumulation.
- The CMDB system provides accurate information to Ops engineers and a considerable number of APIs for other systems.
- Accumulation of monitoring data: The IMS records all monitoring data from the infrastructure layer to the application layer.
- Error log information: The IMS collects and stores all error log information.
- Accumulation of publishing information: The AOMP system record real-time publishing operations such as application publishing, database operation, etc.
- Accumulation of infrastructure data: The WeCloud platform records the change data and operational data from the infrastructure layer, such as network, database, middleware, host, etc.
- The Knowing system collects information from business promotion.
Algorithms Application in AIOps
In early 2018, WeBank started to explore anomaly location and RCA through the following two aspects.
1. ‘Knowing’ is a system of anomaly detection (‘Shitu’) and RCA. The ‘Shitu’ module automatically detects product KPIs without a manual setting of the thresholds.
The system launch marked WeBank’s switch from analyzing alerts to monitoring the product KPIs. Formerly, an Ops engineer sets a threshold in a product to monitor KPI based on his/her prior experience. And the traditional monitoring way is based on the yearly or monthly comparison. With ‘Shitu’ functions, the Knowing system decreases the workload of monitoring configuration and improves the accuracy of root cause location. Nevertheless, front-line personnel still track the alerts in real-time according to the daily alert process. And the Ops engineers of business departments can obtain data from the monitoring system’s APIs to analyze and compress alerts.
2. According to the business serial numbers from service governance, the Knowing system generates a transaction trace and detect whether it is in a normal condition. If there is an exception in a transaction trace, the Knowing system sends alerts to relevant systems.
The Knowing system generates a unique transaction trace through the transaction serial number, which plays a primary role in daily monitoring and operation management. The Knowing system detects every transaction flow in real-time through LSTM or deep neural network to identify anomalies in the production environment.
To implement the real-time detection algorithm, WeBank cooperated closely with Professor Pei Dan of Tsinghua University, a leading figure of AIOps in China. This cooperation helps WeBank make significant progress in AIOps. Formerly, the transaction trace contains data from RMB (Reliable Message Bus) Protocol. To include Non-RMB Protocol into the transaction trace, as well as to improve its comprehensiveness and accuracy, we initiated a new project in 2020.
RCA (Root Cause Analysis)
RCA Project launched in 2018 aims to identify and locate anomalies quickly and demonstrate the process of root cause analysis. The following diagram illustrates the workflow of this system.
1. ‘Knowing’ detects anomaly by minutes.
‘Knowing’ detects the product KPIs like the transaction volume, latency, system success, and business success rates every minute.
2. ‘Knowing’ automatically calculates the deviation between KPIs and current indicators to check whether a system is healthy. However, we also encountered obstacles in ‘Knowing’ project.
l The algorithms applied in ‘Knowing’ relies on historical data, so if the regularity of data collecting is different from historical data, ‘knowing’ will get wrong results;
l When the trading volume is low and the fluctuation is significant, it is easy to trigger alerts by mistake.
3.1 Anomaly alerts
‘Knowing’ divides product transaction indicators into ‘high,’ ‘medium,’ and ‘low’ levels while each indicator has its weight. When ‘Knowing’ detects an anomaly in a system, it scores this anomaly by indicators and alerts on relevant systems. Moreover, ‘Knowing’ decides which anomaly alerts should be sent rather than sending all of them.
Besides, ‘Knowing’ divides alerts into two categories which direct to different escalation process. One category is called indicator jitter to be alerted on PC, robot notification system and ITSM. The other one is called anomaly to be alerted on PC, robot notification system, IPAD and more relevant systems. These anomalies would be final graded in event reviews.
3.2 RCA trigger
‘Knowing’ alerts on systems and kicks off the root cause analysis simultaneously. It is a primary challenge for the RCA module to find out the actual root causes.
- First of all, there should be sufficient data stored in our databases, such as log information, alert information, API calls, transaction links, CMDB relation tree, change information, promotion information and business behavior information.
- Secondly, there should be sufficient historical data of anomalies accumulated in databases.
- Last but not least, ‘Knowing’ deduces the root cause through a detailed analysis from multiple dimensions.
The RCA module applies cutting-edge technologies such as graph databases and knowledge maps to support its analysis.
4. The RCA module demonstrates the root cause of an anomaly.
‘Knowing’ can send out the root cause conclusion in 2 minutes from an alert sending out to the completion of root cause analysis.
5. Recovery notification:
‘Knowing’ sends recovery notification when the system recovers from an anomaly, and it calculates the impact of business volume during the anomaly stage.
6. Events management
l When an anomaly is alerted, ITSM generates a tracking form automatically for further improvement trace.
l Front-line personnel assist in tracking and providing feedback on the root cause of the tracking form.
l Ops team supplement the knowledge database by reviewing each ITSM tracking form.
Benefits
During the promotion of ‘Knowing’, the Ops team has experienced the joy of accurate root cause analysis, as well as the frustration of RCA accuracy rate declining in consecutive 2 months. But overall, ‘Knowing’ project is a success to bring significant changes in the anomaly management of WeBank.
On the one hand, when an anomaly occurs, the Ops team can identify the impact promptly so the product owners and management can focus on anomaly recovering. On the other hand, ‘Knowing’ improves all kinds of indicators and system performance for anomaly management.
- ‘Knowing’ decreases the time consumption of anomaly escalation significantly. The escalation time in 2019 has been improved by 60% compared to 2018.
- The accuracy rate of anomaly detection is above 90%
- The accuracy rate of root cause analysis is 81% by the end of December 2019
What Is Next?
The exploration of anomaly monitoring and detection in 2018 and 2019 significantly advanced the Ops team’s progress in AIOps application. However, these achievements are just a beginning, and there are still many challenges to be solved in the future. The Ops team will continue exploring and improving because we are determined to boldly go where no one has gone before.
Chinese author: Julia Zhu
Translator: Danny Chen
Editors: WeBank AIOps Team