AIOps Series III：Shitu, the Core Algorithm of Intelligent Anomaly Detection

6 min readSep 16, 2020

WeBank has adopted Machine Learning (ML) to detect anomaly from KPI curves without thresholds that were manually defined by human experience with the initial goal of achieving automatic anomaly discovery and alert in intelligent IT operations.

To achieve that goal, WeBank has developed the Shitu module of the intelligent monitoring system. Shitu is an abnormal curve detection system based on four key performance indicators (KPIs), including real-time transaction volume, business success rate, system success rate, and average transaction latency. The difference between the business success rate and the system success rate is the system anomalies captured.

All system failures that affect the business reflected on the four minute-level KPIs, so we can analyze all business impact anomalies by studying abnormal fluctuations of the indicators. The KPIs with different statistical perspectives and fluctuation patterns require different algorithms.

The three methods of pattern detection

Based on Long Short-Term Memory (LSTM) and the Gaussian distribution, the first detection method mainly checks transaction volume and latency. Most curve mutations can be accurately detected, but the drawback is that subtle and slow changes are easily omitted.

Based on K-means, the second featured detection method is supplementary to the omission in the first method, which is very effective in the cases with slow-changing transaction volume.

Based on probability density, the third method detects curves of business success rate and system success rate. As success rate curves have endless possibilities, a more profound approach is required to measure an anomaly’s amplitude.

All the above methods share a common principle: infrequency means an anomaly. In the context of unsupervised learning, anomaly detection becomes a question of how to measure infrequency.

The detection method based on LSTM and the Gaussian distribution

The method includes two steps: the curve forecast and the anomaly estimation.

WeBank tried ARIMIA, Holt-Winter, and LSTM to forecast the curve during research. Maybe because LSTM learns long-distance information, it performed exceptionally well in the data sets of business indicators. Most loss of normalized data is around 0.0001. The following figure shows that the predicted curve is very close to the actual curve under the normal business circumstance. Therefore, we use LSTM as our prediction model.

The predicted curve usually coincides with the actual curve for a stable business because it has learned historical data patterns. When the actual curve deviates from predicted that represents the historical patterns, an anomaly has highly likely emerged. That’s why we need a method to measure the deviation. During research, we found that the difference between the predicted and the actual curves is Gaussian distribution, and the rare data abovementioned is the edge data of the Gaussian distribution.

It is considered to be an improbable event if x is extremely small and closes to 0. It’s probably an anomaly when an improbable event emerges from what we have considered normal. The LSTM and Gaussian distribution detection works well in identifying short-duration mutation.

Fig 2: Detection based on LSTM and Gaussian distribution

However, an anomaly with a small amplitude and long duration can hardly be identified. The main reason is that the LSTM prediction curve closely follows the actual one. If the anomaly spans for a longer time, the projecting curve deviates. The LSTM and Gaussian distribution detection becomes ineffective, as the yellow curve shows in the following figure.

Fig 3: Anomaly detection in slow decline

Based on K-means, the second feature detection method with an adjustable time window was introduced to make up for the first method’s defect.

The feature detection based on K-means algorithm

In terms of curve anomaly detection, a popular method is to extract curve features as input, and then use the annotations marked by O&M staff as labels to train and estimate the models. This method works well in perfect data sets, but very few companies have put it into practice as it requires a massive amount of annotations. Complex transfer learning is necessary to reuse the models and reduce the number of annotations.

Therefore, the method has been massively simplified by a slight change, supplementing the slow-changing blind spot. There are four critical features: mean, slope rate, zero rate, and mean square deviation of the first-order difference. When the indicator appears as a medium- or high- frequency signal, the mean and slope rate become evident in the slow-changing anomaly. When indicators have low or medium frequencies, the mean square deviation of zero and first-order differences are more prominent. Anomalies probably occur in a window, the features of which deviates significantly from the normal pattern.

K-means detects the deviation by presenting the clustering number K. It aggregates presupposed K clusters via iterations. When anomalies are detected in the current window, the window could be integrated with adjacent and historical ones by k-means clustering. If K equals 2, we have the following results:

Fig 5: Anomaly recognition after slow decline

Fig 6: The anomaly detection of business success rate of low transaction volume

The current window features are grouped into one cluster, while the features of adjacent and historical windows are grouped into another. According to the principle that infrequency means anomaly, the current window has an anomaly.

While the aforesaid detections are effective for most business curves, they are ineffective for the business success rate curve. Therefore, we designed a detection method specifically for the success rate curve.

The detection based on probability density

The success rate cannot reflect all possibilities, which may be why the aforesaid two algorithms failed.

Fig 7: Anomaly detection of success rate of low transaction volume

The daily success rate is 95%. When a transaction fails, the current success rate becomes 0%. It’s possible based on the past pattern, but it doesn’t explain anything. 15 failures out of 30 transactions make the current success rate at 50%. This is infrequent and most likely an anomaly. If transaction volume is not considered, 0% is more likely to be an anomaly than 50%. But this is not the case.

The following algorithm integrates transaction volume to solve the problem. It calculates the cumulative success probability, which is less than a specific value.

For example, the success rate is 95%， and the transaction volume is 30, among which the successful transaction is X. The figure8 shows the probability distribution.

In the figure, the maximum success volume fluctuates around 29. When the success volume drops to 15, the probability is close to zero. An event is most likely an anomaly，when incalculable by a regular pattern.

Algorithms WeBank used

Chinese author: Benji Li

Translator: Jess Zhu, Sookie Tao

Editors: WeBank AIOps Team

AIOps Series III：Shitu, the Core Algorithm of Intelligent Anomaly Detection

Written by OPEN ZONE

No responses yet