Large Scale Time Series Analysis for Infrastructure Reliability

Performance, reliability and efficiency of infrastructure systems are instrumental to high quality user experience at Facebook. Extensive time series logging has been implemented to monitor the health of various infrastructure systems. CPU usage of internal services (e.g., large number of machine learning models built by different teams), network traffic between internet service providers and Facebook data centers, video streaming KPI of live broadcasts generated by daily active users, just to name a few. Many methods and tools are available to analyze the logs at single time series level, for instance, time series forecasting and (contextual) anomaly detection. Relatively little is done to address an emerging use case – identifying similar/outlier instances from a large collection of time series. In this poster, we demonstrate the design and implementation of a generic framework to accomplish such needs. We discuss two real-world implementations of the approach: (i) auto-scaling the inference capacity for machine learning (ML) models, and (ii) detect video ingestion quality outliers across Facebook live broadcasts. Specifically, we fill the gaps in existing tools (and/or external, off-the-shelf alternatives) by proposing a method which is interpretable, generalizable and scalable.