Docs

Real-Time Analytics

Outlyer provides a powerful real-time Analytics engine which is used to collect millions of metrics and run ad-hoc queries to display and use on dashboards and alerts. These metrics are collected from the agent via plugin checks, scraping Prometheus endpoints, or push metrics from Dropwizard and in future StatsD.

While the majority of users will be happy using our standard query builder for dashboards and alerts, for advanced users who want to collect and analyse their timeseries metrics, this section provides an introduction to the capabilities and query language used by Outlyer’s analytics engine.

Key Concepts

Time Series

A time series is a sequence of data points reported at a consistent interval over time. The time interval between successive data points is called the step size. In Outlyer each time series is paired with metadata called labels that allow you to query and group the data.

Labels

A set of key value pairs associated with a time series. Each time series must have at least one label with a key of name. Labels are sometimes call dimensions and allow you to select and group your metrics in various ways to easily select the time series you want for your queries.

Metric

A metric is a specific quantity being measured, e.g., the number of requests received by a server. In casual language about Outlyer metric is often used interchangeably with time series. A time series is one way to track a metric and is the method supported by Outlyer. In most cases there will be many time series for a given metric name if labels are used.

Each unique combination of metric name and labels will create a time series in Outlyer, and is considered a unique metric by our analytics engine.

Data Point

A data point is a triple consisting of tags, timestamp, and a value. It is important to understand at a high level how data points correlate with the measurement. Consider requests hitting a server, this would typically be measured using a counter. Each time a request is received the counter is incremented. There is not one data point per increment, a data point represents the behavior over a span of time called the step size.

Step Size

The amount of time between two successive data points in a time series. For Outlyer the datapoints will always be on even boundaries of the step size. If data is not reported on step boundaries, it will get normalized to the boundary.

Metric Type

Metric Types tell Outlyer how to treat the data points collected for a metric and normalize the values to the step boundaries. Outlyer currently supports counters and gauges:

  • Counters: Counters capture a metric that monotonically increases over time and doesn’t vary (go up and down). For example the total number of requests received by a server since it was started. Counters are converted into rates when normalized so when graphing counters in Outlyer you will see the rate of change per second for the metric. You cannot get the original raw value of the counter, as this can lead to confusing results such as drops when a server is restarted.
  • Gauges: Gauges capture a metric that varies over time (goes up and down). For example the CPU Utalization can vary between 0% 100% depending on the server load. Use Gauges where you want to graph the specific value of time series in Outlyer.

Normalization

In Outlyer this usually refers to normalizing data points to step boundaries. Suppose that values are actually getting reported at 30 seconds after the minute instead of exactly on the minute. The values will get normalized to the minute boundary so that all time series in the system are consistent. The following provides important information so you can understand how your metrics are being stored in Outlyer when writing advanced queries and plotting graphs of your queries in Outlyer.

How a normalized value is computed depends on the metric type. Outlyer supports two types: Counters and Gauges.

Gauge Normalization

A value that is sampled from some source and the value is used as is. The last value received will be the value used for the interval.

Counter Normalization

A counter will be normalized to a rate per second. The conversion is done by computing the delta between the current sample and the previous sample and dividing by the time between the samples. Since, the starting value is unknown, at least two samples must be received before the first delta can be computed. This means that new time series relying on counter type will be delayed by one interval.

Consolidation Function

Outlyer will not send back more data points via the API than the pixel width of the graph, as its impossible to plot more data points than the number of pixels available for the graph.

What this means is if you plot a query for 30 days on a 200px width graph, for a metric collected every 30 seconds, the step size will be increased from 30 seconds to 12,960 seconds (3.6hrs). The 432 data points collected during the 3.6hrs step period will be consolidated into a single data point for that period.

By default Outlyer will automatically consolidate based on the query. For example a max query will consolidate the step period to the maximum data point out of the 432 data points collected. Other consolidation functions are avg, min and sum and you can override the default consolidation function for a query if required.

Retention Periods

Time series for monitoring generally lose value quickly. Most users usually need the current or most recent data points to help troubleshoot an incident, and use older data points for analysing trends and capacity planning. Hence, like all monitoring tools, Outlyer will store data points for a certain period of time (retention period) at various resolutions decreasing as the data points age. At the end of the retention period, the data points are expired and deleted from Outlyer so will no longer be available.

By default Outlyer will keep all your metrics for 13 months allowing users to do capacity planning for the past year and compare month on month trends. However to ensure our user’s bill is kept reasonable, we will roll up the data points to the following resolutions over time:

  • Last 6hrs: Original 30 second metric resolution
  • Last 2 weeks: After 6hrs, the metric step size will be reduced to 5 minutes
  • Last 13 months: After 2 weeks, the metric step size will be reduced to 1 hour

Rolled up metrics are storied with various statistics so that running analytics queries against older data points will still return valid results even though the number of data points has been reduced.

By default Outlyer stores all the raw original data points for every metric in cheaper, higher latency storage, which allows Outlyer to always restore your data from raw input data at any point, and in future Outlyer may provide higher resolution for older time periods, and additional reporting capabilities against your original raw data points collected by Outlyer.