2017 August 13


I have being writing an ansible playbook (https://github.com/tiagoprn/devops/tree/master/ansible_playbooks/provision-centos7) to take care of provisioning CentOS 7 servers, which is something I always have being doing manually. This new task has being teaching me a lot, and something that has always being bothering me is that I had to solve the problem on how to collect metrics from the server.

By the time I had the playbook finished, I have used sysstat to help me with that. It is quite useful since it can be used on a CLI and also can consolidate a daily report using sar, which I did writing a cron task to be run in the end of the day consolidating the info on a JSON file.

This seemed nice at first, but still required me to log into the server to get those metrics. A nice web dashboard would be a lot more practical.

Time-series database to the rescue

Then I remembered an article I’ve read on opensource.com on time-series databases (https://opensource.com/article/17/8/influxdb-time-series-database-stack), and how I could use the TICK stack to help me with that. Time-series databases are useful exactly in this scenario, where you have a bunch of metrics that must be saved on a short timespan of seconds and be queryable fast. In other words, they are useful to store a huge amount of data that changes over time (metrics, e.g.).

The TICK stack is composed by the following projects:

  • [T]elegraf: agent for collecting metrics and events
  • [I]nfluxDB: Time-Series Database
  • [C]hronograf: Interface
  • [K]apacitor: Real-time streaming data processing engine

From those projects, 2 seemed a nice fit for my purpose: Telegraf and InfluxDB. As a web UI, I am thinking to use Grafana.

InfluxDB seems nice because it uses a query language that is close to SQL. So, it fits nice into my relational brain. :)

Telegraf is an agent that can be used to collect metrics from linux internals - and many other programs, like Docker, PostgreSQL, Redis, RabbitMQ and others according to its site. But for now I am interest on its capabilities to give me system input data on CPU, Memory, VMSTAT, kernel and others. Here is a glance on what it can extract with the system input plugin: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/system

Another nice bonus is that Telegraf, InfluxDB and Grafana are written in Go. So, they are just binaries and do not rely on installing any additional software on my servers, like a JVM, e.g.. And all of them can be nicely dockerized if neeeded.

The short term plan

For now, I will start with telegraf. It will replace the sysstat reports I have configured for that machine. On this first version I will make it record its logs on systemd and the linux system input data to a simple JSONFile. This is because this playbook is meant to setup a standalone machine. On production, I will probably just redirect telegraf metrics directly to influxdb.