For those serious about their applications, monitoring performance & health is crucial. So what monitoring options do we have? Well, there are many commercial Application Performance Management (APM) tools out there, to name a few, New Relic, Azure Application Insights, Datadog, Prometheus and App Dynamics. These are all awesome products with some great features. But, there are also many great, and free, open source solutions available to us which are just as feature rich.
Why would we bother setting up a custom monitoring solution? Well, in some cases it may not be worth the short-term effort, but in cases where cost, flexibility, vendor lock-in holds you back, longer term retention on data is required, want more control over how data is sampled, not sure up-front what option is going to suit your system(s) best, or don’t want to make an up-front paid commitment, an open source solution can really shine.
So what open source options are available? Lets start with time series databases (TSDBs). You need a data storage backend tailored towards persisting time series, to name a few, Graphite, InfluxDB, OpenTSDB, Elasticsearch. Which is best? Well that depends on your needs and team’s skill set, maybe your team is already very experienced with Elasticsearch, or has given that a go but now understands why it wasn’t the best fit, maybe there’s a requirement for predictive analysis, maybe self hosting isn’t an option but a relativity cheap managed service is what you need to get you started?
Many of the open source TSDBs come with their own visualisation dashbaords, some are better than others, InfluxDB has Chronograf, Elasticsearch has Kibana, Graphite has many. Grafana is another option which is great in that it supports alerting and is a generic dashboard supporting many storage backends including all popular open source TSDBs, for those using AWS it even supports Cloudwatch as a datasource. Otherwise, if you need a tailored interface, there’s always roll-your-own with something like D3.js.
OK so we have options for persistence and visualisation, how do we go about collecting our application’s metrics? Again there is no shortage of open source solutions in this space that work with different storage backends e.g. Telegraf, Metricbeat, CollectD. Some implement a push based approach while others a pull (or poll) based approach.
There are some things to keep in mind when deciding on a pull vs push approach. Let’s start with pull, this is the more traditional approach where an agent spends a great deal of resources and time polling target systems at regular intervals, one such example is Nagios. There are some downsides to this approach:
- Need to design around unavailability e.g. timeouts and disconnects.
- Agent’s infrastructure will need to scale as your architecture scales horizontally (think monitoring in the world of microservices).
- Agent needs to be aware of host’s running each application as they scale, maybe through something like service discovery (think monitoring in the world of containers).
- Loss in transport flexibility, some systems may be suited to TCP, others UDP or AMQP.
Fortunately using a push based approach alleviates these downsides. Having each application push instead of pull allows each application to determine the appropriate precision and transport, collection is decentralised and scales automatically as a result and sends telemetry data only as they are available.
There’s a mass of open source monitoring options out there, each with their own pros and cons, but that’s the beauty! Allowing us to trial and determine what or what combination best suits the applications we’re building and move with the times. Like everything else in software there really is no such thing as a “one size fits all” solution.