Chasing your Tail -f!
- Sarah Mcaleavey
- Sep 15, 2021
- 3 min read
Updated: Sep 16, 2021
We've all been there, and hopefully we hated it so much we changed our ways and sorted out a decent log parsing and analysis system! I know I did.

So you've done it! You've built the perfect logging and monitoring system. You've done the data analysis and have your alerting on your baselines, you've got your max/min thresholds set, you've done your averages and CPU, hooked in your Cloudwatch and alerting. You are set!
Well Done you! Except you weren't expecting this 3 hour long incident call, live, terminal on, screen share on, all eyes on you. Trailing and Trawling through the var/log directory for errors that might lead you to the root cause. You also weren't expecting this much noise - pages of non-perilous errors, left unnoticed and un-actioned for too long, leading you down rabbit holes, only to realise too late - these errors have been firing long before this outage. Make a note of them to resolve, to reduce the noise, later, after.
Now you're regretting building this amazing monitoring and alerting system that has successfully alerted every sleeping senior manager to the incident. Between the uncomfortable silences of your 'tail -f | grep error' there are periodic interruptions to introduce yet another senior manager joining the incident call. How did it come to this?
Logs and Errors are great for triggering Alerts, but Operations people Also need Data Analysis in play, Parsing at the very minimum, Machine Learning at the higher end of the scale to detect anomalies in Errors. Here's one approach I took to set up a parsing system that would alert of kafka events failed at some point in a Data Flow sequence. Detection and Resolution was difficult as Topics were buried in Flink batches and we didn't have a method to sequence events alongside logs timestamps.
In the initial version I provisioned a Logstash server configured to pull Kafka Topics and Logs into an ELK stack, storage requirements were large at around 10Gb every 5 days. Kibana easily parsed and analysed errors alongside Kafka Topic Content feeding this back to an existing Grafana front-end. The major benefit being able to sequence the kafka Topics alongside any errors and identify bottlenecks for failures in the Kafka Data Flow. For those familiar with Flink this ability to see the sequenced flow of Topics is difficult to extract when using Batches, and parsing out Topic Content for Identifiers allows quick detection of lost events. Admittedly it sounds a little over-engineered, but there are barriers in Large Scale Enterprises and existing Technologies need to be utilised, after all, contracts have been purchased, Licenses must be used! There is a reality and pragmatism that is required when architecting in different environments.
Securing the flow across numerous integration points was relatively easy with a combination of Terraform resource aws-kms-key and IAM roles and policies to lock down access between services. The biggest challenge was locking out Kibana UI access to AWS Console users, as Data sensitive protection was required at every touch-point, and access to the the indices may have compromised this. With some testing and new policies in place we were good to go.
This project stopped short of where I would have liked to develop it. We could resize instances easily with Terraform updates, but I would have like to have Auto Scaling and Spot Instances to reduce costs during changing traffic loads. As a further development with a new version of Grafana in place (Version updates in Enterprises can sometimes lag) we could possibly have utilised Grafana Loki instead of Logstash although I haven't tested its compatibility, it integrates with Kafka and may have reduced the requirement to duplicate storage in Elasticsearch.
If you are starting out with Logs, the essential take-away is to make good use of the information inside them...this is made all the easier with Terraform Enterprises recent update "log Fowarding" which automates sending logs to analyse. Datadog, GCP, Azure, Splunk Event Collector to name a few. Monitoring and Alerting no longer an after-thought, but built into the provisioning! I can only imagine how many 'var/log Disk Space Full' issues Hashicorp have inadvertently solved with this stroke of genius!
Perhaps one to play around with in the Future...there are many ways to skin a cat!



Comments