The 11th factor in a 12-factor app is that we should treat logs as event streams. This enables us to capture and process those logs centrally using an observability stack; which means we can set up monitoring and alerting based on log messages.
This is the 11th video in our 12-factor Application Modernisation series. In this video, Marc Firth (Managing Director at Firney) explains software engineering best practices for building scalable, reliable web services that are efficient to work with.
A transcript of the video above is included below.
Why we should treat logs as event streams
Marc: Today we’re talking about the wonderful topic of capturing log output.
It’s a necessary step and you only need to do it once you’ll be really grateful once you can go in and you can find your errors across your scaled environment.
It’s not glamorous. Well, I’ve never heard of the “logging awards”. Although… maybe we should create that!
This is our application modernisation series on the “12-Factor App” methodology, where we talk about how to make applications more reliable, scalable and efficient to work with.
So what do we mean when we say that we should treat logs as event streams?
Logs such as error logs or access logs provide insight into the behaviour of a running app in its environment. It’s usually the first place a developer goes to diagnose an issue.
They’re normally written to disk somewhere like /var/log and then normally in the format of one log per line and they usually start with a timestamp.
Log output: Nginx example
With NGINX for example, we have an access log and an error log. The access log provides details about who’s accessed the various resources on our web server, and the error log provides details of any errors that happened as they were doing so.
With those logs usually written to files on disk, when we have a scaled environment and we have many servers, it means those logs would reside on many servers and it’d be really difficult to go in and find the error.
So that would be painful to search.
It also means our app instances would no longer be immutable. They’d be different because we’d have different logs stored on each instance.
Our containers would no longer be disposable because if we destroyed one of those instances, we would lose all those logs.
If we recreated it, we wouldn’t have those logs anymore. So the ‘state’ would be different. So we need to stream those logs somewhere outside the application instance by default.
How to capture up log output the right way
We do that by sending all of those logs to STDOUT, which usually is in the setting somewhere within the config of your application.
When a developer is working on this locally, they’ll be able to see those logs being streamed out to their terminal.
In our Test, Stage and Production environments, we can capture all that log output using a log aggregator such as New Relic, Datadog, Splunk or Cloud logging. We’ll be able to go into those tools and go back and search through the logs within that aggregator.
How do we capture log output at Firney?
Typically, at Firney, we use a combination of cloud logging and New Relic. But, full disclosure, we are Google Cloud and New Relic Partners. The reason we chose those partners is because we love their tools.
The benefit of using these tools is that they aggregate the logs from all of your different application instances. You can go back and you can search through events that happened in the past. They’re usually feature-rich for the way that you can interact with those logs.
One of the reasons we like New Relic is that we can graph trends such as the number of times a 500 or 404 error occurred. We can use those metrics as Service Level Indicators (SLIs).
We do that by setting up alerts on those metrics to give us our Service Level Indicator, which tells us exactly how many errors of a certain type are happening, or if there’s a significant change in the amount of events that have happened in the last, say, 15 minutes or so.
We can also dashboard those and we can see how close we’re getting to exceeding our Service Level Objectives. This is what enables us to really get to grips with SRE (Site Reliability Engineering) and the automation of our reliability activities.
Tips in Summary
So to summarise:
- Make sure that all your containers or application instances are outputting their logs to STDOUT.
- Ensure that you’re capturing all those logs in an aggregator such as New Relic.
- Set up alerts for all the key metrics that you’re interested in from your SLA.
Closing thoughts on capturing log output
I hope that helps you get better visibility into your scaled environments.
If you’re setting this up for your business, feel free to drop me a message on LinkedIn. Let me know what you’re doing. We’d love to hear more about your success stories or challenges as we’re thinking about setting up a webinar or group to talk more about application modernisation challenges.
Otherwise, have an awesome day, don’t forget to like and share the video and head on over to our company page if you’d like to see more videos like this. I’ll see you in the next one.