status | implementation | status_last_reviewed |
---|---|---|
accepted |
done |
2024-03-04 |
This RFC proposes the retiring of the application GOV.UK CDN Logs Monitor for the following reasons:
- We do not make use of the data it generates;
- It is a complicated tool that few GOV.UK developers understand the purpose of;
- Similarly, due to lack of knowledge we don't know if the data is accurate when problems occur;
- Since there is a very infrequent need to use the application there isn't much justification to invest in developers learning it;
- It uses a significant amount of disk space that requires maintenance.
The purpose of this RFC is to make the case that we should retire it based off the knowledge we have. We are hoping that by circulating this suggestion across the wider GOV.UK technical community we can identify any issues that we haven't yet identified.
We understand that the original motivation for creating GOV.UK CDN Logs Monitor was in response to an incident and it was intended as a means to monitor when URLs on GOV.UK change from responding with a success (2xx) status code to a different one.
A summary of the responsibilities of the application is as follows:
- Monitor the log files that Fastly sends directly to the box via syslog.
- Increments statsd counters for the amount of responses the Fastly is serving with a particular status code. E.g. stats.govuk.app.govuk-cdn-logs-monitor.logs-cdn-1.status.200
- Increments statsd counters for which backend (e.g. origin or mirror) are serving a request.
- It outputs data to stdout - which subsequently goes to logit - of any requests that are not served by origin or the CDN itself.
- On a nightly basis various log files are assembled which:
- Count how many times a path was accessed via a particular method and backend in an hour.
- Store a list of all paths that were accessed successfully that day.
There is more in-depth documentation in the repo.
There does not appear to be any tools that are monitoring the graphite data that statsd populates. This was checked by searching govuk_puppet for any references to the govuk-cdn-logs-monitor namespaces.
We do monitor similar graphite databases for CDN health by utilising
monitoring-1_management.cdn_fastly-govuk.requests-status_*
which are fed by
collectd usage of Fastly API.
We don't appear to have an equivalent statistic to the one provided which tracks requests per backend - presumably though we could collate this if needed by comparing other graphite sources.
If we were to turn off this application we would not have the CDN requests sent to logit. However we would suggest that the ones we have now are a source of confusion as it is not clear why only some reach logit.
The most likely sources we have for similar data relating to when paths changed from a successful status code to an unsuccessful one are Google Analytics, Logit, and access to the raw CDN logs. We are not aware of anyone making use of the files produced by this application with this data.
The Future steps section of this document explores using AWS Athena as a means to query for the data sources that are lost.
This application currently uses 413GB on logs-cdn-1
and stores > 100GB of
Graphite databases on graphite-1
. A significant portion of the graphite storage
is due to unnoticed misconfiguration.
If we are to gain consensus through this RFC that it is beneficial to retire this application we will intend to remove it from GOV.UK architecture and archive application data associated with it.
An earlier draft of this RFC suggested that we could retire the machine that hosts this application, however it has been since learnt that the applications, transition-stats and pre-transition-stats are hosted on this machine. Thus the retiring of the machine is now considered outside the scope of this RFC.
The earlier draft, based on the expectation of removing the machine, suggested moving the drafts fully to S3 however this has now been revised to storing them both on S3 and the Logs CDN machine. This is in the view that S3 should become the definitive source but to avoid the disruption of removing a service that may be needed. In light of this the suggestion is to reduce the storage on the Logs CDN machine from 30 days to 7 days.
The revised steps of this proposal are now:
- Stop running the application through a configuration change in govuk_puppet, then allow time to see if we are alerted to any services or monitoring systems that break due to the lack of data
- Create an S3 bucket which can be used to store the data that will be removed from logs-cdn-1
- Prune the data from Graphite
- Apply Fastly to send the CDN logs to an S3 bucket in addition to logs-cdn-1
- Remove the application and associated services from govuk_puppet
- Archive the GOV.UK CDN Logs Monitor repository
By hooking the eventual S3 bucket into AWS Athena we can set up a query interface to search the logs which should provide answers to a number of the queries that we we hoped GOV.UK CDN Logs Monitor would answer.