The Guardian's improvements to it's breaking news notification service

I am not interested in reading the Guardian, except for its engineering blog, which is published as a regular news article. I was looking at how they improved their breaking news notification service for their mobile app to reach the 90% audience in 2 minutes (90in2) target.

They had a token to topic database, which corresponds to devices and subscribed topics of users. The database is sharded to optimize queries to find tokens, specially useful for heavily-subscribed topics. When an editor decides to publish a breaking news, the notification is triggered from an interface and sent to a Scala Play app. A routine job counts subscribers/topic to optimize sharding. For each shard to query, a message is sent to an Amazon SQS queue, then an AWS Lambda function (Harvester) is triggered to query the db via many queries and receives a stream of tokens for the topic. These tokens are inserted into SQS queues, grouped by Android or iOS. Then functions are triggered to send notification delivery requests to readers’ devices.

As the current engineering team never touched the system before, they decided to enhance observability by structuring logs in the ELK stack. Through Kibana, they saw that the Harvester was performing poorly due to db errors and lots of retries. Each function required a db connection and the db could not handle many concurrent requests. They set up connection pooling through AWS RDS proxy to control the amount of connections.

To reduce SQL query time, they replicated the amount of data in production (few GBs) and used pg_stat_statements to track the plan. AWS Read IOPS informed that queries lead to lots of reads. They also noticed that due to dead rows from earlier snapshots, the tables data was taking more space than the actual data. Even though they had autovaccum on, they applied a full vacuum (this one was easy to catch, right?). This reduction in storage also meant that reads were faster, leading to the database being able to accept more Harvester connections. They also upgraded the posgres version for performance and price reasons as the AWS Gravitron 2 offers 40% better price performance for pg12 and above. To ensure a smooth migration, they setup another dataless db with the same schema and logically replicated the data. When the 2nd db was populated they switched over.

Next, they reduced the time taken by worker lamdas to deliver tokens to Apple/Google for delivery by increasing the Scala thread pool size and increasing the memory dedicated to each Lamda execution environment.

They used a data-led approach, reverting when necessary. They developed theories, carried out experiments and retained the successful ones. They also learnt the lessons of how small, isolated changes can contribute to the goal.