TL;DR notes from articles I read today.
Our adventures in scaling
- Handling sudden activity spikes poses different challenges than scaling a rapidly growing user base.
- Check whether databases are resource-constrained and hence slowing down. Check hardware metrics during spikes to check on CPU, disk i/o and memory.
- If there are no spikes in those metrics, look higher up the infrastructure stack at service resources for increased resource acquisition times. Also, check the garbage collection activity, which indicates whether JVM heap and threads are the bottlenecks.
- Check network metrics next to look for a constraint in the network between services and databases - for example, if the services’ database connection pools are consistently reaching size limits.
- To collect more metrics, log the latency of all transactions and collect those higher than a defined time, which should be analysed across daily usage to determine whether removing the identified bottleneck would make a significant difference.
- Some of the bottlenecks may be code-related, for example, inefficient queries, a service is resource-starved, inconsistencies in database response itself - so look for metrics on higher-level functioning and not just low-level system components.
Full post here, 6 mins read
Migrating functionality between large-scale production systems seamlessly
Lessons from Uber’s migration of its large and complex systems to a new production environment:
- Incorporate shadowing to forward production traffic to the new system for observation, making sure there would be no regressions. This lets you gather performance stats as well.
- Use this opportunity to settle any technical debt incurred over the years, so the team can move faster in the future and your productivity rises.
- Carry out validation on a trial and error basis. Don’t assume it will be a one-time effort and plan for multiple iterations before you get it right.
- Have a data analyst in your team to find issues early, especially if your system involves payments.
- Once confident in your validation metrics, you can roll out to production. Uber chose to start with a test plan with a couple of employees dedicated to testing various success and failure cases, followed by a rollout to all Uber employees, and finally incremental rollout to cohorts of external users.
- Push for a quick final migration, as options for a rollback are often misused, preventing complete migration.
Full post here, 6 mins read
Get these notes directly in your inbox every weekday by signing up for my newsletter, in.snippets().