πŸ“ˆ Scaling Web Applications to a Billion Users. It is Complicated... πŸ˜΅β€πŸ’«

Lukas Mauser - Oct 30 '23 - - Dev Community

Did you ever think:

"Instagram is just a photo sharing website, I could build something like that!"

You can't.

You can probably build a photo sharing website, but the tricky part is scaling it to a billion users.

But why is that so complicated?

Let me explain...

Increasing Demands on Performance and Reliability

Running a small project is easy. Your code doesn't have to be perfect and it doesn't really matter to you if a request takes 200 ms or 500 ms. You probably would not even notice a difference.

But what if 10 people want to access your service at the same time? Let's assume all requests get handled one after the other. Waiting 10 x 200ms = 2 seconds vs 10 x 500ms = 5 seconds does make a noticeable difference. And that's only 10 people.

Now think of 100 people, 1,000 or 100,000 people that are constantly bombarding your servers with requests. Performance improvements of just a few milliseconds make a huge difference at scale.

And same goes for security and reliability. How many people will notice, if your tiny site is down for an hour? On a large site, there could be hundreds of thousands of people who will get upset, who can not finish their checkout process, who maybe rely on that service so much, that it freezes their entire business.

That's why large enterprises have uptime goals of 99.995%. That's a maximum downtime of 30 minutes throughout the whole year!

And that's when it get's complicated...

Scaling Infrastructure

When your hobby project has reached unbearable response times, usually the easiest thing to do is to migrate everything to a bigger server. This process is called scaling vertically or in other words: "throwing money at the problem".

But there is only so much traffic that single machine can handle. So at some point you'll have to add more servers, that run your application in parallel. This is called scaling horizontally.

And now it already get's complicated. Do you run multiple small machines, or a few big ones? Or a mixture of small and big machines (diagonal scaling)? How do you distribute incoming traffic on the cluster? How do you deal with certain traffic spikes?

If you can not confidently answer these questions, it's time to bring in a dev ops specialist. But it does not end here. Remember maximum downtime of 30 minutes per year? You don't reach that, if you do not plan for failures.

You need redundancy in your system, meaning, if one node fails, another one is ready to take over. Or go even further: distribute your computing resources across multiple regions and service providers to also minimize platform risk.

And it get's even more complex when you think about scaling your infrastructure globally. How do you ensure low latency for users in Brazil? What about Australia, Europe, Asia, ...? You get the point. Infrastructure of big global applications is complicated.

But infrastructure is not the only bottleneck when scaling your app.

Scaling the Codebase

In the last chapter I talked about running your app on multiple machines to handle big amounts of traffic. But is it even possible to run your code in parallel or do you use a database that needs to stay consistent across all machines? How do you split your application logic? What part of your code runs on what machine?

Scaling your application also means scaling your codebase.
And this includes:

  • distributing application logic,
  • introducing advanced monitoring tools,
  • optimizing code for security, performance and reliability,
  • improving performance through additional layers like CDNs or caching,
  • introducing quality control processes,
  • ...

And all of that usually means, every tiny little thing, that was so easy to do in your hobby project, is now exponentially more complex.

Take logging for example:

In your photo sharing hobby project, you look at the log file on the server. How do you do that in a cluster of hundreds of servers? And how do you keep an overview on millions of logs a day?

And again, remember 30 minutes maximum downtime? How often do you accidentally push broken code that crashes your whole application? You do not want this to happen in a serious production environment. That's why scaling your codebase also means setting up processes to ensure no one accidentally breaks something.

The same bugfix, that is done within 15 minutes in a hobby project, can take several days if not weeks in a large scale application.

From reproducing the bug, prioritizing it, discussing solutions, coding the fix, writing tests, writing documentation, reviewing the code, reviewing security issues, verifying it works for the customer, load testing, iterating back and forward through the test pipeline to finally releasing it.

But wait, there is more...

Scaling an application doesn't end with scaling the core product. In a bigger context it also means scaling a company. Scaling the team, or multiple teams, multiple divisions, comply with legal requirements in different countries and so on.

You get the point. So next time you think about Instagram as an easy weekend project, also think about the underlying iceberg below the waterline.

But anyways, none of that should scare you from starting something. You won't reach that scale over night. Don't loose yourself in hypothetical scaling scenarios, instead go step by step as you need to.

Interesting read:
Instagrams grew to 14M users with only 3 engineers in the beginning, one of the engineers describes their early architecture: https://instagram-engineering.com/what-powers-instagram-hundreds-of-instances-dozens-of-technologies-adf2e22da2ad

Side note: Need a helping hand with developing your scalable Vue or Nuxt application? Contact me on https://nuxt.wimadev.de

. . . . . . . .
Terabox Video Player