Design Spotify

Prashant Mishra - Oct 2 - - Dev Community

Use cases:
find song
listen to songs/podcasts
follow an artist
create playlist
music recommendation
view lyrics
Lets focus on only few use cases:
Functional requirements:
find songs
Listen to songs
Non functional requirements: Availability > consistency
If the newly uploaded single is not visible to some people for some time it should be fine, but the songs should not disappear,
It should be available at all time, service should not be down at all.

Design consideration :
Traffic consideration :
1B users = 1*10^9 B users = 100 cr people
Storage consideration :
100M songs = 10 cr songs
1song = 1Mb > 100M*1Mb = 10^8*1Mb = 100TB > 1PB (if 10 replication of data is made for availability)
100Bytes metadata for each song > 100bytes * 10^8 = 10GB > (100 GB if 10 replication of the meta-database is made)
Bandwidth consideration :
Out of 1B users let say at a time 150M users are streaming songs on the app
150M(1mb + 100bytes)= 15*10^6*1MB = 1PB/day = 173Mb/second( approx.) (assuming this includes both read and write of data)

Api design:
getSong(userId, token)
playSong(songId, songUri)

Database design:
(Mp3 songs): amazon s3, (metadataofuser, metadataofsongs): nosql/sql database
metadata of user: userId, UserName, isArtist, songUrl[],followers
metadata of song:songId, userId, likes, songUrl
amazons3 for mp3 object storage (since mp3s are immutable data, it is best to store it in an object storage)
for metadata: non-sql database as it will be denormalised( different collections can have duplicates data)
since the app is read heavy: no-sql will be best suited for searching.

High level design:

spotify

Duplication/replication:
we can duplicate the data on s3 bucket distribute it for availability.
We can create various shards of the data as well by using technique likes sharding based on userID or to ensure uniform data
distribution we can think of using Consistent hashing.

Load balancers: we will need load balancer between app and app-server, between app-server and cache between app-server and s3-bucket,etc

For security: We can use 2-factor authentication for user authentication and only loggedIn users will be allowed to access the service.
For storage optimization we can think of data compression and archiving the data/music that is not popular any more.

Caching: we can use cache for metadata of songs and user details access, so the server will make request to the metadata-database if the cache does not have the same data, and the same will be populated in cache to avoid round trip for the server, we can employ techniques like write back cache, i.e in case of cache miss the server will read from the metadata db, and the db will update the cache.
Not all the songs will be accessed frequently, let say only 10 percent of the songs create 90% of the traffic.
So, 5% of 100M songs = 5lakh songs , the size will be 5*10^5 * 1Mb =500Gb of cache size.
Either we can have 500Gb of caches for this or we can also think storing parts of the songs in different instances of caches. these caches will be present at the edge server which is managed by CDN.

Similarly, we can cache the metadata about the songs and users to speed up the query.

Read more about CDN.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player