Introduction

In this article, I'll try to design Twitter.

STEP1: Constraints and user cases

How many users do we expect this system to handle?
- Aim for 10 million users generating around 100 million requests per day
- Expect that each user will be following 200 other users on average, but expect some extraordinary users with tens of thousands of followers
How many requests will that generate?
- Expect that there will be a maximum of 10 million tweets per day and each tweet will probably be favorites twice on average but again, expect some big outliers

Application Layer
- posting new tweets(write)
- following a user(write)
- favoring a tweet(write)
- displaying data about users and tweets(read)
Data storage layer
- the data storage that we will use to store all the data that needs to be persisted

1150 HTTP requests per second
Set a load balancer and a set of application servers running behind it
The load balancer routes requests to the servers using some predefined logic
Disadvantage
- A single load balancer is a single point of failure, configuring multiple load balancers further increases complexity.

The data
- Users profiles
- A set of tweets
- Follow relationships
- Favorites
Tweets will be generated at an average speed of 10 million per day
- for a single year there will be 3.65 billion incoming tweets
- Aim for a solution that can store efficiently at least 10 BLN tweets for now
- 1 tweet : 140 characters
- 140 * 10 BLN = 1.4 trillion => 2.8 TB(1 character: 2byte, 1TB = 2^40 bytes)
2 billion each connection
- follower and followed => 2 user IDs
- 8(two 4 bytes integer) * 2BN = 16BN bytes => 16 GB(1GB = 2^30 bytes)
The favorites are expected to grow at a rate of 20 mln per day
- Let’s say we want to be able to store at least 20 bln such objects
- They can probably just point to one user and one tweet through their IDs
- 8 bytes(User IDs) + 4 bytes(Tweet ID) = 12 bytes
- 12 * 20 BLN = 240 BLN bytes => 240 GB
Expected Data: 2.6 - 2.7 terabytes

In order to handle the incoming read requests we may need to use a caching solution
- A database stores data on disk and it is much slower to read from disk than from memory
- Databases usually have their own caching mechanisms but with memcached we have better control over what gets cached and how
Add the appropriate indexes
- This will also be vital for executing quick queries joining tables
Partitioning the data