Database Replication_System Design Notes 9

Preface

Theoretically, with a reliable load balancing mechanism, we can easily scale 1 server to n servers, however, if these n machines still use the same database, soon the database will become the system's performance bottleneck and reliability bottleneck

So, how to improve database's processing capability?

From a resource perspective, there are only two approaches:

Vertical scaling: Improve single machine configuration (hard disk, memory, CPU, etc.), but will similarly encounter single machine performance bottleneck
Horizontal scaling: Add machines, expand from single database instance to multiple instances in quantity

Looking at it this way, it seems we just need to add a few databases, sharing traffic from application layer together, and the expansion from single database to multiple databases is complete:

Is it really that simple?

1. Consistency Problem

If the same data exists in multiple copies, then we need to consider how to ensure its consistency

(From Consistency Patterns)

The biggest difference between database and application service is that application service can be stateless (or can extract shared state out, such as putting it in database), while database operations are definitely stateful, when scaling database we must consider data consistency

Specifically, consistency is divided into 3 types, in decreasing order of strictness:

Strong consistency: After writing, can read immediately
Eventual consistency: After writing, guarantee can eventually read
Weak consistency: After writing, may not be able to read

2. Replication

So, expanding from single database to multiple databases, at least requires one data update synchronization mechanism, called Replication:

Replication in computing involves sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.

(From Replication (computing)）

That is, ensuring information consistency between multiple data copies through replication (write operations). For example, when writing data to database instance A, also need to write the same data to instances B, C, D, etc.

3. Replication Methods

Asynchronous Replication

Specifically, can inform other instances to update data after writing is complete, that is asynchronous replication:

In this mode, client doesn't need to wait for replication operation to complete, no additional performance impact. But the problem is:

Risk of data loss
Cannot guarantee strong consistency, because there's replication lag

If instance A, after writing, hasn't had time to inform other instances, but goes down itself, data loss will occur:

On the other hand, since replication operation is completed asynchronously, data update is actually lagging:

The time difference from when a write operation completes on current instance to when that operation is applied to other instances is called replication lag. During this period, clients reading from other instances still get old data, obviously does not meet strong consistency requirements (can only guarantee eventual consistency)

Synchronous Replication

Wanting to achieve strict consistency requirements, have to consider synchronous replication:

When write operation occurs, immediately synchronize operation to all other instances, only consider write complete after replication is done, to ensure strict consistency

But synchronous replication affects performance and availability, cost is quite high:

Performance impact: Need to wait for entire replication process to complete
Availability impact: As long as one instance has failure (network, etc.), entire write operation will fail

And the more database instances, the greater these two aspects of impact

Semi-synchronous Replication

Specially, can combine two methods together, called semi-synchronous replication:

Some databases and replication tools allow us to define a number of followers to replicate synchronously, and the others just use the asynchronous approach. This is sometimes called semi-synchronous replication.

That is requiring some database instances to replicate synchronously, rest asynchronously

P.S. PostgreSQL supports this mode

4. Topology Structure

In terms of topology, replication can be divided into 3 categories:

Single leader replication
Multi leader replication
Leaderless replication

Single Leader Replication

That is the most common one-master-multiple-slaves structure:

In this structure, write operations (add/delete/modify) only allowed on master database, master database replicates write operations to all other slave databases, slave databases only support read operations (query)

Since all clients write to same database, successfully avoids big trouble of write operation conflicts. But need to note:

Write operation pressure is still borne by single database: not suitable for write-intensive applications, but fortunately most applications are read-intensive
Latency problem accessing master database: Only one master database, can only be placed in certain geographical location, means initiating write operations (accessing master database) in some regions may have to bear higher latency

Worse situation, if master database goes down, need to immediately select a successor from slave databases, take on master database's responsibilities, ensure this mechanism operates normally

However, this failover strategy is not that easy to implement, difficulties lie in:

How to determine master database really went down?
How to select new master database?
How to route write operations to new master database?

Actually, we cannot distinguish high latency from unavailability, usually consider timeout as unavailable (whether really down or not), then start failover plan, begin selecting new master database

Selecting one is not difficult, key is selected new master database must be recognized by all other slave databases to count (that is Consensus problem), such as pre-determining succession order

After new master database is selected, need to forward all write operations to it, such as adding a distribution layer, to allow routing control

Additionally, if using asynchronous replication, data not yet replicated to other slave databases after old master database recovers may conflict with data written by new master database during downtime, at this point usually adopt LWW (last-write-win) strategy, directly discard old data, but also has risks

Specially, an interesting situation is old master database recovers thinking it's still master database, appears split (Split-brain):

P.S. Network failure can also cause this situation, for example network failure between two clusters, cannot access each other, both think other team is down, so each starts election

Simple handling method is STONITH (Shoot The Other Node In The Head), once discovering multiple master databases exist, directly stop one

Multi Leader Replication

Now have multiple writable master databases, can share write operations, can also deploy in multiple locations, single leader structure's 2 problems are easily solved. However, big trouble appears

Since write operations can simultaneously occur on (asynchronously replicated) multiple databases, we must consider how to resolve write conflicts. Generally there are 3 approaches:

Avoid conflicts: Such as storing by content characteristics in separate databases, mutually independent, for example for domestic and foreign two master databases, if can guarantee all write operations to domestic data can land on domestic master database, all write operations to foreign data can land on foreign master database, then no conflicts exist
LWW (last-write-win) strategy: Attach timestamp to each write operation, only keep latest version
Let user resolve: Record conflicts, application prompts to user, user decides which one to keep

P.S. Some databases (such as CouchDB) support writing all conflict values, and return a series of values when reading

Additionally, another difficult problem under multi leader structure is replicating DDL (Data Definition Language), that is write operations targeting Schema, specifically see DDL replication

Leaderless Replication

Of course, there's also a structure that doesn't distinguish master database, all databases are readable and writable

Looks like "all-master structure", then predictably, write conflicts will become very common, so we need to adjust strategy, avoid making it "all-master structure":

Write: Client writes to multiple databases simultaneously, as long as some succeed consider write complete
Read: Client reads from multiple databases simultaneously, each database returns data and its corresponding version number, client decides which one to adopt based on version number

No master database, means don't need to consider failover, single database failure doesn't affect overall, all kinds of troublesome problems of selecting new master database no longer exist

Meanwhile, no master database also means no data synchronization mechanism, old values read cannot be automatically corrected:

So need additional error correction mechanism, client writes new value back when reading old value (called Read repair), or independent process specifically responsible for finding old values and correcting them back

Another key factor is number of target databases for read/write operations, at least how many databases need to write successfully, at least how many databases need to successfully read to guarantee can definitely read new value?

If w databases write successfully, then successfully read r databases' data, then must satisfy w + r > total number of databases

5. Specific Implementation

Specifically, there are 3 ways to copy some data from one database to another:

Statement-based replication: Send write operation statements as-is to other databases to execute
Write-ahead log shipping replication: Also called physical replication, pass database logs to other databases, recover completely consistent data from logs. For example PostgreSQL's Streaming Replication
Row-based replication: Also called logical replication, pass logs specifically used for replication, replicate by row. For example MySQL's Mixed Binary Logging Format

Problem with replicating by statement is, not all statement execution results are deterministic, for example CURRENT_TIME(), RANDOM(), although some databases will replace these values during replication, still cannot guarantee triggers, and user-defined functions have deterministic execution results. On the other hand, also need to ensure transaction operations' atomicity on all databases, either all completed, or all not done at all

Write-ahead log shipping replication can guarantee data completely consistent, but (storage engine-oriented) logs usually cannot be used across database versions, because under different database versions, data's physical storage method may change. And, log shipping is not suitable for multi leader structure, because cannot merge multiple logs into one

And row-based replication is combination of previous two methods, adopts logs specifically used for replication, no longer coupled with storage engine, thus can be used across database versions. Compared to replicating by statement, row-based replication needs to record more information (for example one statement affected 100 rows, need to record all by row)

Reference Materials

A Primer on Database Replication