본문 바로가기

Database

[Cockroach] Architecture - Read/Write

참조 : www.cockroachlabs.com/docs/v20.2/architecture/reads-and-writes-overview.html

 

Reads and Writes in CockroachDB | CockroachDB Docs

 

www.cockroachlabs.com

 

 


Important concepts

TermDefinition

Cluster Your CockroachDB deployment, which acts as a single logical application.
Node An individual machine running CockroachDB. Many nodes join together to create your cluster.
Range CockroachDB stores all user data (tables, indexes, etc.) and almost all system data in a giant sorted map of key-value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.

From a SQL perspective, a table and its secondary indexes initially map to a single range, where each key-value pair in the range represents a single row in the table (also called the primary index because the table is sorted by the primary key) or a single row in a secondary index. As soon as that range reaches 512 MiB in size, it splits into two ranges. This process continues for these new ranges as the table and its indexes continue growing.
Replica CockroachDB replicates each range (3 times by default) and stores each replica on a different node.
Leaseholder For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.
(모든 Read/Write을 받아서 처리함) 

Unlike writes, read requests access the leaseholder and send the results to the client without needing to coordinate with any of the other range replicas. This reduces the network round trips involved and is possible because the leaseholder is guaranteed to be up-to-date due to the fact that all write requests also go to the leaseholder.
(Read는 leaseholder에서만 처리하고 끝! - 다른 range replica 필요 없음 -> 보통 Write의 leader도 여기니까..)
Raft Leader For each range, one of the replicas is the "leader" for write requests. Via the Raft consensus protocol, this replica ensures that a majority of replicas (the leader and enough followers) agree, based on their Raft logs, before committing the write. The Raft leader is almost always the same replica as the leaseholder.
(Write의 leader이며 Rft logs를 base로 대부분의 replicas가 agree하면 Commit됨)
Raft Log For each range, a time-ordered log of writes to the range that its replicas have agreed on. This log exists on-disk with each replica and is the range's source of truth for consistent replication.

 


Read scenario

- Read는 사용자가 접속한 Gateway Node가 아닌 Lease holder가 위치한 Node에서 수행됨

- Gateway node에서는 이를 받아서 User에게 다시 전달

 

First, imagine a simple read scenario where:

  • There are 3 nodes in the cluster.
  • There are 3 small tables, each fitting in a single range.
    (용이한 테스트를 위해서 1개 테이블 = 1개 Range의 작은 테이블을 가정)
  • Ranges are replicated 3 times (the default).
  • A query is executed against node 2 to read from table 3.
    (=Node2가 Gateway node)

In this case:

  1. Node 2 (the gateway node) receives the request to read from table 3.
  2. The leaseholder for table 3 is on node 3, so the request is routed there.
    (=실제 쿼리는 Node 3에서 수행됨)
  3. Node 3 returns the data to node 2.
  4. Node 2 responds to the client.

If the query is received by the node that has the leaseholder for the relevant range, there are fewer network hops:

 

Read operation 확인

- Gateway node : nsql1
- Lease holder node : nsql3
- Execute "select count(*) from order_line where ol_w_id=3"

- nsql3 서버의 CPU 사용율이 높아짐을 확인

 

 

 


Write scenario

Now imagine a simple write scenario where a query is executed against node 3 to write to table 1:

In this case:

  1. Node 3 (the gateway node) receives the request to write to table 1.
    (Table 1에 대한 Write 명령을 Node3에서 받음)
  2. The leaseholder for table 1 is on node 1, so the request is routed there.
    (모든 처리는 Lease holder로 이동하여 처리하게 됨)
  3. The leaseholder is the same replica as the Raft leader (as is typical), so it simultaneously appends the write to its own Raft log and notifies its follower replicas on nodes 2 and 3.
    (leaseholder는 일반적으로 Raft leader이므로 Write에 대한 Log인 Raft log를 남기고 다른 node의 replicas들도 동일하게 수행(Log 남기고 수행)하도록 함)
  4. As soon as one follower has appended the write to its Raft log (and thus a majority of replicas agree based on identical Raft logs), it notifies the leader and the write is committed to the key-values on the agreeing replicas.
    (Raft log에 write 후 leader에 알리면 Commit 처리 하게 됨)
    In this diagram, the follower on node 2 acknowledged the write, but it could just as well have been the follower on node 3. Also note that the follower not involved in the consensus agreement usually commits the write very soon after the others.
    (Node2에서 Ack를 받으면 Commit 처리 됨 -> Node3은 그 이후에 처리 됨)
  5. Node 1 returns acknowledgement of the commit to node 3.
  6. Node 3 responds to the client.

Just as in the read scenario, if the write request is received by the node that has the leaseholder and Raft leader for the relevant range, there are fewer network hops:


Network and I/O bottlenecks

With the above examples in mind, it's always important to consider network latency and disk I/O as potential performance bottlenecks. In summary:

  • For reads, hops between the gateway node and the leaseholder add latency.
    (Read는 Gateway - leaseholder사이 latency 추가됨)
  • For writes, hops between the gateway node and the leaseholder/Raft leader, and hops between the leaseholder/Raft leader and Raft followers, add latency. In addition, since Raft log entries are persisted to disk before a write is committed, disk I/O is important.
    (Write는 Gateway - leaseholder/raft leader  - raft folower에서 latency 추가 & disk I/O 추가)