Chapter 3: Storage and Retrieval

Log-Structured Storage Engines

Simplest DB: append-only data file

Many databases use a log, which is append-only data file.
Real databases have more issues to deal with
1. concurrency control
2. maintaining disk space so the log doesn’t grow forever
3. handling errors and partially written records

#!/bin/bash
db_set () {
        echo "$1,$2" >> database
}
db_get () {
        grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

jjlin@linjiajiadeMacBook-Air ~ % db_set jimmy lin
jjlin@linjiajiadeMacBook-Air ~ % db_set henhen stupid
jjlin@linjiajiadeMacBook-Air ~ % db_get jimmy
lin
jjlin@linjiajiadeMacBook-Air ~ % db_get j
jjlin@linjiajiadeMacBook-Air ~ % db_get henhen
stupid
jjlin@linjiajiadeMacBook-Air ~ % db_set henhen stupid
jjlin@linjiajiadeMacBook-Air ~ % cat database
jimmy,lin
henhen,stupid
henhen,stupid

Pros: This append-only data file has the fastest write operations(sequential, append at the end)
Cons: Slow read operations (# of records are huge; have to scan from scratch)

Hash Indexes

Speed up read operations by using a hash table that maps each key to a byte offset in the data file
Keep in mind that the hash table must fit within RAM; otherwise, disk I/O will be needed to locate the corresponding byte offset, which can slow down performance

Compaction

When writing data to the disk, we first break data into variable-sized segments, and perform data compaction in a single segment to save space on disk.
Compaction in a single segment is intuitive, as the following diagram illustrates, simply merge the data with the same key, and leave only the newest one:
After a segment is compacted, we can follow the same process and merge two segment into a new one:

Implementation Details

Delete Operation
- Use a tombstone (a special deletion record) to mark records for deletion, and remove them during the compaction process
Crash Recovery
- Store a snapshot of the hash table on disk and reload it upon machine restart
Preventing Partial Writes
- Implement data checksums to detect and prevent partial writes
Concurrency Control
- Allow only one write thread, while permitting multiple read threads for improved concurrency

The append-only file approach is fast because sequential writes are quicker on disk compared to random writes. Additionally, since the append-only format inherently includes time-series encoding, concurrency control and crash recovery are much simpler.

However, this approach has some drawbacks:

The hash table must fit entirely in memory
Range queries are slow, as they require scanning all data (e.g., summing values for keys within a range like 10 to 1000)

Sorting String Table(SSTables) and Log-Structured Merge-Tree(LSM-Tree)

Range query in an append-only data file is slow, what can we do to speed it up?

→ You can combine it with binary search, and what you get is SSTables.
In SSTables, in each segment, the key is sorted, and in each merged segment, each key only appears once.
The merge process is similar to that of a classic merge sort(keep the value from the most recent segment, assume we always merge adjacent segments).
There’s no need to keep the whole hash table in memory anymore since the key value is sorted, this enables us to only keep a small set of it, which means the index is sparse.
Due to the fact that the index is sparse, a read query has to scan over several keys in a given range, thus, we can compress the chunk and then write it to disk, which not only saves disk space but also reduces the I/O bandwidth used.
How to maintain sorted in-memory key values? We can use any self-balanced binary search tree (AVL, Red-Black tree, etc.):
1. Write key-value pair to RAM’s balanced binary tree
2. When the tree size exceed pre-defined threshold, write it to disk. At the same time, we can continue writing data to RAM’s balanced binary tree.
3. When we want to read some data, first search its key in the RAM, and then the first segment on disk, and so on.
4. Perform merging in the background from time to time
SSTalbes had its drawback: when the database crash, the tree in RAM is lost.

→ To remedy this, we can keep a write log on disk, use the log to restore in RAM’s tree (memtable).
DB that uses the concept of SSTables typically called Log Structured-Merge Tree, such kind of DB includes LevelDB, RocksDB.

https://leveldb-handbook.readthedocs.io/zh/latest/sstable.html

Compaction Strategy

Different timing of SSTables compaction will affect its read and write performance

size-tiered compaction → HBase, Cassandra
1. triggered when the system has enough (four by default) similarly sized SSTables.
- picture
- Good Write Amplification, Good Space Amplification, Bad Read Amplification
leveled compaction → LevelDB, RocksEB, Cassandra
1. the system uses small, fixed-size (by default 160 MB) SSTables distributed across different levels.
- picture
- Bad Write Amplification, Bad Space Amplification, Good Read Amplification

https://opensource.docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html

📎 The key difference between the two strategies is that leveled compaction tends to aggressively merge a smaller sorted run into a larger one, while "tiered" waits for several sorted runs with similar size and merge them together.

It is generally regarded that the second strategy provides far better write amplification with worse read amplification.

An intuitive way to think about it: in tiered storage, every time an update is compacted, it tends to be moved from a smaller sorted run to a much larger one. Every compaction is likely to make the update exponentially closer to the final sorted run, which is the largest.

In leveled compaction, however, an update is compacted more as a part of the larger sorted run where a smaller sorted run is merged into, than as a part of the smaller sorted run. As a result, in most of the times an update is compacted, it is not moved to a larger sorted run, so it doesn’t make much progress towards the final largest run.

https://github.com/facebook/rocksdb/wiki/universal-compaction

Bloom Filters

LST-Tree Algorithm is slow when looking up a key that does not exist since we’re gonna have to check all segments, which means possibly reading them all back from the disk. This is where bloom filters come into play since bloom filters can guarantee us the key does not exist.

What is Bloom Filters

https://youtu.be/qZNJTh2NEiU?si=YfczhhPQWqz__z3T

https://youtu.be/kfFacplFY4Y?si=KSc1mzbVrGWZT5Yv

B-Trees

The most commonly used index is not SSTables, but B-Tress. B-Trees updates the keys, while LSM-Trees append the keys and then compacts them.

While SSTable stores data in variable-size segments(typically several megabytes), B-Tree stores data in fixed-size blocks or pages, traditionally 4KB in size

B-Trees are designed for fast comparison(branching) in CPU, and low I/O for disk(hence multiple entries in a single node).

What is B-Trees

https://www.geeksforgeeks.org/introduction-of-b-tree-2/

https://youtu.be/K1a2Bk8NrYQ?si=4toy0fEEcJwGZCQP

截圖 2024-07-25 下午1.28.18.png

B-tree implementations includes a additional data structure on disk: write-ahead log(WAL), which is a append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.

B-tree solves the concurrency issue with protecting the tree with latches(lightweight locks).

https://catkang.github.io/2022/01/27/btree-lock.html?spm=a2c6h.13046898.publish-article.32.65f86ffaNeaCYy

For B-tree, it’s hard to keep adjacent leaf pages appearing in sequential order on disk, while this is easy for LSM-trees.

B-trees v.s. LSM-trees

LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.

Since LSM-trees sequentially write compact SSTable files while B-tree overwrite several pages(random write), LSM-trees typically have higher write throughput(especially on magnetic hard drives, where sequential write is much faster than random writes)

LSM-trees has lower storage overheads due to periodically rewrite SSTables and no page fragmentation.

The downside of LSM-trees is its background compacting job because it occupies disk bandwidth and may interfere with incoming write operations.

Full-text Search and Fuzzy Index

A SSTable-like structure is used for its term dictionary. The in-memory index is a finite state automaton over the characters in the keys, similar to a trie, which is transformed into Levenshtein automaton.

OLTP & OLAP

OLTP & OLAP have different access patterns for data, the indexing algorithm covered so far works well for OLTP but not OLAP, which has very different data access patterns.

截圖 2024-08-11 下午2.29.41.png

Extract-Transform-Load(EFL): From OLTP(Online Transaction Processing) System to OLAP(Online Analytic Processing) System.

截圖 2024-07-27 中午12.46.50.png

Star Schema

To efficiently process data in an OLAP system, the star schema(or snowflake schema) and fact table is applied.

截圖 2024-08-11 下午2.31.18.png

pauquet column-oriented storage

Jimmy Lin's Blog

Explorer

Chapter 3 Storage and Retrieval