Chapter 3: Storage and Retrieval
Log-Structured Storage Engines
Simplest DB: append-only data file
- Many databases use a log, which is append-only data file.
- Real databases have more issues to deal with
- concurrency control
- maintaining disk space so the log doesn’t grow forever
- handling errors and partially written records
- Pros: This append-only data file has the fastest write operations(sequential, append at the end)
- Cons: Slow read operations (# of records are huge; have to scan from scratch)
Hash Indexes
-
Speed up read operations by using a hash table that maps each key to a byte offset in the data file
-
Keep in mind that the hash table must fit within RAM; otherwise, disk I/O will be needed to locate the corresponding byte offset, which can slow down performance
Compaction
-
When writing data to the disk, we first break data into variable-sized segments, and perform data compaction in a single segment to save space on disk.
-
Compaction in a single segment is intuitive, as the following diagram illustrates, simply merge the data with the same key, and leave only the newest one:
-
After a segment is compacted, we can follow the same process and merge two segment into a new one:
Implementation Details
- Delete Operation
- Use a tombstone (a special deletion record) to mark records for deletion, and remove them during the compaction process
- Crash Recovery
- Store a snapshot of the hash table on disk and reload it upon machine restart
- Preventing Partial Writes
- Implement data checksums to detect and prevent partial writes
- Concurrency Control
- Allow only one write thread, while permitting multiple read threads for improved concurrency
The append-only file approach is fast because sequential writes are quicker on disk compared to random writes. Additionally, since the append-only format inherently includes time-series encoding, concurrency control and crash recovery are much simpler.
However, this approach has some drawbacks:
- The hash table must fit entirely in memory
- Range queries are slow, as they require scanning all data (e.g., summing values for keys within a range like 10 to 1000)
Sorting String Table(SSTables) and Log-Structured Merge-Tree(LSM-Tree)
-
Range query in an append-only data file is slow, what can we do to speed it up?
→ You can combine it with binary search, and what you get is SSTables.
-
In SSTables, in each segment, the key is sorted, and in each merged segment, each key only appears once.
-
The merge process is similar to that of a classic merge sort(keep the value from the most recent segment, assume we always merge adjacent segments).
-
There’s no need to keep the whole hash table in memory anymore since the key value is sorted, this enables us to only keep a small set of it, which means the index is sparse.
-
Due to the fact that the index is sparse, a read query has to scan over several keys in a given range, thus, we can compress the chunk and then write it to disk, which not only saves disk space but also reduces the I/O bandwidth used.
-
How to maintain sorted in-memory key values? We can use any self-balanced binary search tree (AVL, Red-Black tree, etc.):
- Write key-value pair to RAM’s balanced binary tree
- When the tree size exceed pre-defined threshold, write it to disk. At the same time, we can continue writing data to RAM’s balanced binary tree.
- When we want to read some data, first search its key in the RAM, and then the first segment on disk, and so on.
- Perform merging in the background from time to time
-
SSTalbes had its drawback: when the database crash, the tree in RAM is lost.
→ To remedy this, we can keep a write log on disk, use the log to restore in RAM’s tree (memtable).
-
DB that uses the concept of SSTables typically called Log Structured-Merge Tree, such kind of DB includes LevelDB, RocksDB.
https://leveldb-handbook.readthedocs.io/zh/latest/sstable.html
Compaction Strategy
Different timing of SSTables compaction will affect its read and write performance
- size-tiered compaction → HBase, Cassandra
- triggered when the system has enough (four by default) similarly sized SSTables.
-
picture
-
Good Write Amplification, Good Space Amplification, Bad Read Amplification
- leveled compaction → LevelDB, RocksEB, Cassandra
- the system uses small, fixed-size (by default 160 MB) SSTables distributed across different levels.
-
picture
-
Bad Write Amplification, Bad Space Amplification, Good Read Amplification
https://opensource.docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html
https://github.com/facebook/rocksdb/wiki/universal-compaction
Bloom Filters
LST-Tree Algorithm is slow when looking up a key that does not exist since we’re gonna have to check all segments, which means possibly reading them all back from the disk. This is where bloom filters come into play since bloom filters can guarantee us the key does not exist.
-
What is Bloom Filters
B-Trees
The most commonly used index is not SSTables, but B-Tress. B-Trees updates the keys, while LSM-Trees append the keys and then compacts them.
While SSTable stores data in variable-size segments(typically several megabytes), B-Tree stores data in fixed-size blocks or pages, traditionally 4KB in size
B-Trees are designed for fast comparison(branching) in CPU, and low I/O for disk(hence multiple entries in a single node).
-
What is B-Trees
B-tree implementations includes a additional data structure on disk: write-ahead log(WAL), which is a append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.
B-tree solves the concurrency issue with protecting the tree with latches(lightweight locks).
For B-tree, it’s hard to keep adjacent leaf pages appearing in sequential order on disk, while this is easy for LSM-trees.
B-trees v.s. LSM-trees
LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.
Since LSM-trees sequentially write compact SSTable files while B-tree overwrite several pages(random write), LSM-trees typically have higher write throughput(especially on magnetic hard drives, where sequential write is much faster than random writes)
LSM-trees has lower storage overheads due to periodically rewrite SSTables and no page fragmentation.
The downside of LSM-trees is its background compacting job because it occupies disk bandwidth and may interfere with incoming write operations.
Full-text Search and Fuzzy Index
A SSTable-like structure is used for its term dictionary. The in-memory index is a finite state automaton over the characters in the keys, similar to a trie, which is transformed into Levenshtein automaton.
OLTP & OLAP
OLTP & OLAP have different access patterns for data, the indexing algorithm covered so far works well for OLTP but not OLAP, which has very different data access patterns.
Extract-Transform-Load(EFL): From OLTP(Online Transaction Processing) System to OLAP(Online Analytic Processing) System.
Star Schema
To efficiently process data in an OLAP system, the star schema(or snowflake schema) and fact table is applied.
pauquet column-oriented storage