Chapter 4: Encoding and Evolution
Introduction
First of all, what is encoding(serialization) and decoding(deserialization)?
According to Wiki:
In short:
- Before serialization
- Data exists as complex in-memory structures specific to a particular programming language or system.
- After serialization
- Data is converted into a standardized format (like JSON, XML, or binary formats) that can be easily stored, transmitted, or understood by different systems.
Note that all data are represented as binary 0 and 1 before being sent across the network, and before transmission, binary 0 and 1 are encoded into different waveforms, this process is also called encoding.
Language built-in encoding
- Java → java.io.Serializable, Python → pickle, Ruby → Marshal
- less efficient
- language lock-in
- lack of forward and backward compatibility
JSON & XML & CSV
- textual format, human-readable, but not space-efficient
- https://github.com/nlohmann/json
- https://pkg.go.dev/encoding/json
Binary Encoding
Binary encoding consumes less space compared to textual format, here are some example
MessagePack
This JSON object is encoded into the following byte stream:
Let’s break it down:
82
means an object with 2 fieldsa6
means a 6-byte long string68 65 6e 68 65 6e
meanshenhen
a6
means a 6-byte long string70 65 72 73 6f 6e
meansperson
a6
means a 6-byte long stringc3
meanstrue
This byte stream only takes up 23 bytes, compared to the original 33 bytes JSON object, a 30% reduction in space.
Let’s look at another JSON object encoded using MessagePack:
Apache Thrift(Facebook)
Thrift requires schema for any data encoded, which can be defined using Thrift interface definition language (IDL).
Thrift has two different binary encoding formats: Binary Protocol & Compact Protocol.
Notice the field tag, which is defined in the schema
The field tag and type are packed into a single byte, which further decrease total bytes used, and also notice that instead of using 8 bytes to represent
1137
,1137
is encoded using only 2 bytes.
Protocol Buffers(protocol buffer) (google)
Protocol Buffers are similar to that of Compact Protocol when it comes to encoding data.
In protocol buffers, whether a field is marked required or optional does not affect the encoding, it simply act as a runtime check for catching bugs.
Protocol Buffers V.S. Thrift
- Protocol Buffers do not have a list or array datatypes, instead, it has a “repeated” marker for fields(which is a third option alongside “required” and “optional”)
- Thrift has a dedicated list datatype
Schema Evolution
Schema will inevitably change over time. So there are two problems:
- Forward Compatibility
- How does old code read new data generated by new schema?
- Backward Compatibility
- How does new code read stored data generated by old schema?
Forward compatibility is simpler, when old code reads a field in the schema which it does not recognize, it simply ignore it.
Backward compatibility requires us programmers to be careful: when adding a new field, the new field must be marked as optional(or has a default value) but not required, because if it is marked as required, when new code reads the old record, it will fail to find the corresponding new field, and fails.
What about removing a field? Removing a field is just like adding a field, with backward and forward compatibility concerns reversed. That means you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again (because you may still have data written somewhere that includes the old tag number, and that field must be ignored by new code).
Apache Avro
a good fit for Hadoop
Modes of Dataflow
Here we’re gonna investigate three different mode for sending and receiving data, simply put: who sends and receives data through services like protocol buffers.
databases
process write data to database, essentially sending a message to the future self(or others)
we call this data outlives code, so be careful
service calls(REST and RPC)
client & server
server can be a client to other services → service-oriented architecture(SOA) → microservices architecture
REST: a design philosophy that builds upon the principles of HTTP
SOAP: XML-based protocol
not human readable, relies on automate tooling, less common, less favorite by small companies
RPCs(Remote Procedure Calls)
https://github.com/twitter/finagle
https://github.com/linkedin/rest.li
we can make a assumption that all servers will be updated first, and all the clients second, thus, you only need backward compatibility on requests, and forward compatibility on responses
it’s hard to force clients to upgrade(you have no control over them), so servers tends to maintain multiple versions of the service API
asynchronous message passing
client’s request(message) is sent to a message broker(message queue or message-oriented middleware), and does not expected a response to this message(thus asynchronous)
client(a process) sends a message to a named queue or topic, and then the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic.
the consumer can send message to another queue
Actor Model: a programming model for concurrency in a single process
one actor communicates with other actors by sending and receiving asynchronous messages
distributed actor framework: Actor Model + message broker
https://learn.microsoft.com/en-us/dotnet/orleans/resources/best-practices
When to use RPC and when to use REST.