I recently tried to resolve a JRuby issue involving Marshal. I’ve used Marshal before, but never needed to pay attention to the actual bytes written to disk. I decided to write up what I learned in the process.
Version number
0408
I collected this data using Ruby 1.9.3p327, which has Marshal version 4.8. The version number is encoded with two bytes, one each for the major and minor version. This version number precedes all dumps and I’ll ignore it for the rest of this post.
Nil, true, false
nil
0408 30
The typecode 30
is ASCII 0
.
true
0408 54
The typecode 54
is ASCII T
.
false
0408 46
The typecode 46
is ASCII F
.
Integers (easy)
0
0408 6900
The typecode 69
is ASCII i
. The typecode is followed by the value
of the integer. Zero is represented as 00
.
1
0408 6906
Here we see that the encoded value for one is 06
, not 01
as we
might expect at first. This allows for more efficient storage of
smaller numbers. -123 <= x <= 122 can be encoded in just one byte.
Arrays
[]
0408 5b00
The typecode 5b
is ASCII [
. The typecode is followed by the
number of elements in the array.
[1]
0408 5b06 6906
The number of items in the array is encoded in the same form as integers. Each value in the array is encoded sequentially after the size of the array.
Hashes
{}
0408 7b00
The typecode 7b
is ASCII {
. The typecode is followed by the number
of (key, value) pairs in the hash.
{1 => 2}
0408 7b06 6906 6907
Like arrays, the number of items in the hash is encoded in the same form as integers. Each pair of (key, value) is encoded sequentially after the size of the hash.
Symbols
:hello
0408 3a0a 6865 6c6c 6f
The typecode 3a
is ASCII :
. The typecode is followed by the length
of the symbol name and then the symbol name itself, encoded as UTF-8.
Symlinks
When a symbol is repeated multiple times, the Marshal encoding allows subsequent instances to reference the first instance to save space in the stream.
[:hello, :hello]
0408 5b07 3a0a 6865 6c6c 6f3b 00
The typecode 3b
is ASCII ;
. The typecode is followed by the
position of the symbol in the cache table. This table is indexed by
the order in which the symbol first appeared.
The rest
There’s a lot more to the Marshal format; I haven’t even covered strings yet! You can find more at the next post in this series, or jump right to the last post.
How to explore on your own
To generate the examples for this post, I hacked up a quick helper in irb:
1 2 3 4 |
|