Jake Goulding

A little dip into Ruby's Marshal format

I recently tried to resolve a JRuby issue involving Marshal. I’ve used Marshal before, but never needed to pay attention to the actual bytes written to disk. I decided to write up what I learned in the process.

Version number

0408

I collected this data using Ruby 1.9.3p327, which has Marshal version 4.8. The version number is encoded with two bytes, one each for the major and minor version. This version number precedes all dumps and I’ll ignore it for the rest of this post.

Nil, true, false

nil

0408 30

The typecode 30 is ASCII 0.

true

0408 54

The typecode 54 is ASCII T.

false

0408 46

The typecode 46 is ASCII F.

Integers (easy)

0

0408 6900

The typecode 69 is ASCII i. The typecode is followed by the value of the integer. Zero is represented as 00.

1

0408 6906

Here we see that the encoded value for one is 06, not 01 as we might expect at first. This allows for more efficient storage of smaller numbers. -123 <= x <= 122 can be encoded in just one byte.

Arrays

[]

0408 5b00

The typecode 5b is ASCII [. The typecode is followed by the number of elements in the array.

[1]

0408 5b06 6906

The number of items in the array is encoded in the same form as integers. Each value in the array is encoded sequentially after the size of the array.

Hashes

{}

0408 7b00

The typecode 7b is ASCII {. The typecode is followed by the number of (key, value) pairs in the hash.

{1 => 2}

0408 7b06 6906 6907

Like arrays, the number of items in the hash is encoded in the same form as integers. Each pair of (key, value) is encoded sequentially after the size of the hash.

Symbols

:hello

0408 3a0a 6865 6c6c 6f

The typecode 3a is ASCII :. The typecode is followed by the length of the symbol name and then the symbol name itself, encoded as UTF-8.

Symlinks

When a symbol is repeated multiple times, the Marshal encoding allows subsequent instances to reference the first instance to save space in the stream.

[:hello, :hello]

0408 5b07 3a0a 6865 6c6c 6f3b 00

The typecode 3b is ASCII ;. The typecode is followed by the position of the symbol in the cache table. This table is indexed by the order in which the symbol first appeared.

The rest

There’s a lot more to the Marshal format; I haven’t even covered strings yet! You can find more at the next post in this series, or jump right to the last post.

How to explore on your own

To generate the examples for this post, I hacked up a quick helper in irb:

1
2
3
4
def dump(x)
  File.open('/tmp/out', 'w') {|f| Marshal.dump(x, f)}
  `xxd /tmp/out`
end