Jake Goulding

Another dip into Ruby's Marshal format

In a previous post I started to describe some details of Ruby’s Marshal format. This post goes further: a larger set of integers, IVARs, strings, and object links.

Larger integers

What happens once we go beyond integer values that can be represented in one byte? Marshal simply writes the number of bytes needed to represent the value, followed by the value, least significant byte first. Leading zeroes are not encoded.

123

0408 6901 7b

01 indicates that the value takes up one byte, followed by the value itself.

256

0408 6902 0001

256 requires two bytes.

2**30 - 1

0408 6904 ffff ff3f

This is the largest value you can serialize as an integer. Above this, Marshal starts serializing integers as a “bignum”.

Negative integers

-1

0408 69fa

fa is -6 in two’s complement, which mirrors how 1 is encoded as 6.

-124

0408 69ff 84

Here the first byte is -1 in two’s complement. This indicates that one byte of value follows. The value has had leading FF bytes removed, similar to large positive integers.

-257

0408 69fe fffe

-257 requires two bytes.

-(2**30)

0408 69fc 0000 00c0

This is the largest negative value you can serialize as an integer before becoming a bignum.

IVARs

Hang on to your seats, we’re going to jump into strings. First, however, we need to talk about IVARs. The crucial thing that IVARs bring to the table is the handling of string encodings.

'hello'

0408 4922 0a68 656c 6c6f 063a 0645 54

The typecode 49 is ASCII I and denotes that this object contains instance variables. After all the object data, the number of instance variables is provided. The first instance variable is a special one – it’s the string encoding of the object. In this example the string encoding is UTF-8, denoted by the symbol :E followed by a true.

'hello'.force_encoding('US-ASCII')

0408 4922 0a68 656c 6c6f 063a 0645 46

To represent US-ASCII, :E false is used instead. Both US-ASCII and UTF-8 are common enough string encodings that special indicators were created for them.

'hello'.force_encoding('SHIFT_JIS')

0408 4922 0a68 656c 6c6f 063a 0d65 6e63 6f64 696e 6722 0e53 6869 6674 5f4a 4953

For any other string encoding, the symbol :encoding is used and the full string encoding is written out as a raw string – "SHIFT_JIS".

'hello'.tap {|s| s.instance_variable_set(:@test, nil)}

0408 4922 0a68 656c 6c6f 073a 0645 543a 0a40 7465 7374 30

Additional instance variables follow the string encoding. There are now 2 instance variables. The symbol for the instance variable name :@test comes before the value, nil.

Raw strings

'hello'

0408 4922 0a68 656c 6c6f 063a 0645 54

Raw strings are safely nestled inside an IVAR, and are comparatively very simple. The typecode 22 is ASCII " and denotes that this object is a raw string. The length of the string data is next, encoded in the same form as integers. The string data follows as a set of bytes. These bytes must be interpreted using the encoding from the surrounding IVAR.

Object links

When the same object instance is repeated multiple times, the Marshal encoding allows subsequent instances to reference the first instance to save space in the stream.

a = 'hello'; [a, a]

0408 5b07 4922 0a68 656c 6c6f 063a 0645 5440 06

The typecode 40 is ASCII @. The typecode is followed by the position of the object in the cache table. This cache table is distinct from the symbol cache.

The rest

There’s a more types that Marshal can handle, but not all of them are interesting. The next post covers regexes, classes, modules, and instances of objects.