In a previous post I started to describe some details of Ruby’s Marshal format. This post goes further: a larger set of integers, IVARs, strings, and object links.
Larger integers
What happens once we go beyond integer values that can be represented in one byte? Marshal simply writes the number of bytes needed to represent the value, followed by the value, least significant byte first. Leading zeroes are not encoded.
123
0408 6901 7b
01
indicates that the value takes up one byte, followed by the value
itself.
256
0408 6902 0001
256 requires two bytes.
2**30 - 1
0408 6904 ffff ff3f
This is the largest value you can serialize as an integer. Above this, Marshal starts serializing integers as a “bignum”.
Negative integers
-1
0408 69fa
fa
is -6 in two’s complement, which mirrors how 1
is encoded as 6.
-124
0408 69ff 84
Here the first byte is -1 in two’s complement. This indicates that one
byte of value follows. The value has had leading FF
bytes removed,
similar to large positive integers.
-257
0408 69fe fffe
-257 requires two bytes.
-(2**30)
0408 69fc 0000 00c0
This is the largest negative value you can serialize as an integer before becoming a bignum.
IVARs
Hang on to your seats, we’re going to jump into strings. First, however, we need to talk about IVARs. The crucial thing that IVARs bring to the table is the handling of string encodings.
'hello'
0408 4922 0a68 656c 6c6f 063a 0645 54
The typecode 49
is ASCII I
and denotes that this object contains
instance variables. After all the object data, the number of instance
variables is provided. The first instance variable is a special one –
it’s the string encoding of the object. In this example the string
encoding is UTF-8, denoted by the symbol :E
followed by a true
.
'hello'.force_encoding('US-ASCII')
0408 4922 0a68 656c 6c6f 063a 0645 46
To represent US-ASCII, :E
false
is used instead. Both US-ASCII and
UTF-8 are common enough string encodings that special indicators were
created for them.
'hello'.force_encoding('SHIFT_JIS')
0408 4922 0a68 656c 6c6f 063a 0d65 6e63 6f64 696e 6722 0e53 6869 6674 5f4a 4953
For any other string encoding, the symbol :encoding
is used and the
full string encoding is written out as a raw string – "SHIFT_JIS"
.
'hello'.tap {|s| s.instance_variable_set(:@test, nil)}
0408 4922 0a68 656c 6c6f 073a 0645 543a 0a40 7465 7374 30
Additional instance variables follow the string encoding. There are
now 2 instance variables. The symbol for the instance variable name
:@test
comes before the value, nil
.
Raw strings
'hello'
0408 4922 0a68 656c 6c6f 063a 0645 54
Raw strings are safely nestled inside an IVAR, and are comparatively
very simple. The typecode 22
is ASCII "
and denotes that this
object is a raw string. The length of the string data is next, encoded in
the same form as integers. The string data follows as a set of
bytes. These bytes must be interpreted using the encoding from the
surrounding IVAR.
Object links
When the same object instance is repeated multiple times, the Marshal encoding allows subsequent instances to reference the first instance to save space in the stream.
a = 'hello'; [a, a]
0408 5b07 4922 0a68 656c 6c6f 063a 0645 5440 06
The typecode 40
is ASCII @
. The typecode is followed by the
position of the object in the cache table. This cache table is
distinct from the symbol cache.
The rest
There’s a more types that Marshal can handle, but not all of them are interesting. The next post covers regexes, classes, modules, and instances of objects.