How Ruby Reads Your Code

Você pode ler uma versão deste artigo em português aqui.

Introduction

Ruby is a programming language loved by many programmers. In addition to having many methods that make our work a lot easier, Ruby is known for this set of characteristics that is commonly called “Syntactic Sugar”, which makes writing code more pleasant. You can read more about Syntactic Sugar in this article.

The language construct for, for instance, is present in many programming languages, including Ruby. Even though it’s use is not recommended by the community, having this construct available can ease the transition from a different programming language to Ruby.

Usually you should use methods like each or map instead of for when iterating collections. In addition to usually making more readable code, internally the for construct is implemented in terms of each. So that we can better understand what that means, we need to analyze how Ruby reads and interprets the code you write.

Just a “heads up”, this will not be a deep dive into how programming languages work internally. So this article should be better suited for people who have a basic to intermediate understanding of programming, but might not be familiar with some Computer Science topics.

The three phases of Ruby code transformation

Before executing the code you wrote, Ruby needs to read it and transform these lines of code 3 times. Which means that every time a Ruby script is ran, it goes through a 3 step process, that will transform your code into the instructions that will be executed: tokenization, parse and, finally, compilation.

 ___________      ______________      _______      _____________      ______________
| Your code | -> | Tokenization | -> | Parse | -> | Compilation | -> | Instructions |
 -----------      --------------      -------      -------------      --------------

In this article we will briefly discuss this phases, but if you are interested in knowing more you should read the book Ruby Under a Microscope, which was used as the main reference for this article. You can follow along the instructions in your own computer if you like, the only requirement being a Ruby version 1.9 or higher installed. For the examples in this article, version 3.0.0 was used.

Let’s begin with a simple Ruby script:

[1, 3].each do |e|
  p e
end

When the script above is ran, we get:

1
3
# => [1, 3]

The numbers 1 and 3 are printed in screen and the array is returned. The same result can be obtained with this implementation using for:

for e in [1, 3] do
  p e
end

Before executing the code, the Ruby interpreter needs to translate it and, to do that, it reads each character and “tokenizes” it. In other words, the interpreter converts this sequences into tokens or words. Consider the line bellow:

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
[	1	,		3	]	.	e	a	c	h		d	o		\|	e	\|

First, the interpreter finds the opening square bracket. Then, an integer (1), followed by a comma, followed by white space, which means that another element is expected, and that the first element of the array is an integer number of one digit. And so on… step by step, the interpreter transforms these characters into tokens, and identifies the method each, the keyword do etc.

If you would like to see for yourself how Ruby interprets different code, you can use the class Ripper with the method lex, which allows you to visualize the “tokenized code”:

require 'ripper'
require 'pp'

code = <<STR
[1, 3].each do |e|
  p e
end
STR

puts code
pp Ripper.lex(code)

Running the script above we get:

[1, 3].each do |e|
  p e
end
[[[1, 0], :on_lbracket, "[", BEG|LABEL],
 [[1, 1], :on_int, "1", END],
 [[1, 2], :on_comma, ",", BEG|LABEL],
 [[1, 3], :on_sp, " ", BEG|LABEL],
 [[1, 4], :on_int, "3", END],
 [[1, 5], :on_comma, ",", BEG|LABEL],
 [[1, 6], :on_sp, " ", BEG|LABEL],
 [[1, 7], :on_int, "5", END],
 [[1, 8], :on_rbracket, "]", END],
 [[1, 9], :on_period, ".", DOT],
 [[1, 10], :on_ident, "each", ARG],
 [[1, 14], :on_sp, " ", ARG],
 [[1, 15], :on_kw, "do", BEG],
 [[1, 17], :on_sp, " ", BEG],
 [[1, 18], :on_op, "|", BEG|LABEL],
 [[1, 19], :on_ident, "e", ARG],
 [[1, 20], :on_op, "|", BEG|LABEL],
 [[1, 21], :on_ignored_nl, "\n", BEG|LABEL],
 [[2, 0], :on_sp, "  ", BEG|LABEL],
 [[2, 2], :on_ident, "p", CMDARG],
 [[2, 3], :on_sp, " ", CMDARG],
 [[2, 4], :on_ident, "e", END|LABEL],
 [[2, 5], :on_nl, "\n", BEG],
 [[3, 0], :on_kw, "end", END],
 [[3, 3], :on_nl, "\n", BEG]]

Each line is read and translated. The characters are identified with it’s position and symbol. For instance, [1, 10] indicates line 1, position 10, and :on_ident indicates the each block. In this step, even if the code contains syntax mistakes, it can still be tokenized.

When we use Ripper to read the for implementation, we get:

for e in [1, 3] do
  p e
end
[[[1, 0], :on_kw, "for", BEG],
 [[1, 3], :on_sp, " ", BEG],
 [[1, 4], :on_ident, "e", CMDARG],
 [[1, 5], :on_sp, " ", CMDARG],
 [[1, 6], :on_kw, "in", BEG],
 [[1, 8], :on_sp, " ", BEG],
 [[1, 9], :on_lbracket, "[", BEG|LABEL],
 [[1, 10], :on_int, "1", END],
 [[1, 11], :on_comma, ",", BEG|LABEL],
 [[1, 12], :on_sp, " ", BEG|LABEL],
 [[1, 13], :on_int, "3", END],
 [[1, 14], :on_rbracket, "]", END],
 [[1, 15], :on_sp, " ", END],
 [[1, 16], :on_kw, "do", BEG],
 [[1, 18], :on_ignored_nl, "\n", BEG],
 [[2, 0], :on_sp, "  ", BEG],
 [[2, 2], :on_ident, "p", CMDARG],
 [[2, 3], :on_sp, " ", CMDARG],
 [[2, 4], :on_ident, "e", END|LABEL],
 [[2, 5], :on_nl, "\n", BEG],
 [[3, 0], :on_kw, "end", END],
 [[3, 3], :on_nl, "\n", BEG]]

As expected, the results are different from the each implementation, since in this stage, the only thing Ruby does is read characters and transforms them into tokens.

After “tokenizing” the full script, Ruby still needs to understand exactly what it has to be executed and, to do that, reading the tokens is not enough. Like many programming languages, the Ruby “parser” contains a series of grammatical rules that will be used to analyze and interpret your code. Just like the tokenization process, we can use the Ripper class to see how the “parser” does that. We can use the same script as before, only this time we use the method sexp instead of lex:

require 'ripper'
require 'pp'

code = <<STR
[1, 3].each do |e|
  p e
end
STR

puts code
pp Ripper.sexp(code)

Running the code above we get the following:

[1, 3].each do |e|
  p e
end
[:program,
 [[:method_add_block,
   [:call,
    [:array, [[:@int, "1", [1, 1]], [:@int, "3", [1, 4]]]],
    [:@period, ".", [1, 6]],
    [:@ident, "each", [1, 7]]],
   [:do_block,
    [:block_var,
     [:params, [[:@ident, "e", [1, 16]]], nil, nil, nil, nil, nil, nil],
     false],
    [:bodystmt,
     [[:command,
       [:@ident, "p", [2, 2]],
       [:args_add_block, [[:var_ref, [:@ident, "e", [2, 4]]]], false]]],
     nil,
     nil,
     nil]]]]]

We can start to notice a few elements that represent components of a script. The line bellow, for instance, defines an array with 2 integer numbers (1 and 3), being 1 on line 1, in position 1, and 3 on line 1, position 4:

[:array, [[:@int, "1", [1, 1]], [:@int, "3", [1, 4]]]]

Note that the symbol :method_add_block indicates that a block of code is called right at the beginning. Just for comparison, let’s see how to the code bellow is read:

code = <<STR
arr = []
arr << "Pedro"
STR

puts code
pp Ripper.sexp(code)

[:program,
 [[:assign, [:var_field, [:@ident, "arr", [1, 0]]], [:array, nil]],
  [:binary,
   [:var_ref, [:@ident, "arr", [2, 0]]],
   :<<,
   [:string_literal,
    [:string_content, [:@tstring_content, "Pedro", [2, 8]]]]]]]

There we can find the value attribution (:assign) to the variable (:var_field) identified by [:@ident, "arr", [1, 0]]], and then further down :<< is called, adding an element to the array.

Now going back to the code transformation phases, the main difference between this stage and the previous, is that the interpreter needs to understand the prioritization of operations that will be executed. And at this stage of the process, syntax mistakes will raise errors.

After the first 2 phases, the code is converted into an abstract syntax tree data structure.

Finally, the tokenized and parsed code will go through the compiler, which means it will be translated into another language that will be easier for computers to understand. Unlike programming languages like Java or C, Ruby compiles code automatically without us even noticing. The compiler will generate bytecode instructions, called YARV, which will be interpreted in the Ruby Virtual Machine (YARV - Yet Another Ruby Virtual machine).

Using this data structure, the compiler is able to understand the node prioritization and will add objects and method calls to the execution stack.

In the same manner as the previous stages, we can visualize the compiled code using the class Ripper. The lines shown bellow are different from the actual compiled results in the C compiler, but this already allows us to have a way of seeing how Ruby works.

Let’s start with a very simple code:

code = <<STR
2 + 3
STR

puts code
puts RubyVM::InstructionSequence.compile(code).disasm

The result of running this script is shown bellow:

+ 3
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,5)> (catch: FALSE)
putobject                   2                         (   1)[Li]
putobject                   3
opt_plus                    <calldata!mid:+, argc:1, ARGS_SIMPLE>
leave

The first instruction sends an integer (2) to the execution stack, the second, passes the argument (integer 3). Finally, the third instruction sends the method + (mid:+, which means something like method_id = +). Just for comparison, if we switch the operation to multiplication, the line would look like this:

opt_mult                       <calldata!mid:*, argc:1, ARGS_SIMPLE>

Now let’s compile the initial script, the each block:

[1, 3].each do |e|
  p e
end
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(3,3)> (catch: FALSE)
== catch table
| catch type: break  st: 0000 ed: 0005 sp: 0000 cont: 0005
| == disasm: #<ISeq:block in <compiled>@<compiled>:1 (1,12)-(3,3)> (catch: FALSE)
| == catch table
| | catch type: redo   st: 0001 ed: 0006 sp: 0000 cont: 0001
| | catch type: next   st: 0001 ed: 0006 sp: 0000 cont: 0006
| |------------------------------------------------------------------------
| local table (size: 1, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
| [ 1] e@0<Arg>
| 0000 nop                                                       (   1)[Bc]
| 0001 putself                                                   (   2)[Li]
| 0002 getlocal_WC_0                          e@0
| 0004 opt_send_without_block   <calldata!mid:p, argc:1, FCALL|ARGS_SIMPLE>
| 0006 nop
| 0007 leave                                                     (   3)[Br]
|------------------------------------------------------------------------
0000 duparray                               [1, 3]               (   1)[Li]
0002 send                  <calldata!mid:each, argc:0>, block in <compiled>
0005 nop
0006 leave                                                       (   1)

Each line that starts with == disasm: indicates a scope, and each scope can have it’s own local table (local table (...)). To facilitate visualization, the lines bellow will focus only on the scope of the each block.

 == disasm: #<ISeq:block in <compiled>@<compiled>:1 (1,12)-(3,3)> (catch: FALSE)
 == catch table
 | catch type: redo   st: 0001 ed: 0006 sp: 0000 cont: 0001
 | catch type: next   st: 0001 ed: 0006 sp: 0000 cont: 0006
 |------------------------------------------------------------------------
 local table (size: 1, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
 [ 1] e@0<Arg>
 0000 nop                                                       (   1)[Bc]
 0001 putself                                                   (   2)[Li]
 0002 getlocal_WC_0                          e@0
 0004 opt_send_without_block   <calldata!mid:p, argc:1, FCALL|ARGS_SIMPLE>
 0006 nop
 0007 leave                                                     (   3)[Br]

You can see the e variable with getlocal_WC_0, and next, the method call p (calldata!mid:p). In the end, leave indicates the end of the block. The vertical line of pipes (|) will give you a visual indication of the scopes.

Going back the rest of the compiled code we can see the method call each (with send) to the array [1, 3] (duparray), which opens the previous code block.

Now that we can better understand how Ruby reads our code, we’ll see how our initial example scripts, with for and each are compiled. We will analyze and compare them.

for e in [1, 3] do
  p e
end
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(3,3)> (catch: FALSE)
== catch table
| catch type: break  st: 0000 ed: 0005 sp: 0000 cont: 0005
| == disasm: #<ISeq:block in <compiled>@<compiled>:1 (1,0)-(3,3)> (catch: FALSE)
| == catch table
| | catch type: redo   st: 0005 ed: 0010 sp: 0000 cont: 0005
| | catch type: next   st: 0005 ed: 0010 sp: 0000 cont: 0010
| |------------------------------------------------------------------------
| local table (size: 1, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
| [ 1] ?@0<Arg>
| 0000 getlocal_WC_0                          ?@0                (   1)
| 0002 setlocal_WC_1                          e@0
| 0004 nop                                    [Bc]
| 0005 putself                                                   (   2)[Li]
| 0006 getlocal_WC_1                          e@0
| 0008 opt_send_without_block   <calldata!mid:p, argc:1, FCALL|ARGS_SIMPLE>
| 0010 nop
| 0011 leave                                                     (   3)[Br]
|------------------------------------------------------------------------
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] e@0
0000 duparray                               [1, 3]               (   1)[Li]
0002 send                  <calldata!mid:each, argc:0>, block in <compiled>
0005 nop
0006 leave                                                       (   1)

Let’s focus on the lines bellow to make things easier:

|------------------------------------------------------------------------
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] e@0
0000 duparray                               [1, 3]               (   1)[Li]
0002 send                  <calldata!mid:each, argc:0>, block in <compiled>
0005 nop
0006 leave                                                       (   1)

Note that the array definition and the method call are exactly the same as the result from the each script compilation. This means that, after being “tokenized” and interpreted, the for script is translated just like the each script. However, not everything is exactly the same. Bellow we’ll present the compiled each script for comparison:

|------------------------------------------------------------------------
duparray                               [1, 3]               (   1)[Li]
send                  <calldata!mid:each, argc:0>, block in <compiled>
nop
leave

Notice that there is no local table or e variable definition. This means that, when we use for, a variable is created outside of the scope, and it will be used inside the block. That structure has implications when the code ran. Let’s take a look at another example:

arr = [1, 2, 3]
for elem in arr do
  puts elem
end

# elem is accessible outside of the for scope
elem # => 3

arr = [1, 2, 3]
arr.each { |elem| puts elem }

# elem is not accessible outside the each scope
elem # => NameError: undefined local variable or method `elem'

This means that for behaves differently from each and that can affect variables that were already defined outside of the scope with the same name. For exemple:

i = "a"

for i in [2,3] do
  p i
end
2
3
# => [2, 3]

i # => 3

Variable i changes after the for block.

i = "a"

[2, 3].each { |i| p i }
2
3
# => [2, 3]

i # => "a"

Variable i stays the same after the each block.

If you would like to see a more detailed example of the side effects of the for construct in Ruby, you can checkout the article “The Evils of the For Loop”.

Conclusion

In this article we summarize the three steps of the process Ruby uses to read your code up to the instructions to be executed. Knowing this allows us to better understand how the programming language works and, potentially, helps us evolve as programmers.

In addition to that, we analyzed how Ruby processes for and each calls and the differences between them, specially how the use of the first can lead to unwanted side effects.

References:

André Kanamura

DEV Ruby on Rails and photographer

How Ruby Reads Your Code

Introduction

The three phases of Ruby code transformation

Conclusion