Flex Your Lexer

August 16, 2013

A Lexer is a ruby class with a single method: tokenize(). Its purpose is to label each chunk of code with a particular token. Strings are labeled as strings, numbers as numbers, and class names as constants. For example,

Lexer.new.tokenize("string")
# =>[[:STRING, "string"]]
Lexer.new.tokenize("True")
# =>[[:CONSTANT, "True"]]
Lexer.new.tokenize("a Car")
# =>[[:A, 'a'], [:CONSTANT, 'Car']])
Lexer.new.tokenize("+")
# =>[["+", "+"]]

“Lexing” is accomplished using regular expressions. The lexer treats the text string of your code like a shish kabob - it slides recognizable chunks off the skewer one at a time and categorizes them (onion, pepper, method definition, etc).

class Lexer
  KEYWORDS = ['a', 'can', 'if', 'else', 'while', 'true', 'false', 'nil']

  def tokenize(code)
    code.chomp!
    i = 0
    tokens = []

    while i < code.size
      chunk = code[i..-1]

      # checks keywords
      if identifier = chunk[/\A(eh\?)/, 1]
        tokens << ['}', '}']
        i += identifier.size

      elsif identifier = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(identifier)
          tokens << [identifier.upcase.to_sym, identifier]
        else
          tokens << [:IDENTIFIER, identifier]
        end
        i += identifier.size
...

In this example lexer I have defined my list of keywords, removed the trailing newline, and begun peeling identifiable chunks off the code example.

If the code chunk matches an ‘eh?’ it will add a closing brace to the token array.
If the code chunk matches a KEYWORD it will be upcased and labeled as a keyword and added to the token array.
If the code chunk matches any word of lowercase letters it will be labeled as an IDENTIFIER and added to the token array.

I continued defining regex rules until I was able to completely tokenize a class definition:

code = <<-EOS
a Canadian
  can curl
    if skip:
      say 'Hurry!'
    eh?
  eh?
eh?
EOS
...
Lexer.new.tokenize(code)
# => [[:A, "a"], [:CONSTANT, "Canadian"], [:CAN, "can"], [:IDENTIFIER, "curl"], [:IF, "if"], [:IDENTIFIER, "skip"], ["{", "{"], [:IDENTIFIER, "say"], [:STRING, "Hurry!"], ["}", "}"], ["}", "}"], ["}", "}"]]

Now that I have an array of tokens, it is time to write my grammar laws that will tell the parser what to do with them.

Awesome