Flex Your Lexer

August 16, 2013

A Lexer is a ruby class with a single method: tokenize(). Its purpose is to label each chunk of code with a particular token. Strings are labeled as strings, numbers as numbers, and class names as constants. For example,

Lexer.new.tokenize("string")
# =>[[:STRING, "string"]]
Lexer.new.tokenize("True")
# =>[[:CONSTANT, "True"]]
Lexer.new.tokenize("a Car")
# =>[[:A, 'a'], [:CONSTANT, 'Car']])
Lexer.new.tokenize("+")
# =>[["+", "+"]]

“Lexing” is accomplished using regular expressions. The lexer treats the text string of your code like a shish kabob - it slides recognizable chunks off the skewer one at a time and categorizes them (onion, pepper, method definition, etc).

class Lexer
  KEYWORDS = ['a', 'can', 'if', 'else', 'while', 'true', 'false', 'nil']

  def tokenize(code)
    code.chomp!
    i = 0
    tokens = []

    while i < code.size
      chunk = code[i..-1]

      # checks keywords
      if identifier = chunk[/\A(eh\?)/, 1]
        tokens << ['}', '}']
        i += identifier.size

      elsif identifier = chunk[/\A([a-z]\w*)/, 1]
        if KEYWORDS.include?(identifier)
          tokens << [identifier.upcase.to_sym, identifier]
        else
          tokens << [:IDENTIFIER, identifier]
        end
        i += identifier.size
...

In this example lexer I have defined my list of keywords, removed the trailing newline, and begun peeling identifiable chunks off the code example.

  • If the code chunk matches an ‘eh?’ it will add a closing brace to the token array.
  • If the code chunk matches a KEYWORD it will be upcased and labeled as a keyword and added to the token array.
  • If the code chunk matches any word of lowercase letters it will be labeled as an IDENTIFIER and added to the token array.

I continued defining regex rules until I was able to completely tokenize a class definition:

code = <<-EOS
a Canadian
  can curl
    if skip:
      say 'Hurry!'
    eh?
  eh?
eh?
EOS
...
Lexer.new.tokenize(code)
# => [[:A, "a"], [:CONSTANT, "Canadian"], [:CAN, "can"], [:IDENTIFIER, "curl"], [:IF, "if"], [:IDENTIFIER, "skip"], ["{", "{"], [:IDENTIFIER, "say"], [:STRING, "Hurry!"], ["}", "}"], ["}", "}"], ["}", "}"]]

Now that I have an array of tokens, it is time to write my grammar laws that will tell the parser what to do with them.

Awesome


Katie Leonard

Mostly Katie explaining things to herself.

© 2025