Building a regex engine from first principles in Golang

Introduction
Formal languages
Finite automata
- Thompson's construction (NFA)
- Subset construction (DFA)
Operator precedence parsing
Tokenising
Regular expression engine

Figure 1: Thompson's construction

Introduction

Regular grammars define regular languages. We can define recognisers for regular languages using simple finite state machines. Regular grammars are equivalent to regular expressions which are used throughout computer science to define basic string search patterns.

A nondeterministic finite state automata can be constructed from any regular expression using the Thompson’s construction algorithm. The automata can be walked using an input string to check if the string belongs to the language defined by the regular expression (grammar).

Formal languages

A formal language is formal because it is unambiguous and constructed with a specific purpose in mind. Formal languages are defined using a schema known as a formal grammar, which consists of an alphabet and set of production rules.

This is different from a natural language - which evolves naturally over time amongst a group of people. Whilst natural languages have alphabets, they often rely heavily on context, have regional differences (letters and words can be rearranged arbitrarily), and cannot be traced back to a single correct axiomatic definition.

Alphabets

The first component of a formal grammar is an alphabet: the alphabet may contain an infinite number of symbols, but in this context it is always finite. A formal language is a finite or infinite set of strings (also known as words) over an alphabet.

A string over an alphabet \(\Sigma\) is any finite sequence of symbols that belong to \(\Sigma\). The set of all strings over an alphabet is denoted using Kleene star notation \(\Sigma^*\). The string with zero symbols (usually called the empty string) belongs to every language and is denoted by epsilon \(\epsilon\) ¹.

Production rules

The second component of a formal grammar is a finite set of production rules - which are rewrite rules that define how sequences of symbols in an alphabet should be arranged to construct valid words in a formal language.

A set of production rules \(P\) consists of a set \(N\) of non-terminal symbols, a set \(\Sigma\) of terminal symbols (the alphabet), and a distinguished start symbol \(S \in \Sigma\). Production rules are written like so:

\begin{equation} \text{left-hand side} \rightarrow \text{right-hand side} \end{equation}

Both sides are finite sequence of terminals and non-terminals from \(\Sigma\) and \(N\). It is convention to represent non-terminals using uppercase letters e.g. \(N = \{S, A, B\}\) and terminals using lowercase letters \(\Sigma = \{a, b, c\}\).

A language \(L\) over an alphabet \(\Sigma\) with production rules \(P\) is a subset of all possible sequences of symbols in the Kleene closure of the language's alphabet: \(L \subset \Sigma^*\).

Chomsky hierarchy

Figure 2: Chomsky hierarchy

In ² Noam Chomsky describes a formal grammar hierarchy which defines four tiers of grammars. In this article, and in computer languages in general, we are only interested in regular and context-free grammars.

Each tier imposes different constraints on the a grammar’s production rules. The first tier (regular) has strict constraints, whereas the last tier (recursively enumerable) has no constraints at all. This means different algorithms (which, generally speaking, increase in complexity - time, space, conceptually) are used to define recognisers for languages in each tier.

A recogniser for a regular language can only recognise regular languages; a recogniser for a context-free language can recognise context-free languages and regular languages; and so on. Using the definitions from ³, we can outline each tier in the hierarchy more precisely:

\(a\) is a terminal;
\(A, B\) are non-terminals;
\(\alpha, \beta, \gamma\) are finite sequence of terminals and non-terminals;
\(\alpha, \beta\) can be empty strings, but \(\gamma\) cannot be empty.

Regular languages

Constrains: \(A \rightarrow a\) and \(A \rightarrow aB\)
Recogniser algorithm: finite state machine

Regular expressions

… regular expressions are used by scanners in the lexical analysis stage to transform character streams into token streams. The token stream is then read by the parser during syntax analysis, which checks if the sequence of tokens belong to a context-free language defined by a context-free grammar.

Finite automata

…

Thompson's construction (NFA)

A nondeterministic finite state automata can be constructed from any regular expression using the Thompson’s construction algorithm ⁴. The automaton for any regular grammar can be constructed using only three operations:

Concatenation: given two symbols or sub-expressions x and y, their concatenation \(xy\) means "\(x\) must be followed by \(y\)".
Union: given two symbols or sub-expressions \(x\) and \(y\), their union \(x|y\) is means "either \(x\) or \(y\) but not both".
Closure: any symbol or sub-expression repeated zero or more times.

The automata can be walked using an input string to check if the string belongs to the language defined by the regular expression (grammar).

Figure 3: Example NFA for the regular expression \((a|b)cd*\)

…

// a Thompsoner is anything that can return a Grah representation
// of itself using Thompson's construction.
type Thompsoner interface {
          Thompson(counter int) (Alphabet, *Graph)
}

…

// a Vertex has a unique id and a list of out-going edges.
type Vertex struct {
          Id       int
          Outgoing []Edge
}

// an edge has a pointer to a destination vertex and a label:
// which is the symbol that the edge "accepts". if the symbol
// can either be a symbol in the language's alphabet or epsilon
// (null transition).
type Edge struct {
          Dest    *Vertex
          Accepts Symbol
}

// a Graph has pointers to start and end vertices.
type Graph struct {
          Start, End *Vertex
}

Accept symbol

Figure 4: Thompson's construction: accept symbol

// returns a graph representation of an accept-symbol using
// Thompson's construction. Symbol wraps a rune and has no
// sub-expressions. Symbol essentially acts as a base case
// in the recursive Thompson() tree.
func (s Symbol) Thompson(counter int) (Alphabet, *Graph) {
          start := new(Vertex)
          end := new(Vertex)

          start.Id = counter
          end.Id = counter + 1

          start.Outgoing = []Edge{
                  Edge{end, s}}

          return newAlphabet(s), &Graph{start, end}
}

Kleene closure

Figure 5: Thompson's construction: Kleene closure

// returns a graph representation of a Kleene closure using
// Thompson's construction. Closure is a Thompsoner and has
// one sub-expression.
func (c Closure) Thompson(counter int) (Alphabet, *Graph) {
          start := new(Vertex)
          end := new(Vertex)
          alpha, exp := c.Exp.Thompson(counter + 1)

          start.Id = counter
          end.Id = exp.End.Id + 1

          start.Outgoing = []Edge{
                  Edge{exp.Start, Epsilon},
                  Edge{end, Epsilon}}

          exp.End.Outgoing = []Edge{
                  Edge{exp.Start, Epsilon},
                  Edge{end, Epsilon}}

          return alpha, &Graph{start, end}
}

Union

Figure 6: Thompson's construction: union

// returns a graph representation of a union using
// Thompson's construction. Union is a Thompsoner and has
// two sub-expressions.
func (u Union) Thompson(counter int) (Alphabet, *Graph) {
          start := new(Vertex)
          end := new(Vertex)
          lAlpha, lExp := u.LeftExp.Thompson(counter + 1)
          rAlpha, rExp := u.RightExp.Thompson(lExp.End.Id + 1)
          alpha := lAlpha.Union(rAlpha)

          start.Id = counter
          end.Id = rExp.End.Id + 1

          lExp.End.Outgoing = []Edge{
                  Edge{end, Epsilon}}

          rExp.End.Outgoing = []Edge{
                  Edge{end, Epsilon}}

          start.Outgoing = []Edge{
                  Edge{lExp.Start, Epsilon},
                  Edge{rExp.Start, Epsilon}}

          return alpha, &Graph{start, end}
}

Concatenate

Figure 7: Thompson's construction: concatenate

// returns a graph representation of a concatenation using
// Thompson's construction. Concat is a Thompsoner and has
// two sub-expressions.
func (c Concat) Thompson(counter int) (Alphabet, *Graph) {
          lAlpha, lExp := c.LeftExp.Thompson(counter)
          rAlpha, rExp := c.RightExp.Thompson(lExp.End.Id + 1)
          alpha := lAlpha.Union(rAlpha)

          lExp.End.Outgoing = []Edge{
                  Edge{rExp.Start, Epsilon}}

          return alpha, &Graph{lExp.Start, rExp.End}
}

Traversal

…

// implements a stack-based depth-first search over a NFA
// graph. returns true if it can reach the target from the
// given start and consumed all of the input symbols in the
// process. otherwise returns false (=the input string is not
// accepted by the NFA).
func dfs(input []Symbol, start, target *Vertex) bool {
          stack := vertexStack{}

          stack.Push(0, start)

          for !stack.Empty() {
                  idx, vertex := stack.Pop()

                  if vertex == target && idx == len(input) {
                          return true
                  }

                  for _, edge := range vertex.Outgoing {
                          if edge.Accepts == Epsilon {
                                  stack.Push(idx, edge.Dest)
                          } else if idx < len(input) {
                                  if edge.Accepts == input[idx] {
                                          stack.Push(idx+1, edge.Dest)
                                  }
                          }
                  }
          }

          return false
}

Subset construction (DFA)

…

Epsilon closure

…

// finds the closure of a set given a particular input symbol.
// that is, the set of nodes reachable from each node in the
// set if only edges labelled with the input symbol are walked.
func closure(set nfaVertexSet, symbol nfa.Symbol) nfaVertexSet {
          vertices := set.Vertices()
          closureSet := newVertexSet()
          stack := newNFAVertexStack(vertices...)

          // perform a stack-based depth-first search
          for !stack.Empty() {
                  vertex := stack.Pop()

                  for _, edge := range vertex.Outgoing {
                          if edge.Accepts != symbol {
                                  continue
                          }
                          if closureSet.Contains(edge.Dest) {
                                  continue
                          }

                          closureSet.Add(edge.Dest)
                          stack.Push(edge.Dest)
                  }
          }

          return closureSet
}

Trie data structure

…

Traversal

…

// walks a DFA graph given a slice of input symbols. returns
// true if all symbols are consumed and the walk terminates on
// an accepting state. else returns false. unlike the NFA walk
// this algorithm does not use back tracking.
func walkDFA(input []nfa.Symbol, start *Vertex) bool {
          currentVertex := start
          symbolIndex := 0
          stuck := false

          // keep walking until run out of symbols or stuck at
          // a vertex (cannot transition on current symbol).
          for symbolIndex < len(input) && !stuck {
                  symbol := input[symbolIndex]

                  if nextVertex, ok :=
                          currentVertex.Outgoing[symbol]; ok {

                          currentVertex = nextVertex
                          symbolIndex += 1
                  } else {
                          stuck = true
                  }
          }

          // the string was accepted if the automaton reached
          // an accepting state without getting stuck.
          if currentVertex.Accepting && !stuck {
                  return true
          } else {
                  return false
          }
}

Operator precedence parsing

…

Dijkstra’s Shunting-yard algorithm

…

Infix to postfix

…

Postfix to prefix

…

Prefix to abstract syntax tree

…

Tokenising

…

Regular expression engine

…

Footnotes:

https://en.wikipedia.org/wiki/Regular_grammar

https://ieeexplore.ieee.org/document/1056813

https://en.wikipedia.org/wiki/Ambiguous_grammar

⁴

https://en.wikipedia.org/wiki/Thompson%27s_construction

Building a regex engine from first principles in Golang

Table of Contents

Introduction

Formal languages

Alphabets

Production rules

Chomsky hierarchy

Regular languages

Regular expressions

Finite automata

Thompson's construction (NFA)

Accept symbol

Kleene closure

Union

Concatenate

Traversal

Subset construction (DFA)

Epsilon closure

Trie data structure

Traversal

Operator precedence parsing

Dijkstra’s Shunting-yard algorithm

Infix to postfix

Postfix to prefix

Prefix to abstract syntax tree

Tokenising

Regular expression engine

Footnotes: