Simulator Internals
======================

Grackle consists of the following components:

 - **MATLAB Parser**
     MATLAB is a surprisingly complex language to parse for a
     variety of reasons.

   -  Many parts of the language are white-space sensitive, and the
      parser must account for this. For example, both the strings
      ``[x+y]`` and ``[x_+_y]`` where ``_`` is a space character denote
      arrays with a single element. However, ``[x_+y]`` denotes an array
      with two elements.
   -  The same character may have different meanings in different
      contexts. For example, the single quote ``'`` is used for both
      transpose and the beginning of string literals.
   -  There are typically a variety of ways of writing the same
      expression. For example, matrix literals may use new-lines or
      semicolons to separate rows. Individual values in rows may use
      spaces or commas.

   Our approach to parsing MATLAB has been to use a three phase process
   consisting of relatively simple independent translation steps. The
   components for the respective phases are the **lexer**, the **syntax
   parser**, and the **precedence parser**.

   -  The **lexer** recognizes characters in the MATLAB program, and
      groups them into tokens. The lexer is defined using regular
      expressions, and generates tokens for whitespace and newlines in
      addition to the other characters.
   -  The **syntax parser** uses the tokens generated by the lexer, and
      builds the abstract syntax tree representing the MATLAB program.
      The parser is defined as a context free grammar.
   -  The **precedence parser** reorganizes the terms generated by the
      syntax parser so that rules of precedence are respected.

   Our approach differs from existing parsers from Matlab-like languages
   that we have found in several ways. Octave has a two stage parser,
   and incorporates operator precedence into the syntax parser.
   Additionally, the lexer is context-sensitive and the parser will
   selectively indicate when it should be whitespace sensitive. This
   simplifies the rules in the grammar, but couples the lexing and
   parser, so that they cannot be analyzed independently.

-  **Control Flow Graph Translator** Once parsed, the code is simplied
   and translated into a simplified control flow graph representation
   for symbolic simulation. In this representation, a control flow graph
   is constructed for each function in the source file, including
   anonymous functions and local functions.

   The control flow graph for a function includes:

   -  A unique handle for referencing the function even if two functions
      share the same name.
   -  The list of arguments and return values for the function.
   -  The list of registers used to store temporary values within the
      function. Each register has an associated type used to indicate
      the type of value that may be stored in the register.

   -  A set of basic blocks, where each basic block has a unique
      non-negative integer label, and a sequence of instructions. The
      entry point to the function is a basic block with label ``0``.

   The types of values stored in register are:

   1. ``RealArray``: An array of real or complex values.
   2. ``IntArray``: An array of fixed width signed integers. The width is
      not part of the type, and the width of values stored in the
      register may vary during execution.
   3. ``UIntArray``: An array of fixed width signed integers. The width
      is not part of the type, and the width of values stored in the
      register may vary during execution.
   4. ``CharArray``: An array of character values. These are wide
      characters whose encoded value ranges from ``0`` to ``2^16-1``.
   5. ``LogicArray``: An array of Boolean values that are either true or
      false.
   6. ``CellArray``: An array of arbitrary Matlab values.
   7. ``StructArray``: An array of Matlab structs.
   8. ``FunctionHandle``: A single handle to a Matlab function, or a
      closure containing a reference to the function and the extra
      context values to call the function with.
   9. ``AnyMatlab``: A register that may store any of the above Matlab
      values.

   The instructions are as follows:

   -  Assignment statements ``lhs := rhs``. These evaluate the
      right-hand side in the current frame, and assign the value to the
      left-hand side.
   -  Function calls ``v := f(args,n)`` evaluates ``f`` and calls the
      resulting function with the given arguments. The last argument,
      denoted ``n``, returns the number of values that the calling
      function expects. It is the called function's responsibility to
      ensure it returns sufficient arguments, or throw a exception when
      too many arguments are requested.
   -  Print statements ``print name expr`` that evaluate the expression,
      and print the result with the given name.
   -  Branch statements ``br expr x y (postdom z)`` which evaluate a
      conditional expression, and jump to label ``x`` if true and label
      ``y`` if false. The postdominator for the current branch is the
      optional argument ``z``, and is used for determining where to
      merge.
   -  Jump statements ``jump x`` which jump to ``x``.
   -  Switch statements ``switch expr cases`` which take a Matlab
      expression as a value and jump to the label appropriate to the
      type of the branch. The cases are for the first 8 types of values
      that may be stored in a Matlab register, and each case contains a
      typed register to store the matched value and a label to jump to.
   -  Return statements ``return`` which return from the current
      procedure.

   The stages of the translation process are as follows:

   1. Operators are mapped to their associated functions to simplify the
      expression language. For example ``x + y`` maps to ``plus(x,y)``.
   2. Statements involving complex expression evaluation that may have
      side effects are converted into a simpler register transfer
      language where each statement may be executed atomically.
   3. Control structures such as if statements, for loops, and while
      loops are eliminated and replaced with basic blocks with a single
      entry point and jumps between blocks.
   4. Anonymous functions such as ``@(x) x+y`` are lifted up into
      standalone functions. When the function refers to extra variables
      defined in the parent context (such as the variable ``y`` above),
      the function is extended to take an extra argument containing the
      values of those captured variables. Function handles for anonymous
      functions will then contain both a reference to the outer
      function, and the values to pass in through the first argument.
   5. Finally, A post-dominator analysis is performed to identify merge
      points for all conditional branch points in the program.
      Conditional branches are then annotated with the post-dominator
      nodes.

-  The **Symbolic Simulator** processes the simplified as input along
   with an initial starting program execution state, and symbolically
   executes the program until all execution paths in the simulator state
   have completed.

   The symbolic simulator maintains a tree called the *execution tree*.
   The execution tree generalizes a *call stack* and stores the state of
   the simulator along different execution paths. The execution tree
   contains two types of internal nodes and three types of leaves.

   The internal nodes in the execution tree are either:

   -  *Branches*. Branches in the program whose condition is symbolic.
      These also store information about when the child execution tree
      computations should stop so that they can merge into a single
      execution.

   -  *Call frames*. State used to return from a procedure call, and
      update the caller's local state. Call frames can currently either
      be user-provided *Matlab call frames*, which correspond to
      symbolic execution of Matlab code provided by the user, or
      *primitive call frames*, which correspond to executions currently
      executing a primitive operation such as ``plus``.

   The leaves in an execution tree are either:

   -  *Active paths*. Execution paths that have not yet terminated or
      reached their merge point.
   -  *Return values*. Results from execution paths that have reached
      the merge point for the parent branch or the final result of the
      overall simulation.
   -  *Failed paths*. Execution paths that have violated an assertion
      and have effectively terminated.

   The simulator infrastructure is designed with extension in mind.
   Eventually, we plan to port the LLVM symbolic simulator to use the
   new simulator infrastructure. This will allow us to reduce the
   overhead of maintaining two separate simulators, and provide a
   mechanism for calling C code from Matlab.

-  The **Symbolic Backend** provides an API to the symbolic simulator so
   that it can generate and manipulate expressions. The expressions can
   be sent to automated theorem provers such as Yices or written out to
   standardized file formats such as SMTLIB.

-  Finally, the simulator contains **Builtin Definitions** that define
   the semantics for builtin functions that are not defined in the
   program. The builtin functions include arithmetic and logical
   operations, primitives for creating symbolic variables, and support
   for CFDF primitives.