Grackle consists of the following components:
- MATLAB Parser
MATLAB is a surprisingly complex language to parse for a variety of reasons.
- Many parts of the language are white-space sensitive, and the parser must account for this. For example, both the strings
_is a space character denote arrays with a single element. However,
[x_+y]denotes an array with two elements.
- The same character may have different meanings in different contexts. For example, the single quote
'is used for both transpose and the beginning of string literals.
- There are typically a variety of ways of writing the same expression. For example, matrix literals may use new-lines or semicolons to separate rows. Individual values in rows may use spaces or commas.
Our approach to parsing MATLAB has been to use a three phase process consisting of relatively simple independent translation steps. The components for the respective phases are the lexer, the syntax parser, and the precedence parser.
- The lexer recognizes characters in the MATLAB program, and groups them into tokens. The lexer is defined using regular expressions, and generates tokens for whitespace and newlines in addition to the other characters.
- The syntax parser uses the tokens generated by the lexer, and builds the abstract syntax tree representing the MATLAB program. The parser is defined as a context free grammar.
- The precedence parser reorganizes the terms generated by the syntax parser so that rules of precedence are respected.
Our approach differs from existing parsers from Matlab-like languages that we have found in several ways. Octave has a two stage parser, and incorporates operator precedence into the syntax parser. Additionally, the lexer is context-sensitive and the parser will selectively indicate when it should be whitespace sensitive. This simplifies the rules in the grammar, but couples the lexing and parser, so that they cannot be analyzed independently.
Control Flow Graph Translator Once parsed, the code is simplied and translated into a simplified control flow graph representation for symbolic simulation. In this representation, a control flow graph is constructed for each function in the source file, including anonymous functions and local functions.
The control flow graph for a function includes:
- A unique handle for referencing the function even if two functions share the same name.
- The list of arguments and return values for the function.
- The list of registers used to store temporary values within the function. Each register has an associated type used to indicate the type of value that may be stored in the register.
- A set of basic blocks, where each basic block has a unique
non-negative integer label, and a sequence of instructions. The
entry point to the function is a basic block with label
The types of values stored in register are:
RealArray: An array of real or complex values.
IntArray: An array of fixed width signed integers. The width is not part of the type, and the width of values stored in the register may vary during execution.
UIntArray: An array of fixed width signed integers. The width is not part of the type, and the width of values stored in the register may vary during execution.
CharArray: An array of character values. These are wide characters whose encoded value ranges from
LogicArray: An array of Boolean values that are either true or false.
CellArray: An array of arbitrary Matlab values.
StructArray: An array of Matlab structs.
FunctionHandle: A single handle to a Matlab function, or a closure containing a reference to the function and the extra context values to call the function with.
AnyMatlab: A register that may store any of the above Matlab values.
The instructions are as follows:
- Assignment statements
lhs := rhs. These evaluate the right-hand side in the current frame, and assign the value to the left-hand side.
- Function calls
v := f(args,n)evaluates
fand calls the resulting function with the given arguments. The last argument, denoted
n, returns the number of values that the calling function expects. It is the called function’s responsibility to ensure it returns sufficient arguments, or throw a exception when too many arguments are requested.
- Print statements
print name exprthat evaluate the expression, and print the result with the given name.
- Branch statements
br expr x y (postdom z)which evaluate a conditional expression, and jump to label
xif true and label
yif false. The postdominator for the current branch is the optional argument
z, and is used for determining where to merge.
- Jump statements
jump xwhich jump to
- Switch statements
switch expr caseswhich take a Matlab expression as a value and jump to the label appropriate to the type of the branch. The cases are for the first 8 types of values that may be stored in a Matlab register, and each case contains a typed register to store the matched value and a label to jump to.
- Return statements
returnwhich return from the current procedure.
The stages of the translation process are as follows:
- Operators are mapped to their associated functions to simplify the
expression language. For example
x + ymaps to
- Statements involving complex expression evaluation that may have side effects are converted into a simpler register transfer language where each statement may be executed atomically.
- Control structures such as if statements, for loops, and while loops are eliminated and replaced with basic blocks with a single entry point and jumps between blocks.
- Anonymous functions such as
@(x) x+yare lifted up into standalone functions. When the function refers to extra variables defined in the parent context (such as the variable
yabove), the function is extended to take an extra argument containing the values of those captured variables. Function handles for anonymous functions will then contain both a reference to the outer function, and the values to pass in through the first argument.
- Finally, A post-dominator analysis is performed to identify merge points for all conditional branch points in the program. Conditional branches are then annotated with the post-dominator nodes.
The Symbolic Simulator processes the simplified as input along with an initial starting program execution state, and symbolically executes the program until all execution paths in the simulator state have completed.
The symbolic simulator maintains a tree called the execution tree. The execution tree generalizes a call stack and stores the state of the simulator along different execution paths. The execution tree contains two types of internal nodes and three types of leaves.
The internal nodes in the execution tree are either:
- Branches. Branches in the program whose condition is symbolic. These also store information about when the child execution tree computations should stop so that they can merge into a single execution.
- Call frames. State used to return from a procedure call, and
update the caller’s local state. Call frames can currently either
be user-provided Matlab call frames, which correspond to
symbolic execution of Matlab code provided by the user, or
primitive call frames, which correspond to executions currently
executing a primitive operation such as
The leaves in an execution tree are either:
- Active paths. Execution paths that have not yet terminated or reached their merge point.
- Return values. Results from execution paths that have reached the merge point for the parent branch or the final result of the overall simulation.
- Failed paths. Execution paths that have violated an assertion and have effectively terminated.
The simulator infrastructure is designed with extension in mind. Eventually, we plan to port the LLVM symbolic simulator to use the new simulator infrastructure. This will allow us to reduce the overhead of maintaining two separate simulators, and provide a mechanism for calling C code from Matlab.
The Symbolic Backend provides an API to the symbolic simulator so that it can generate and manipulate expressions. The expressions can be sent to automated theorem provers such as Yices or written out to standardized file formats such as SMTLIB.
Finally, the simulator contains Builtin Definitions that define the semantics for builtin functions that are not defined in the program. The builtin functions include arithmetic and logical operations, primitives for creating symbolic variables, and support for CFDF primitives.