16.06.2008 Sinfex – Simple Infix Expression Evaluator

See: http://blogs.testbit.eu/timj/2008/06/16/16062008-sinfex-simple-infix-expression-evaluator/ (page moved)

The XML GUI definition files used in Rapicorn and also in Beast (described briefly in an earlier blog post) supported a simple $(function,arguments...) evaluation syntax, similar to GNU Make. I’ve never been very happy with this syntax, but it was fairly easy to implement at the start and followed naturally from early $VARIABLE expansion features. At some point last year I considered various alternative syntax variants and discussed the ideas with Stefan Westerfeld. Over the course of the last two months, I finally got around to implement them.

I’ve never grown familiar with reverse polish notation, and although Guile is the canonical scripting language for Beast, I’ve always found myself very inefficient with expressing my thoughts in lisp expressions. So the new syntax definitely had to support infix expressions – despite the more complex parsing logic required to parse them. Bison already ships with an example calculator that parses infix expressions, so that’s a quick start as far as the syntax rules go. Integration is quite a different story though, more on that later.

Since in Rapicorn the expressions are used to compute widget property values, they are likely to be executed very often, i.e. each time a widget is created from an XML file description. Consequently, I wanted to use a pre-parsed AST to carry out the evaluation and avoid mixing evaluation logic with parser logic, which would have forced reparsing expressions upon each evaluation. At first I quickly threw together some C++ classes for the arithmetic operators and used those as nodes for the AST, similar to:

  class ASTNodeNot : virtual public ASTNode {
    ASTNode &m_operand;
  public:
    explicit ASTNodeNot (ASTNode &operand) :
      m_operand (a)
    {}
    virtual Value
    eval (Env *env) const
    {
      Value a = m_operand.eval();
      return Value (!a.asbool());
    }
  };

The supported syntax is quickly summarized:

  Operators: ( + - * / ** < > <= >= == != or and not )
  Functions: name ( args... )
  Inputs:    FloatingPoint 'SingleQuotedString' "DoubleQuotedString"

In this notation, FloatingPoint includes hexadecimal numbers and of course integers and the quoted strings support C style escape sequences like octal numbers, ‘\n‘, ‘\t‘ and so on. The functions can be implemented by the parser API user, but a good set of standard arithmetic functions is already supported like ceil(), floor(), min(), max(), log(), etc. So it’s a very basic parser, but covers the vast majority of expressions needed to position widgets or configure packing properties.

One thing that turned out to be tricky is the binary operator semantics for strings. At the very least, I wanted support for "string" + "string" and "string" == "string". Since both operators are supported for numbers as well, the exact behavior of "string" + FloatingPoint and "string" == FloatingPoint had to be defined. I managed to find a few programming language precedents here in Perl, Python, and ECMAScript (Javascript). They of course all handle the cases differently. In the end I settled with ECMAScript semantics:

  Value1 == Value2  # does string comparisons if both values are strings
  Value1 + Value2   # does string conversion if either value is a string

Unit testing for the parser turned out to be particularly easy to implement. All that’s needed is a small utility that reads expressions and prints/verifies the evaluation results. Throwing in some additional shell code allowed a normal text file to drive unit testing. It simply contains expressions and expected results on alternating lines. Btw, libreadline can be really handy in cases like this. Using it takes a mere 5-10 lines of additional code to support a nice interactive command line interface including history for the evaluator test shell.

After some initial testing, the C++ AST node classes seemed like an awful lot of pointers, fragmentation and VTable calls for a supposedly straight forward expression evaluation. Also, adding the missing destruction semantics would have significantly increased the existing class logic. So I tried to come up with a leaner and foremost flat memory presentation. In the end, I settled with a single array that grows in 4 byte (integer) steps, embeds strings literally (padded to 4 byte alignment) and uses array offsets instead of pointers for references. A single multiplication operator is encoded with 3 integers that way: MUL_opcode factor1_index factor2_index. That’s essentially 12 bytes per binary operator, still significantly more than the parser input, but also significantly smaller than allocating an equivalent C++ class. Evaluation of the expression stored in the array doesn’t need any VTable calls, and a single straight forward evaluation function can be used, that implements the different operators as switch statement cases. Also release semantics are persuasively trivial for a single consecutive array.

What’s left was to figure a way to embed expression evaluation in XML values or text blocks. Previously, $(expression) was substituted everywhere, but with the new syntax, variables (or rather constants defined within the Rapicorn core or via the ‘<arg:ArgName default="5"/>‘ syntax supported by Rapicorn XML files) didn’t use a $-prefix anymore. So sticking with $() seemed to make little sense. As it’s implemented now, backticks are used to cause expression evaluation, with the empty expression evaluating to a single backtick:

  XML Value/Text         Parser Result
    Foo  5 + 5      ->     Foo  5 + 5
    Foo `5 + 5`     ->     Foo 10
    ``Foo``         ->     `Foo`

We will see how useful the current expression style turns out to be. I don’t consider every implementation bit outlined here solidly engraved into stone yet. So as always, I’m open to constructive feedback.

As forewarned, I have a few more words on integrating Flex and Bison with each other and into one’s own library. First, Flex and Bison turned out not to be exactly simple to configure, especially if none of the generated symbols should be exported from a library or clash with a possible second parser linked into the same library or program. Also some fiddling is required to pass on proper line count information from the lexer to the parser, getting character counts as well is non-trivial but lucky wasn’t strictly needed for Rapicorn expressions. The simplest setup I managed to come up with works as follows:

  sinfex.hh     # public API
  sinfeximpl.hh # internal structure definitions
  sinfex.cc     # evaluator implementation
  sinfex.l      # scanner rules for Flex
  sinfex.y      # parser rules for Bison
  sinfex.lgen   # generated by Flex
  sinfex.ygen   # generated by Bison

The only compiled file in this list is sinfex.cc which includes sinfex.lgen and sinfex.ygen as part of an anonymous C++ namespace. A linker script ldscript.map used when finally linking the library prevents anonymous symbols from being exported. The anonymous namespacing of everything declared in sinfex.lgen and sinfex.ygen is what prevents clashes with a possible second parser linked into the library. This isn’t as elegant as i was hoping for, but at least effective in a practical sense. There unfortunately is no way to configure Flex or Bison to generate only static functions and variables. And yes, I have also looked into variants like flex++, bison++, bisonc++, byacc, etc, but they usually show much of the same problems and also tend to make matters worse by generating more files or more complex code.