Skip to content

DSL ~ LARK Documentation

andi willow edited this page May 2, 2022 · 23 revisions

LARK is the parsing tool used by the domain specific language team to enable algorithmic, grammar based conversion of dsl files, which are largely written in plaintext, to a WDL++.

There are 5 Parts to creating a lark file:

  1. Writing the grammar
  2. Creating the parser
  3. Shaping the tree
  4. Evaluating the tree
  5. Optimizing

Writing the Grammar

Lark accepts its grammars in a format called EBNF:
Example of EBNF code can be found here: https://tomassetti.me/ebnf/

item: _NL "ITEM" id "IN" location property* action*
    | _NL "ITEM" id property* action*

EBNF syntax is a method to rigorously specify formal language (that is, a language with precise structure and little ambiguity). We use EBNF to specify the exact rules and terminals that define our language.

Terminals can be thought of as the alphabet of a language. Rules can be thought of as defining how we combine and structure the terminals.

Creating the Parser

To create a parser, all we need to do is to tell Lark to take a value

_NL: ( /(\r?\n[\t ]*)/ | COMMENT) +

%import .util.COMMENT
%ignore COMMENT

Shaping the Tree

Shaping the tree allows you to structure grammar inputs that would normally be filtered out such as DEC_NUMBER, "true", "false", etc... Examples and a more complex description of functionalities are included below.

    ?value: dict
          | array
          | noun verb "like" noun -> comparative
          | DEC_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

  • The little arrows represent aliases, which is a name for a specific rule. Creating of aliases will allow it to be easier for later processing. (In this case, true, false, and null are the same)
  • ?[value] tells the code to arrange the tree so that it only has one branch since it only has one member (value).
  • By turning the DEC_NUMBER terminal into a rule, we can have it appear as a branch of the tree so that it can be used in other places that are part of grammar (mainly in pair rule).

Evaluating the Tree

Trees are built automatically based on the structure (rules) of the grammar. Each rule that is matched becomes a node on the tree, with children nodes that are generated from all of the parent rule's matches.

As an example (from the Lark documentation):

Let us define the grammar as follows:

start:  PNAME pname

PNAME:  "(" NAME ")"
pname:  "(" NAME ")"

NAME:   /\w+/
%ignore /\s+/

Then Lark parses "(Hello)(World)" as

start
    (Hello)
    pname World

Optimizing

By default, Lark uses the Earley algorithm to parse language and grammar. While versatile and powerful, with the ability to parse context-free languages (more on the Earley parser here), it runs in cubic time in the general case. If we write grammar that is LR-compatible we can parse using an LR parser in linear time. Specifically, we would need to use a deterministic context-free language with unambiguous grammar more here. This would allow us to use the LALR(1) parser which runs in linear time, toggled via a command line option.

json_parser = Lark(json_grammar, start='value', parser='lalr')

A full understanding of various optimizations required an elementary knowledge in formal language. Some more information can be found here: formal language formal grammar context-free languages

Example calculator built using LARK:

from lark import Lark, Transformer, v_args
 
try:
    input = raw_input
except NameError:
    pass
 
my_grammar_calc = """
    %import common.CNAME -> NAME
    %import common.NUMBER
    %import common.WS_INLINE
    %ignore WS_INLINE
 
    ?start: value -> number
    ?value: add
        | SIGNED_NUMBER -> assign_var
    ?add: sub
        | add "+" sub -> add
    ?sub: mult
        | sub "-" mult -> sub
    ?mult: div
        | mult "*" div -> mul
    ?div: neg
        | div "/" neg -> div
    ?neg: parenth
        | "-" parenth -> neg
    ?parenth: value
        |"(" value ")"
"""
 
@v_args(inline = True)
class my_calc_tree(Transformer):
    from operator import add, sub, mul, truediv as div, neg
    number = float
    def __init__(self):
        self.vars = {}
    def assign_var(self, nm, val):
        self.vars[name] = val
 
parse_my_calc = Lark(my_grammar_calc, parser = 'lalr', transformer=my_calc_tree())
calculator = parse_my_calc.parse
 
def test():
    print(calculator("a = 3+4"));
    print(calculator("a = 3/4"));
    print(calculator("a = 3*4"));
    print(calculator("a = 3-4"));
 
test()

Clone this wiki locally