Anglr Language Tutorial
Introduction
This paragraph will present the process of creating anglr file, in which the syntax rules for the language of the simple arithmetic expressions will be stored. Arithmetic expressions will be used for adding, subtracting, multiplying, and division of integers.
Problem Description
We must identify and discuss the following things:
- operands used in arithmetic expressions
- arithmetic operators
- arithmetic operators priority
- arithmetic operators associativity laws
- ability to use comments
After that we will be able to define:
- terminal symbols and regular expressions associated with them
- scanners: for the elimination of syntax-insignificant parts of text and to extract terminal symbol values
- lexical analyzer for our compiler
- syntax rules of arithmetic expressions where things identified above will be taken into account
Problem Analysis
Operands
Basic operands will be:
- natural numbers, those whole numbers that are greater or equal to zero. for example 1, -3, +7 and even --4 or +-+-7 and so on In fact, all integer numbers unsigned and signed, no matter how many times we specify the sign + or -.
- nested arithmetic expressions, those appearing between rounded braces, characters ( and ), will also be treated as basic operands. The reason is in the fact that no matter of its internal structure according to the arithmetic expression in which it is nested, it acts as a unit.
Arithmetic operators
It is not hard to immagine which arithmetic operators will be used in our expressions. They are:
- unary operators represened with characters '+' and '-' used for signing basic operands
- operator for addition. It will be represented with character string '+'
- operator for subtraction. It will be represented with character string '-'
- operator for multiplication. It will be represented with character string '*'
- operator for division. It will be represented with character string '/'
Priority of arithmetic operators
We will apply already well known rules for arithmetic operators priorities. However, let us list them again:
- The hieghest priority have unary operators acting on basic operands.
- Multiplication and division have equal priorities but lower than
- The lowes priority have addition and subtraction
- natural numbers and nested expressions have no priority
Let's take a look at the operators' priorities a little bit closer. Suppose that signing basic operands has lower priority than addition and observe the following expression:
-1 + 2
Take another example:
1 + 2 * 3
But remember, however, that these things are not strange in themselves, but because of the subconscious perception of the arrangements that we are no longer aware of.
Operator priorities highly influence the format of syntax rules that define arithmetic expressions. Let's take a look at this example:
1 * 2 + 3 * 4 + 5 * 6
1 * 2 + 3 * 4 + 5 * 6 1 * 2 + 3 * 2 * 2 + 5 * 2 * 3 2 + 12 + 30
The same arguments should be used when we observe the relationship between the multiplicative and unary expressions: multiplicative expressions are lists of unary expressions separated with multiplicative operators for multiplication and division ( * and / ).
Associativity laws
There is another aspect concerning the syntax rules and in particular the arithemtic expressions. It's about associativity laws. For some arithmetical operations, it is important in what order they are computed. Let's take a look at this example:
1 - 2 - 3
If we compute it from the left to the right (1 - 2) - 3, we get -4. If we compute it from the right to the left 1 - (2 - 3), we get 0. This means that subtraction is not associative operation. On the other side we allways get the same result with addition operator. Therefore, we distinguish computational operations into those that are associative from the left or right side, and those that are associative from both sides. By agreement, the subtraction is defined as an operation that is associative from the left side. Since it doesn't matter what order of the calculation is performed for addition, we'll use left associativity rules for addition too.
The same observation is true for multiplication and divison. Multiplication is associative operation while division is not. Like subtraction, the division is also defined as an operation associative from the left side. Also for multiplicative expression we will use left associativity rules.
Also associativity laws influence the format of syntax rules that define arithmetic expressions. Let's take a look at this example again:
1 * 2 + 3 * 4 + 5 * 6
(1 * 2 + 3 * 4) + 5 * 6
This can be read as follows: to the intermediate result located on the left side of operator + add the value of expression located on the right side of it. In fact, it is a description of the syntax rule for addition that respects the law of left associativity: to the left side is intermediate result, to the right is the reminder which will give us the final result. In this way also the syntax rules for subtraction, multiplication and division can be described.
Comments
Comments can be found anywhere in the arithmetic expression: before it, behind it, between the operator and the operand, etc. Comments represent any text located between /* and */, like:
/* before */ 1 * 2 /* between */ + 3 * 4 + 5 /* between */ * 6 /* behind */
Anglr File Construction
Now we have enough information to begin with construction of anglr rules for compiler of our simple arithmetic expressions.
Declaration part
As we know from the description of Declaration Part of Anglr File in this part we define:
- Terminal symbols used by syntax rules defined in parser part of Anglr file. Terminal symbols can be defined together with their character string representations where it makes sense.
- Regular expressions. Regular expressions are most often associated with terminal symbols, but they can also use be connected to other content in the analyzed text. It is a good practice to define regular expression for every terminal symbol used by syntax rules. The reason is that things are defined in one place and can be reused.
From the problem analyzes we know that text containing simple arithmetic expressions contains the following things:
- natural numbers. They are associated with the terminal symbol NUMBER. They will not be associated with their character string representations since it does not make sense because of the simple reason: cardinality of natural numbers set is equal to countable infinity. However, terminal symbols with multiple textual representations can also be associated with unique textual representation. But doing this is highly discouraged since it can cause confusion.
- operators for addition, subtraction, multiplication, division. They are associated with terminal symbols add, sub, mul and div respectivelly and will be also given their character string representations. Operators for signing basic operands will not be associated with additional terminal symbols, since they are equal to arithmetic operators for addition nd subtraction.
- rounded braces for nested expressions. They will be associated with terminal symbols lb and rb.
- there are also other parts of text but they are not important for the syntax analyzer like space characters, new line characters, comments, namely things that make text more readable and organised. These parts of text are not associated with terminal symbols. They are simply skipped by lexical analyzer.
Following discussion above we can define the following terminal symbols:
%terminal { NUMBER add '+' sub '-' mul '*' div '/' lb '(' rb ')' }
We then also need to define regular expressions by which scanners will extract pieces of text and associate them with terminal symbols. With terminal symbol NUMBER we will associate two regular expressions. Here is the list of regular expressions:
%regex { decimal-digit [0-9] number {decimal-digit}+ add \+ sub \- mul \* div \/ lb \( rb \) }
Note that we used the same names for the terminal symbols and regular expression names. This is perfectly legal since Anglr compiler knows from the usage context which name belongs to terminal symbol and which belongs to regular expression. We should package these definitions into the declaration part of angler file with surrounding them with part parentheses &{ and &} and equipped with appropriate attributes to get the following piece of text:
[ CompilationInfo ClassName='MathDecls' NameSpace='Math.Declarations' Access='public'] %declarations mathDecls %{ %terminal { NUMBER add '+' sub '-' mul '*' div '/' lb '(' rb ')' } %regex { decimal-digit [0-9] number {decimal-digit}+ add \+ sub \- mul \* div \/ lb \( rb \) } %}
We created declaration part of anglr file with the following properties:
- its name is mathDecls
- generated class will be named ClassName='MathDecls' and will be part of namespace NameSpace='Math.Declarations'.
Scanner part
Let's start by saying we're going to need two scanners to analyze the text. The reason is simply that comments can also appear in the text. Suppose that in text we encounter a sequence of characters that represents a natural number. If we had a single scanner, we'd have to ask ourselves where we find that number: inside or outside of comment. If we find it inside comment we should skip it otherwise we should return terminal code NUMBER to the calling environment (usually a syntax analyzer). This should be done for every terminal symbol every time we came across with it. But if we have two scanners, things get a lot simpler. The scanner that analyzes comments ignores the content of the text at all, except for the one that represents the beginning or end of the comment. On the other side, the scanner, which analyses the arithmetic expression, unconditionally recognizes the terminal symbol it encountered.
Comment scanner
Let's start with the definition of the comment scanner. This scanner does not start automatically. It is run by a scanner that reads the contents of an arithmetic expression when it encounters characters that indicate the beginning of a comment. The beginning of comment is thus not read by comment scanner. The main work of the scanner is to keep an eye on when the sequence of characters that indicate the end of the comment will appear, when it pops itself from the scanner stack uncovering the scanner that reads the contents of arithmetic expression. Text appearing before the end of comment is simply skipped by comment scanner.
Let's describe the algorithem for comment scanner:
- if scanner encounters character string indicating end of comment it must pop itself from the scanner stack
- if it encounters everything else it should skip it
- regular expresion \*\/ represents character sequence indicating end of comment
- regular expression . indicates everything else
- pop, when we encounter end of comment
- skip, for every other character
-
using CompilationInfo attribute we should specify the name of class generated by Anglr compiler for comment scanner:
[ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public']
-
we give the name to the comment scanner:
%scanner commentScanner
-
and combine regular expressions with actions:
\*\/ pop . skip
[ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner commentScanner %{ \*\/ pop . skip %}
Scanner for arithmetic expression
Scanner for arithmetic expressions is very simple. Its algorithm should be defined on that way:
- if it encounters character sequence for the beginning of the comment it pushes comment scanner into the scanner stack
- when it matches sequence of digits it should return terminal code NUMBER
- when it matches character sequence of arithmetic operator it should return terminal code for that operator
- when it matches character sequence of any rounded brace it should return its code
- it should skip all spaces and new line character sequences
- other character sequences should be skipped as well
[ Declarations Id='mathDecls' ] [ CompilationInfo ClassName='MathRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner mathScanner %{ \/\* push commentScanner {number} terminal NUMBER {add} terminal add {sub} terminal sub {mul} terminal mul {div} terminal div {lb} terminal lb {rb} terminal rb [ \t]+ skip [\n\r] skip . skip %}
In the scanner part above:
- Attribute Declarations specifies id of declarations part (namely 'mathDecls') which contains definitions of terminal symbols and associated regular expressions used by the scanner for arithmetic expressions.
- Attribute CompilationInfo specifies class name (namely MathRegex) which will be generated by Anglr compiler for this scanner.
- name of this scanner part is mathScanner
Note the notation of regular expressions associated with terminal symbols: they are specified with their variables not by their values. This is the preffered way of regular expressions usage. Also, if we look closely at the definition of the scanner, we notice that it is essentially a function that maps text (or pieces of text) into terminal symbols.
Lexer part
Lexer part is one of the easiest parts of Anglr file. All we need to do is to provide two lexer attributes:
- UseScanner, where we list all scanners and initial scanner used by lexical analyzer
- CompilationInfo, where we provide the name of class, namespace name and access specifier which will be used by Anglr compiler when generating soure code for lexical analyzer
[ UseScanner ScannerId='commentScanner' InitialScanner='mathScanner' ] [ CompilationInfo ClassName='MathLexer' NameSpace='Math.Lexer' Access='public' Hover='true' CodeDir='.' ] %lexer mathLexer %{ %}
- both scanners defined in this tutorial are used by this lexical analyzer, mathScanner is initial scanner, the one which will be used by lexical analyzer at start up. The other scanner, commentScanner, will be activated by mathScanner when needed: at the beginning of each comment.
- name of class implementing lexical analyzer is MathLexer
- name of lexer part is mathLexer
Parser part
We've come to the last, most important part. In the analysis of the problem, too, we have mostly dealt with the things we need in this part. In parser part we need to:
- select a declaration part in which terminal symbols are defined. There is only one declaration section defined in this tutorial, named mathDecls, and is quite suitable for this parser part.
- determine the lexer that we will use to extract terminal symbols. This parser part will use mathLexer
- the main thing is the definition of syntax rules for simple arithmetic expressions.
Definition of syntax rules
Let us briefly repeat the findings from the analyses of the problem, which we must take into account when building syntax rules:
- basic operands are natural numbers and nested expressions
- the highest priority operators are signing operators, which change the sign of basic operands. They are right associative operators
- followed by multiplication and division operators. They are left associative operators
- the lowest priority have the addition and subtraction operators. They are left associative operators, too.
In the construction of syntax rules we will follow these steps:
- every syntax rule will be given unique name, with which will be introduced (and defined) new non-terminal symbol
- short description of how the syntax rule will be built
- definition of syntax rule in canonical form
syntax rule for basic operand:
name | basic-operand |
discussion | The basic operand defines two options: the use of natural numbers and nested expressions. In the definition of basic operand we use the name of syntax rule expression which will be defined later |
syntax rule |
basic-operand : NUMBER | '(' expression ')' ; |
syntax rule for signing operators:
name | signing-expression |
discussion |
The signing operators with character string representations '+' and '-',
change the sign of basic operand. It can be applied as many time as needed, even zero times is allowed, which is indicated by the
first production of the syntax rule signing-expression. As we can see this syntax rule defines right
associativity law for the signing operator, since expression:
-+-+3 -(+(-(+3))) |
syntax rule |
signing-expression : basic-operand | '+' signing-expression | '-' signing-expression ; |
syntax rule for multiplication and division:
name | multiplicative-expression |
discussion | From the discussion of the problem we know that the multiplicative expressions are actually strings of signed operands (trivialy or not trivialy) separated with operators for multiplication and division, including trivial case which is composed of single signed expression. Since multiplicative expressions are left associative, we express the intermediate result (multiplicative-expression) at the left side of operators '*' and '/' and the reminder of expression (signing-expression) to the right side. |
syntax rule |
multiplicative-expression : signing-expression | multiplicative-expression '*' signing-expression | multiplicative-expression '/' signing-expression ; |
syntax rule for addition and subtraction:
name | additive-expression |
discussion | The reasons we have given for constructing multiplicative expressions also apply to additive expressions. |
syntax rule |
additive-expression : multiplicative-expression | additive-expression '+' multiplicative-expression | additive-expression '-' multiplicative-expression ; |
The syntax rule expression, used in the second production of basic-operand has not yet been defined. Since we've defined all the syntax rules we need, expression has to be one of them. The most appropriate candidate is additive-expression since it is the "outermost" syntax rule, the one which generates all others with using trivial syntax rules in succession in the derivation of first production (the trivial one) of additive-expression. For example, we can derive all syntax rules defined above:
additive-expression --> multiplicative-expression --> signing-expression --> basic-operand
additive-expression : multiplicative-expression ; multiplicative-expression : signing-expression ; signing-expression : basic-operand ;
-
we can use this identity:
expression : additive-expression ;
-
or we can redefine the second production of basic-operand:
basic-operand : NUMBER | '(' additive-expression ')' ;
The parser section is almost complete. We have to do the following:
- define attributes Declarations, Lexer and CompilationInfo, which will help define declaration part containing definitions of terminal symbols, lexical analyzer and class name which will be used by Anglr compiler to store implementation of parser based on syntax rules specified in this parser part.
- define the name of parser part
- Collect all syntax rules and publish them in one place.
- By using the Start attribute, we declare syntax rule expression as the starting rule of language of simple arithmetic expressions.
Let's do it:
[ Declarations Id='mathDecls' ] [ Lexer Id='mathLexer' ] [ CompilationInfo ClassName='MathParser' NameSpace='Math.Parser' Access='public' CodeDir='mathParser' ] %parser mathParser %{ [ Start ] expression : additive-expression ; additive-expression : multiplicative-expression | additive-expression '+' multiplicative-expression | additive-expression '-' multiplicative-expression ; multiplicative-expression : signing-expression | multiplicative-expression '*' signing-expression | multiplicative-expression '/' signing-expression ; signing-expression : basic-operand | '+' signing-expression | '-' signing-expression ; basic-operand : NUMBER | '(' expression ')' ; %}
Anglr File Contents
This is the final content of Anglr file, which defines language for simple arithmetic expressions:
[ CompilationInfo ClassName='MathDecls' NameSpace='Math.Declarations' Access='public'] %declarations mathDecls %{ %terminal { NUMBER add '+' sub '-' mul '*' div '/' lb '(' rb ')' } %regex { decimal-digit [0-9] number {decimal-digit}+ add \+ sub \- mul \* div \/ lb \( rb \) } %} [ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner commentScanner %{ \*\/ pop . skip %} [ Declarations Id='mathDecls' ] [ CompilationInfo ClassName='MathRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner mathScanner %{ \/\* push commentScanner {number} terminal NUMBER {add} terminal add {sub} terminal sub {mul} terminal mul {div} terminal div {lb} terminal lb {rb} terminal rb [ \t]+ skip [\n\r] skip . skip %} [ UseScanner ScannerId='commentScanner' InitialScanner='mathScanner' ] [ CompilationInfo ClassName='MathLexer' NameSpace='Math.Lexer' Access='public' Hover='true' CodeDir='.' ] %lexer mathLexer %{ %} [ Declarations Id='mathDecls' ] [ Lexer Id='mathLexer' ] [ CompilationInfo ClassName='MathParser' NameSpace='Math.Parser' Access='public' CodeDir='mathParser' ] %parser mathParser %{ [ Start ] expression : additive-expression ; additive-expression : multiplicative-expression | additive-expression '+' multiplicative-expression | additive-expression '-' multiplicative-expression ; multiplicative-expression : signing-expression | multiplicative-expression '*' signing-expression | multiplicative-expression '/' signing-expression ; signing-expression : basic-operand | '+' signing-expression | '-' signing-expression ; basic-operand : NUMBER | '(' expression ')' ; %}
Advanced Features
As we know from the discussion of the parser part of Anglr file and particulary from the discussion of cardinality operators, the syntax rules can be expressed in more compact form by using cardinality operators. Cardinality operators are used to represent recursive syntax rules in the form of the sequences. This fact can alo be used in this tutorial.
When looking at the syntax rules constructed above we can see some interesting things:
- syntax rules additive-expression, multiplicative-expression and signing-expression are actually lists of expressions with higher priority. For example: additive-expression is a list of multiplicative-expressions, multiplicative-expression is a list of signing-expressions and so on.
- this lists are delimited with arithmetical operators like '+' and '-' in the case of additive-expression, '*' and '/' in the case of multiplicative-expression and so on.
- Some of these syntax rules are expressed using left recursions and the others are expressed using right recursions.
[ Start ] expression : additive-expression ; additive-expression : multiplicative-expression + [ '+' ] | multiplicative-expression + [ '-' ] ; multiplicative-expression : signing-expression + [ '*' ] | signing-expression + [ '/' ] ; signing-expression : basic-operand | '+' signing-expression | '-' signing-expression ; basic-operand : NUMBER | '(' expression ')' ;
This can be further reduced in the following form:
[ Start ] expression : additive-expression ; additive-expression : multiplicative-expression + [ '+' | '-' ] ; multiplicative-expression : signing-expression + [ '*' | '/' ] ; signing-expression : basic-operand | '+' signing-expression | '-' signing-expression ; basic-operand : NUMBER | '(' expression ')' ;
Using nested rules we can achieve greater compactness of the specifications of syntax rules. If we replace standalone rule for basic-operand with its nested rule equivalent, for example, we get:
[ Start ] expression : additive-expression ; additive-expression : multiplicative-expression + [ '+' | '-' ] ; multiplicative-expression : signing-expression + [ '*' | '/' ] ; signing-expression : ( : basic-operand : NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression ;
Following this way we can replace all standalone rules, except starting rule, with their nested rule equivalents, geting in succession:
[ Start ] expression : additive-expression ; additive-expression : multiplicative-expression + [ '+' | '-' ] ; multiplicative-expression : ( : signing-expression : ( : basic-operand : NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression) + [ '*' | '/' ] ;
[ Start ] expression : additive-expression ; additive-expression : ( : multiplicative-expression : ( : signing-expression : ( : basic-operand : NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression) + [ '*' | '/' ]) + [ '+' | '-' ] ;
[ Start ] expression : ( : additive-expression : ( : multiplicative-expression : ( : signing-expression : ( : basic-operand : NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression) + [ '*' | '/' ]) + [ '+' | '-' ]) ;
Using anonymous syntax rules for all nested syntax rules which are not referenced by other syntax rules, we get:
[ Start ] expression : ( ( ( : signing-expression : ( NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression) + [ '*' | '/' ]) + [ '+' | '-' ]) ;
Eliminating unnecessary rounded brackets and puting everything in the single line, we get the final and most compact specification of syntax rules for simple arithmetical expressions:
[ Start ] expression : ( : signing-expression : ( NUMBER | '(' expression ')' ) | '+' signing-expression | '-' signing-expression) + [ '*' | '/' ] + [ '+' | '-' ] ;
If we carefully examine the last syntax rule we can see that is composed of:
- some ugly stuff which is difficult to understand. In fact it is nothing but a signing-expression expressed in the form of nested syntax rule
-
the other part following the ugly stuff is more interesting: + [ '*' | '/' ] + [ '+' | '-' ]
It is the sequence of cardinality operators whose delimiters are arithmetical operators with descending priority. Let us imagine that in our simple arithmetic expressions
we use arithmetic operators, which are known in modern programming languages. The sequence of cardinality operators whose delimiters are arithmetic operators used in most
modern programming languages counted with their descending priorities could then look something like this:
+ [ '*' | '/' ] + [ '+' | '-' ] + [ '<<' | '>>' ] + [ '<' | '<' | '<=' | '>=' ] + [ '==' | '!=' ] + [ '&' ] + [ '^' ] + [ '|' ] + [ '&&' ] + [ '||' ]
That's no longer a joke, because we get a recipe of how to replace a set of possibly few dozens of syntax rules with only one in concise and easy to understand form, namely:[ Start ] expression : signing-expression + [ '*' | '/' ] + [ '+' | '-' ] + [ '<<' | '>>' ] + [ '<' | '>' | '<=' | '>=' ] + [ '==' | '!=' ] + [ '&' ] + [ '^' ] + [ '|' ] + [ '&&' ] + [ '||' ] ;
We can read above syntax rule in the following way:
- basic arithmetic expressions are signing-expression: signed or unsigned numeres and nested expresions as well, those enclosed with rounded braces.
- basic expressions can be multiplyed and divided
- results of multiplication and division and basic expressions themself can be furder added and subtructed
- the results of all previous expressions: basic expressions, multiplyed, divided, added and subtructed can be shifted left or right
- following this way we can continue till the end syntax rule where we can say that the results of all previous arithmetic operations can be logically "or"-ed
- we can also observe the ordering of priorities: multiplication and divison have higher priority, they are followed by addition and subtraction and so on. The lowes priority has operator logical or with character representation '||'.
- we can also observe the un-natural ordering of priorities for arithmetical-and '&', arithmetical-xor '^' and arithmetical-or '|' operators, whose priorities are lower than that of relational '<', '>', '<=', '>=' and equality '==', '!=' operators.
The following should be mentioned here:
- it is completely unnecessary to strive to achieve maximum compactness.
- be careful when applying anonymous rules.