Scanner Part of Anglr File
Introduction
The scanner is that part of the lexical analyzer that extracts terminal symbols from the source text. To work, it uses a set of regular expressions to extract sequences of characters that best match one of them from the source file. Each regular expression can be a associated with a terminal symbol to create a mapping between the sequences of characters in the source file and terminal symbols. The names of the terminal symbols and the values of the regular expressions required by the scanner are found in the declaration parts of anglr file. The link to the declaration parts and the scanner shall be described by the attributes that precede the part in which the scanner is defined.
Typically, multiple scanners are defined in an anglr file at the same time, and they share work in lexical analysis of a text. For example, a scanner specializes in detecting content, and another scanner specializes in things that don't matter to the content itself, such as comments.
Syntax
The following is a list of syntax rules that define the contents of the scanner part.
RULE S-1
<scanner part> : <attribute list> ? '%scanner' <identifier> '%{' <regular expression list> ? '%}' ;
RULE S-2
<regular expression list> : <regular expression usage> + ;
RULE S-3
<regular expression usage> : <regular expression> <actions> ? ;
RULE S-4
<actions> : <action> + ;
RULE S-5
<action> : <skip action> | <terminal action> | <event action> | <push action> | <pop action> ;
RULE S-6
<skip action> : 'skip' ;
RULE S-7
<terminal action> : 'terminal' <identifier> ;
RULE S-8
<event action> : 'event' <identifier> ;
RULE S-9
<push action> : 'push' <identifier> ;
RULE S-10
<pop action> : 'pop' ;
Here is an example of two scanners:
[ Description Text='definition of scanner, which extracts comments from input string'] [ Declarations Id='mathDecls' ] [ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner commentScanner %{ [\*]+\/ pop [\n\r] skip [^\*]+ skip [\*]+ skip %} [ Description Text='definition of scanner, which extracts terminal symbols from input string'] [ Declarations Id='mathDecls' ] [ CompilationInfo ClassName='MathRegex' NameSpace='Math.ScannerLib' Access='public'] %scanner mathScanner %{ \/\* push commentScanner {number} terminal NUMBER \+ terminal add \- terminal sub \* terminal mul \/ terminal div \( terminal lb \) terminal rb [ \t]+ skip [\n\r] skip . skip %}
In the exaple above two scanners are defined. The first scanner removes comments from the source text. The second scanner, however, reveals "useful" content, which is sent to syntax analyzer. Lookig closely at the examples above we can see, that they are basicaly some kind of case statements. And they actualy are case statements since Anglr compiler translates them to case statements of selected programming language.
Discussion
RULE S-1 - Structure of Scanner Part
Rule RULE S-1 defines the top structure of the scanner part:
- Scanner part is preceded by possibly empty attribute list
- attribute list is followed by reserved word %scanner and an identifier representing the name of scanner
- Then there is a list of regular expressions and associated activities between part parentheses %{ ad %}.
In example above the first scanner part is preceded by this attribute list:
[ Description Text='definition of scanner, which extracts comments from input string'] [ Declarations Id='mathDecls' ] [ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public']
Its name is:
%scanner commentScanner
Then a list of regular expressions and associated actions follows:
%{ [\*]+\/ pop [\n\r] skip [^\*]+ skip [\*]+ skip %}
We can read above list as a case statement:
- If you encounter a non-empty asterisk sequence that ends with the '/' sign, execute pop action. This action deactivate the current scanner by removing it from the top of scanner stack and activates the scanner which was revealed by this action.
- if you encounter any other character string, just skip it and try again.
This type of scanner is used to detect and extract multi-line comments from source text.
RULE S-2 - List of Regular Expressions
Rule RULE S-2 defines the list of regular expressions and associated actions. Here is an example of such list:
\/\* push commentScanner {number} terminal NUMBER \+ terminal add \- terminal sub \* terminal mul \/ terminal div \( terminal lb \) terminal rb [ \t]+ skip [\n\r] skip . skip
RULE S-3 - Regular Expression Usage
Rule RULE S-3 defines single regular expression with associated actions which should be executed when some piece of source text matches particular regular expression:
- the regular expression shall first be given.
- followed by a list of actions, which may also be empty.
Example: return terminal symbol NUMBER, if you encounter regular expression {number} defined in this declaration part.
{number} terminal NUMBER
Regular expressions must always be listed at the beginning of the line, otherwise the anglr compiler does not recognize them. By the contrary, actions must be indented by at least one space character from the beginning of the line, otherwise they will be treated as regular expressions.
Examples of ill-formed usages:
{number} terminal NUMBER
The above example will treat regular expression as an unknown action statement
{number} terminal NUMBER
The above example will always skip all text matching first regular expression {number} since it has no explicitely stated actions. On the other side, terminal action terminal NUMBER will be treated as regular expression and skiped as well.
RULE S-4 - List of Actions
Rule RULE S-4 defines list of actions. All examples of action lists in this part have only one action. But in the definition of Anglr language itself there are examples of action lists with more actions. For example:
{identifier} push scanner_part_ctx terminal <identifier>
RULE S-5 - Kind of Action
Rule RULE S-5 defines action types:
- skip action
- terminal action
- event action
- push action
- pop action
Skip, terminal and event actions terminate execution of action list. Any actions mentioned after them are ignored. One of these actions should end every action list. If this is not the case, skip action is implied. In example above, where commentScanner is defined, first action list terminate with pop action. In this case, skip action is silently executed immediately after execution of pop action:
[\*]+\/ pop
is terminated by an implicit skip action and should be writen in this way:
[\*]+\/ pop skip
RULE S-6 - Skip Action
Rule RULE S-6 defines skip action. This action causes the character sequence which matches regular expression associated with action list containing skip action, to be discarded. However, there are also mechanisms that allow it to be detected which part of the text has been discarded. After the skip action is executed, the scanner will not return to the calling environment (typically syntax analyser) but will repeat the text scan.
Skip action plays similar role as the continue statement in moder programming languages.
Example of skip action:
[\n\r] skip
Above action will discard all new line characters, probably because they are not important for syntax analyze of given text.
RULE S-7 - Terminal Action
Rule RULE S-7 defines terminal action. This action establishes mapping between terminal symbol returned by that action and text that matches associated regular expression. After this action finished executing, we know the following things:
- the content of terminal: character string which matches regular expression
- the meanning of that text: terminal symbol.
Action is composed in that way:
- first, we mention the word terminal.
- followed by the name of the terminal. This name must be defined in one of declaration parts. This action is the final action, so we should not cite other actions after it.
Action terminal has similar role as return statement in modern programming langages, for example:
{number} terminal NUMBER
will be compiled to:
return NUMBER;
RULE S-8 - Event Action
Rule RULE S-8 defines event action. This action is used in more complex cases, when we canoot use other actions to handle text, which matches regular expression. This action defines an event that will help us deal with the text, which matches regular expression. The function that handles this event must be implemented in a some application library. Action is composed in that way:
- first, we mention the word event.
- followed by the name of event. By mentioning the name on this place, we have defined a new event, so it doesn't have to be defined somewhere else.
Event action can be compared to the calculated case statement. If calculation returns less than or equal to zero, continue statement is executed, otherwise the calcuated code is returned. Example of a function which handles event action:
private int AsnRegex__identifier__Event (AsnLexer_AsnRegex regex, AsnLexer scanner) { int t; if ((t = AsnKeywordDB.FindKeyword (regex.text)) < 0) return AsnDeclarations.tokens.identifier; return t; }
In the above routine if text matched by regular expression appears to be keyword, specific keyword code is returned, otherwise identifier code is returned. This code (identifier or specific keyword) is then returned to the syntax analyzer. Event action is similar to returning a result of function call, for example (above event):
return AsnRegex__identifier__Event (regex, scanner);
RULE S-9 - Push Action
Rule RULE S-9 defines push action. This action activates a new scanner. Each lexical analyzer contains a stack of scanners. Active scanner sits at the top of the stack. Push action puts a new scanner at the top of the stack. Push action is composed on that way:
- first, we mention the word push
- followed by the name of the scanner part defined elswhere in anglr file.
Example: one of the actions of mathScanner from example above executes this push action
\/\* push commentScanner
It discards first two characters /* at the beginning of multi-line comment and activates scanner commentScanner which will read the remaining characters of comment.
RULE S-10 - Pop Action
Rule RULE S-10 defines pop action. This action removes the current scanner from the top of the scanner stack. The scanner located just below the top becomes an active scanner.
Example: one of the actions of commentScanner from example above executes this pop action:
[\*]+\/ pop
Scanner commentScanner executes this action every time it encounters an end of multi-line comment string, removing itself from the scanner stack and reviling scanner immediately below it, mathScanner in this case.
Push and pop actions must be balanced. At no time there should be more pop than push actions.
Attributes
These attributes are mandatory for scanner part:
Attribute | Value Name | Value | Description |
---|---|---|---|
Declarations | Id | part name | Declarations id is name of some declaration part within anglr file. Regular expressions and terminal symbols referenced in scanner part should be defined in this declaration part. |
CompilationInfo | ClassName | class name | name of class generated by anglr compiler containing implementation of scanner |
NameSpace | namespace name | namespce used by anglr compiler when generating scanner class | |
Access | class access | class access should be one of these keywords: public, private or internal | |
CodeDir | directory path | directory path which will be used to save generated scanner class |