Scanner Part of Anglr File

Introduction

The scanner is that part of the lexical analyzer that extracts terminal symbols from the source text. To work, it uses a set of regular expressions to extract sequences of characters that best match one of them from the source file. Each regular expression can be a associated with a terminal symbol to create a mapping between the sequences of characters in the source file and terminal symbols. The names of the terminal symbols and the values of the regular expressions required by the scanner are found in the declaration parts of anglr file. The link to the declaration parts and the scanner shall be described by the attributes that precede the part in which the scanner is defined.

Typically, multiple scanners are defined in an anglr file at the same time, and they share work in lexical analysis of a text. For example, a scanner specializes in detecting content, and another scanner specializes in things that don't matter to the content itself, such as comments.

Syntax

The following is a list of syntax rules that define the contents of the scanner part.

RULE S-1

<scanner part>
    : <attribute list> ? '%scanner' <identifier> '%{' <regular expression list> ? '%}'
    ;

RULE S-2

<regular expression list>
    : <regular expression usage> +
    ;

RULE S-3

<regular expression usage>
    : <regular expression> <actions> ?
    ;

RULE S-4

<actions>
    : <action> +
    ;

RULE S-5

<action>
    : <skip action>
    | <terminal action>
    | <event action>
    | <push action>
    | <pop action>
    ;

RULE S-6

<skip action>
    : 'skip'
    ;

RULE S-7

<terminal action>
    : 'terminal' <identifier>
    ;

RULE S-8

<event action>
    : 'event' <identifier>
    ;

RULE S-9

<push action>
    : 'push' <identifier>
    ;

RULE S-10

<pop action>
    : 'pop'
    ;

Here is an example of two scanners:

[ Description Text='definition of scanner, which extracts comments from input string']
[ Declarations Id='mathDecls' ]
[ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public']
%scanner commentScanner
%{
[\*]+\/
    pop
[\n\r]
    skip
[^\*]+
    skip
[\*]+
    skip
%}

[ Description Text='definition of scanner, which extracts terminal symbols from input string']
[ Declarations Id='mathDecls' ]
[ CompilationInfo ClassName='MathRegex' NameSpace='Math.ScannerLib' Access='public']
%scanner mathScanner
%{
\/\*
    push commentScanner
{number}
    terminal NUMBER
\+
    terminal add
\-
    terminal sub
\*
    terminal mul
\/
    terminal div
\(
    terminal lb
\)
    terminal rb
[ \t]+
    skip
[\n\r]
    skip
.
    skip
%}

In the exaple above two scanners are defined. The first scanner removes comments from the source text. The second scanner, however, reveals "useful" content, which is sent to syntax analyzer. Lookig closely at the examples above we can see, that they are basicaly some kind of case statements. And they actualy are case statements since Anglr compiler translates them to case statements of selected programming language.

Discussion

RULE S-1 - Structure of Scanner Part

Rule RULE S-1 defines the top structure of the scanner part:

Scanner part is preceded by possibly empty attribute list
attribute list is followed by reserved word %scanner and an identifier representing the name of scanner
Then there is a list of regular expressions and associated activities between part parentheses %{ ad %}.

In example above the first scanner part is preceded by this attribute list:

[ Description Text='definition of scanner, which extracts comments from input string']
[ Declarations Id='mathDecls' ]
[ CompilationInfo ClassName='CommentRegex' NameSpace='Math.ScannerLib' Access='public']

Its name is:

%scanner commentScanner

Then a list of regular expressions and associated actions follows:

%{
[\*]+\/
    pop
[\n\r]
    skip
[^\*]+
    skip
[\*]+
    skip
%}

We can read above list as a case statement:

If you encounter a non-empty asterisk sequence that ends with the '/' sign, execute pop action. This action deactivate the current scanner by removing it from the top of scanner stack and activates the scanner which was revealed by this action.
if you encounter any other character string, just skip it and try again.

This type of scanner is used to detect and extract multi-line comments from source text.

RULE S-2 - List of Regular Expressions

Rule RULE S-2 defines the list of regular expressions and associated actions. Here is an example of such list:

\/\*
    push commentScanner
{number}
    terminal NUMBER
\+
    terminal add
\-
    terminal sub
\*
    terminal mul
\/
    terminal div
\(
    terminal lb
\)
    terminal rb
[ \t]+
    skip
[\n\r]
    skip
.
    skip

RULE S-3 - Regular Expression Usage

Rule RULE S-3 defines single regular expression with associated actions which should be executed when some piece of source text matches particular regular expression:

the regular expression shall first be given.
followed by a list of actions, which may also be empty.

Example: return terminal symbol NUMBER, if you encounter regular expression {number} defined in this declaration part.

{number}
    terminal NUMBER

Regular expressions must always be listed at the beginning of the line, otherwise the anglr compiler does not recognize them. By the contrary, actions must be indented by at least one space character from the beginning of the line, otherwise they will be treated as regular expressions.

Examples of ill-formed usages:

    {number}
    terminal NUMBER

The above example will treat regular expression as an unknown action statement

{number}
terminal NUMBER

The above example will always skip all text matching first regular expression {number} since it has no explicitely stated actions. On the other side, terminal action terminal NUMBER will be treated as regular expression and skiped as well.

RULE S-4 - List of Actions

Rule RULE S-4 defines list of actions. All examples of action lists in this part have only one action. But in the definition of Anglr language itself there are examples of action lists with more actions. For example:

{identifier}
    push scanner_part_ctx
    terminal <identifier>

RULE S-5 - Kind of Action

Rule RULE S-5 defines action types:

skip action
terminal action
event action
push action
pop action

Skip, terminal and event actions terminate execution of action list. Any actions mentioned after them are ignored. One of these actions should end every action list. If this is not the case, skip action is implied. In example above, where commentScanner is defined, first action list terminate with pop action. In this case, skip action is silently executed immediately after execution of pop action:

[\*]+\/
    pop

is terminated by an implicit skip action and should be writen in this way:

[\*]+\/
    pop
    skip

RULE S-6 - Skip Action

Rule RULE S-6 defines skip action. This action causes the character sequence which matches regular expression associated with action list containing skip action, to be discarded. However, there are also mechanisms that allow it to be detected which part of the text has been discarded. After the skip action is executed, the scanner will not return to the calling environment (typically syntax analyser) but will repeat the text scan.

Skip action plays similar role as the continue statement in moder programming languages.

Example of skip action:

[\n\r]
    skip

Above action will discard all new line characters, probably because they are not important for syntax analyze of given text.

RULE S-7 - Terminal Action

Rule RULE S-7 defines terminal action. This action establishes mapping between terminal symbol returned by that action and text that matches associated regular expression. After this action finished executing, we know the following things:

the content of terminal: character string which matches regular expression
the meanning of that text: terminal symbol.

Action is composed in that way:

first, we mention the word terminal.
followed by the name of the terminal. This name must be defined in one of declaration parts. This action is the final action, so we should not cite other actions after it.

Action terminal has similar role as return statement in modern programming langages, for example:

{number}
    terminal NUMBER

will be compiled to:

    return NUMBER;

RULE S-8 - Event Action

Rule RULE S-8 defines event action. This action is used in more complex cases, when we canoot use other actions to handle text, which matches regular expression. This action defines an event that will help us deal with the text, which matches regular expression. The function that handles this event must be implemented in a some application library. Action is composed in that way:

first, we mention the word event.
followed by the name of event. By mentioning the name on this place, we have defined a new event, so it doesn't have to be defined somewhere else.

Event action can be compared to the calculated case statement. If calculation returns less than or equal to zero, continue statement is executed, otherwise the calcuated code is returned. Example of a function which handles event action:

    private int AsnRegex__identifier__Event (AsnLexer_AsnRegex regex, AsnLexer scanner)
    {
        int t;
        if ((t = AsnKeywordDB.FindKeyword (regex.text)) < 0)
            return AsnDeclarations.tokens.identifier;
        return t;
    }

In the above routine if text matched by regular expression appears to be keyword, specific keyword code is returned, otherwise identifier code is returned. This code (identifier or specific keyword) is then returned to the syntax analyzer. Event action is similar to returning a result of function call, for example (above event):

    return AsnRegex__identifier__Event (regex, scanner);

RULE S-9 - Push Action

Rule RULE S-9 defines push action. This action activates a new scanner. Each lexical analyzer contains a stack of scanners. Active scanner sits at the top of the stack. Push action puts a new scanner at the top of the stack. Push action is composed on that way:

first, we mention the word push
followed by the name of the scanner part defined elswhere in anglr file.

Example: one of the actions of mathScanner from example above executes this push action

\/\*
    push commentScanner

It discards first two characters /* at the beginning of multi-line comment and activates scanner commentScanner which will read the remaining characters of comment.

RULE S-10 - Pop Action

Rule RULE S-10 defines pop action. This action removes the current scanner from the top of the scanner stack. The scanner located just below the top becomes an active scanner.

Example: one of the actions of commentScanner from example above executes this pop action:

[\*]+\/
    pop

Scanner commentScanner executes this action every time it encounters an end of multi-line comment string, removing itself from the scanner stack and reviling scanner immediately below it, mathScanner in this case.

Push and pop actions must be balanced. At no time there should be more pop than push actions.

Attributes

These attributes are mandatory for scanner part:

Attribute	Value Name	Value	Description
Declarations	Id	part name	Declarations id is name of some declaration part within anglr file. Regular expressions and terminal symbols referenced in scanner part should be defined in this declaration part.
CompilationInfo	ClassName	class name	name of class generated by anglr compiler containing implementation of scanner
	NameSpace	namespace name	namespce used by anglr compiler when generating scanner class
	Access	class access	class access should be one of these keywords: public, private or internal
	CodeDir	directory path	directory path which will be used to save generated scanner class