Lexical Scanner Source Code


Introduction

Typically, multiple scanners are defined in an Anglr file. Each scanner specializes in reading specific part of the source file. The most common example is the scanner which is capable of extracting comments from the source file. It is activated at the beginning of each comment. When it reads a comment, it is automatically deactivated. Scanners cannot be used directly, but can only be used with the help of lexical analysers. Each lexical analyzer uses at least one scanner but usually more. Due to the implementation specifics, each scanner is implemented together with the lexical analyzer that uses it. When generating source code for the scanner, an Anglr compiler uses information located in the Declarations and CompilationInfo attributes. Because the scanner is always dependent on a lexical analyzer, the name of the class with which it is implemented depends on the name of the class with which the lexical analyzer is implemented.

Declarations Attribute

This attribute defines identity of that declaration part of Anglr file which defines terminal symbols and regular expressions used by scanner. Only one declaration part can be used by specific scanner. This attribute is not mandatory. There are some scanners which does not need to know terminal symbols or regular expressions. Typical example is comment scanner, which needs to know only which combination of characters signals the end of comment.

These settings can be specified by Declarations attribute:

Name Value Description
Id name of declaration part the name of the declaration part must be one of those names appearing in %declaration statements found in Anglr file. All terminal symbol names and all regular expression values used by scanner will be taken from that declaration part

CompilationInfo Attribute

CompilationInfo attribute is mandatory attribute for scanner part of Anglr file. Scanner implementation class name, its namespace and access mode are specified by this attribute.

Scanner part settings contained within CompilationInfo attribute preceeding scanner part of Anglr file:

Name Value Description
ClassName name of generated class The name of the class which implements specific scanner is composed of two parts: the first one is the name of class which implements lexical analyzer which uses this scanner adn the second one comes from the value of the setting ClassName. Between them is underscore character. If this setting is not found in the CompilationInfo attribute, the scanner part name will be used instead.
NameSpace namespace of generated class this name will be used by Anglr compiler as the name of namespace containg the generated class. If this setting is not found in the CompilationInfo attribute, the namespace name found in the setting NameSpace of the CompilationInfo of the general part of Anglr file will be used instead if there is one. Otherwise the name of Anglr file will be taken for the name of namespace.
Access access of generated class Access can have one of the following three values: internal, public and private of which internal and public are preffered, since private access will make generated class unaccessible.

Generated Code

Structure of the generated class which implements specific scanner is relatively simple and has the following characteristics:

  • it implements the interface RegexInterface, which requires the implementation of method match()
  • it is a subclass of System.Text.RegularExpressions.Regex. This means that scanner is in fact some kind of regular expression object
  • it has a public constructor, which references lexical analyzer which constructed it and initializes subclassed regular expression
  • it has a Scanner property, which references lexical analyzer introduced in the constructor of the scanner object
  • it has a text property, containing the last piece of text matched by the method match ()
  • it defines a delegate (callback function prototype) for events which will be triggered for event actions (if there are any) introduced in the scanner part of Anglr file
  • and finally there is one event defined with the delegate mentioned above for every event action introduced in the scanner part of Anglr file

Let's take a look at an example:

[ CompilationInfo ClassName='CommentRegex' NameSpace='Csharp.RegexLib' Access='public' Hover='true' ]
%scanner commentScanner
%{
\*+/
    pop
{delimited-comment-section}
    skip
{new-line}
    skip
%}
            
Above example is an excerpt from the Anglr file in which is defined a scanner which extracts multiline comments from the C# source files. Anglr compiler generated the following source code for it:
internal class CsharpLexer_CommentRegex : Regex, RegexInterface
{
    public CsharpLexer_CommentRegex (CsharpLexer scanner) : base (@"(?<g1>^\*+/)|(?<g2>^\/|(\*+)?([^\/\*]))|(?<g3>^\u000D|\u000A|\u000D\u000A|\u0085|\u2028|\u2029)", RegexOptions.ExplicitCapture)
    {
        Scanner = scanner;
    }

    public CsharpLexer Scanner { get; private set; }

    public delegate int scannerCallback (CsharpLexer_CommentRegex regex, CsharpLexer scanner);

    public string text { get; private set; }

    public (int, int) match (string currentLine)
    {
        int matchIndex = 0;
        int matchLength = 0;
        try
        {
            text = "";
            Match match = Match (currentLine);
            if (!match.Success)
                return (-1, 0);
            int index = 0;
            foreach (Group group in match.Groups)
            {
                if (index++ == 0)
                    continue;
                if (!group.Success)
                    continue;
                try
                {
                    matchLength = match.Value.Length;
                    matchIndex = index - 1;
                    text = currentLine.Substring (0, matchLength);
                }
                catch (Exception)
                {
                    continue;
                }
                break;
            }
        }
        catch (Exception e)
        {
            return (-2, 0);
        }

        int? result = null;
        switch (matchIndex)
        {
        case 1:
            Scanner.popScanner ();
            result = 0;
            break;
        case 2:
            result = 0;
            break;
        case 3:
            result = 0;
            break;
        }
        return (result != null) ? (result.Value, matchLength) : (0, matchLength);
    }
}
            
The generated code is straightforward:
  • The class implementing the scanner is named CsharpLexer_CommentRegex. The name is composed of two parts delimited with underscore character. The first part is named CsharpLexer and comes from the class implementing lexical analyzer which created this scanner. The second one is named CommentRegex and comes from the attribute CompilationInfo preceeding the definition of scanner part associated with this implementation of scanner:
    internal class CsharpLexer_CommentRegex : Regex, RegexInterface
                        
  • constructor of scanner does this things:
    • introduces the reference to the lexical analyzer which created that scanner
    • initializes the property Scanner with the reference being introduced.
    • initializes the constructor of subclassed Regex object with the value of all regular expressios introduced in the scanner part combined
        public CsharpLexer_CommentRegex (CsharpLexer scanner) : base (@"(?<g1>^\*+/)|(?<g2>^\/|(\*+)?([^\/\*]))|(?<g3>^\u000D|\u000A|\u000D\u000A|\u0085|\u2028|\u2029)", RegexOptions.ExplicitCapture)
        {
            Scanner = scanner;
        }
                        
  • next comes the definition of lexical analyzer reference Scanner:
        public CsharpLexer Scanner { get; private set; }
                        
  • lexical analyzer reference is followed by the definition of delegate scannerCallback (callback function prototype) used in the definition of events which shall be fired by this class. Since this scanner has no events, this delegate is meaningless.
        public delegate int scannerCallback (CsharpLexer_CommentRegex regex, CsharpLexer scanner);
                        
  • next comes the definition of property text, which contains the last piece of text matched by function match()
        public string text { get; private set; }
                        
  • at the end is defined method match(), required by interface RegexInterface, which is implemented by that scanner. The goal of this method is to extract the piece of text, which best matches the regular expression introduced with the constructor of scanner and to make other decisions and invoke certain actions like: which terminal code is associated with matched text if any, what action should be invoked by the calling environment. The body of this method is the same for all scanners except for the switch statement appearing at the end of the method, which reflects the contents of scanner part. Every action list in scanner part is associated with some case of this switch statement. Number of cases in the switch statement is equal to the number of action lists in scanner part and they appear in the same order as the action lists: the first case is associated with the first action list, the second case is associated with the second action list and so on. Let's see these action lists and the switch statement together, that we will better understand the topic presented. The action list of the scanner:
    \*+/
        pop
    {delimited-comment-section}
        skip
    {new-line}
        skip
                        
    is resembled in this switch statement of generated code
            switch (matchIndex)
            {
            case 1:
                Scanner.popScanner ();
                result = 0;
                break;
            case 2:
                result = 0;
                break;
            case 3:
                result = 0;
                break;
            }
                        
All cases in general are numbered from one to the number of action lists constituting the scanner part. In this case from one to three, since scanner part contains three actions lists. We can observe that skip actions are translated in the statement:
            result = 0;
            
like in the cases 2 and 3, which corespond to the second and third action lists in the scanner part and which consist of only the single action, namely te skip action. From the case 1 we can observe, that pop action transates to
            Scanner.popScanner ();
            
and also that the pop action is quietly terminated by skip action (statement result = 0;). Remember, that each action list must be terminated by skip, terminal or event action otherwise the skip action will be quietly inserted.

Before we conclude with the explanation, it would be worth noting a particular detail that is liked to be overlooked. The variable result is defined just before the switch statementon the following way:

        int? result = null;
            
Before the execution of the switch statement the value of the variable result is undefined and it remains undefined if the value of matchIndex does not match any of case values of the switch statement (one to three in the above example). In this case value (0, matchLength) is returned to the calling environment sigaling that no text was matched in the current iteration of method call match(). This is usually due to the design errors made in the construction of the scanner part: we probably did not foresee all possible patterns that may appear in the text being scanned by our scanner. Simply put, we're missing some regular expression and associated action list which will manage the missing case.

Scanner Events

There is another functionality that scanners can implement, events associated with event actions which could appear in the action lists of scanner parts of Anglr file. Let's take a look at another example of scanner in the same Anglr file as in the previous example:

[ Declarations Id='csharpDecls' Hover='true' ]
[ CompilationInfo ClassName='ScannerRegex' NameSpace='Csharp.RegexLib' Access='public' Hover='true' ]
%scanner csharpScanner
%{
\/\/.*
    skip
\/\*
    push commentScanner
{integer-literal}
    terminal integer-literal
{real-literal}
    terminal real-literal
{character-literal}
    terminal character-literal
{string-literal}
    terminal string-literal
{identifier}
    event identifier
{cs-ops}
    event cs-ops
[ \t]
    skip
[\n\f\r\v]
    skip
.
    skip
%}
            
As we can see the regular expressions {identifier} and {cs-ops} are handled with event actions named identifier and cs-ops, resectively. Don't replace the names of event action names with their counterparts for the name of terminal symbol identifier and the name of regular expression cs-ops which appear elswhere in the Anglr file, but in the different context.

If we look at the source code geneated by Anglr compiler for the above scanner:

internal class CsharpLexer_ScannerRegex : Regex, RegexInterface
{
    public CsharpLexer_ScannerRegex (CsharpLexer scanner) : base (@"(?<g1>^\/\/.*)|(?<g2>^\/\*)|(?<g3>^((([0-9])+)(U|u|L|l|UL|Ul|uL|ul|LU|Lu|lU|lu)?)|(0(x|X)(([0-9A-Fa-f])+)(U|u|L|l|UL|Ul|uL|ul|LU|Lu|lU|lu)?))|(?<g4>^(([0-9])+)\.(([0-9])+)((e|E)(\+|\-)?(([0-9])+))?(F|f|D|d|M|m)?|\.(([0-9])+)((e|E)(\+|\-)?(([0-9])+))?(F|f|D|d|M|m)?|(([0-9])+)((e|E)(\+|\-)?(([0-9])+))(F|f|D|d|M|m)?|(([0-9])+)(F|f|D|d|M|m))|(?<g5>^'(([^'\\\u000D\u000A\u0085\u2028\u2029])|(\'|\""|\\|\0|\a|\b|\f|\n|\r|\t|\v)|(\\x([0-9A-Fa-f]){1,4})|(\\u([0-9A-Fa-f]){4}|\\u([0-9A-Fa-f]){8}))')|(?<g6>^(""((([^""\\\u000D\u000A\u0085\u2028\u2029])|(\'|\""|\\|\0|\a|\b|\f|\n|\r|\t|\v)|(\\x([0-9A-Fa-f]){1,4})|(\\u([0-9A-Fa-f]){4}|\\u([0-9A-Fa-f]){8}))+)?"")|(@""((([^""])|(""""))+)?""))|(?<g7>^(((\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl})|(_|\u005F))(((\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl})|(\p{Nd})|(\p{Pc})|(\p{Mn}|\p{Mc})|(\p{Cf}))+)?)|\@(((\p{ Lu}|\p{ Ll}|\p{ Lt}|\p{ Lm}|\p{ Lo}|\p{ Nl})| (_ | \u005F))(((\p{ Lu}|\p{ Ll}|\p{ Lt}|\p{ Lm}|\p{ Lo}|\p{ Nl})| (\p{ Nd})| (\p{ Pc})| (\p{ Mn}|\p{ Mc})| (\p{ Cf}))+)?))|(?<g8>^(\<\<\ =|\?\?|\:\:|\+\+|\-\-|\&\&|\|\||\-\>|\=\=|\!\=|\<\ =|\>\=|\+\=|\-\=|\*\=|\/\=|\%\=|\&\=|\|\=|\^\=|\<\<|\ =\>|\{|\}|\[|\]|\(|\)|\.|\,|\:|\;|\+|\-|\*|\/|\%|\&|\||\^|\!|\~|\=|\<|\>|\?))|(?<g9>^[ \t])|(?<g10>^[\n\f\r\v])|(?<g11>^.)", RegexOptions.ExplicitCapture)
    {
        Scanner = scanner;
    }

    public CsharpLexer Scanner { get; private set; }

    public delegate int scannerCallback (CsharpLexer_ScannerRegex regex, CsharpLexer scanner);

    public event scannerCallback identifier_Event;
    public event scannerCallback cs_ops_Event;

    public string text { get; private set; }

    public (int, int) match (string currentLine)
    {
        int matchIndex = 0;
        int matchLength = 0;
        try
        {
            text = "";
            Match match = Match (currentLine);
            if (!match.Success)
                return (-1, 0);
                int index = 0;
            foreach (Group group in match.Groups)
            {
                if (index++ == 0)
                    continue;
                if (!group.Success)
                    continue;
                try
                {
                    matchLength = match.Value.Length;
                    matchIndex = index - 1;
                    text = currentLine.Substring (0, matchLength);
                }
                catch (Exception)
                {
                    continue;
                }
                break;
            }
        }
        catch (Exception e)
        {
            return (-2, 0);
        }

        int? result = null;

        switch (matchIndex)
        {
                    case 1:
            result = 0;
            break;
        case 2:
            Scanner.pushScanner (CsharpLexer.commentScanner);
            result = 0;
            break;
        case 3:
            result = CsharpDeclarations.tokens.token_integer_literal;
            break;
        case 4:
            result = CsharpDeclarations.tokens.token_real_literal;
            break;
        case 5:
            result = CsharpDeclarations.tokens.token_character_literal;
            break;
        case 6:
            result = CsharpDeclarations.tokens.token_string_literal;
            break;
        case 7:
            result = identifier_Event?.Invoke (this, Scanner);
            break;
        case 8:
            result = cs_ops_Event?.Invoke (this, Scanner);
            break;
        case 9:
            result = 0;
            break;
        case 10:
            result = 0;
            break;
        case 11:
            result = 0;
            break;

        }
        return (result != null) ? (result.Value, matchLength) : (0, matchLength);
    }
}
            
we can clearly see the definitions of two events immediately following the delegate used to define them:
    public delegate int scannerCallback (CsharpLexer_ScannerRegex regex, CsharpLexer scanner);

    public event scannerCallback identifier_Event;
    public event scannerCallback cs_ops_Event;
            
These events are invoked in the cases numbered 7 and 8 in the switch statement at the end of method match(). If we want that the method works as expected, we must implement event handlers for these events otherwise the result of method match() call will be undefined. It should be noted, that even that the result will be undefined in this case, the text will be matched and skipped. This scanner is part of larger project where the event handlers for these events are defined in the following way:

        private int CsharpScanner_identifier_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner)
        {
            int t = CsharpScannerInternals.findReservedWord (regex.text);
            return (t < 0) ? CsharpDeclarations.tokens.token_identifier : t;
        }

        private int CsharpScanner_cs_ops_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner)
        {
            int t = CsharpScannerInternals.findOperator (regex.text);
            return (t < 0) ? 0 : t;
        }
            
Now, let's take a look at these event handlers a little bit closer:
  • both event handlers are doing something with the text matched by regular expression: text property of object regex, namely regex.text. The first one is checking if this text represents C# reserved word like if, then, else etc.:
                int t = CsharpScannerInternals.findReservedWord (regex.text);
                
    The second one is checking if this text represents C# operator code like + or - etc.:
                int t = CsharpScannerInternals.findOperator (regex.text);
                
  • The result of check is taken differently by these event handlers
  • In the case of positive result they both return the code returned by the checking functions findReservedWord() and findOperator(), the termial symbol code associated with reserved word and C# operator, respectively.
  • In the case of negative result there are differences in the meaning of the event handler invocations. First event handler, CsharpScanner_identifier_Event returns CsharpDeclarations.tokens.token_identifier, the code of terminal symbol identifier, just like it is saying: ok it is not reserved word but an identifier.
  • The other event handler, CsharpScanner_cs_ops_Event returns 0 in the case of negative result, just like saying: it is probably not an operator code, that's why just skip it. This is probably an error, since this piece of skipped text is not there for no reason. In addition to returning 0, this event handler should have also reported an error in some way. But this is not important for the context of this discussion.
This things are important to remember from the discussion of scanner events:
  • scanner events should be used to handle more advanced scenarios, like in the example above, where we could not decide immediately what kind of text has been matched by scanner. But neverlheless also the most simple scenarious should be handled by scanner events, like the skip actions or single terminal actions, for example:
            private int CsharpScanner_cs_ops_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner)
            {
                return 0;   // skip everything
            }
                
  • returning less than or equal to zero from the event handler associated with an event action mimics skip action
  • returning the value which is greater than zero from the event hander associated with an event action mimics terminal action. Care should be taken that the value returned really represents an existing terminal symbol code. Since scanners are obviously invoked by syntax analyzers, although indirectly by the call of lexical analyzers, invalid terminal codes should cause syntax errors.