Lexical Analyzer Source Code


Introduction

Lexical parts of Anglr file are used to generate the source code of lexical analyzers. The generated code depends on the settings found in the attributes UseScanner and CompilationInfo associated with the lexer part of Anglr file.

UseScanner Attribute

This attribute is mandatory for the lexical part of Anglr file. It references all scanners used to implement lexical analyzer associated with this lexer part of Anglr file. It cotains these settings used by Anglr compiler:

Name Value Description
ScannerId id of scanner part this id references scanner part which will be used to create scanner class which will be used exclusively by lexical analyzer referencing it. There may be many ScannerId settings in single UseScanner attributes, as many as there are scanners used by lexical analyzer.
InitialScanner id of scanner part this id references scanner part which whose scanner will be used when the lexical analyzer is started. Every scanner can then activate other scanners when needed, but only those referenced by this attribute. If this setting is not found in the attribute UseScanner the first scanner referenced by ScannerId setting is taken as the scanner which is active at the lexical analyzer startup.

CompilationInfo Attribute

CompilationInfo attribute is mandatory attribute for the lexical part of Anglr file. It supplies the following information to the Anglr compiler:

Name Value Description
ClassName name of generated class this name will be used by Anglr compiler to create a class which will contain the implementation of the lexical analyzer
NameSpace namespace of generated class this name will be used by Anglr compiler as the name of namespace containg the generated class. If this setting is not found in the CompilationInfo attribute, the namespace name found in the setting NameSpace of the CompilationInfo of the general part of Anglr file will be used instead if there is one. Otherwise the name of Anglr file will be taken for the name of namespace.
Access access of generated class Access can have one of the following three values: internal, public and private of which internal and public are preffered, since private access will make generated class unaccessible.

Generated Code

Class implementing lexical analyzer is actually a list of objects implementing scanners associated with the scanner parts listed in the UseScanner attribute of lexical part associated wit this lexical analyzer. The primary goal of the class implementing lexical analyzer is to create these scanners. The actual lexical analyzes of the text is performed by the class LexerBase which is subclassed by the class implementing lexical analyzer. Let's take a look at an example of lexer part and source code in C# generated by Anglr compiler. This time we will take a lexical analyzer defined in the Anglr file which defines Anglr language itself:

[ Description Text='Lexer for anglr file' Hover='true' ]
[
    UseScanner
        ScannerId='comment_ctx'
        ScannerId='attribute_ctx'
        ScannerId='scanner_id_ctx'
        ScannerId='scanner_part_ctx'
        ScannerId='regex_id_ctx'
        ScannerId='regex_part_ctx'
        ScannerId='regex_block_ctx'
        ScannerId='regex_block_part_ctx'
        InitialScanner='anglrScanner'
        Hover='true'
]
[ CompilationInfo ClassName='AnglrLexer' NameSpace='Anglr.Lexer' Access='public' Hover='true' ]
%lexer anglrLexer
%{
%}
            
From the attributes which belong to the lexer part we can see the following:
  • lexical analyzer is composed of nine scanners
  • The name of the class implementing lexical analyzer is AnglrLexer
  • The class implementing lexical analyzer is the member of the namespace Anglr.Lexer
The generated code reflects the above observations:
    public class AnglrLexer : LexerBase
    {
        internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } }
        internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } }
        internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } }
        internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } }
        internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } }
        internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } }
        internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } }
        internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } }
        internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } }
        public AnglrLexer (TextReader textReader)
        {
            Init ();
            pushInput (textReader);
            pushScanner (anglrScanner);
        }

        public AnglrLexer (string [] lines)
        {
            Init ();
            pushInput (lines);
            pushScanner (anglrScanner);
        }

        public AnglrLexer (string line)
        {
            Init ();
            pushInput (line);
            pushScanner (anglrScanner);
        }

        public void Init ()
        {
            regarray = new RegexInterface []
            {
                new AnglrLexer_CommentRegex (this),
                new AnglrLexer_AttributeRegex (this),
                new AnglrLexer_ScannerIdRegex (this),
                new AnglrLexer_ScannerPartRegex (this),
                new AnglrLexer_RegexIdRegex (this),
                new AnglrLexer_RegexPartRegex (this),
                new AnglrLexer_RegexBlockRegex (this),
                new AnglrLexer_RegexBlockPartRegex (this),
                new AnglrLexer_AnglrRegex (this),
            };
        }

        // scanner codes
        public const int comment_ctx = 0;
        public const int attribute_ctx = 1;
        public const int scanner_id_ctx = 2;
        public const int scanner_part_ctx = 3;
        public const int regex_id_ctx = 4;
        public const int regex_part_ctx = 5;
        public const int regex_block_ctx = 6;
        public const int regex_block_part_ctx = 7;
        public const int anglrScanner = 8;
    }
}
            
The class implementing lexical analyzer is composed on the following way:
  • At the beginning of the class is the list of properties referencing objects representing scanners composing lexical analyzer, like in the generated code above:
            internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } }
            internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } }
            internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } }
            internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } }
            internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } }
            internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } }
            internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } }
            internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } }
            internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } }
                        
    The names of properties are the same as class names mentioned in the ClassName setting in the attribute CompilationInfo of the scanner part associated with the scanner object referenced by this propery.
  • the list of scanner references is followed by three constructors of the class implementing lexical analyzer provided for different kind of input:
    • the first one can read text file input
              public AnglrLexer (TextReader textReader)
              {
                  Init ();
                  pushInput (textReader);
                  pushScanner (anglrScanner);
              }
                                  
    • the second one can handle input from an array of strings
              public AnglrLexer (string [] lines)
              {
                  Init ();
                  pushInput (lines);
                  pushScanner (anglrScanner);
              }
                                  
    • the last one takes the single string for its input.
              public AnglrLexer (string line)
              {
                  Init ();
                  pushInput (line);
                  pushScanner (anglrScanner);
              }
                                  
    All of them have the similar structure. They initialize themsef by calling mehod Init(), next they reference text source introduced by its input parameter and finaly it pushes the reference of initial scanner, the one associated with the scanner part occuring in the InputScanner setting of the lexer part associated with the lexical analyzer, into the scanner stack. Scanner stack now contains one scanner reference.
  • Init() method creates all scanners used ´by lexical analyzer. It creates them by initializing array regarray with references of newly created scanner objects which constitute lexical analyzer.
            public void Init ()
            {
                regarray = new RegexInterface []
                {
                    new AnglrLexer_CommentRegex (this),
                    new AnglrLexer_AttributeRegex (this),
                    new AnglrLexer_ScannerIdRegex (this),
                    new AnglrLexer_ScannerPartRegex (this),
                    new AnglrLexer_RegexIdRegex (this),
                    new AnglrLexer_RegexPartRegex (this),
                    new AnglrLexer_RegexBlockRegex (this),
                    new AnglrLexer_RegexBlockPartRegex (this),
                    new AnglrLexer_AnglrRegex (this),
                };
            }
                        
  • at the end of the class is the list of integer constants representing the index values which can be used to access scanner references in the array regarray created by the Init() method.

Lexer superclass

Introduction

Class LexerBase is the superclass of all classes implementing the lexical analyzers. It is the real work-horse of any lexical analyzer. It is repeatedly invoked by syntax analyzer until it reaches the end of source text, being it text file, string array or single string input. Basically, this class is the driver for scanners that make up the lexical analyzer:

  • it uses scanners to gather flow of terminal symbols from input text
  • it contains scanner manipulation routines
  • it contains mechanisms with which an application can fine tune the scanning process
  • it also contains mathods with which an application can manipulate the set of input source texts

This part of page is worth reading to understend advanced features of lexical analyzer. These features can be used to fine tune the lexical analyzes proces.

Properties and fields

Delegates

Class LexerBase defines some delegates which are the prototypes for the events which are fired at different occasions during the process of lexical analyzes of the input text. These delegates are defined in the following way:

        public delegate int scannerEnterCallback ();
        public delegate void scannerLeaveCallback (int token);
        public delegate void scannerPushCallback (int oldCtx, int newCtx);
        public delegate void scannerPopCallback (int oldCtx, int newCtx);
        public delegate void scannerTokenCallback (int ctx, int token, string text);
            
Their meanning will be described in the discussion of events fired by lexical analyzer.
scannerEnterCallback
Parameters this delegate has no parameters
Return Value
  • an integer value representing valid terminal symbol code, the one being defined in the generated class associated with the declaration part of Anglr file. It is supposed that this code will be returned to the calling environment and that input text will not be scanned. It is a way of how to "insert" a terminal symbol into the input text.
  • negative or zero value indicating that the result of delegate function should be ignored.
Decription this callback is intended to be used in scenarios where we want to insert some terminal into the input string. This insertion is attributed only to the terminal symbol code, since the text is actually not inserted in the input text.
scannerLeaveCallback
Parameters
  • int token: an integer representing the terminal symbol code returned to the syntax analyzer
Return Value this delegate has no return value
Decription This delegate is intended to be used in scenarios where we want to detect the terminal symbol code being retrieved by the lexical analyzer and sent to syntax analyzer.
scannerPushCallback
Parameters
  • int oldCtx: an integer representing currently active scanner. Its reference is positioned on the top of the scanner stack and will be covered with the reference of the scanner being pushed on the stack. The value of oldCtx is an index of some scanner reference in the table of scanner references. Scanner index values are defined at the end of the class which implements the lexical analyzer.
  • int newCtx: an integer representing scanner which will become active. It will be pushed on the top of the scanner stack. It is an index into the table of scanner references.
Return Value this delegate has no return value
Decription This delegate is intended to be used in scenarios where we want to detect which scanner covers the previous one.
scannerPopCallback
Parameters
  • int oldCtx: an integer representing currently active scanner. Its reference will be removed from the top of the scanner stack uncovering the reference of the scanner lying immediatelly below it. The value of oldCtx is an index of some scanner reference in the table of scanner references. Scanner index values are defined at the end of the class which implements the lexical analyzer.
  • int newCtx: an integer representing scanner which will become active. It will be uncovered by the pop action. The value of newCtx is an index into the table of scanner references.
Return Value this delegate has no return value
Decription This delegate is intended to be used in scenarios where we want to detect which scanner becomes uncovered when popping the current one from the scanner stack.
scannerTokenCallback
Parameters
  • int ctx: an integer representing the index of scanner reference within the table of scanner references
  • int token: terminal symbol code of terminal being retrieved by lexical analyzer
  • string text: text of terminal being retrieved by lexical analyzer.
Return Value this delegate has no return value
Decription This delegate is intended to be used in scenarios where we want to detect which token has been retrieved by lexical analyzer. This terminal need not to be returned to syntax analyzer - it should be skipped. Thus, it should be used to detect every terminal symbols being skipped or not.

Events

Immediatelly after the definition of delegates are the definitions of events which are fired by the lexical analyzer. Every event is associated with one of above discussed delegates. They are defined in the following way:

        public event scannerEnterCallback scannerEnterEvent;
        public event scannerLeaveCallback scannerLeaveEvent;
        public event scannerPushCallback scannerPushEvent;
        public event scannerPopCallback scannerPopEvent;
        public event scannerTokenCallback scannerTokenEvent;
            
scannerEnterEvent
Event Use This event is intended to be used in more advanced scenarious where the syntax of some structured text contains terminal symbol which does not have textual representations and must be generated by application. Using these symbols, we artificially direct the course of syntax analysis of structured text.
Event Source event is fired at the beginning of the scanning phase of the lexical analyzer. By invocation of this event, the lexical analyser grants the potential subscriber to this event the ability to influence the scanning process of the input text. The flow of the scanning process can be influenced by the return value of the event handler:
  • positive values will terminate the scanning process. These values are supposed to be valid terminal symbol codes which are synchronized with the syntax analyzer: they must be valid codes in the current state of syntax analyzer's stack automata . Invalid terminal symbol codes and those not synchronized with the syntax analyzer will cause the syntax errors.
  • other values, less than or equal to zero, will not affect the flow of the scanning process.
Event Subscriber The subscriber to this event is usually an application that analyzes some structured text. For analysis, it uses a parser, which is generated by an Anglr compiler. Typical example is Anglr compiler itself. Every text analyzed by Anglr compiler is in fact a fragment of Anglr file. Parser of Anglr compiler works like this:
  • First it reads the artificial terminal symbol, which represents the type of fragment to be analysed. This terminal symbol has no textual representation and is inserted by an event handler of scannerEnterEvent
  • after that the input text shall be analysed, which must represent the contents of the fragment. For example: if the artifical terminal symbol, inserted by event handler of scannerEnterEvent represents single production of syntax rule, than the input text must contain the single production of arbitrary syntax rule. If for example the inserted terminal symbol represents cardinality operator, the input text must contain arbitrary cardinality operator. Also the whole Anglr file is a fragment and it has artifical symbol representing it, namely the <anglr file terminal> terminal symbol.
In that manner can act any application which analyzes structured text. Such a process is right in dealing with more advanced scenarios, where, for example, we insert parts of structured text, translate source code that needs to be generated first and so on.
Event Parameters this event has no parameters
Event Return Value an nteger representing the terminal symbol code being inserted.
scannerLeaveEvent
Event Use This event is intended to be used in advanced scenarios where we must track the flow of terminal symbols returned by lexical analyzer to syntax analyzer. The event handler of this event is in fact the "man-in-the-middle". It can see all terminal symbols, together with their textual representations, returned by lexical analyzer. But it can not see the symbols that are skipped or those inserted by scannerEnterEvent event handlers. There is no general guidance on how to use this event. A good example would be, for example, the detection of a specific sequence of terminal symbols.
Event Source event is fired immediately before the completion of the metoed scan()
Event Subscriber The subscriber to this event is usually an application that analyzes some structured text and is interested in the flow of terminal symbols. An example is compiler of SNMP macros. This compiler is not interested in the definition of SNMP macros, since they are all built in it. That's why it wants to skip it. The definition of every SNMP macro begins with this sequence of text MACRO ::= BEGIN end ends with END. The idea of how to skip the definition of SNMP macro is pretty simple:
  • find the sequence of consecutive terminal symbols representing pieces of text 'MACRO', '::=' and 'BEGIN'. Important is that these pieces of text follow one another in the order mentioned before without intermediate terminal symbols between them, except those which does not alter the syntax of the text like comments, space characters and the like.
  • after we find the sequence of terminal symbols mentioned above, we can skip all terminal symbols until we find that one which is associated with the text 'END'
The first part of algorithm given above can be achieved with evet handler subscribed to handle event scannerLeaveEvent
Event Parameters event has the following parameter:
  • int token - an integer representing terminal symbol code returned from lexical analyzer to syntax analyzer.
Event Return Value event has no return code
scannerPushEvent
Event Use This event handler is intended to be used in scenarios where we want to detect the activation of specific scanner so that we can initialize some objects, for example.
Event Source This event is fired by the method pushScanner(). This method can be invoked elswhere, typically by the scanner when invoking push action, but it can also be invoked by the application itself if it has a reference to the object implementing lexical analyzer.
Event Subscriber Event should be subscribed by any application that has a reference to object implementing syntax analyzer. It is typically used to make some scanner level initialization, for example, at the beginning of comment. Event is fired before the new scanner reference is actually pushed to the stack of scanner references.
Event Parameters event has the following parameters:
  • int oldIndex - an integer representing index of current scanner reference at the top of scanner reference stack.
  • int newIndex - an integer representing index of scanner reference that will be pushet at the top of scanner reference stack.
Event Return Value event has no return value
scannerPopEvent
Event Use This event handler is intended to be used in scenarios where we want to detect the deactivation of specific scanner and to act accordingly. It is the opposite of eventPushEvent and is intended to be used to finalize something what was initiated when eventPushEvent was fired.
Event Source Event is fired by method popScanner(). Like method pushScanner() also this method should be called elswhere in an application which has a reference to object implementing lexical analyzer. Event is fired after the current scanner reference is popped from the stack of scanner references.
Event Subscriber Event should be subscribed by any application that holds a reference to object implementing syntax analyzer. It is typically used to make some scanner level finalialization, for example, at the end of comment. Event is fired after the old scanner reference is actually popped from the stack of scanner references.
Event Parameters event has the following parameters:
  • int oldIndex - an integer representing index of the scanner reference being popped from the top of the stack of scanner references.
  • int newIndex - an integer representing index of the scanner reference at the top of the scanner references stack being uncovered by popScanner() method.
Event Return Value event has no return value
scannerTokenEvent
Event Use Event is intended to be used in advanced scenarios where we want to detect every piece of text matched by any scanner of lexical analyzer. Between them are also terminal symbols returned to syntax analyzer. This event will not be fired for events inserted with scannerEnterEvent. This event is particularly suitable for restoring those parts of the text that are not relevant to the syntax of the text.
Event Source event is fired by method scan() immediatelly after the text had been matched by the currently active scanner. Terminal symbol, if any, associated with this piece of text need not to be returned from the method scan(). This is always the case if the regular expression associated with matched text invokes skip ation
Event Subscriber subscriber to this event is usually application which is interested of collecting text which is not visible to syntax analyzer.
Event Parameters event has the following parameters:
  • int ctx - an integer representing index of active scanner object reference
  • int token - an integer representing terminal symbol code or
    • an integer representing terminal symbol code if matched text will be returned
    • -1 if matched text is skipped
  • string text - matched piece of text
Event Return Value event has no return value

Fields

Fields contain internal state of the LexerBase object. It has these fields:

Properties

Methods

scan
pushScanner
popScanner
pushInput
popInput