Lexical Analyzer Source Code
Introduction
Lexical parts of Anglr file are used to generate the source code of lexical analyzers. The generated code depends on the settings found in the attributes UseScanner and CompilationInfo associated with the lexer part of Anglr file.
UseScanner Attribute
This attribute is mandatory for the lexical part of Anglr file. It references all scanners used to implement lexical analyzer associated with this lexer part of Anglr file. It cotains these settings used by Anglr compiler:
| Name | Value | Description |
|---|---|---|
| ScannerId | id of scanner part | this id references scanner part which will be used to create scanner class which will be used exclusively by lexical analyzer referencing it. There may be many ScannerId settings in single UseScanner attributes, as many as there are scanners used by lexical analyzer. |
| InitialScanner | id of scanner part | this id references scanner part which whose scanner will be used when the lexical analyzer is started. Every scanner can then activate other scanners when needed, but only those referenced by this attribute. If this setting is not found in the attribute UseScanner the first scanner referenced by ScannerId setting is taken as the scanner which is active at the lexical analyzer startup. |
CompilationInfo Attribute
CompilationInfo attribute is mandatory attribute for the lexical part of Anglr file. It supplies the following information to the Anglr compiler:
| Name | Value | Description |
|---|---|---|
| ClassName | name of generated class | this name will be used by Anglr compiler to create a class which will contain the implementation of the lexical analyzer |
| NameSpace | namespace of generated class | this name will be used by Anglr compiler as the name of namespace containg the generated class. If this setting is not found in the CompilationInfo attribute, the namespace name found in the setting NameSpace of the CompilationInfo of the general part of Anglr file will be used instead if there is one. Otherwise the name of Anglr file will be taken for the name of namespace. |
| Access | access of generated class | Access can have one of the following three values: internal, public and private of which internal and public are preffered, since private access will make generated class unaccessible. |
Generated Code
Class implementing lexical analyzer is actually a list of objects implementing scanners associated with the scanner parts listed in the UseScanner attribute of lexical part associated wit this lexical analyzer. The primary goal of the class implementing lexical analyzer is to create these scanners. The actual lexical analyzes of the text is performed by the class LexerBase which is subclassed by the class implementing lexical analyzer. Let's take a look at an example of lexer part and source code in C# generated by Anglr compiler. This time we will take a lexical analyzer defined in the Anglr file which defines Anglr language itself:
[ Description Text='Lexer for anglr file' Hover='true' ] [ UseScanner ScannerId='comment_ctx' ScannerId='attribute_ctx' ScannerId='scanner_id_ctx' ScannerId='scanner_part_ctx' ScannerId='regex_id_ctx' ScannerId='regex_part_ctx' ScannerId='regex_block_ctx' ScannerId='regex_block_part_ctx' InitialScanner='anglrScanner' Hover='true' ] [ CompilationInfo ClassName='AnglrLexer' NameSpace='Anglr.Lexer' Access='public' Hover='true' ] %lexer anglrLexer %{ %}
From the attributes which belong to the lexer part we can see the following:
- lexical analyzer is composed of nine scanners
- The name of the class implementing lexical analyzer is AnglrLexer
- The class implementing lexical analyzer is the member of the namespace Anglr.Lexer
The generated code reflects the above observations:
public class AnglrLexer : LexerBase
{
internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } }
internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } }
internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } }
internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } }
internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } }
internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } }
internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } }
internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } }
internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } }
public AnglrLexer (TextReader textReader)
{
Init ();
pushInput (textReader);
pushScanner (anglrScanner);
}
public AnglrLexer (string [] lines)
{
Init ();
pushInput (lines);
pushScanner (anglrScanner);
}
public AnglrLexer (string line)
{
Init ();
pushInput (line);
pushScanner (anglrScanner);
}
public void Init ()
{
regarray = new RegexInterface []
{
new AnglrLexer_CommentRegex (this),
new AnglrLexer_AttributeRegex (this),
new AnglrLexer_ScannerIdRegex (this),
new AnglrLexer_ScannerPartRegex (this),
new AnglrLexer_RegexIdRegex (this),
new AnglrLexer_RegexPartRegex (this),
new AnglrLexer_RegexBlockRegex (this),
new AnglrLexer_RegexBlockPartRegex (this),
new AnglrLexer_AnglrRegex (this),
};
}
// scanner codes
public const int comment_ctx = 0;
public const int attribute_ctx = 1;
public const int scanner_id_ctx = 2;
public const int scanner_part_ctx = 3;
public const int regex_id_ctx = 4;
public const int regex_part_ctx = 5;
public const int regex_block_ctx = 6;
public const int regex_block_part_ctx = 7;
public const int anglrScanner = 8;
}
}
The class implementing lexical analyzer is composed on the following way:
-
At the beginning of the class is the list of properties referencing objects representing scanners composing lexical analyzer, like in the
generated code above:
The names of properties are the same as class names mentioned in the ClassName setting in the attribute CompilationInfo of the scanner part associated with the scanner object referenced by this propery.
internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } } internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } } internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } } internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } } internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } } internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } } internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } } internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } } internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } } -
the list of scanner references is followed by three constructors of the class implementing lexical analyzer provided for different
kind of input:
-
the first one can read text file input
public AnglrLexer (TextReader textReader) { Init (); pushInput (textReader); pushScanner (anglrScanner); } -
the second one can handle input from an array of strings
public AnglrLexer (string [] lines) { Init (); pushInput (lines); pushScanner (anglrScanner); } -
the last one takes the single string for its input.
public AnglrLexer (string line) { Init (); pushInput (line); pushScanner (anglrScanner); }
-
the first one can read text file input
-
Init() method creates all scanners used ´by lexical analyzer. It creates them by initializing array regarray with references of newly
created scanner objects which constitute lexical analyzer.
public void Init () { regarray = new RegexInterface [] { new AnglrLexer_CommentRegex (this), new AnglrLexer_AttributeRegex (this), new AnglrLexer_ScannerIdRegex (this), new AnglrLexer_ScannerPartRegex (this), new AnglrLexer_RegexIdRegex (this), new AnglrLexer_RegexPartRegex (this), new AnglrLexer_RegexBlockRegex (this), new AnglrLexer_RegexBlockPartRegex (this), new AnglrLexer_AnglrRegex (this), }; } - at the end of the class is the list of integer constants representing the index values which can be used to access scanner references in the array regarray created by the Init() method.
Lexer superclass
Introduction
Class LexerBase is the superclass of all classes implementing the lexical analyzers. It is the real work-horse of any lexical analyzer. It is repeatedly invoked by syntax analyzer until it reaches the end of source text, being it text file, string array or single string input. Basically, this class is the driver for scanners that make up the lexical analyzer:
- it uses scanners to gather flow of terminal symbols from input text
- it contains scanner manipulation routines
- it contains mechanisms with which an application can fine tune the scanning process
- it also contains mathods with which an application can manipulate the set of input source texts
This part of page is worth reading to understend advanced features of lexical analyzer. These features can be used to fine tune the lexical analyzes proces.
Properties and fields
Delegates
Class LexerBase defines some delegates which are the prototypes for the events which are fired at different occasions during the process of lexical analyzes of the input text. These delegates are defined in the following way:
public delegate int scannerEnterCallback ();
public delegate void scannerLeaveCallback (int token);
public delegate void scannerPushCallback (int oldCtx, int newCtx);
public delegate void scannerPopCallback (int oldCtx, int newCtx);
public delegate void scannerTokenCallback (int ctx, int token, string text);
scannerEnterCallback
| Parameters | this delegate has no parameters |
| Return Value |
|
| Decription | this callback is intended to be used in scenarios where we want to insert some terminal into the input string. This insertion is attributed only to the terminal symbol code, since the text is actually not inserted in the input text. |
| Parameters |
|
| Return Value | this delegate has no return value |
| Decription | This delegate is intended to be used in scenarios where we want to detect the terminal symbol code being retrieved by the lexical analyzer and sent to syntax analyzer. |
| Parameters |
|
| Return Value | this delegate has no return value |
| Decription | This delegate is intended to be used in scenarios where we want to detect which scanner covers the previous one. |
| Parameters |
|
| Return Value | this delegate has no return value |
| Decription | This delegate is intended to be used in scenarios where we want to detect which scanner becomes uncovered when popping the current one from the scanner stack. |
| Parameters |
|
| Return Value | this delegate has no return value |
| Decription | This delegate is intended to be used in scenarios where we want to detect which token has been retrieved by lexical analyzer. This terminal need not to be returned to syntax analyzer - it should be skipped. Thus, it should be used to detect every terminal symbols being skipped or not. |
Events
Immediatelly after the definition of delegates are the definitions of events which are fired by the lexical analyzer. Every event is associated with one of above discussed delegates. They are defined in the following way:
public event scannerEnterCallback scannerEnterEvent;
public event scannerLeaveCallback scannerLeaveEvent;
public event scannerPushCallback scannerPushEvent;
public event scannerPopCallback scannerPopEvent;
public event scannerTokenCallback scannerTokenEvent;
| Event Use | This event is intended to be used in more advanced scenarious where the syntax of some structured text contains terminal symbol which does not have textual representations and must be generated by application. Using these symbols, we artificially direct the course of syntax analysis of structured text. |
| Event Source |
event is fired at the beginning of the scanning phase of the lexical analyzer. By invocation of this event, the lexical analyser grants the potential
subscriber to this event the ability to influence the scanning process of the input text. The flow of the scanning process can be influenced by the
return value of the event handler:
|
| Event Subscriber |
The subscriber to this event is usually an application that analyzes some structured text. For analysis, it uses a parser, which is generated
by an Anglr compiler. Typical example is Anglr compiler itself. Every text analyzed by Anglr compiler is in fact a fragment of Anglr file. Parser
of Anglr compiler works like this:
|
| Event Parameters | this event has no parameters |
| Event Return Value | an nteger representing the terminal symbol code being inserted. |
| Event Use | This event is intended to be used in advanced scenarios where we must track the flow of terminal symbols returned by lexical analyzer to syntax analyzer. The event handler of this event is in fact the "man-in-the-middle". It can see all terminal symbols, together with their textual representations, returned by lexical analyzer. But it can not see the symbols that are skipped or those inserted by scannerEnterEvent event handlers. There is no general guidance on how to use this event. A good example would be, for example, the detection of a specific sequence of terminal symbols. |
| Event Source | event is fired immediately before the completion of the metoed scan() |
| Event Subscriber |
The subscriber to this event is usually an application that analyzes some structured text and is interested in the flow of terminal symbols.
An example is compiler of SNMP macros. This compiler is not interested in the definition of SNMP macros, since they are all built in it.
That's why it wants to skip it. The definition of every SNMP macro begins with this sequence of text MACRO ::= BEGIN end ends with END.
The idea of how to skip the definition of SNMP macro is pretty simple:
|
| Event Parameters |
event has the following parameter:
|
| Event Return Value | event has no return code |
| Event Use | This event handler is intended to be used in scenarios where we want to detect the activation of specific scanner so that we can initialize some objects, for example. |
| Event Source | This event is fired by the method pushScanner(). This method can be invoked elswhere, typically by the scanner when invoking push action, but it can also be invoked by the application itself if it has a reference to the object implementing lexical analyzer. |
| Event Subscriber | Event should be subscribed by any application that has a reference to object implementing syntax analyzer. It is typically used to make some scanner level initialization, for example, at the beginning of comment. Event is fired before the new scanner reference is actually pushed to the stack of scanner references. |
| Event Parameters |
event has the following parameters:
|
| Event Return Value | event has no return value |
| Event Use | This event handler is intended to be used in scenarios where we want to detect the deactivation of specific scanner and to act accordingly. It is the opposite of eventPushEvent and is intended to be used to finalize something what was initiated when eventPushEvent was fired. |
| Event Source | Event is fired by method popScanner(). Like method pushScanner() also this method should be called elswhere in an application which has a reference to object implementing lexical analyzer. Event is fired after the current scanner reference is popped from the stack of scanner references. |
| Event Subscriber | Event should be subscribed by any application that holds a reference to object implementing syntax analyzer. It is typically used to make some scanner level finalialization, for example, at the end of comment. Event is fired after the old scanner reference is actually popped from the stack of scanner references. |
| Event Parameters |
event has the following parameters:
|
| Event Return Value | event has no return value |
| Event Use | Event is intended to be used in advanced scenarios where we want to detect every piece of text matched by any scanner of lexical analyzer. Between them are also terminal symbols returned to syntax analyzer. This event will not be fired for events inserted with scannerEnterEvent. This event is particularly suitable for restoring those parts of the text that are not relevant to the syntax of the text. |
| Event Source | event is fired by method scan() immediatelly after the text had been matched by the currently active scanner. Terminal symbol, if any, associated with this piece of text need not to be returned from the method scan(). This is always the case if the regular expression associated with matched text invokes skip ation |
| Event Subscriber | subscriber to this event is usually application which is interested of collecting text which is not visible to syntax analyzer. |
| Event Parameters |
event has the following parameters:
|
| Event Return Value | event has no return value |
Fields
Fields contain internal state of the LexerBase object. It has these fields: