Lexical Analyzer Source Code
Introduction
Lexical parts of Anglr file are used to generate the source code of lexical analyzers. The generated code depends on the settings
found in the attributes UseScanner and CompilationInfo associated with the lexer part of Anglr file.
UseScanner Attribute
This attribute is mandatory for the lexical part of Anglr file. It references all scanners used to implement lexical analyzer associated
with this lexer part of Anglr file. It cotains these settings used by Anglr compiler:
Name |
Value |
Description |
ScannerId |
id of scanner part |
this id references scanner part which will be used to create scanner class which will be used exclusively by lexical analyzer
referencing it. There may be many ScannerId settings in single UseScanner attributes, as many as there are scanners used by
lexical analyzer.
|
InitialScanner |
id of scanner part |
this id references scanner part which whose scanner will be used when the lexical analyzer is started. Every scanner can then
activate other scanners when needed, but only those referenced by this attribute. If this setting is not found in the attribute
UseScanner the first scanner referenced by ScannerId setting is taken as the scanner which is active at the lexical analyzer
startup.
|
CompilationInfo Attribute
CompilationInfo attribute is mandatory attribute for the lexical part of Anglr file. It supplies the following information to the
Anglr compiler:
Name |
Value |
Description |
ClassName |
name of generated class |
this name will be used by Anglr compiler to create a class which will contain the implementation of the lexical analyzer
|
NameSpace |
namespace of generated class |
this name will be used by Anglr compiler as the name of namespace containg the generated class. If this setting is not found in the
CompilationInfo attribute, the namespace name found in the setting NameSpace of the CompilationInfo of the general part of Anglr file
will be used instead if there is one. Otherwise the name of Anglr file will be taken for the name of namespace.
|
Access |
access of generated class |
Access can have one of the following three values: internal, public and private of which internal and public are preffered, since
private access will make generated class unaccessible.
|
Generated Code
Class implementing lexical analyzer is actually a list of objects implementing scanners associated with the scanner parts listed in the UseScanner attribute
of lexical part associated wit this lexical analyzer. The primary goal of the class implementing lexical analyzer is to create these scanners. The actual
lexical analyzes of the text is performed by the class LexerBase which is subclassed by the class implementing lexical analyzer. Let's take a look at an example
of lexer part and source code in C# generated by Anglr compiler. This time we will take a lexical analyzer defined in the Anglr file which defines Anglr language
itself:
[ Description Text='Lexer for anglr file' Hover='true' ]
[
UseScanner
ScannerId='comment_ctx'
ScannerId='attribute_ctx'
ScannerId='scanner_id_ctx'
ScannerId='scanner_part_ctx'
ScannerId='regex_id_ctx'
ScannerId='regex_part_ctx'
ScannerId='regex_block_ctx'
ScannerId='regex_block_part_ctx'
InitialScanner='anglrScanner'
Hover='true'
]
[ CompilationInfo ClassName='AnglrLexer' NameSpace='Anglr.Lexer' Access='public' Hover='true' ]
%lexer anglrLexer
%{
%}
From the attributes which belong to the lexer part we can see the following:
-
lexical analyzer is composed of nine scanners
-
The name of the class implementing lexical analyzer is AnglrLexer
-
The class implementing lexical analyzer is the member of the namespace Anglr.Lexer
The generated code reflects the above observations:
public class AnglrLexer : LexerBase
{
internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } }
internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } }
internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } }
internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } }
internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } }
internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } }
internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } }
internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } }
internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } }
public AnglrLexer (TextReader textReader)
{
Init ();
pushInput (textReader);
pushScanner (anglrScanner);
}
public AnglrLexer (string [] lines)
{
Init ();
pushInput (lines);
pushScanner (anglrScanner);
}
public AnglrLexer (string line)
{
Init ();
pushInput (line);
pushScanner (anglrScanner);
}
public void Init ()
{
regarray = new RegexInterface []
{
new AnglrLexer_CommentRegex (this),
new AnglrLexer_AttributeRegex (this),
new AnglrLexer_ScannerIdRegex (this),
new AnglrLexer_ScannerPartRegex (this),
new AnglrLexer_RegexIdRegex (this),
new AnglrLexer_RegexPartRegex (this),
new AnglrLexer_RegexBlockRegex (this),
new AnglrLexer_RegexBlockPartRegex (this),
new AnglrLexer_AnglrRegex (this),
};
}
// scanner codes
public const int comment_ctx = 0;
public const int attribute_ctx = 1;
public const int scanner_id_ctx = 2;
public const int scanner_part_ctx = 3;
public const int regex_id_ctx = 4;
public const int regex_part_ctx = 5;
public const int regex_block_ctx = 6;
public const int regex_block_part_ctx = 7;
public const int anglrScanner = 8;
}
}
The class implementing lexical analyzer is composed on the following way:
-
At the beginning of the class is the list of properties referencing objects representing scanners composing lexical analyzer, like in the
generated code above:
internal AnglrLexer_CommentRegex CommentRegex { get { return (AnglrLexer_CommentRegex) regarray [comment_ctx]; } }
internal AnglrLexer_AttributeRegex AttributeRegex { get { return (AnglrLexer_AttributeRegex) regarray [attribute_ctx]; } }
internal AnglrLexer_ScannerIdRegex ScannerIdRegex { get { return (AnglrLexer_ScannerIdRegex) regarray [scanner_id_ctx]; } }
internal AnglrLexer_ScannerPartRegex ScannerPartRegex { get { return (AnglrLexer_ScannerPartRegex) regarray [scanner_part_ctx]; } }
internal AnglrLexer_RegexIdRegex RegexIdRegex { get { return (AnglrLexer_RegexIdRegex) regarray [regex_id_ctx]; } }
internal AnglrLexer_RegexPartRegex RegexPartRegex { get { return (AnglrLexer_RegexPartRegex) regarray [regex_part_ctx]; } }
internal AnglrLexer_RegexBlockRegex RegexBlockRegex { get { return (AnglrLexer_RegexBlockRegex) regarray [regex_block_ctx]; } }
internal AnglrLexer_RegexBlockPartRegex RegexBlockPartRegex { get { return (AnglrLexer_RegexBlockPartRegex) regarray [regex_block_part_ctx]; } }
internal AnglrLexer_AnglrRegex AnglrRegex { get { return (AnglrLexer_AnglrRegex) regarray [anglrScanner]; } }
The names of properties are the same as class names mentioned in the ClassName setting in the attribute CompilationInfo of the
scanner part associated with the scanner object referenced by this propery.
-
the list of scanner references is followed by three constructors of the class implementing lexical analyzer provided for different
kind of input:
-
the first one can read text file input
public AnglrLexer (TextReader textReader)
{
Init ();
pushInput (textReader);
pushScanner (anglrScanner);
}
-
the second one can handle input from an array of strings
public AnglrLexer (string [] lines)
{
Init ();
pushInput (lines);
pushScanner (anglrScanner);
}
-
the last one takes the single string for its input.
public AnglrLexer (string line)
{
Init ();
pushInput (line);
pushScanner (anglrScanner);
}
All of them have the similar structure. They initialize themsef by calling mehod Init(), next they reference text source introduced by
its input parameter and finaly it pushes the reference of initial scanner, the one associated with the scanner part occuring in the
InputScanner setting of the lexer part associated with the lexical analyzer, into the scanner stack. Scanner stack now contains one
scanner reference.
-
Init() method creates all scanners used ´by lexical analyzer. It creates them by initializing array regarray with references of newly
created scanner objects which constitute lexical analyzer.
public void Init ()
{
regarray = new RegexInterface []
{
new AnglrLexer_CommentRegex (this),
new AnglrLexer_AttributeRegex (this),
new AnglrLexer_ScannerIdRegex (this),
new AnglrLexer_ScannerPartRegex (this),
new AnglrLexer_RegexIdRegex (this),
new AnglrLexer_RegexPartRegex (this),
new AnglrLexer_RegexBlockRegex (this),
new AnglrLexer_RegexBlockPartRegex (this),
new AnglrLexer_AnglrRegex (this),
};
}
-
at the end of the class is the list of integer constants representing the index values which can be used to access scanner references in the
array regarray created by the Init() method.
Lexer superclass
Introduction
Class LexerBase is the superclass of all classes implementing the lexical analyzers. It is the real work-horse of any lexical analyzer. It is repeatedly
invoked by syntax analyzer until it reaches the end of source text, being it text file, string array or single string input. Basically, this class is the
driver for scanners that make up the lexical analyzer:
-
it uses scanners to gather flow of terminal symbols from input text
-
it contains scanner manipulation routines
-
it contains mechanisms with which an application can fine tune the scanning process
-
it also contains mathods with which an application can manipulate the set of input source texts
This part of page is worth
reading to understend advanced features of lexical analyzer. These features can be used to fine tune the lexical analyzes proces.
Properties and fields
Delegates
Class LexerBase defines some delegates which are the prototypes for the events which are fired at different occasions during the process of lexical analyzes
of the input text. These delegates are defined in the following way:
public delegate int scannerEnterCallback ();
public delegate void scannerLeaveCallback (int token);
public delegate void scannerPushCallback (int oldCtx, int newCtx);
public delegate void scannerPopCallback (int oldCtx, int newCtx);
public delegate void scannerTokenCallback (int ctx, int token, string text);
Their meanning will be described in the discussion of events fired by lexical analyzer.
scannerEnterCallback
Parameters |
this delegate has no parameters |
Return Value |
-
an integer value representing valid terminal symbol code, the one being defined in the generated class associated with the declaration part of
Anglr file. It is supposed that this code will be returned to the calling environment and that input text will not be scanned. It is a way of how
to "insert" a terminal symbol into the input text.
-
negative or zero value indicating that the result of delegate function should be ignored.
|
Decription |
this callback is intended to be used in scenarios where we want to insert some terminal into the input string. This insertion is attributed only to the
terminal symbol code, since the text is actually not inserted in the input text.
|
scannerLeaveCallback
Parameters |
-
int token: an integer representing the terminal symbol code returned to the syntax analyzer
|
Return Value |
this delegate has no return value |
Decription |
This delegate is intended to be used in scenarios where we want to detect the terminal symbol code being retrieved by the lexical
analyzer and sent to syntax analyzer.
|
scannerPushCallback
Parameters |
-
int oldCtx: an integer representing currently active scanner. Its reference is positioned on the top of the scanner stack and will
be covered with the reference of the scanner being pushed on the stack. The value of oldCtx is an index of some scanner reference in
the table of scanner references. Scanner index values are defined at the end of the class which implements the lexical analyzer.
-
int newCtx: an integer representing scanner which will become active. It will be pushed on the top of the scanner stack. It is an
index into the table of scanner references.
|
Return Value |
this delegate has no return value |
Decription |
This delegate is intended to be used in scenarios where we want to detect which scanner covers the previous one.
|
scannerPopCallback
Parameters |
-
int oldCtx: an integer representing currently active scanner. Its reference will be removed from the top of the scanner stack uncovering
the reference of the scanner lying immediatelly below it. The value of oldCtx is an index of some scanner reference in the table of
scanner references. Scanner index values are defined at the end of the class which implements the lexical analyzer.
-
int newCtx: an integer representing scanner which will become active. It will be uncovered by the pop action. The value of newCtx is
an index into the table of scanner references.
|
Return Value |
this delegate has no return value |
Decription |
This delegate is intended to be used in scenarios where we want to detect which scanner becomes uncovered when popping the current one from
the scanner stack.
|
scannerTokenCallback
Parameters |
-
int ctx: an integer representing the index of scanner reference within the table of scanner references
-
int token: terminal symbol code of terminal being retrieved by lexical analyzer
-
string text: text of terminal being retrieved by lexical analyzer.
|
Return Value |
this delegate has no return value |
Decription |
This delegate is intended to be used in scenarios where we want to detect which token has been retrieved by lexical analyzer. This terminal
need not to be returned to syntax analyzer - it should be skipped. Thus, it should be used to detect every terminal symbols being skipped
or not.
|
Events
Immediatelly after the definition of delegates are the definitions of events which are fired by the lexical analyzer. Every event is associated with one of above
discussed delegates. They are defined in the following way:
public event scannerEnterCallback scannerEnterEvent;
public event scannerLeaveCallback scannerLeaveEvent;
public event scannerPushCallback scannerPushEvent;
public event scannerPopCallback scannerPopEvent;
public event scannerTokenCallback scannerTokenEvent;
scannerEnterEvent
Event Use |
This event is intended to be used in more advanced scenarious where the syntax of some structured text contains terminal symbol which does not
have textual representations and must be generated by application. Using these symbols, we artificially direct the course of syntax analysis
of structured text.
|
Event Source |
event is fired at the beginning of the scanning phase of the lexical analyzer. By invocation of this event, the lexical analyser grants the potential
subscriber to this event the ability to influence the scanning process of the input text. The flow of the scanning process can be influenced by the
return value of the event handler:
-
positive values will terminate the scanning process. These values are supposed to be valid terminal symbol codes which are synchronized with
the syntax analyzer: they must be valid codes in the current state of syntax analyzer's stack automata . Invalid terminal symbol codes and those
not synchronized with the syntax analyzer will cause the syntax errors.
-
other values, less than or equal to zero, will not affect the flow of the scanning process.
|
Event Subscriber |
The subscriber to this event is usually an application that analyzes some structured text. For analysis, it uses a parser, which is generated
by an Anglr compiler. Typical example is Anglr compiler itself. Every text analyzed by Anglr compiler is in fact a fragment of Anglr file. Parser
of Anglr compiler works like this:
-
First it reads the artificial terminal symbol, which represents the type of fragment to be analysed. This terminal symbol has no textual
representation and is inserted by an event handler of scannerEnterEvent
-
after that the input text shall be analysed, which must represent the contents of the fragment. For example: if the artifical terminal symbol,
inserted by event handler of scannerEnterEvent represents single production of syntax rule, than the input text must contain the single
production of arbitrary syntax rule. If for example the inserted terminal symbol represents cardinality operator, the input text must
contain arbitrary cardinality operator. Also the whole Anglr file is a fragment and it has artifical symbol representing it, namely the
<anglr file terminal> terminal symbol.
In that manner can act any application which analyzes structured text. Such a process is right in dealing with more advanced scenarios, where,
for example, we insert parts of structured text, translate source code that needs to be generated first and so on.
|
Event Parameters |
this event has no parameters |
Event Return Value |
an nteger representing the terminal symbol code being inserted.
|
scannerLeaveEvent
Event Use |
This event is intended to be used in advanced scenarios where we must track the flow of terminal symbols returned by lexical analyzer to
syntax analyzer. The event handler of this event is in fact the "man-in-the-middle". It can see all terminal symbols, together with their
textual representations, returned by lexical analyzer. But it can not see the symbols that are skipped or those inserted by scannerEnterEvent
event handlers. There is no general guidance on how to use this event. A good example would be, for example, the detection of a specific
sequence of terminal symbols.
|
Event Source |
event is fired immediately before the completion of the metoed scan()
|
Event Subscriber |
The subscriber to this event is usually an application that analyzes some structured text and is interested in the flow of terminal symbols.
An example is compiler of SNMP macros. This compiler is not interested in the definition of SNMP macros, since they are all built in it.
That's why it wants to skip it. The definition of every SNMP macro begins with this sequence of text MACRO ::= BEGIN end ends with END.
The idea of how to skip the definition of SNMP macro is pretty simple:
-
find the sequence of consecutive terminal symbols representing pieces of text 'MACRO', '::=' and 'BEGIN'. Important is that these
pieces of text follow one another in the order mentioned before without intermediate terminal symbols between them, except those
which does not alter the syntax of the text like comments, space characters and the like.
-
after we find the sequence of terminal symbols mentioned above, we can skip all terminal symbols until we find that one which is
associated with the text 'END'
The first part of algorithm given above can be achieved with evet handler subscribed to handle event scannerLeaveEvent
|
Event Parameters |
event has the following parameter:
-
int token - an integer representing terminal symbol code returned from lexical analyzer to syntax analyzer.
|
Event Return Value |
event has no return code
|
scannerPushEvent
Event Use |
This event handler is intended to be used in scenarios where we want to detect the activation of specific scanner so that we can initialize
some objects, for example.
|
Event Source |
This event is fired by the method pushScanner(). This method can be invoked elswhere, typically by the scanner when invoking push action, but
it can also be invoked by the application itself if it has a reference to the object implementing lexical analyzer.
|
Event Subscriber |
Event should be subscribed by any application that has a reference to object implementing syntax analyzer. It is typically used to make some
scanner level initialization, for example, at the beginning of comment. Event is fired before the new scanner reference is actually pushed
to the stack of scanner references.
|
Event Parameters |
event has the following parameters:
-
int oldIndex - an integer representing index of current scanner reference at the top of scanner reference stack.
-
int newIndex - an integer representing index of scanner reference that will be pushet at the top of scanner reference stack.
|
Event Return Value |
event has no return value
|
scannerPopEvent
Event Use |
This event handler is intended to be used in scenarios where we want to detect the deactivation of specific scanner and to act accordingly.
It is the opposite of eventPushEvent and is intended to be used to finalize something what was initiated when eventPushEvent was fired.
|
Event Source |
Event is fired by method popScanner(). Like method pushScanner() also this method should be called elswhere in an application which has a
reference to object implementing lexical analyzer. Event is fired after the current scanner reference is popped from the stack of scanner
references.
|
Event Subscriber |
Event should be subscribed by any application that holds a reference to object implementing syntax analyzer. It is typically used to make some
scanner level finalialization, for example, at the end of comment. Event is fired after the old scanner reference is actually popped
from the stack of scanner references.
|
Event Parameters |
event has the following parameters:
-
int oldIndex - an integer representing index of the scanner reference being popped from the top of the stack of scanner references.
-
int newIndex - an integer representing index of the scanner reference at the top of the scanner references stack being uncovered by
popScanner() method.
|
Event Return Value |
event has no return value
|
scannerTokenEvent
Event Use |
Event is intended to be used in advanced scenarios where we want to detect every piece of text matched by any scanner of lexical analyzer.
Between them are also terminal symbols returned to syntax analyzer. This event will not be fired for events inserted with scannerEnterEvent.
This event is particularly suitable for restoring those parts of the text that are not relevant to the syntax of the text.
|
Event Source |
event is fired by method scan() immediatelly after the text had been matched by the currently active scanner. Terminal symbol, if any, associated
with this piece of text need not to be returned from the method scan(). This is always the case if the regular expression associated with matched
text invokes skip ation
|
Event Subscriber |
subscriber to this event is usually application which is interested of collecting text which is not visible to syntax analyzer.
|
Event Parameters |
event has the following parameters:
-
int ctx - an integer representing index of active scanner object reference
-
int token - an integer representing terminal symbol code or
-
an integer representing terminal symbol code if matched text will be returned
-
-1 if matched text is skipped
-
string text - matched piece of text
|
Event Return Value |
event has no return value
|
Fields
Fields contain internal state of the LexerBase object. It has these fields:
Properties
Methods
scan
pushScanner
popScanner
pushInput
popInput