Lexical Scanner Source Code
Introduction
Typically, multiple scanners are defined in an Anglr file. Each scanner specializes in reading specific part of the source file. The most common example is the scanner which is capable of extracting comments from the source file. It is activated at the beginning of each comment. When it reads a comment, it is automatically deactivated. Scanners cannot be used directly, but can only be used with the help of lexical analysers. Each lexical analyzer uses at least one scanner but usually more. Due to the implementation specifics, each scanner is implemented together with the lexical analyzer that uses it. When generating source code for the scanner, an Anglr compiler uses information located in the Declarations and CompilationInfo attributes. Because the scanner is always dependent on a lexical analyzer, the name of the class with which it is implemented depends on the name of the class with which the lexical analyzer is implemented.
Declarations Attribute
This attribute defines identity of that declaration part of Anglr file which defines terminal symbols and regular expressions used by
scanner. Only one declaration part can be used by specific scanner. This attribute is not mandatory. There are some scanners which does not
need to know terminal symbols or regular expressions. Typical example is comment scanner, which needs to know only which combination of
characters signals the end of comment.
These settings can be specified by Declarations attribute:
Name | Value | Description |
---|---|---|
Id | name of declaration part | the name of the declaration part must be one of those names appearing in %declaration statements found in Anglr file. All terminal symbol names and all regular expression values used by scanner will be taken from that declaration part |
CompilationInfo Attribute
CompilationInfo attribute is mandatory attribute for scanner part of Anglr file. Scanner implementation class name, its namespace and
access mode are specified by this attribute.
Scanner part settings contained within CompilationInfo attribute preceeding scanner part of Anglr file:
Name | Value | Description |
---|---|---|
ClassName | name of generated class | The name of the class which implements specific scanner is composed of two parts: the first one is the name of class which implements lexical analyzer which uses this scanner adn the second one comes from the value of the setting ClassName. Between them is underscore character. If this setting is not found in the CompilationInfo attribute, the scanner part name will be used instead. |
NameSpace | namespace of generated class | this name will be used by Anglr compiler as the name of namespace containg the generated class. If this setting is not found in the CompilationInfo attribute, the namespace name found in the setting NameSpace of the CompilationInfo of the general part of Anglr file will be used instead if there is one. Otherwise the name of Anglr file will be taken for the name of namespace. |
Access | access of generated class | Access can have one of the following three values: internal, public and private of which internal and public are preffered, since private access will make generated class unaccessible. |
Generated Code
Structure of the generated class which implements specific scanner is relatively simple and has the following characteristics:
- it implements the interface RegexInterface, which requires the implementation of method match()
- it is a subclass of System.Text.RegularExpressions.Regex. This means that scanner is in fact some kind of regular expression object
- it has a public constructor, which references lexical analyzer which constructed it and initializes subclassed regular expression
- it has a Scanner property, which references lexical analyzer introduced in the constructor of the scanner object
- it has a text property, containing the last piece of text matched by the method match ()
- it defines a delegate (callback function prototype) for events which will be triggered for event actions (if there are any) introduced in the scanner part of Anglr file
- and finally there is one event defined with the delegate mentioned above for every event action introduced in the scanner part of Anglr file
Let's take a look at an example:
[ CompilationInfo ClassName='CommentRegex' NameSpace='Csharp.RegexLib' Access='public' Hover='true' ] %scanner commentScanner %{ \*+/ pop {delimited-comment-section} skip {new-line} skip %}
internal class CsharpLexer_CommentRegex : Regex, RegexInterface { public CsharpLexer_CommentRegex (CsharpLexer scanner) : base (@"(?<g1>^\*+/)|(?<g2>^\/|(\*+)?([^\/\*]))|(?<g3>^\u000D|\u000A|\u000D\u000A|\u0085|\u2028|\u2029)", RegexOptions.ExplicitCapture) { Scanner = scanner; } public CsharpLexer Scanner { get; private set; } public delegate int scannerCallback (CsharpLexer_CommentRegex regex, CsharpLexer scanner); public string text { get; private set; } public (int, int) match (string currentLine) { int matchIndex = 0; int matchLength = 0; try { text = ""; Match match = Match (currentLine); if (!match.Success) return (-1, 0); int index = 0; foreach (Group group in match.Groups) { if (index++ == 0) continue; if (!group.Success) continue; try { matchLength = match.Value.Length; matchIndex = index - 1; text = currentLine.Substring (0, matchLength); } catch (Exception) { continue; } break; } } catch (Exception e) { return (-2, 0); } int? result = null; switch (matchIndex) { case 1: Scanner.popScanner (); result = 0; break; case 2: result = 0; break; case 3: result = 0; break; } return (result != null) ? (result.Value, matchLength) : (0, matchLength); } }
-
The class implementing the scanner is named CsharpLexer_CommentRegex. The name is composed of two parts delimited with underscore character. The first
part is named CsharpLexer and comes from the class implementing lexical analyzer which created this scanner. The second one is named CommentRegex and
comes from the attribute CompilationInfo preceeding the definition of scanner part associated with this implementation of scanner:
internal class CsharpLexer_CommentRegex : Regex, RegexInterface
-
constructor of scanner does this things:
- introduces the reference to the lexical analyzer which created that scanner
- initializes the property Scanner with the reference being introduced.
- initializes the constructor of subclassed Regex object with the value of all regular expressios introduced in the scanner part combined
public CsharpLexer_CommentRegex (CsharpLexer scanner) : base (@"(?<g1>^\*+/)|(?<g2>^\/|(\*+)?([^\/\*]))|(?<g3>^\u000D|\u000A|\u000D\u000A|\u0085|\u2028|\u2029)", RegexOptions.ExplicitCapture) { Scanner = scanner; }
-
next comes the definition of lexical analyzer reference Scanner:
public CsharpLexer Scanner { get; private set; }
-
lexical analyzer reference is followed by the definition of delegate scannerCallback (callback function prototype) used in the definition of events which shall be
fired by this class. Since this scanner has no events, this delegate is meaningless.
public delegate int scannerCallback (CsharpLexer_CommentRegex regex, CsharpLexer scanner);
-
next comes the definition of property text, which contains the last piece of text matched by function match()
public string text { get; private set; }
-
at the end is defined method match(), required by interface RegexInterface, which is implemented by that scanner. The goal of this method is to extract
the piece of text, which best matches the regular expression introduced with the constructor of scanner and to make other decisions and invoke certain
actions like: which terminal code is associated with matched text if any, what action should be invoked by the calling environment. The body of this
method is the same for all scanners except for the switch statement appearing at the end of the method, which reflects the contents of scanner part.
Every action list in scanner part is associated with some case of this switch statement. Number of cases in the switch statement is equal to the number
of action lists in scanner part and they appear in the same order as the action lists: the first case is associated with the first action list, the
second case is associated with the second action list and so on. Let's see these action lists and the switch statement together, that we will better
understand the topic presented. The action list of the scanner:
\*+/ pop {delimited-comment-section} skip {new-line} skip
switch (matchIndex) { case 1: Scanner.popScanner (); result = 0; break; case 2: result = 0; break; case 3: result = 0; break; }
result = 0;
Scanner.popScanner ();
Before we conclude with the explanation, it would be worth noting a particular detail that is liked to be overlooked. The variable result is defined just before the switch statementon the following way:
int? result = null;
Scanner Events
There is another functionality that scanners can implement, events associated with event actions which could appear in the action lists of scanner parts of Anglr file. Let's take a look at another example of scanner in the same Anglr file as in the previous example:
[ Declarations Id='csharpDecls' Hover='true' ] [ CompilationInfo ClassName='ScannerRegex' NameSpace='Csharp.RegexLib' Access='public' Hover='true' ] %scanner csharpScanner %{ \/\/.* skip \/\* push commentScanner {integer-literal} terminal integer-literal {real-literal} terminal real-literal {character-literal} terminal character-literal {string-literal} terminal string-literal {identifier} event identifier {cs-ops} event cs-ops [ \t] skip [\n\f\r\v] skip . skip %}
If we look at the source code geneated by Anglr compiler for the above scanner:
internal class CsharpLexer_ScannerRegex : Regex, RegexInterface { public CsharpLexer_ScannerRegex (CsharpLexer scanner) : base (@"(?<g1>^\/\/.*)|(?<g2>^\/\*)|(?<g3>^((([0-9])+)(U|u|L|l|UL|Ul|uL|ul|LU|Lu|lU|lu)?)|(0(x|X)(([0-9A-Fa-f])+)(U|u|L|l|UL|Ul|uL|ul|LU|Lu|lU|lu)?))|(?<g4>^(([0-9])+)\.(([0-9])+)((e|E)(\+|\-)?(([0-9])+))?(F|f|D|d|M|m)?|\.(([0-9])+)((e|E)(\+|\-)?(([0-9])+))?(F|f|D|d|M|m)?|(([0-9])+)((e|E)(\+|\-)?(([0-9])+))(F|f|D|d|M|m)?|(([0-9])+)(F|f|D|d|M|m))|(?<g5>^'(([^'\\\u000D\u000A\u0085\u2028\u2029])|(\'|\""|\\|\0|\a|\b|\f|\n|\r|\t|\v)|(\\x([0-9A-Fa-f]){1,4})|(\\u([0-9A-Fa-f]){4}|\\u([0-9A-Fa-f]){8}))')|(?<g6>^(""((([^""\\\u000D\u000A\u0085\u2028\u2029])|(\'|\""|\\|\0|\a|\b|\f|\n|\r|\t|\v)|(\\x([0-9A-Fa-f]){1,4})|(\\u([0-9A-Fa-f]){4}|\\u([0-9A-Fa-f]){8}))+)?"")|(@""((([^""])|(""""))+)?""))|(?<g7>^(((\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl})|(_|\u005F))(((\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl})|(\p{Nd})|(\p{Pc})|(\p{Mn}|\p{Mc})|(\p{Cf}))+)?)|\@(((\p{ Lu}|\p{ Ll}|\p{ Lt}|\p{ Lm}|\p{ Lo}|\p{ Nl})| (_ | \u005F))(((\p{ Lu}|\p{ Ll}|\p{ Lt}|\p{ Lm}|\p{ Lo}|\p{ Nl})| (\p{ Nd})| (\p{ Pc})| (\p{ Mn}|\p{ Mc})| (\p{ Cf}))+)?))|(?<g8>^(\<\<\ =|\?\?|\:\:|\+\+|\-\-|\&\&|\|\||\-\>|\=\=|\!\=|\<\ =|\>\=|\+\=|\-\=|\*\=|\/\=|\%\=|\&\=|\|\=|\^\=|\<\<|\ =\>|\{|\}|\[|\]|\(|\)|\.|\,|\:|\;|\+|\-|\*|\/|\%|\&|\||\^|\!|\~|\=|\<|\>|\?))|(?<g9>^[ \t])|(?<g10>^[\n\f\r\v])|(?<g11>^.)", RegexOptions.ExplicitCapture) { Scanner = scanner; } public CsharpLexer Scanner { get; private set; } public delegate int scannerCallback (CsharpLexer_ScannerRegex regex, CsharpLexer scanner); public event scannerCallback identifier_Event; public event scannerCallback cs_ops_Event; public string text { get; private set; } public (int, int) match (string currentLine) { int matchIndex = 0; int matchLength = 0; try { text = ""; Match match = Match (currentLine); if (!match.Success) return (-1, 0); int index = 0; foreach (Group group in match.Groups) { if (index++ == 0) continue; if (!group.Success) continue; try { matchLength = match.Value.Length; matchIndex = index - 1; text = currentLine.Substring (0, matchLength); } catch (Exception) { continue; } break; } } catch (Exception e) { return (-2, 0); } int? result = null; switch (matchIndex) { case 1: result = 0; break; case 2: Scanner.pushScanner (CsharpLexer.commentScanner); result = 0; break; case 3: result = CsharpDeclarations.tokens.token_integer_literal; break; case 4: result = CsharpDeclarations.tokens.token_real_literal; break; case 5: result = CsharpDeclarations.tokens.token_character_literal; break; case 6: result = CsharpDeclarations.tokens.token_string_literal; break; case 7: result = identifier_Event?.Invoke (this, Scanner); break; case 8: result = cs_ops_Event?.Invoke (this, Scanner); break; case 9: result = 0; break; case 10: result = 0; break; case 11: result = 0; break; } return (result != null) ? (result.Value, matchLength) : (0, matchLength); } }
public delegate int scannerCallback (CsharpLexer_ScannerRegex regex, CsharpLexer scanner); public event scannerCallback identifier_Event; public event scannerCallback cs_ops_Event;
private int CsharpScanner_identifier_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner) { int t = CsharpScannerInternals.findReservedWord (regex.text); return (t < 0) ? CsharpDeclarations.tokens.token_identifier : t; } private int CsharpScanner_cs_ops_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner) { int t = CsharpScannerInternals.findOperator (regex.text); return (t < 0) ? 0 : t; }
-
both event handlers are doing something with the text matched by regular expression: text property of object regex, namely regex.text.
The first one is checking if this text represents C# reserved word like if, then, else etc.:
int t = CsharpScannerInternals.findReservedWord (regex.text);
int t = CsharpScannerInternals.findOperator (regex.text);
- The result of check is taken differently by these event handlers
- In the case of positive result they both return the code returned by the checking functions findReservedWord() and findOperator(), the termial symbol code associated with reserved word and C# operator, respectively.
- In the case of negative result there are differences in the meaning of the event handler invocations. First event handler, CsharpScanner_identifier_Event returns CsharpDeclarations.tokens.token_identifier, the code of terminal symbol identifier, just like it is saying: ok it is not reserved word but an identifier.
- The other event handler, CsharpScanner_cs_ops_Event returns 0 in the case of negative result, just like saying: it is probably not an operator code, that's why just skip it. This is probably an error, since this piece of skipped text is not there for no reason. In addition to returning 0, this event handler should have also reported an error in some way. But this is not important for the context of this discussion.
-
scanner events should be used to handle more advanced scenarios, like in the example above, where we could not decide immediately
what kind of text has been matched by scanner. But neverlheless also the most simple scenarious should be handled by scanner events,
like the skip actions or single terminal actions, for example:
private int CsharpScanner_cs_ops_Event (CsharpLexer_ScannerRegex regex, CsharpLexer scanner) { return 0; // skip everything }
- returning less than or equal to zero from the event handler associated with an event action mimics skip action
- returning the value which is greater than zero from the event hander associated with an event action mimics terminal action. Care should be taken that the value returned really represents an existing terminal symbol code. Since scanners are obviously invoked by syntax analyzers, although indirectly by the call of lexical analyzers, invalid terminal codes should cause syntax errors.