FeaturesPluginsDocs & SupportCommunityPartners

Batch Lexer Requirements

Version: 1.0
Author: Miloslav Metelka, Sun Microsystems/NetBeans
Document History: available in CVS

To use a given batch lexer in the lexer framework the lexer must satisfy certain criteria described here.

Input Recognition

One of the most important uses of the lexer module is syntax coloring.
In order to function reasonably in the syntax coloring setup the batch lexer must be able to work with virtually any combination of Unicode characters present on the input including constructions that do not conform to the the given language's grammar.
When such lexical error is found the batch lexer must be able to recover automatically and return token(s) spanning the errorneous construction (and continue lexing normally).
Lexers for certain languages can work with just one errorneous token identification e.g. ERROR but some languages can benefit from having many error-token types for distinct errorneous constructions.
There can be special token types for so-called incomplete tokens for e.g. unclosed string literals, incomplete block comments etc.
In general more error token types is better because it gives the outside world more information about the errorneous construction.
For example distinguishing incomplete string literal from unclosed block-comment allows each of the two to have a different syntax coloring.
Certain tools e.g. a spellchecker finding incorrectly spelled words in comment sections can work with both complete and incomplete comments in the same way.
It's a good idea to group all errorneous and incomplete tokens into token categories to be able to quickly resolve whether the given token is lexically correct or not.

Once the batch lexer is able to recover from possible lexical errors present on the input it's much easier to work with it. It's no longer necessary to catch any LexicalError-like exceptions and have an extra error-recovery code present on every place at which the batch lexer is being used.
The lexer will always partition any given input into tokens without any exceptions.

The first requirement for batch lexer in order to be used with the lexer module successfully:

 Batch lexer must recognize (i.e. partition into tokens) any combination of input characters.
 No LexicalError-like exceptions can be thrown from the batch lexer.

Input Characters Coverage

Some batch lexers tend to eliminate whitespace and comments (i.e. they do not create any tokens for such character sequences) as these tokens are generally unnecessary for further processing such as parsing.
Such lexers cannot be used in the lexer module framework.

Second requirement:

 Every character must be tokenized.
 It means that every character instance fed by the batch lexer from the input must belong to a token instance produced by the batch lexer.
 There cannot be any character instances skipped by the batch lexer (i.e. not be part of any token produced by the batch lexer).

This requirement is mainly related to the fact that the lexer module helps to create token elements coverage for documents and that the element coverage is expected to be complete i.e. there cannot be any areas of the document that would not be covered by a token element.
Therefore the lexer framework requires the batch lexer to assign each and every character instance from the input to exactly one token instance.

Restartable Batch Lexer

When Incremental algorithm is fixing token list after document modification it needs to restart the batch lexer before the modified area and replace the old tokens by the new ones (based on the new input) up to the point from which the tokens in the original token list continue to be correct again.
The batch lexer must be able to express its internal state (java.lang.Object instance in general) after token recognition (i.e. at token boundary). That state is later used to restart lexing from the given point in the character input stream.
This allows the incremental algorithm (fixing the token list) to restart lexing just before the modified area instead of starting batch lexer from scratch at the begining of the whole document.

For example let's assume a following html text (tokens colored):

B < B >

The first "B" is plain-text token while second "B" is a html-tag-B token.

Let's assume that user puts cursor after second "B" and types "R":

B < BR >

Incremental algorithm finds out that the text after "<" must be re-lexed.
If there would be no notion about batch lexer internal state (i.e. the batch lexer would be started from scratch) the lexer would now recognize "BR" as plain text (as it does at the begining of the text). This is of course incorrect and the lexer must go into some sort of internal state e.g. called "INSIDE_TAG" (can be an integer constant) once it recognizes "<".
That way when token "<" is initially created by batch lexer (at the time when document gets loaded) the lexer module asks the batch lexer to return its internal state after recognition of "<". The lexer responds "INSIDE_TAG" (instead of some sort of default state that leads to recognition of plain text).
Lexer module remembers the state and associates it with "<" token in the token list.
Now when the "R" gets inserted the incremental algorithm restarts the batch lexer right after "<" with the "INSIDE_TAG" state.
The batch lexer is then aware that it is inside tag and can recognize the "BR" as html-tag-BR instead of plain text.

Restartability is very important feature of the batch lexer. Not having it would require the incremental algorithm to start batch lexer to lex from the begining of the document after each document modification (so that the batch lexer goes into "INSIDE_TAG" state after "<" naturally).
This is of course unacceptable from the performance point of view.

Third requirement:

 Lexer must be able to express its internal state necessary to restart lexing from the same point later.

When the batch lexer is returning its internal state after token recognition it must be aware of the fact that the batch lexer instance that will possibly be later asked to restart lexing using the returned state information can be a different instance of the same lexer class.
It's not guaranteed that the same instance of the batch lexer that produced the state will be asked to restart from it.
Therefore the state information must not contain any information related to a particular batch lexer instance.
In general there are three typical kinds of state values:

  • null - it's a valid state value meaning that there is no additional information necessary in order to continue lexing from the given point.
  • An instance of java.lang.Integer. Some batch lexers e.g. those generated by JavaCC hold an internal state which is usually int e.g. XXXTokenManager.curLexState.
    The IntegerCache can be used to cheaply translate small non-negative int values to corresponding java.lang.Integer instances so that the requirement of state being an Object is satisfied.
  • Everything else. This is less usual case although the lexer module allows this too.

Lookahead

Lookahead of a token are extra characters past the end of the token that the lexer needed to read before it was able to recognize the token.
For example a java identifier token has lookahead one because lexer recognizes the identifier once it has read a character that does not belong to the identifier. Without reading that extra character the lexer would not be able to tell that there was really an end of the identifier found.
A picture showing the process of recognition of identifier using LexerInput can be found here.
On the other hand a semicolon has lookahead zero because once the semicolon character is read the lexer immediately knows that it has found a semicolon separator token without reading any further.
LexerInput interface that feeds the lexer with input characters allows to find out current lookahead value.
In an incremental setting the lookahead of each recognized token is determined from LexerInput and associated with the recognized token.
If there is any modification of the document performed later the incremental algorithm uses the lookahead value to find out the first token below the modification point affected by the modification.
It's important to understand that not only the token itself must be below the modification point in order to be sure that the token was not affected by the modification. It's also necessary that all the lookahead characters of the token will be below modification point too.
Modification of lookahead characters of the token could influence batch lexer to recognize a different token or make the token longer or shorter.
For example typing an extra character right at the end of the identifier token could make the preceding identifier one character longer in case the typed character is part of an identifier token according to identifier's lexical description. Although the typed character was not positioned inside the token it was positioned in the token's lookahead and thus the token was affected and had to be relexed.

Unnecessary Lookahead

The incremental algorithm works efficiently if the lookahead characters consumed by the lexer really correspond to what the lexer needs in order to recognize the token. Typically the lookahead is just 0 or 1.
Some lexers however tend to preload characters in batches into their internal cache and then recognize the tokens by working with the cache only. Once the cache becomes empty they preload another batch.
Such lexers will not work efficiently in an incremental setting because by preloading a whole batch of characters they make the incremental algorithm to believe that they needed all those preloaded characters in order to recognize all subsequent tokens in the batch. Once there would be a modification performed somewhere at the end of the batch almost all the tokens in that batch would be relexed.

Fourth requirement:

 To keep the incremental algorithm most efficient the batch lexer should work with minimum necessary lookahed.
 No pre-loading of character batches into internal lexer's cache should be done.

Companion
Projects:
MySQL Database Server   Open JDK: an Open SourceJDK   GlassFish Community: an Open Source Application Server    Mobile & Embedded Community    Open Solaris   java.net - The Source for Java Technology Collaboration   Virtual Box - full virtualizer  Open ESB - The Open Enterprise Service Bus Powered by