Batch Lexer Requirements
Version: 1.0
Author: Miloslav Metelka, Sun Microsystems/NetBeans
Document History:
available in CVS
To use a given batch lexer in the lexer framework
the lexer must satisfy certain criteria described here.
Input Recognition
One of the most important uses of the lexer module is syntax coloring.
In order to function reasonably in the syntax coloring setup
the batch lexer must be able to work with virtually any combination
of Unicode characters present on the input
including constructions that do not conform to the the given language's grammar.
When such lexical error is found the batch lexer must be able to recover automatically
and return token(s) spanning the errorneous construction (and continue lexing normally).
Lexers for certain languages can work with just one errorneous token identification e.g. ERROR
but some languages can benefit from having many error-token types for distinct errorneous constructions.
There can be special token types for so-called incomplete tokens
for e.g. unclosed string literals, incomplete block comments etc.
In general more error token types is better because it gives the outside
world more information about the errorneous construction.
For example distinguishing incomplete string literal from unclosed block-comment
allows each of the two to have a different syntax coloring.
Certain tools e.g. a spellchecker
finding incorrectly spelled words in comment sections can work with both complete
and incomplete comments in the same way.
It's a good idea to group all errorneous and incomplete tokens
into token categories to be able to quickly resolve whether the given token
is lexically correct or not.
Once the batch lexer is able to recover from possible lexical errors present on the input
it's much easier to work with it. It's no longer necessary to catch any LexicalError-like
exceptions and have an extra error-recovery code present on every place
at which the batch lexer is being used.
The lexer will always partition any given input
into tokens without any exceptions.
The first requirement for batch lexer in order to be used
with the lexer module successfully:
Batch lexer must recognize (i.e. partition into tokens)
any combination of input characters.
No LexicalError-like exceptions can be thrown from the batch lexer.
Input Characters Coverage
Some batch lexers tend to eliminate whitespace and comments (i.e. they do not create any
tokens for such character sequences) as these tokens are generally unnecessary
for further processing such as parsing.
Such lexers cannot be used in the lexer module framework.
Second requirement:
Every character must be tokenized.
It means that every character instance fed by the batch lexer from the input
must belong to a token instance produced by the batch lexer.
There cannot be any character instances
skipped by the batch lexer (i.e. not be part of any token
produced by the batch lexer).
This requirement is mainly related to the fact that
the lexer module helps to create token elements coverage for documents and that the element
coverage is expected to be complete i.e. there cannot be any areas
of the document that would not be covered by a token element.
Therefore the lexer framework requires the batch lexer
to assign each and every character instance from the input to exactly one token instance.
Restartable Batch Lexer
When
Incremental algorithm is fixing token list after document modification
it needs to restart the batch lexer before the modified area and replace the old
tokens by the new ones (based on the new input) up to the point from which the tokens
in the original token list continue to be correct again.
The batch lexer must be able to express its internal state
(java.lang.Object instance in general) after token recognition (i.e. at token boundary).
That state is later used to restart lexing from the given point in the character input stream.
This allows the incremental algorithm (fixing the token list) to restart lexing
just before the modified area instead of starting batch lexer
from scratch at the begining of the whole document.
For example let's assume a following html text (tokens colored):
B
<
B
>
The first "B" is plain-text token while second "B" is a html-tag-B token.
Let's assume that user puts cursor after second "B" and types "R":
B
<
BR
>
Incremental algorithm finds out that the text after "<" must be re-lexed.
If there would be no notion about batch lexer internal state
(i.e. the batch lexer would be started from scratch) the lexer would now recognize
"BR" as plain text (as it does at the begining of the text). This is of course
incorrect and the lexer must go into some sort of internal state e.g. called "INSIDE_TAG"
(can be an integer constant) once it recognizes "<".
That way when token "<" is initially created by batch lexer (at the time when document
gets loaded) the lexer module
asks the batch lexer to return its internal state after recognition of "<".
The lexer responds "INSIDE_TAG"
(instead of some sort of default state that leads to recognition of plain text).
Lexer module remembers the state and associates it with "<" token in the token list.
Now when the "R" gets inserted the incremental algorithm restarts
the batch lexer right after "<" with the "INSIDE_TAG" state.
The batch lexer is then aware that it is inside tag and can recognize
the "BR" as html-tag-BR instead of plain text.
Restartability is very important feature of the batch lexer. Not having it
would require the incremental algorithm to start batch lexer to lex from the begining of the document after
each document modification (so that the batch lexer goes into "INSIDE_TAG" state
after "<" naturally).
This is of course unacceptable from the performance point of view.
Third requirement:
Lexer must be able to express its internal state
necessary to restart lexing from the same point later.
When the batch lexer is returning its internal state after token recognition
it must be aware of the fact that the batch lexer instance that will possibly be later asked to restart
lexing using the returned state information can be a different instance of the same
lexer class.
It's not guaranteed that the same instance of the batch lexer
that produced the state will be asked to restart from it.
Therefore the state information must not contain any information
related to a particular batch lexer instance.
In general there are three typical kinds of state values:
null - it's a valid state value meaning that there is
no additional information necessary in order to continue lexing
from the given point.
- An instance of
java.lang.Integer. Some batch lexers e.g. those generated by JavaCC
hold an internal state which is usually int
e.g. XXXTokenManager.curLexState.
The
IntegerCache can be used to cheaply translate small non-negative int
values to corresponding java.lang.Integer instances so that the requirement
of state being an Object is satisfied.
- Everything else. This is less usual case although the lexer module
allows this too.
Lookahead
Lookahead of a token are extra characters past the end of the token that the lexer
needed to read before it was able to recognize the token.
For example a java identifier token has lookahead one because lexer recognizes
the identifier once it has read a character that does not belong to the identifier.
Without reading that extra character the lexer would not be able to tell that
there was really an end of the identifier found.
A picture showing the process of recognition of identifier using LexerInput
can be found here.
On the other hand a semicolon has lookahead zero because once the semicolon
character is read the lexer immediately knows that it has found a semicolon separator token
without reading any further.
LexerInput interface that feeds the lexer with input characters
allows to find out current lookahead value.
In an incremental setting the lookahead of each recognized token
is determined from LexerInput and associated with the recognized token.
If there is any modification of the document performed later
the incremental algorithm uses the lookahead value to find out the first
token below the modification point affected by the modification.
It's important to understand that not only the token itself must be
below the modification point in order to be sure that the token
was not affected by the modification. It's also necessary that all the lookahead
characters of the token will be below modification point too.
Modification of lookahead characters of the token could influence batch lexer
to recognize a different token or make the token longer or shorter.
For example typing an extra character right at the end of the identifier
token could make the preceding identifier one character longer in case the typed
character is part of an identifier token according to identifier's lexical description.
Although the typed character was not positioned inside the token
it was positioned in the token's lookahead and thus the token was affected
and had to be relexed.
Unnecessary Lookahead
The incremental algorithm works efficiently if the lookahead characters consumed
by the lexer really correspond to what the lexer needs in order to recognize
the token. Typically the lookahead is just 0 or 1.
Some lexers however tend to preload characters in batches into their internal cache
and then recognize the tokens by working with the cache only. Once the cache becomes
empty they preload another batch.
Such lexers will not work efficiently in an incremental setting because by preloading
a whole batch of characters they make the incremental algorithm to believe
that they needed all those preloaded characters in order to recognize all subsequent
tokens in the batch. Once there would be a modification performed
somewhere at the end of the batch almost all the tokens in that batch would be relexed.
Fourth requirement:
To keep the incremental algorithm
most efficient the batch lexer should work with minimum necessary lookahed.
No pre-loading of character batches into internal lexer's cache should be done.