Java Lexcial Structure

Lexical analysis

lexical analysis is the process of translation from a raw Unicode character stream to a sequence of tokens. The tokens are the terminal symbols of the syntactic grammar. A program that perform lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. In detail, there are three steps in turn :

  1. translate all Unicode escapes to the corresponding Unicode character, for example, translate \n to 0A
  2. recognize line terminators to separate the stream resulting from step 1 to the input characters and terminators, this step will save line numbers of source code so that you can debug your program by some error message with corresponding line number
  3. split result from step 2 to white space (including line terminator), comments and tokens , and then tokens are reserved


Token is a very important concept in compiler. Java tokens contain :

  • Identifier
  • Keyword
  • Literal
  • Separator
  • Operator

The Tokens are non-terminal symbols of the lexical grammar with characters as terminal symbols, like this :


but the terminal symbols of the syntactic grammar. A parser which analyze the syntax of programming language uses token stream as input, and abstract syntax tree (AST) as output.




