Revisiting Compiler Principles

Compilers

A compiler is also a program that can read a program written in one language (the source language) and translate it into an equivalent program written in another language (the target language).

That is, a program that takes a source program as input and outputs a target program, capable of mapping the source program to a semantically equivalent target program:

       Compiler
Source Program -------> Target Program

The source program is generally a human-readable string, while the target program can take several forms:

Machine code, such as an executable binary program obtained by compiling C language.
Intermediate bytecode, such as a .class file for the JVM obtained by compiling Java.
Strings, such as JavaScript code transformed by Babel.

It is essentially translation. For instance, compiling from a string to machine code involves translating code language that humans can understand into machine language that machines can "understand" (recognize and execute). Users can then interact with the machine via the target program:

         Target Program
User Input ---------> Output

P.S. An interpreter is very similar to a compiler, with the difference being that there is no phase for generating a target program:

                 Interpreter
Source Program & User Input -------> Output

Since it is interpreted and executed at runtime, the execution efficiency of interpreted languages is generally lower than that of compiled languages.

The Compilation Process

It is divided into two parts:

Analysis: Breaks the source program into multiple parts, adds grammatical structure, and generates an intermediate representation and a symbol table.
Synthesis: Constructs the target program based on the intermediate representation and the symbol table.

The processing steps of a typical compiler are as follows:

(Input) Character Stream
|
|- Lexical Analyzer (lexer)
|  (Generate) Symbol Stream
|  (Generate) Symbol Table
|- Syntax Analysis
|  (Generate) Syntax Tree
|- Semantic Analysis
|  (Generate) Syntax Tree
|- Intermediate Code Generator
|  (Generate) Intermediate Representation
|- Machine-Independent Code Optimizer
|  (Generate) Intermediate Representation
|- Code Generator
|  (Generate) Target Machine Language
|- Machine-Dependent Code Optimizer
v
(Output) Target Machine Language

The primary steps are lexical analysis, syntax analysis, semantic analysis, and code generation, while intermediate code generation and the two optimization steps are optional.

Lexical Analysis

The first step of compilation, aiming to convert a character stream into a sequence of tokens.

Takes raw input, which is a stream of characters, and converts it into a stream of tokens, which are logical units, each representing one or more characters that "belong together."

For example:

// Input
position = initial + rate * 60

// Output
[
  (id, 1),
  (=),
  (id, 2),
  (+),
  (id, 3),
  (*),
  (number, 4)
]

The specific process involves scanning character by character, removing comments and whitespace, and splitting the character stream into lexemes:

position,=,initial,+,rate,*,60

Then, the lexemes are converted into token format:

(token-name, attribute-value)

A token is the first layer of abstraction built on characters. It takes this key-value pair form because most subsequent stages only need to focus on the type (token-name), while detailed value information (attribute-value) is only needed during final code generation. Here, token-name is the abstract symbol required by the syntax analysis stage (e.g., id representing an identifier), and attribute-value is the index of that lexeme in the symbol table:

[
  (1, position),
  (2, initial),
  (3, rate),
  (4, 60)
]

Symbol Table

A symbol table is a data structure used by a compiler to store various pieces of information regarding the source program's constructs.

It is generated during the analysis phase and used during the synthesis phase:

In effect, the role of the symbol table is to pass information from the place of declaration to the place of actual use.

Since the lexical analyzer has limited information (lexeme literals), the generated symbol table only contains basic information (the mapping between lexeme literals and identifiers). Later, the syntax analyzer will decide whether to use an existing symbol table entry or create a new one based on semantic information.

Furthermore, there isn't just one global symbol table; rather, there is an independent symbol table for each scope. The purpose is to allow the same identifier to appear repeatedly in different declaration blocks of the program—that is, to prevent variable name conflicts in different scopes. At the same time, to support the most-closely nested rule for statement blocks, these symbol tables need to be linked according to their nesting hierarchy to form a tree structure, supporting language features such as inheritance and closures.

P.S. Keywords are also stored in the symbol table like identifiers, distinguished by return codes during table lookups.

Syntax Analysis

In the syntax analysis phase, token-names are used to create a tree-like intermediate representation to describe the grammatical structure while checking for syntax errors.

Check input is well-formed and build a parse tree or similar representation of input.

For example, an Abstract Syntax Tree (AST) is a common intermediate representation that describes the hierarchical grammatical structure of the source program. Each node in the tree represents an operation, and its child nodes represent the operands of that operation:

=
  (id, 1)
  +
    (id, 2)
    *
      (id, 3)
      (number, 4)

The syntax tree is the second layer of abstraction above characters. Concepts like priority and associativity come into play here, where operations must match their operands according to these rules to generate a unique tree structure (requiring the grammar to be unambiguous). This stage confirms whether the input source program format is correct, such as whether parentheses match.

Context-Free Grammar

The grammatical structure of a language is typically defined using a Context-Free Grammar (CFG), for example:

if (expression) statement else statement

The corresponding production is:

stmt -> if (expr) stmt else stmt

Variables like expr and stmt are nonterminals, representing sequences of terminals (lexical elements like if, ()).

A context-free grammar consists of four elements:

Set of terminals: The set of basic symbols of the language.
Set of nonterminals: Also called grammatical variables, which can be replaced by sequences of terminals.
Set of productions: Used to represent a certain writing form of a construct.
Start symbol: Designates a nonterminal as the starting point.

Starting from the start symbol, the process of repeatedly replacing nonterminals with the production bodies on the right is called derivation. The set of all terminal sequences that can be derived from the start symbol is the language defined by the grammar. Conversely, a language is a series of terminal strings that conform to the production rules.

For example, the grammar corresponding to property declarations in CSS:

// Terminal
ident	[-]?{nmstart}{nmchar}*
// Nonterminal
IDENT	{ident}
// Production
property    : IDENT;

P.S. The start symbol is stylesheet, and the corresponding production is stylesheet : [ CDO | CDC | S | statement ]*;. See CSS Core Syntax for details.

Priority and Associativity

The priority and associativity of operations are also defined by production rules, for example:

expr -> expr + term | expr - term | term
term -> term * factor | term / factor | factor
factor -> digit | (expr)

The meanings are as follows:

factor: An expression that cannot be separated by any operator; it is the smallest unit of an operand, either a numeric value or an expression protected by parentheses.
term: An expression that can be separated by high-priority operators (* and /) but not by low-priority operators (+ and -).
expr: A general expression that can be separated by any of the above operators.

Thus, the strategy for controlling priority is that each priority level has a dedicated nonterminal representing expressions that can be separated by operators of that level or higher.

Semantic Analysis

In this stage, static checks are performed to see if the input grammatical structure meets semantic requirements:

Type checking: For example, checking whether each operator has matching operands and whether array index data types are correct.
Coercion: For example, performing implicit type conversion on operands.

For example:

=
  (id, 1)
  +
    (id, 2)
    *
      (id, 3)
      inttofloat
        (number, 4)

P.S. Here, inttofloat represents a unary operation converting an integer to a float, which is an operation inserted during type conversion.

Intermediate Code Generation

After syntax and semantic analysis are complete, an intermediate representation in a machine-like language might be generated. This representation has two characteristics:

Easy to generate (it is just an intermediate representation, so the generation cost cannot be too high).
Easy to translate into machine language (approaching the target language step by step).

For example, Three-Address Code (TAC) is a common intermediate representation, similar to assembly instructions:

t1 = inttofloat(number4)
t2 = id3 * t1
t2 = id2 + t2
id1 = t3

Code Optimization

Divided into two categories:

Machine-independent optimization: Targeted at the intermediate representation to produce a better target program. Examples include optimizations performed by the Closure Compiler on JS code, such as loop optimization, dead code elimination, and redundant code simplification.
Machine-dependent optimization: Targeted at the target program, involving CPU registers and memory references, with the goal of maximizing the use of computer architecture features like parallelism and memory hierarchy.

The former occurs after the intermediate code generation phase, while the latter occurs after the target program generation, serving as the final step before output.

P.S. For instance, the inttofloat conversion in the previous step can be performed at compile time (converting 60 to 60.0), which counts as machine-independent optimization.

Code Generation

Takes the intermediate representation as input and outputs the target language. For example, to generate machine language, one must assign a register or memory location to each variable and then translate the intermediate representation into an equivalent sequence of machine instructions.

The machine language corresponding to the optimized three-address code (specific language unknown, looks like assembly) is:

LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1

It uses two registers, R1 and R2. It reads id3 into R2, multiplies it by 60.0, stores the product back in R2... Finally, the value of R1 is placed in the memory address corresponding to id1, completing the assignment.

Excluding machine-dependent optimization, the compilation process ends here. Logically, it goes through four essential stages: lexical analysis, syntax analysis, semantic analysis, and code generation. In practice, these stages do not always have clear boundaries; instead, they often attempt to complete multiple processes in one pass to improve performance.

References

"Compilers: Principles, Techniques, and Tools" (Dragon Book, Second Edition)
http://infolab.stanford.edu/~ullman/dragon/slides1.pdf
Compiler Design - Code Optimization
Compiled language