the-super-tiny-compiler Source Code Analysis

1. Goal

Support simple expression transformation through implementing a compiler, converting Lisp-style function calls to C-style, for example:

              LISP-style              C-style
2 + 2         (add 2 2)               add(2, 2)
4 - 2         (subtract 4 2)          subtract(4, 2)
2 + (4 - 2)   (add 2 (subtract 4 2))  add(2, subtract(4, 2))

That is input (add 2 2), require output add(2, 2)

2. Design

Split compilation work into 3 parts:

Parsing: Convert code string to abstract representation
Transformation: Modify abstract representation to desired form
Code Generation: Convert transformed abstract representation to new code string

Parsing here includes lexical analysis and syntax analysis, lexical analyzer converts code string into a series of tokens, then syntax analyzer generates Intermediate Representation or Abstract Syntax Tree (AST) that can describe syntax structure (including syntax components and their relationships)

From structure perspective, tokens are a set of small objects describing independent syntax components (such as numbers, tags, punctuation, operators, etc.), Abstract Syntax Tree (AST for short) is a deeply nested object, easy to process and carries syntax structure information, for example:

// 代码字符串
(add 2 (subtract 4 2))
// 词法单元
[
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'add'      },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: '('        },
  { type: 'name',   value: 'subtract' },
  { type: 'number', value: '4'        },
  { type: 'number', value: '2'        },
  { type: 'paren',  value: ')'        },
  { type: 'paren',  value: ')'        },
]
// AST
{
  type: 'Program',
  body: [{
    type: 'CallExpression',
    name: 'add',
    params: [{
      type: 'NumberLiteral',
      value: '2',
    }, {
      type: 'CallExpression',
      name: 'subtract',
      params: [{
        type: 'NumberLiteral',
        value: '4',
      }, {
        type: 'NumberLiteral',
        value: '2',
      }]
    }]
  }]
}

Transformation phase needs to modify AST, that is traverse the tree structure generated above, perform node-level operations (add/delete/modify nodes) and property-level operations (add/delete/modify properties). Of course, can also create a new tree

In implementation, can use depth-first traversal, nothing special, but processing node by node is quite interesting:

var visitor = {
  NumberLiteral: {
    enter(node, parent) {},
    exit(node, parent) {}
  },
  CallExpression: {
    enter(node, parent) {},
    exit(node, parent) {}
  },
  //...
};

Extract node visiting operations as visitor layer, during traversal call corresponding enter/exit methods according to token type, considered a small trick

After modifying AST, comes to final code generation phase, traverse and collect, restore AST to code string

3. Implementation

Lexical Analysis

// 接受代码字符串 input
function tokenizer(input) {
  // 当前正在处理的字符索引
  let current = 0;
  // 输出结果集合，存放词法单元
  let tokens = [];

  // 遍历字符串，挑出词法单元
  while (current < input.length) {
    let char = input[current];

    // 匹配左括号、右括号
    if (char === '(') {
      tokens.push({
        type: 'paren',
        value: '(',
      });
      current++;

      continue;
    }
    if (char === ')') {
      tokens.push({
        type: 'paren',
        value: ')',
      });
      current++;

      continue;
    }

    // 跳过空白字符
    let WHITESPACE = /\s/;
    if (WHITESPACE.test(char)) {
      current++;
      continue;
    }

    // 匹配数值
    let NUMBERS = /[0-9]/;
    if (NUMBERS.test(char)) {
      let value = '';

      // 匹配连续数字，作为数值
      while (NUMBERS.test(char)) {
        value += char;
        char = input[++current];
      }
      tokens.push({ type: 'number', value });

      continue;
    }

    // 匹配形如"abc"的字符串
    if (char === '"') {
      let value = '';

      // 吃掉左双引号
      char = input[++current];
      // 收集两个双引号之间的所有内容，作为字符串值
      while (char !== '"') {
        value += char;
        char = input[++current];
      }
      // 吃掉右边双引号
      char = input[++current];
      tokens.push({ type: 'string', value });

      continue;
    }

    // 匹配函数名，要求只含大小写字母
    let LETTERS = /[a-z]/i;
    if (LETTERS.test(char)) {
      let value = '';

      // 收集连续字母
      while (LETTERS.test(char)) {
        value += char;
        char = input[++current];
      }
      tokens.push({ type: 'name', value });

      continue;
    }

    // 无法识别的字符，报错
    throw new TypeError('I dont know what this character is: ' + char);
  }

  return tokens;
}

Traverse code string, split out each morpheme, wrap into token form

Syntax Analysis

function parser(tokens) {
  // 当前正在处理的 token 索引
  let current = 0;

  // 递归遍历（因为函数调用允许嵌套），把 token 转成 AST 节点
  function walk() {
    let token = tokens[current];

    // 数值
    if (token.type === 'number') {
      current++;

      // 生成一个 AST 节点，表示数值字面量
      return {
        type: 'NumberLiteral',
        value: token.value,
      };
    }

    // 字符串
    if (token.type === 'string') {
      current++;

      return {
        type: 'StringLiteral',
        value: token.value,
      };
    }

    // 函数调用
    if (
      token.type === 'paren' &&
      token.value === '('
    ) {
      // 丢掉左括号，取下一个 token 作为函数名
      token = tokens[++current];

      let node = {
        type: 'CallExpression',
        name: token.value,
        params: [],
      };

      // 看下一个 token
      token = tokens[++current];

      // 右括号之前的所有 token 解析完都是参数
      while (
        (token.type !== 'paren') ||
        (token.type === 'paren' && token.value !== ')')
      ) {
        node.params.push(walk());
        token = tokens[current];
      }
      // 吃掉右括号
      current++;

      return node;
    }

    // 无法识别的 token，报错
    throw new TypeError(token.type);
  }

  // AST 的根节点
  let ast = {
    type: 'Program',
    body: [],
  };
  // 填充 ast.body，允许多条语句，所以放循环里
  while (current < tokens.length) {
    ast.body.push(walk());
  }

  return ast;
}

To handle function call nesting, use recursion to traverse token sequence

Traversal

function traverser(ast, visitor) {
  // 遍历 AST 节点数组
  function traverseArray(array, parent) {
    array.forEach(child => {
      traverseNode(child, parent);
    });
  }

  function traverseNode(node, parent) {
    // 从 visitor 取出对应的一组方法
    let methods = visitor[node.type];
    // 通知 visitor 我们正在访问 node
    if (methods && methods.enter) {
      methods.enter(node, parent);
    }

    switch (node.type) {
      // 根节点
      case 'Program':
        traverseArray(node.body, node);
        break;
      // 函数调用
      case 'CallExpression':
        traverseArray(node.params, node);
        break;
      // 数值和字符串，没孩子，不用处理
      case 'NumberLiteral':
      case 'StringLiteral':
        break;

      // 无法识别的 AST 节点，报错
      default:
        throw new TypeError(node.type);
    }

    // 通知 visitor 我们要离开 node 了
    if (methods && methods.exit) {
      methods.exit(node, parent);
    }
  }

  // 开始遍历
  traverseNode(ast, null);
}

Recursively traverse AST structure, notify visitor during traversal, has a bit of AOP flavor

Transformation

// 输入 Lisp AST，输出 C AST
function transformer(ast) {
  // 新 AST 的根节点
  let newAst = {
    type: 'Program',
    body: [],
  };

  // 偷懒以简单粗暴的方式维持新旧 AST 的联系，方便在遍历过程中操作新 AST
  ast._context = newAst.body;

  // 创建 vistor，开始遍历
  traverser(ast, {
    // 数值和字符串，直接原样插入新 AST
    NumberLiteral: {
      enter(node, parent) {
        parent._context.push({
          type: 'NumberLiteral',
          value: node.value,
        });
      },
    },
    StringLiteral: {
      enter(node, parent) {
        parent._context.push({
          type: 'StringLiteral',
          value: node.value,
        });
      },
    },

    // 函数调用
    CallExpression: {
      enter(node, parent) {
        // 创建不同的 AST 节点
        let expression = {
          type: 'CallExpression',
          callee: {
            type: 'Identifier',
            name: node.name,
          },
          arguments: [],
        };

        // 函数调用可以有孩子，建立节点对应关系，供子节点使用
        node._context = expression.arguments;

        // 顶层函数调用算是语句，包装成特殊的 AST 节点
        if (parent.type !== 'CallExpression') {
          expression = {
            type: 'ExpressionStatement',
            expression: expression,
          };
        }

        parent._context.push(expression);
      },
    }
  });

  return newAst;
}

Since traversal (traverser) and modification (visitor) operations are separated into 2 layers, not easy to clearly know corresponding new AST parent node while traversing old AST nodes, here uses simple and crude way, directly through adding _context property let old AST node's parent node hold reference to new AST node to be operated, works, but pollutes old AST

Code Generation

// 递归遍历新 AST，输出代码字符串
function codeGenerator(node) {
  switch (node.type) {
    // 根节点，把 body 里的所有内容都生成一遍，按行输出
    case 'Program':
      return node.body.map(codeGenerator).join('\n');

    // 表达式语句，处理其表达式内容，并添上分号
    case 'ExpressionStatement':
      return (
        codeGenerator(node.expression) + ';'
      );

    // 函数调用，添上括号，参数用逗号分隔
    case 'CallExpression':
      return (
        codeGenerator(node.callee) +
        '(' +
        node.arguments.map(codeGenerator).join(', ') +
        ')'
      );

    // 标识符，数值，原样输出
    case 'Identifier':
      return node.name;
    case 'NumberLiteral':
      return node.value;

    // 字符串，用双引号包起来再输出
    case 'StringLiteral':
      return '"' + node.value + '"';

    // 无法识别的新 AST 节点，报错
    default:
      throw new TypeError(node.type);
  }
}

Convert AST back to code string, add semicolons where needed, add parentheses where needed...

Process Chaining

function compiler(input) {
  let tokens = tokenizer(input);
  let ast    = parser(tokens);
  let newAst = transformer(ast);
  let output = codeGenerator(newAst);

  return output;
}

Chain all above links together, form simple compiler, indeed only about 200 lines, try it out:

const input  = '(add 2 (subtract 4 2))';
let output = compiler(input);
// add(2, subtract(4, 2));
console.log(output);

4. Optimization

2 optimizable points:

Lexical analysis: Character-by-character matching is verbose, if efficiency is acceptable, regular expressions are clearer
Transformation: Method of generating new AST is dirty, pollutes old AST, can use additional data structure to record connection between old and new AST, avoid affecting old AST

Clearer Lexical Analysis

For example regular expressions corresponding to each morpheme:

const REGEX = {
  PAREN: /^\(|^\)/,
  WHITESPACE: /^\s+/,
  NUMBERS: /^\d+/,
  STRING: /^"([^"]+)?"/,
  NAME: /^[a-z]+/i
};

while (rest.length > 0) {
  let type, value;
  // 匹配结果，本次匹配消费掉的串长度
  let matched, span;

  // 匹配左括号、右括号
  if (matched = rest.match(REGEX.PAREN)) {
    type = 'paren';
  }
  //...匹配其它词素

  value = value || matched[0];
  tokens.push({type, value});
  rest = rest.slice(span || matched[0].length);
}

Much clearer, if want to improve extensibility, can further configure morpheme-related type, value, span:

// 词素相关配置
const lexemes = [
  {
    regex: /^\(|^\)/,
    type: 'paren',
    value: 0,
    span: 0
  }, {
    regex: /^\d+/,
    type: 'number',
    value: 0,
    span: 0
  }, {
    regex: /^"([^"]+)?"/,
    type: 'string',
    value: 1,
    span: 0
  }, {
    regex: /^[a-z]+/i,
    type: 'name',
    value: 0,
    span: 0
  }
];

Then process these morphemes in unified way:

while (rest.length > 0) {
  // 跳过空白字符
  if (matched = rest.match(/^\s+/)) {
    rest = rest.slice(matched[0].length);
    continue;
  }

  lexemes.forEach(({regex, type : t, value: v, span: s}) => {
    if (matched) return;
    if (matched = rest.match(regex)) {
      type = t;
      value = matched[v];
      span = matched[s].length;
    }
  });

  if (matched) {
    tokens.push({type, value});
    rest = rest.slice(span);
  }
  else {
    // 无法识别的字符，报错
    throw new TypeError('Unexpected character: ' + rest);
  }
}

Cleaner Transformation

Generating new tree needs to do two things:

Node mapping
Create tree structure

Node mapping is easy, where to hang new nodes is the problem. visitor and transformer are implemented as independent two layers, so need to manually record connection between old and new trees, such as in transformation source code above:

// 偷懒以简单粗暴的方式维持新旧 AST 的联系，方便在遍历过程中操作新 AST
ast._context = newAst.body;
// 函数调用可以有孩子，建立节点对应关系，供子节点使用
node._context = expression.arguments;

This way know where new node corresponding to currently visited old node should be hung in new tree, for example:

// 旧树中父节点身上挂着对应新树节点的孩子数组，把新节点填进去
parent._context.push({
  type: 'NumberLiteral',
  value: node.value,
});

But this implementation is quite dirty (pollutes passed-in old tree ast). More reasonable approach is to record currently active node container in new tree in non-intrusive way, since function calls allow nesting, need to use stack structure to record:

// 用额外的数据结构维持新旧 AST 的联系
let stack = [newAst.body];
function peak() {
  return stack[stack.length - 1];
}

And maintain this connection during recursive traversal of old tree:

// 函数调用
CallExpression: {
  enter(node, parent) {
    // 取出活跃的新树节点容器
    let newASTHost = peak();

    // 创建新树节点
    let expression = {
      type: 'CallExpression',
      callee: {
        type: 'Identifier',
        name: node.name,
      },
      arguments: [],
    };

    // 函数调用可以有孩子，建立节点对应关系，供子节点使用
    stack.push(expression.arguments);

    // 填充新树结构
    newASTHost.push(expression);
  },
  leave(node, parent) {
    // 参数收集结束，回到上一层
    stack.pop();
  }
}

This way won't pollute old tree anymore, and cost is not high (one additional data structure stack)

5. Source Code

Github address: https://github.com/ayqy/the-super-tiny-compiler

P.S. Includes original version with detailed comments, and optimized implementation