A Toy C Compiler

A Toy C Compiler

Dogan Kurt
dogan.kurt@dodobyte.com
October 2016
  1. Introduction
  2. The Compiler
  3. The Assembler and Linker
  4. Conclusion

Introduction

This was my attempt to write a small C compiler around 2014. I had it in my mind for some time and i was experimenting with Reverse Polish Notation, Shunting-yard Algorithm, Recursive Descent Parsers and so on.

It supports only a subset of C, i never wanted it to be complete, it was purely experimental. I just wanted to compile some code and generate an executable from scratch. I've managed to create a version that generates a working executable from C source in only five days. At this point, i've felt a great joy, but also i lost my motivation and stopped working on it.

I have never published it until now. But i think it can be beneficial as it shows how Lexical Analyzers, Recursive Descent Parsers and Operator-precedence Parsers work in a fairly simple way. The code generation part however is really poor, i had no prior experience with it and sole x86 Assembly knowledge was surely not sufficient.

The project has two parts, the main part is the compiler (dcc.c) which takes a C file as input and generates NASM compatible x86 assembly file. The second part is assembler and linker (dal.c), it assembles the output of compiler and generates the final executable. They were two different executables in the beginning but i modified them slightly to make a single executable. To get assembly output of the compiler, you should run it like dcc file.c -asm.

Compiler contains ~1200 lines of code and Assembler-Linker contains ~900 lines of code. You can download the package here with some test programs.


The Compiler

Implemented Subset

Before getting into details, let me specify which subset of C is implemented. Arbitrarily nested if-else statements, for, while loops and recursive calls are supported. Pointers are supported only at a basic level, & and * operators work but pointer arithmetic is not implemented. Supported variable types are int and char. Preprocessor, which is not a part of compiler, is not implemented. Blocks should start with a { even if the following statement was a single line.

Global variables stored in data section. Local variables and arguments stored in stack. Stack frame is not different than other compilers, arguments are found in ebp + n and local variables found in ebp - n. String literals also stored in data section, therefore they are not immutable.

Basic Routines

get_token: This function is the tokenizer, in other words, lexical analyzer. It reads the source file and returns the next token. A token is a basic unit like an integer value, an identifier or a string literal.

shunting_yard: This function reads the expression one token at a time by calling get_token, and turns it into Reverse Polish Notation. Shunting-yard Algorithm was founded by great computer scientist Edsger W. Dijkstra.

eval_expr: This function evaluates an RPN expression. This is the function that generates x86 assembly code from an expression.

expression: This is a higher level function, whenever we expect an expression, we call this function and it produces code by calling shunting_yard and eval_expr respectively.

Parsing Process

The expressions are parsed with shunting yard algorithm—an operator-precedence parser. Everything else is parsed with a recursive descent parser.

We start compilation by calling the external routine. This routine writes the start and finish codes to output. It also initiates the parser by calling the declare routine for the first time. It's called with GLOBAL parameter indicating that we are not in any block yet. The declare routine handles global variable definitions and if it encounters a function, it calls the function routine.

The function routine indirectly calls declare again for handling function arguments, it's called with PARAM parameter this time. Finally it calls the block routine and compilation of the function body starts.

The block routine is not specific to functions. It's also called to compile other code blocks like if and while. The block routine calls declare for the local variables, with LOCAL parameter of course. Once the local variable definitions handled, block routine calls the statement routine in a loop until the block ends.

The statement routine first recognizes the type of statement. If it's a regular statement, it calls the expression, otherwise, it calls the appropriate routine like do_if for if statement, do_while for while statement and so on.

As we mentioned earlier, the expression routine is the core function for generating code. Other functions like do_while call the expression routine to generate code for their conditions, and then call the block routine to compile their bodies.

Once the block is called, the process starts over again, this is how a source file is parsed with a recursive descent parser.


Code Generation

As i mentioned earlier, the code generation part is very poor. Let's see an example C code with the compiled assembly output first.

-- The C code --

void main()
{
	printf("hello world!\n");
}

-- Assembly Output --

[bits 32]

global _start

section .text

_main:
	push	ebp
	mov	ebp, esp
	sub	esp, 4096
	pusha
	mov	eax, str_0
	push	eax
	call	_printf
	add	esp, 4
	mov	[ebp -4000], eax
	popa
	mov	eax, [ebp -4000]
	leave
	ret

_start:
	call _main
	push 0
	call _exit
	ret


section .data

str_0	db	0x68, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x77, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x0A, 0x0


extern	_printf
extern	_exit

We pre-allocate 4096 bytes for local variables, hence the maximum number of local variables are 1024. As you may know, some registers are caller save and some registers are callee save. If we don't preserve values of caller save registers, functions like printf may change them.

Instead of managing that, i just push everything to the stack with pusha instruction before making any call. Even worse, i push eax register, which is the return value of the function, to some far away stack location, run popa instruction to restore registers, and finally load return value from stack to eax register again. Registers are used to emulate a stack machine, they hold the intermediate results. Real stack is used if we are out of registers.

You see the _start stub which calls the main function and calls exit after the program finishes. As you may notice, we don't give any parameters to main, and we ignore it's return code and exit always with 0.


The Assembler and Linker

Although they are not a part of the compiler, i wanted to implement assembler and linker to make the compiler independent of any third party program.

They are far from being generic, assembler only assembles the instruction my compiler emits, and linker only generates import table for msvcrt.dll, which means only standard C functions are supported. Most of the limitations come from assembler-linker, if you chose to use NASM and any other linker, you can call Windows APIs and other functions.

Linker generates three sections, .text contains the code, .data contains global variables as well as the string literals, and .idata contains import table.


Conclusion

This is not an educational material for compiler design. It's a pragmatic piece of code to compile some C code and create an executable from scratch. I started learning Compiler Construction formally well after i've finished this one. If you are interested in the subject however, i'd love to give some references.

Niklaus Wirth's Compiler Construction book is the best starting point. Dragon Book is the reference book for compiler design and implementation. Ken Thompson wrote a great paper named A New C Compiler.