This was my attempt to write a small C compiler around 2014. I had it in my mind for some time and i was experimenting with Reverse Polish Notation, Shunting-yard Algorithm, Recursive Descent Parsers and so on.
It supports only a subset of C, i never wanted it to be complete, it was purely experimental. I just wanted to compile some code and generate an executable from scratch. I've managed to create a version that generates a working executable from C source in only five days. At this point, i've felt a great joy, but also i lost my motivation and stopped working on it.
I have never published it until now. But i think it can be beneficial as it shows how Lexical Analyzers, Recursive Descent Parsers and Operator-precedence Parsers work in a fairly simple way. The code generation part however is really poor, i had no prior experience with it and sole x86 Assembly knowledge was surely not sufficient.
The project has two parts, the main part is the compiler (dcc.c) which takes a C file as input and generates NASM compatible x86 assembly file. The second part is assembler and linker (dal.c), it assembles the output of compiler and generates the final executable. They were two different executables in the beginning but i modified them slightly to make a single executable. To get assembly output of the compiler, you should run it like dcc file.c -asm
.
Compiler contains ~1200 lines of code and Assembler-Linker contains ~900 lines of code. You can download the package here with some test programs.
Before getting into details, let me specify which subset of C is implemented. Arbitrarily nested if-else
statements, for, while
loops and recursive calls are supported. Pointers are supported only at a basic level, &
and *
operators work but pointer arithmetic is not implemented. Supported variable types are int
and char
. Preprocessor, which is not a part of compiler, is not implemented. Blocks should start with a {
even if the following statement was a single line.
Global variables stored in data section. Local variables and arguments stored in stack. Stack frame is not different than other compilers, arguments are found in ebp + n
and local variables found in ebp - n
. String literals also stored in data section, therefore they are not immutable.
get_token:
This function is the tokenizer, in other words, lexical analyzer. It reads the source file and returns the next token. A token is a basic unit like an integer value, an identifier or a string literal.
shunting_yard:
This function reads the expression one token at a time by calling get_token
, and turns it into Reverse Polish Notation. Shunting-yard Algorithm was founded by great computer scientist Edsger W. Dijkstra.
eval_expr:
This function evaluates an RPN expression. This is the function that generates x86 assembly code from an expression.
expression:
This is a higher level function, whenever we expect an expression, we call this function and it produces code by calling shunting_yard
and eval_expr
respectively.
The expressions are parsed with shunting yard algorithm—an operator-precedence parser. Everything else is parsed with a recursive descent parser.
We start compilation by calling the external
routine. This routine writes the start and finish codes to output. It also initiates the parser by calling the declare
routine for the first time. It's called with GLOBAL parameter indicating that we are not in any block yet. The declare
routine handles global variable definitions and if it encounters a function, it calls the function
routine.
The function
routine indirectly calls declare
again for handling function arguments, it's called with PARAM parameter this time. Finally it calls the block
routine and compilation of the function body starts.
The block
routine is not specific to functions. It's also called to compile other code blocks like if
and while
. The block
routine calls declare
for the local variables, with LOCAL parameter of course. Once the local variable definitions handled, block
routine calls the statement
routine in a loop until the block ends.
The statement
routine first recognizes the type of statement. If it's a regular statement, it calls the expression
, otherwise, it calls the appropriate routine like do_if
for if statement, do_while
for while statement and so on.
As we mentioned earlier, the expression
routine is the core function for generating code. Other functions like do_while
call the expression
routine to generate code for their conditions, and then call the block
routine to compile their bodies.
Once the block
is called, the process starts over again, this is how a source file is parsed with a recursive descent parser.
As i mentioned earlier, the code generation part is very poor. Let's see an example C code with the compiled assembly output first.
-- The C code --
void main() { printf("hello world!\n"); }
-- Assembly Output --
[bits 32] global _start section .text _main: push ebp mov ebp, esp sub esp, 4096 pusha mov eax, str_0 push eax call _printf add esp, 4 mov [ebp -4000], eax popa mov eax, [ebp -4000] leave ret _start: call _main push 0 call _exit ret section .data str_0 db 0x68, 0x65, 0x6C, 0x6C, 0x6F, 0x20, 0x77, 0x6F, 0x72, 0x6C, 0x64, 0x21, 0x0A, 0x0 extern _printf extern _exit
We pre-allocate 4096 bytes for local variables, hence the maximum number of local variables are 1024. As you may know, some registers are caller save and some registers are callee save. If we don't preserve values of caller save registers, functions like printf
may change them.
Instead of managing that, i just push everything to the stack with pusha
instruction before making any call. Even worse, i push eax
register, which is the return value of the function, to some far away stack location, run popa
instruction to restore registers, and finally load return value from stack to eax
register again. Registers are used to emulate a stack machine, they hold the intermediate results. Real stack is used if we are out of registers.
You see the _start
stub which calls the main
function and calls exit
after the program finishes. As you may notice, we don't give any parameters to main
, and we ignore it's return code and exit always with 0.
Although they are not a part of the compiler, i wanted to implement assembler and linker to make the compiler independent of any third party program.
They are far from being generic, assembler only assembles the instruction my compiler emits, and linker only generates import table for msvcrt.dll
, which means only standard C functions are supported. Most of the limitations come from assembler-linker, if you chose to use NASM and any other linker, you can call Windows APIs and other functions.
Linker generates three sections, .text
contains the code, .data
contains global variables as well as the string literals, and .idata
contains import table.
This is not an educational material for compiler design. It's a pragmatic piece of code to compile some C code and create an executable from scratch. I started learning Compiler Construction formally well after i've finished this one. If you are interested in the subject however, i'd love to give some references.
Niklaus Wirth's Compiler Construction book is the best starting point. Dragon Book is the reference book for compiler design and implementation. Ken Thompson wrote a great paper named A New C Compiler.