As in Project 1, the output of your compiler must be an assembly language program that can be assembled with NASM to produce machine code that corresponds to the input source code.
As will be the case for the remainder of the semester, you will have much flexibility in the design of the syntax and semantics of your programming language. For example, this project description does not specify the rules for variable names. You should pick something reasonable. The same holds for the syntax of variable declarations, print statements and assignments.
Declarations. Variables must be declared before their first use. The declaration must specify the type of the variable. Otherwise, the compiler cannot infer the type of an expression like x + y. Have an extensible scheme for declarations. You will at least have to allow for Boolean types when we add if statements to the language.
Assignments. As with expressions, you do not have to worry about type promotion. Assigning a floating point expression to an integer variable and vice versa are errors (for now).
Printing. In Project 1, the program automatically printed the value of expression. We don't want that behavior here because an expression might be part of an assignment statement. Instead, we require that the programmer use a print statement explicitly. For flexibility in formatting the output, have two versions of the print statement: one that prints a newline and one that does not.
CVS. You should first use CVS to checkout Project 2.
Yacc. You should review the lex and yacc programs that allow assignment to single character variables and note how %union is used in conjunction with %token, %left, %right and %type to declare the type of tokens and nonterminals. Also, note the use of union type field names in the lex program: ch3-03.l, ch3-03.y and ch3-03.txt.
For the project, you would not want the expression nonterminal to have double type since you are generating assembly language code rather than evaluating the expression. Instead you want the value associated with expression to tell you the type of the expression (floating-point or integer). For example, when the rule
Assembly language. For this project, you will still keep the results of sub-expressions on the top of the stack pointed by the ESP register, even for 8-byte floating-point values. Although the x87 FPU has its own stack, that stack is limited to 8 items and it is difficult in yacc to determine whether the depth of an expression exceeds 8.
Memory allocation (assembly). The variables used in this phase of the compiler will correspond to global variables in the final compiler. Memory for the variables should be allocated in a .DATA section of the assembly language program generated by your compiler. (Note: you can have more than one .DATA section in your assembly code.) There are two general approaches for memory allocation. You can either allocate memory all at once or use the variable name as a label.
In the first approach, you keep track of the total amount of memory you need for all of the variables and reserve that amount of memory in one big chunk. For example, if you need 32 bytes of memory, then the following assembler directive will do the trick:
The advantage of the first approach is that global variables are accessed in the same way that local variables and actual parameters will be accessed in later projects. (Those are offsets from the [EBP].) The disadvantage is that these global variables will not be accessible from external code.
In the second approach, you simply use the variable's name as a label in the assembly code. For example, a floating point variable x will be allocated by:
C Programming Practices. When you use lex and yacc, you pretty much have to lift the ban on using global variables. (Alternatives are possible. See Section 3.7.10 A Pure (Reentrant) Parser of the Bison Manual.)
However, this does not mean that we want to throw away every rule of good programming practice. One rule that you should still very much stick to is the edict to avoid ``magic numbers''. For example, if you use a constant 4 in the middle of your yacc source file, it is not clear whether the 4 is there because integer values take 4 bytes or if addresses take 4 bytes. (I.e., if you wanted to modify the code to work for 2-byte integers, should you change this 4?) It is much better to use a constant, say INT_SIZE, defined using the #define preprocessor directive:
Furthermore, in cases where you don't really need the value of a constant, but just want some way to label things, it is better to use an enumeration type. For example, you can define etype as follows:
Symbol table. The symbol table is a critical part of your compiler. For now, you just need the symbol table to remember the type of each variable and possibly its offset in memory (depending on the memory allocation scheme you choose to follow). Later, you will also use the symbol table to remember whether an identifier is a variable or a function. In case of variables, the symbol table must also keep track of whether the variable is a local variable, a global variable or a parameter. In case of functions, the types of its return value and of its parameters are also stored in the symbol table. At this stage, you should keep the design of your symbol table flexible. You will almost certainly need to add fields to its entries.
The two basic functions of the symbol table are 1) inserting a new item and 2) finding an existing item. Since a symbol table find operation is needed every time the compiler encounters an identifier, we need the data structure to support very fast find operations. This can be accomplished using a hash table which supports O(1) average running time for insertion and find. You can choose any hashing scheme as long as:
Alternatively, you can use the hash_map class from the C++ Standard Template Library. (See documentation.) Actually, hash_map is not quite part of the STL, so you need to add "ext" to the include path, like so:
Memory allocation (in C). You have to be very careful with dynamically allocated strings. Between lex, yacc and the symbol table, you need to have a clear protocol for whether a function has its own copy of the string and when the memory for that string is released.
Recall that the characters matched by lex are stored in yytext. In the case when you've matched a variable name, you will want to give the yacc parser access to that name. If you store a pointer in yylval, is that a separate copy of the string? or a pointer into yytext? Is that safe? If you store a name in a hash table, does the hash table make a copy of that string? or does it only store a pointer to that string? If it makes a copy, when is memory for the string released? If the hash table doesn't make a copy, you have to be sure that memory for the string is not freed until the hash table is done with it. Here are some issues: