In this homework, you will write an assembler in C or C++ for a simplified subset of the MIPS instruction set. Your program should take two arguments, the name of an assembler input file and the name of the hex output file.
Instructions are written with an operation followed by a list of operands separated by white space or commas. To simplify parsing, you can treat commas and parentheses as white space characters, so any of these are equivalent:
lw $2,:label($3) lw $2 :label($3) lw $2,:label,$3 lw $2 :label $3
Register operands are written $0 to $31. Immediate constants are given as a '#' followed by a c-style number, for example '#100' is decimal 100, while '#0x100' is hex 100 or decimal 256. Labels can be defined at the beginning of a line with a leading colon (:label), and can also appear as an immediate constant. Relative data addressing is written with a label followed by a register in parentheses: ':label($1)'.
Anything from a ';' character to the end of the line should be discarded as a comment. Any blank or comment only lines should be ignored, and you should allow any amount of white space between elements.
We will also use one pseudo-instruction, int
, whose "operand" is one word-sized integer to store as a data element.
For example, here is a section of legal assembly code
addi $1, $0, #12 ; x = 12 = offset to end of data table add $2, $0, $0 ; y = 0 :loop blez $1, :end ; jump to end when x<=0 addi $1, $1, #-4 ; x = x - 4 = update offset by one 32-bit word lw $3, :data($1) ; z = data[x] add $2, $2, $3 ; y = y + z j :loop ; start next loop :end j :end ; loop forever to end :data ; label can be on its own line int #0 ; data pseudo-instructions int #1 int #2
All instructions are encoded in 32 bits in one of three formats. Note that using a label might produce different immediate values in the encoded instruction depending on the type of instruction. Data accesses (lw, sw) use the full immediate address. Jump instructions (j, jal) use a word address for the immediate value (i.e. the address divided by 4). 32-bit aligned instructions can only occur at addresses that are divisible by 4, and there is no need to waste space on two bits that would always be zero. The immediate values branch instructions (beq, bne, blez, and bgtz) is the relative word offset to the label from the next instruction.
The three instruction formats are:
R-type | opcode (6b) | rs (5b) | rt (5b) | rd (5b) | sh (5b) | func (6b) |
---|---|---|---|---|---|---|
I-type | opcode (6b) | rs (5b) | rd (5b) | immediate (16b) | ||
J-type | opcode (6b) | address (26b) |
You should support the following opcodes (this page is a good reference for the instruction meaning, format, and operand order)
low bits | |||||||||
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
high bits | 000 | * | j | jal | beq | bne | blez | bgtz | |
001 | addi | addiu | andi | ori | xori | lui | |||
010 | |||||||||
011 | |||||||||
100 | lw | ||||||||
101 | sw | ||||||||
110 | |||||||||
111 |
Opcode 000000 is used for all R-type instructions, which use the func field to determine the actual operation. The func operations you should support are:
low bits | |||||||||
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
high bits | 000 | ||||||||
001 | jr | jalr | |||||||
010 | |||||||||
011 | |||||||||
100 | add | addu | sub | subu | and | or | xor | nor | |
101 | |||||||||
110 | |||||||||
111 |
Here is the same sample code as above with instruction addresses shown, labels filled in, and the encoding type and relevant components shown
inst# address code comment 0 0 addi $1 $0 #12 ; i-type op=0x08, rd=1, rs=0, imm=12 1 4 add $2 $0 $0 ; r-type fn=0x20, rd=2, rs=0, rt=0 2 8 blez $1 #4 ; i-type op=0x06, rs=1, imm=4 3 12 addi $1 $1 #-4 ; i-type op=0x08, rd=1, rs=1, imm=-4 4 16 lw $3 :data $1 ; i-type op=0x23, rd=3, rs=1, imm=32 5 20 add $2 $2 $3 ; r-type fn=0x20, rd=2, rs=2, rt=3 6 24 j #2 ; j-type op=0x02, address=2 7 28 j #7 ; j-type op=0x02, address=7 8 32 int #0 ; data[0] = 0 9 36 int #1 ; data[1] = 1 10 40 int #2 ; data[2] = 2
Your output should be a list of c-format 32-bit hex integers. So, for the sample code, your output would be:
0x2001000c, 0x00001020, 0x18200004, 0x2021fffc, 0x8c230020, 0x00431020, 0x08000002, 0x08000007, 0x00000000, 0x00000001, 0x00000002,
There are a few functions that may prove useful in implementing this project.
First is fgets(line, size, file)
, which reads a line from a stdio FILE* into a string buffer. This function requires a maximum line size (= size of the buffer). 256 should work fine. It will return NULL when you reach the end of the file. Using this rather than parsing directly from the file will make handling comments (which must skip the rest of a line) easier.
Second is strtok(line, " \t\v\r\n\f,()")
. The second string is all of the standard "white space" characters, as well as the comma character and both parentheses. This strtok returns the first "token" in the line made entirely of characters other than white space, comma or parentheses. strtok(NULL, " \t\v\r\n\f,()")
returns the next token in the same line. When there are no tokens left in the line, strtok returns NULL.
This particular assembly language is defined so you can tell what tokens are by the first character. Feel free to take advantage of that! Registers always start with $, labels always start with :, numbers always start with #, and comments always start with ;.
sscanf(token, "#%i", &val)
will skip the leading # and convert the remainder of the token into an integer, including doing any C-style integer conversions (handling signed, decimal, hex with leading 0x, etc.)
fprintf(oFile, "0x%08x,\n", val)
will print one integer in the expected output format (hex integer, padded with 0's to eight characters wide)
You'll want to use bitwise logical operators to manipulate the bits of the encoding: a&b is a bitwise and. It may be exceptionally useful for masking (for example, val & 0xFFFF
will keep just 16 bits of a signed integer value). a|b is a bitwise or. It may be useful for composing parts of the encoded instruction together. a<<b (or a>>b) shift a left (or right) by b bits. This may be useful for positioning the parts of the encoded instruction to combine.
You may see references to some labels before their location is defined. There are two approaches you can use to deal with that. The first is to take a first pass through the file counting words and collecting label addresses, then use rewind(file)
to start over. The second is to store the addresses of unresolved labels in a data structure, then fill them in when you find the definition of the label or at the end of the parsing.
All electronic submissions in this class will be done using the git version control system. You should look at the class git instructions before you start work. To get full credit for your submission, you should (1) check out a copy of the empty hw1 directory before you start, (2) do your work in that checked out copy, (3) submit several intermediate checkins with short but useful messages (e.g. "parsing works!"), and (4) check in your final submission before class starts on the day of the deadline.
Include a short file named "readme.txt" that describes how to build and run your program, as well as a description of any known problems or bugs. If your program is contained in a single C or C++ file, say named "program.c", just running "make program" should suffice to build it, but we need to know the name of your program. Bugs you identify in your readme will lose only half points. Bugs you don't identify that are found during grading will lose full points.