12 September 2013
5 September 2013
In this homework, you will write an assembler in C or C++ for a simplified subset of the MIPS instruction set. Your program should take two arguments, the name of an assembler input file and the name of the hex output file.
Instructions are written with an operation followed by a list of operands separated by white space (for ease of parsing, we will NOT have commas between operands):
operation operand1 [operand2 ...]
Register operands are written $0 to $31, though $0 is hard-wired to 0. Immediate constants are given as a c-style number, for example '100' is decimal 100, while '0x100' is hex 100 or decimal 256, or with a label. In addition, labels can be defined at the beginning of a line with a leading colon (:label), and can appear as an immediate constant without the colon (label). Note that using a label might produce different immediate values in the encoded instruction depending on the type of instruction. Data accesses use the real address byte, while a branch would need to compute the relative word offset to the label from the next instruction.
For ease of parsing, we will write data addresses as "register offset", rather than the more common "offset(register)", for example "$1 label" not "label($1)"
Anything from a '#' character to the end of the line should be discarded as a comment. Any blank or comment only lines should be ignored, and you should allow any amount of white space between elements.
We will also use one pseudo-instruction, int
, whose "operands" are a list of word-sized integers to store as sequential data elements
All instructions are encoded in 32 bits in one of three formats:
R-type | opcode (6b) | rs (5b) | rt (5b) | rd (5b) | sh (5b) | func (6b) |
---|---|---|---|---|---|---|
I-type | opcode (6b) | rs (5b) | rd (5b) | immediate (16b) | ||
J-type | opcode (6b) | address (26b) |
For R-type instructions the registers are written in the order "rd rs rt
", where rd
is the destination register. For I-type instructions they are written in the order "rd rs
", where rd
is the destination register.
You should support the following opcodes (see the textbook for descriptions)
low bits | |||||||||
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
high bits | 000 | * | j | jal | beq | bne | blez | bgtz | |
001 | addi | addiu | andi | ori | xori | lui | |||
010 | |||||||||
011 | |||||||||
100 | lw | ||||||||
101 | sw | ||||||||
110 | |||||||||
111 |
Opcode 000000 is used for all R-type instructions, which use the func field to determine the actual operation. The func operations you should support are:
low bits | |||||||||
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
high bits | 000 | ||||||||
001 | jr | jalr | |||||||
010 | |||||||||
011 | mult | multu | div | divu | |||||
100 | add | addu | sub | subu | and | or | xor | nor | |
101 | |||||||||
110 | |||||||||
111 |
Your output should be a list of c-format 32-bit hex integers:
0x12345678, 0x87654321, ...
For example, here is a section of legal assembly code
addi $1 $0 10 # x = 10 add $2 $0 $0 # y = 0 :loop addi $1 $1 -4 # x = x - 4 (1 word) lw $3 $1 data # z = data[x] add $2 $2 $3 # y = y + z bgtz $1 loop # loop if x>0 :end j end # loop forever to end :data # label can be on its own line int 0 1 2 3 4 # data pseudo-instruction int 5 6 7 8 9
Here is the same code with instruction addresses shown, labels filled in, and the encoding type and relevant components shown
0 addi $1 $0 10 # i-type op=0x08, rd=1, rs=0, imm=10
4 add $2 $0 $0 # r-type fn=0x20, rd=2, rs=0, rt=0
8 :loop addi $1 $1 -4 # i-type op=0x08, rd=1, rs=1, imm=-4
12 lw $3 $1 28 # i-type op=0x23, rd=3, rs=1, imm=28
16 add $2 $2 $3 # r-type fn=0x20, rd=2, rs=2, rt=3
20 bgtz $1 -4 # i-type op=0x07, rd=0, rs=1, imm=-4
24 :end j 6 # j-type op=0x02, address=6
28 :data int 0 1 2 3 4 # data
48 int 5 6 7 8 9
and here is what you'd print as output
0x2001000a,
0x00001020,
0x2021fffc,
0x8c23001c,
0x00431020,
0x1c20fffc,
0x08000006,
0x00000000,
0x00000001,
0x00000002,
0x00000003,
0x00000004,
0x00000005,
0x00000006,
0x00000007,
0x00000008,
0x00000009,
There are a few functions that may prove useful in implementing this project.
First is fgets(line, size, file)
, which reads a line from a file into a string buffer. This function requires a maximum line size (= size of the buffer). 256 should work fine. It will return NULL when you reach the end of the file.
Second is strtok(line, " \t\v\r\n\f")
. The second string there contains a space, tab, vertical tab, carriage return, line feed, and form feed (the standard "white space" characters). This strtok returns the first "token" in the line made entirely of characters other than white space. You will also want to use strtok(NULL, " \t\v\r\n\f")
, which returns the next token in the same line. When there are no tokens left in the line, strtok returns NULL
.
sscanf(token, "%i", &val)
will convert a string to an integer, including doing any C-style integer conversions (handling signed, decimal, hex with leading 0x, etc.)
fprintf(oFile, "0x%08x,\n", val)
will print one integer in the expected output format (hex integer, padded with 0's to eight characters wide)
You'll probably want to use these bitwise logical operators to manipulate the bits of the encoding: a & b
is a bitwise and. It may be exceptionally useful for masking (for example, val & 0xFFFF
will keep just 16 bits of a signed integer value). a | b
is a bitwise or. It may be useful for composing parts of the encoded instruction together. a<<b
(or a>>b
) shift a
left (or right) by b
bits. This may be useful for positioning the parts of the encoded instruction to combine.
You may see references to some labels before their location is defined. There are two approaches you can use to deal with that. The first is to take a first pass through the file counting words and collecting label addresses, then use rewind(file)
to start over. The second is to store the addresses of unresolved labels in a data structure, then fill them in when you find the definition of the label.
All electronic submissions in this class will be done using the CVS version control system. You should look at the class CVS instructions before you start work. To get full credit for your submission, you should (1) check out a copy of the empty hw1 directory before you start, (2) do your work in that checked out copy, (3) submit several intermediate checkins with short but useful messages (e.g. "parsing works!"), and (4) check in your final submission before class starts on the day of the deadline.
Include a short file named "readme.txt" that describes how to build and run your program, as well as a description of any known problems or bugs. If your program is contained in a single C or C++ file, say named "program.c", just running "make program" should suffice to build it, but we need to know the name of your program. Bugs you identify in your readme will lose only half points. Bugs you don't identify that are found during grading will lose full points.