The goal of the semester project is be sure you can work with formal grammars. Each grammar can generate a set of strings that is a formal language.
For project grammars we will use the following constraints: All variables in the grammar will be a single upper case letter followed by digits. (95% credit if you only handle single letters) (You would not be able to easily handle Greibach Normal Form reductions without the digits.) 'S' will always be the starting variable but may not be the first rule. ('S' could get changed internally to A1 for some algorithms. e.g. Greibach) All terminal symbols in the grammar will be a single lower case letter, and all lower case letters will be terminal symbols. The null string, epsilon, will be represented by the "at sign" character @ . Productions, rules, will be of the form: S -> a S -> SS A -> BacD A1 -> bA27c (only in last 5% grammar g3a.g) X -> @ The first item, not necessarily in column 1, is a variable. The second item, always with at least one space before and after, is the arrow represented as two characters "->" minus and grater characters. The third item is one or more terminal symbols and variables in any order. (The standard data for the course will not use the '|' or symbol in the productions but you can implement it if you wish. Rules in the data need to be in the simplified form without the '|'.) A given grammar, G = (V, T, P, S) can be determined from the productions: V is the set of upper case letters (and digits) found in the productions. T is the set of lower case letters found in the productions. P is the productions of the form V -> alpha S is always S for the input grammar Comments are lines with a '#' in column 1.
A sample input file, g1.g is # g1.g a test grammar for the project from Book P83 # # G = (V, T, P, S) V={S,A} T={a,b} S=S # S -> aAS S -> a A -> SbA A -> SS A -> ba
A string of terminal symbols might be abaa which is accepted A string of terminal symbols might be abbab which is not accepted
The project is to implement the CYK algorithm on p 139-141 in text book. Your computer program, in the language of your choice, will 1) read a grammar as defined above from a file. (fixed name or command line) 2a) print the grammar from your internal data structure. 2b) print the grammar after converted to Chomsky Normal Form. 3) read a string of terminal symbols from the keyboard, 4) print the string that is read, 5) print "accepted" or "rejected" depending on the string being in the grammar or not being in the grammar. Your program should loop back and read another string until end of file. i.e. keep repeating steps 3) 4) and 5) until end of file. Below are three grammars (pick 3 or 3a) and four strings for each grammar. Run them and turn in: a) source code listing of your program b) output from three runs (four if you did variable with digits)
The CYK algorithm expects the grammar in Chomsky Normal Form, that is: productions can be one of two formats A -> a or A -> BC The grammars will have the "simplification" steps 1), 2) and 3) out of the way, that is 1) No useless variables, 2) No nullable variables and 3) no unit productions. But, the grammars are not in Chomsky Normal Form, so, either convert the grammar by hand, YUK!, or program in the conversion of the raw grammar to Chomsky Normal Form. Step 4) of "simplification" is the following algorithm: 'length' refers to the number of variables plus terminal symbols on the right side of a production. Loop through the productions For each production with length greater than 1 do Replace each terminal symbol with a new variable and add a production new variable -> terminal symbol. (An optimization is to reuse a variable name that has an existing production variable name -> terminal symbol and not add another production) Loop through the productions For each production with length grater than 2 do Replace two rightmost variables with a new variable and add a production new variable -> two rightmost variables. (Repeat - either on a production or loop until no replacements.) Now the grammar, as represented by the productions, is in Chomsky Normal Form. proceed with CYK. An optimization is possible but not required, for any two productions with the same right side, delete the second production and replace all occurrences of the second productions left variable with the left variable of the first production in all productions. (This could be needed if you run out of upper case letters for variables.)
Programming hints (that you can ignore) The table, matrix, V is built for each input string. You may want to create it dynamically based on the length of the input string, called x in Fig. 6.8. If you choose to make it a fixed size matrix, it must be at least 64 by 64. It is easiest to make V a square matrix with number of rows equal to number of columns equal to the length of the input string. A matrix element, V(i,j) or V[i,j] only holds a pointer to a set. A set can be made by a simple linked list. It would horrify your algorithms instructor, but for this algorithm, set union can be just hook the new element onto the linked list. (Set union should really check that no duplicate element ever gets into a set.) Linked lists, by their nature, are usually created dynamically, but if you want a fixed number of links, allow for at least 50,000. You will need a set 'find' function that takes a variable and searches the linked list and returns 'found' or 'not found' to implement parts of CYK such as Step 6) e.g. B is in V(i,k) and C is in V(i+k,j-k). Here "B" is the first of two variables in the production A -> BC and C is the second variable. Be careful implementing this step, BOTH 'find' must return 'found'. If you are using C you can run all the loops from 0 to n-1 and fix up all the related values or just waste a row and a column with a loop from 1 to n. A short cut to end the algorithm: For an input string of length n, whenever a variable is added to V(1,n), check to see if it is the starting variable, 'S' and print "accepted" and go read the next input string if it is 'S'. Then if you reach the end of the algorithm, print "rejected". Unique variables, upper case letters, can start with Z and work backwards (but skip S). Check g2.g to be sure you do not create what you think is a new variable, but duplicates an existing variable. You may be better off ignoring the "short cuts" and doing a good quality program because often good quality takes less time.
DO NOT LEAVE THIS TO THE LAST DAY! Expect to spend 8 hours of very careful programming on this project. (It will take 40 hours if you are sloppy or careless.) Comment your code as much for yourself as for grading. Indicate what "simplification" or "step" is being executed. Use indentation to match the structure of your code. You can leave debug print in your code, under control of something to turn it off when generating the output that is turned in.
Grading: 50% for first grammar correct 75% for first and second grammar correct 100% for first, second and third grammar correct 105% for also handling variables with numbers. g3a.* Points can be lost by: Partially incorrect results, Lack of comments to follow what code is doing at a coarse level. Really bad programming practice (not specifically allowed above.) Partial credit will be given for partially working code. Get some amount of output.
Last updated 11/26/98