CMSC 451 Automata Theory and Formal Languages Project

 
  The goal of the semester project is be sure you can work with
  formal grammars. Each grammar can generate a set of strings
  that is a formal language.

Grammar Definition

For project grammars we will use the following constraints:

All variables in the grammar will be a single upper case letter followed
by digits. (95% credit if you only handle single letters)
(You would not be able to easily handle Greibach Normal Form
 reductions without the digits.)

'S' will always be the starting variable but may not be the first rule.
('S' could get changed internally to A1 for some algorithms. e.g. Greibach)

All terminal symbols in the grammar will be a single lower case letter,
and all lower case letters will be terminal symbols.

The null string, epsilon,  will be represented by the "at sign" character @ .

Productions, rules, will be of the form:

S -> a
S -> SS
A -> BacD
A1 -> bA27c   (only in last 5% grammar g3a.g)
X -> @

The first item, not necessarily in column 1, is a variable.
The second item, always with at least one space before and after, is
the arrow represented as two characters "->" minus and grater characters.
The third item is one or more terminal symbols and variables in any order.
(The standard data for the course will not use the '|' or symbol
in the productions but you can implement it if you wish. Rules in
the data need to be in the simplified form without the '|'.)

A given grammar, G = (V, T, P, S) can be determined from the productions:
V is the set of upper case letters (and digits) found in the productions.
T is the set of lower case letters found in the productions.
P is the productions of the form  V -> alpha
S is always S for the input grammar

Comments are lines with a '#' in column 1.

Sample input grammar

A sample input file, g1.g is

# g1.g  a test grammar for the project from  Book P83
#
# G = (V, T, P, S)  V={S,A} T={a,b} S=S
# 
S -> aAS
S -> a
A -> SbA
A -> SS
A -> ba 

Sample input strings

A string of terminal symbols might be  abaa  which is accepted
A string of terminal symbols might be  abbab which is not accepted

Project Implementation

The project is to implement the CYK algorithm on p 139-141 in text book.
Your computer program, in the language of your choice, will
1)  read a grammar as defined above from a file. (fixed name or command line)
2a) print the grammar from your internal data structure.
2b) print the grammar after converted to Chomsky Normal Form.
3)  read a string of terminal symbols from the keyboard,
4)  print the string that is read,
5)  print "accepted" or "rejected" depending on the string being 
    in the grammar or not being in the grammar.

Your program should loop back and read another string until end of file.
i.e. keep repeating steps 3) 4) and 5) until end of file.
   
Below are three grammars (pick 3 or 3a) and four strings for each grammar.
Run them and turn in:
a) source code listing of your program
b) output from three runs (four if you did variable with digits)

But, you need Chomsky Normal Form

The CYK algorithm expects the grammar in  Chomsky Normal Form, 
that is: productions can be one of two formats A -> a or A -> BC

The grammars will have the "simplification" steps 1), 2) and 3) out
of the way, that is 1) No useless variables, 2) No nullable variables and
3) no unit productions. But, the grammars are not in Chomsky Normal Form,
so, either convert the grammar by hand, YUK!, or program in the
conversion of the raw grammar to Chomsky Normal Form.

Step 4) of "simplification" is the following algorithm:

   'length' refers to the number of variables plus terminal
   symbols on the right side of a production.

   Loop through the productions
     For each production with length greater than 1 do
       Replace each terminal symbol with a new variable and
       add a production   new variable -> terminal symbol.
       (An optimization is to reuse a variable name that has an
        existing production    variable name -> terminal symbol
        and not add another production)

   Loop through the productions
     For each production with length grater than 2 do
       Replace two rightmost variables with a new variable and
       add a production    new variable -> two rightmost variables.
       (Repeat - either on a production or loop until no replacements.)

   Now the grammar, as represented by the productions, is in Chomsky
   Normal Form. proceed with CYK. 

   An optimization is possible but not required, for any two productions
   with the same right side, delete the second production and
   replace all occurrences of the second productions left variable
   with the left variable of the first production in all productions.
   (This could be needed if you run out of upper case letters
    for variables.)

Hints that may or may not help

Programming hints (that you can ignore)

The table, matrix, V is built for each input string.
You may want to create it dynamically based on the length of the
input string, called x in Fig. 6.8. If you choose to make it a
fixed size matrix, it must be at least 64 by 64. It is easiest
to make V a square matrix with number of rows equal to number of
columns equal to the length of the input string. A matrix element,
V(i,j) or V[i,j] only holds a pointer to a set.

A set can be made by a simple linked list. It would horrify your
algorithms instructor, but for this algorithm, set union can be just
hook the new element onto the linked list. (Set union should really
check that no duplicate element ever gets into a set.)

Linked lists, by their nature, are usually created dynamically, but
if you want a fixed number of links, allow for at least 50,000.

You will need a set 'find' function that takes a variable and searches
the linked list and returns 'found' or 'not found' to implement
parts of CYK such as Step 6)  e.g.  B is in V(i,k) and
C is in V(i+k,j-k). Here "B" is the first of two variables in the
production  A -> BC  and C is the second variable.
Be careful implementing this step, BOTH 'find' must return 'found'.

If you are using C you can run all the loops from 0 to n-1 and
fix up all the related values or just waste a row and
a column with a loop from 1 to n.

A short cut to end the algorithm: For an input string of length n,
whenever a variable is added to V(1,n), check to see if it is the
starting variable, 'S' and print "accepted" and go read the next
input string if it is 'S'. Then if you reach the end of the algorithm,
print "rejected".

Unique variables, upper case letters, can start with Z and work
backwards (but skip S). Check g2.g to be sure you do not create what
you think is a new variable, but duplicates an existing variable.

You may be better off ignoring the "short cuts" and doing a good
quality program because often good quality takes less time.

Amount of effort

DO NOT LEAVE THIS TO THE LAST DAY!
Expect to spend 8 hours of very careful programming on this project.
(It will take 40 hours if you are sloppy or careless.)
Comment your code as much for yourself as for grading.
Indicate what "simplification" or "step" is being executed.
Use indentation to match the structure of your code.
You can leave debug print in your code, under control of something
to turn it off when generating the output that is turned in.

Grading policy

 Grading: 50% for first grammar correct
          75% for first and second grammar correct
         100% for first, second and third grammar correct
         105% for also handling variables with numbers. g3a.*

 Points can be lost by: 
   Partially incorrect results,
   Lack of comments to follow what code is doing at a coarse level.
   Really bad programming practice (not specifically allowed above.)


 Partial credit will be given for partially working code.
 Get some amount of output.

Last updated 11/26/98