Notes on Formal Languages

Basics

Formal languages are a tool for describing, modelling or recognizing practical languages.  The primary focus is on the 'syntax' of the language.

To define a particular formal language, we first pick a set of symbols--an alphabet.  (Depending on the langugage we hope to define, the "symbols" might be simple things "a" or "b" or they could be more complex structures--like "if", "then" or "+=".)  Call that alphabet "S".  

We call an arbitrary combination of symbols (allowing both repeats and unused symbols) a word.  A formal language is a set of words over some alphabet.  Two obvious examples of such a language are 1) the empty set (that is no words at all, which works regardless of the alphabet) and 2) S* (kleene star operator) which is the set of _all_ possible words over the alphabet.  The interesting languages are the ones which are non-trivial subsets of S*.

A language can be defined either by generator (a group of rules which allow you to "build" words in the language) or a recognizer (a machine which checks whether a suggested word is in the language), but we will typically describe the language with the generator, and then try to build recognizers to more practically manage.  

Regular Languages

Regular languages are typically generated by regular expressions.  We can define regular expressions like this:

1) The empty string is a valid regular expression.
2) Any symbol in the language is a valid regular expression.
3) If A is a valid regex, then A+ and A* are valid regexes.
4) If A and B are valid regexes, then A|B is a valid regex.
5) If A is a valid regex, then (A) is a valid regex.

The meaning of these rules is as follows:

1) This means that the empty string is a valid word in the language (assuming we use the rule in the definition of our language).
2) This means that the symbol (in offered rule) is a valid word in the language.
3) A+ means that we accept words which are an arbitrary (but not zero) number of copies of words from A.
A* means that we also accept the zero-length words from A.
4) A|B means that any word in either A or B is a valid word in the langauge being defined.
5) (A) means the same thing as the language A, its just a notional convenience.

Thought Problem:  Describe how to define A* just using the other bits of the definition.

Finite Automata

Let's be super lazy and ignore the empty string.  

A finite automata is a machine designed to recognize regular languages.  It consists of:

a) a finite set of symbols called the alphabet (S)
b) a finite set of states (Q)
c) an initial state (where the machine starts, called q0)
d) a set of final states (F, which is a subset of Q)
e) a transition function (D: Q x S -> S)

The machine is used to determine if a candidate word is a member of the language.  It works like this:

The FA starts in state q0.


We read the next character in the candidate word and apply the transition function to determine the next state.
If there is no valid transition (given the current state and the symbol), then the candidate word is not accepted (the machine halts).


When the candidate word (input string) is exhausted, the machine checks whether the current state is in F.  
If it is, the word is accepted.
If it is not, the word is not accepted.

Thought problem: Devise metamachines which--given basic machines which recognizes the regex A and B--can recognize languages like A* or B+ or A|B.


Context-Free Grammar


CFGs can be used to define more complex languages than those definable by regular expressions (N.B. except for those languages which are finite, mostly languages which are defined by a Regex or a CFG are countably infinite, so the languages are the same size).

A CFG looks like this:

a) a set of terminal symbols (S, the alphabet)
b) a set of non-terminal symbols (V, it is important that V and S are disjoint).
c) a distinguished start nont-terminal S0.
d) a set of production rules.

Productions rules are of the form V -> W   (where W is in (S U V)*, that is arbitrary strings that are made of members of either S or V).

A "derivation" is a series of applications of production rules (starting from just S0), which use the production rule to replace a non-terminal symbol with the RHS of the rule, which ends with a string consisting of only non-terminals.

e.g. Consider the following set of production rules:

S0 -> AA
A -> x
A -> BB
B -> y

then an example derivation would be  S0 ->  AA  ->  xA -> xBB -> xyB -> xyy   (which means xyy is a 'word' in the language.)