context-free grammar resource
In formal language theory, a context-free grammar (CFG) is a grammar in which
every production rule is of the form
V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or
nonterminals (possibly empty). The term "context-free" expresses the fact that
nonterminals can be rewritten without regard to the context in which they occur.
A formal language is context-free if some context-free grammar generates it.
Context-free grammars play a central role in the description and design of
programming languages and compilers. They are also used for analysing the syntax
of natural languages.
Background
Since the time of Pā?ini, at least, linguists have described the grammars of
languages in terms of their block structure, and described how sentences are
recursively built up from smaller phrases, and eventually individual words or
word elements.
The context-free grammar (or "phrase structure grammar," as Chomsky called it)
formalism developed by Noam Chomsky took the way in which linguistics had
described this grammatical structure, and turned it into rigorous mathematics. A
context-free grammar provides a simple and precise mechanism for describing the
methods by which phrases in some natural language are built from smaller blocks,
capturing the "block structure" of sentences in a natural way. Its simplicity
makes the formalism amenable to rigorous mathematical study, but it comes at a
price: important features of natural language syntax such as agreement and
reference cannot be expressed in a natural way, or not at all.
Block structure was introduced into computer programming languages by the Algol
project, which, as a consequence, also featured a context-free grammar to
describe the resulting Algol syntax. This became a standard feature of computer
languages, and the notation for grammars used in concrete descriptions of
computer languages came to be known as Backus-Naur Form, after two members of
the Algol language design committee.
The "block structure" aspect that context-free grammars capture is so
fundamental to grammar that the terms syntax and grammar are often identified
with context-free grammar rules, especially in computer science. Formal
constraints not captured by the grammar are then considered to be part of the
"semantics" of the language.
Context-free grammars are simple enough to allow the construction of efficient
parsing algorithms which, for a given string, determine whether and how it can
be generated from the grammar. An Earley parser is an example of such an
algorithm, while the widely used LR and LL parsers are more efficient algorithms
that only deal with more restrictive subsets of context-free grammars.
Formal Definitions
A context-free grammar G is a 4-tuple:
where
1. is a finite set of non-terminal characters or variables. They represent
different types of phrase or clause in the sentence.
2. is a finite set of terminals, disjoint with , which make up the actual
content of the sentence.
3. is the start variable, used to represent the whole sentence (or program). It
must be an element of .
4. is a relation from to such that .
The members of are called the rules or productions of the grammar.
Additional Definition 1
For any strings , we say yields , written as , if such that and . Thus, v is the
result of applying the rule (α,β) to u.
Additional Definition 2
For any (or in some textbooks) if such that
Additional Definition 3
The language of a grammar is the set
Additional Definition 4
A language is said to be a context-free language (CFL) if there exists a CFG,
such that .
Examples
Example 1
A simple context-free grammar is given as:
S → aSb | ab;
where | is used to separate multiple options for the same non-terminal; so this
is the same as
S → aSb
S → ab
The terminals here are a and b, while the only non-terminal is S. This grammar
generates the language which is not regular.
The special character ε stands for the empty string. By changing the above
grammar to :S → aSb | ε we obtain a grammar generating the language instead.
This differs only in that it contains the empty string while the original
grammar did not.
Example 2
Here is a context-free grammar for syntactically correct infix algebraic
expressions in the variables x, y and z:
S → x | y | z | S + S | S - S | S * S | S/S | (S)
This grammar can, for example, generate the string "( x + y ) * x - z * y / ( x
+ x )" as follows: "S" is the initial string. "S - S" is the result of applying
the fifth transformation [S → S - S] to the nonterminal S. "S * S - S / S" is
the result of applying the sixth transform to the first S and the seventh one to
the second S. "( S ) * S - S / ( S )" is the result of applying the final
transform to certain of the nonterminals. "( S + S ) * S - S * S / ( S + S )" is
the result of the fourth and fifth transforms to certain nonterminals. "( x + y
) * x - z * y / ( x + x )" is the final result, obtained by using the first
three transformations to turn the S nonterminals into the terminals x, y, and z.
This grammar is ambiguous, meaning that one can generate the same string with
more than one parse tree. For example, "x + y * z" might have either the + or
the * parsed first; presumably these will produce different results.
Example 3
A context-free grammar for the language consisting of all strings over {a,b} for
which the number of a's and b's are different is
S → U | V
U → TaU | TaT
V → TbV | TbT
T → aTbT | bTaT | ε
Here, the nonterminal T can generate all strings with the same number of a's as
b's, the nonterminal U generates all strings with more a's than b's and the
nonterminal V generates all strings with fewer a's than b's.
Example 4
Another example of a context-free language is . This is not a regular language,
but it is context free as it can be generated by the following context-free
grammar:
S → bSbb | A
A → aA | ε
Other examples
Context-free grammars are not limited in application to mathematical ("formal")
languages. For example, it has been suggested that a class of Tamil poetry
called Venpa is governed by a context-free grammar.
Derivations and syntax trees
There are two common ways to describe how a given string can be derived from the
start symbol of a given grammar. The simplest way is to list the consecutive
strings of symbols, beginning with the start symbol and ending with the string,
and the rules that have been applied. If we introduce a strategy such as "always
replace the left-most nonterminal first" then for context-free grammars the list
of applied grammar rules is by itself sufficient. This is called the leftmost
derivation of a string. For example, if we take the following grammar:
(1) S → S + S
(2) S → 1
(3) S → a
and the string "1 + 1 + a" then a left derivation of this string is the list [
(1), (1), (2), (2), (3) ]. Analogously the rightmost derivation is defined as
the list that we get if we always replace the rightmost nonterminal first. In
this case this could be the list [ (1), (3), (1), (2), (2)].
The distinction between leftmost derivation and rightmost derivation is
important because in most parsers the transformation of the input is defined by
giving a piece of code for every grammar rule that is executed whenever the rule
is applied. Therefore it is important to know whether the parser determines a
leftmost or a rightmost derivation because this determines the order in which
the pieces of code will be executed. See for an example LL parsers and LR
parsers.
A derivation also imposes in some sense a hierarchical structure on the string
that is derived. For example, if the string "1 + 1 + a" is derived according to
the leftmost derivation:
S→S+S (1)
→S+S+S (1)
→1+S+S (2)
→1+1+S (2)
→1+1+a (3)
the structure of the string would be:
{ { { 1 }S + { 1 }S }S + { a }S }S
where { ... }S indicates a substring recognized as belonging to S. This
hierarchy can also be seen as a tree:
S
/|\
/ | \
/ | \
S '+' S
/|\ |
/ | \ |
S '+' S 'a'
| |
'1' '1'
This tree is called a concrete syntax tree (see also abstract syntax tree) of
the string. In this case the presented leftmost and the rightmost derivations
define the same syntax tree; however, there is another (leftmost) derivation of
the same string
S→ S + S (1)
→ 1 + S (2)
→ 1 + S + S (1)
→ 1 + 1 + S (2)
→ 1 + 1 + a (3)
and this defines the following syntax tree:
S
/|\
/ | \
/ | \
S '+' S
| /|\
| / | \
'1' S '+' S
| |
'1' 'a'
If, for certain strings in the language of the grammar, there is more than one
parsing tree, then the grammar is said to be an ambiguous grammar. Such grammars
are usually hard to parse because the parser cannot always decide which grammar
rule it has to apply.
Normal forms
Every context-free grammar that does not generate the empty string can be
transformed into one in which no rule has the empty string as a product [a rule
with ε as a product is called an ε-production]. If it does generate the empty
string, it will be necessary to include the rule , but there need be no other
ε-rule. Every context-free grammar with no ε-production has an equivalent
grammar in Chomsky normal form or Greibach normal form. "Equivalent" here means
that the two grammars generate the same language.
Because of the especially simple form of production rules in Chomsky Normal Form
grammars, this normal form has both theoretical and practical implications. For
instance, given a context-free grammar, one can use the Chomsky Normal Form to
construct a polynomial-time algorithm which decides whether a given string is in
the language represented by that grammar or not (the CYK algorithm).
Undecidable problems
Although some operations on context-free grammars are decidable due to their
limited power, CFGs do have interesting undecidable problems. One of the
simplest and most cited is the problem of deciding whether a CFG accepts the
language of all strings. A reduction can be demonstrated to this problem from
the well-known undecidable problem of determining whether a Turing machine
accepts a particular input. The reduction uses the concept of a computation
history, a string describing an entire computation of a Turing machine. We can
construct a CFG that generates all strings that are not accepting computation
histories for a particular Turing machine on a particular input, and thus it
will accept all strings only if the machine does not accept that input.
As a consequence of this, it is also undecidable whether two CFGs describe the
same language, since we can't even decide whether a CFG is equivalent to the
trivial CFG deciding the language of all strings.
Another point worth mentioning is that the problem of determining if a
context-sensitive grammar describes a context-free language is undecidable.
Extensions
An obvious way to extend the context-free grammar formalism is to allow
nonterminals to have arguments, the values of which are passed along within the
rules. This allows natural language features such as agreement and reference,
and programming language analogons such as the correct use and definition of
identifiers, to be expressed in a natural way. E.g. we can now easily express
that in English sentences, the subject and verb must agree in number.
In computer science, examples of this approach include affix grammars, attribute
grammars, indexed grammars, and Van Wijngaarden two-level grammars.
Similar extensions exist in linguistics.
Another extension is to allow additional symbols to appear at the left hand side
of rules, constraining their application. This produces the formalism of
context-sensitive grammars.
Linguistic applications
Chomsky himself initially hoped to overcome the limitations of context-free
grammars by adding transformation rules.
Such rules are another standard device in traditional linguistics; e.g.
passivization in English. However, arbitrary transformations must be disallowed,
since they are much too powerful (Turing complete). Much of generative grammar
has been devoted to finding ways of refining the descriptive mechanisms of
phrase-structure grammar and transformation rules such that exactly the kinds of
things can be expressed that natural language actually allows.
His general position regarding the non-context-freeness of natural language has
held up since then , although his specific examples regarding the inadequacy of
CFGs in terms of their weak generative capacity were later disproven. Gerald
Gazdar and Geoffrey Pullum have argued that despite a few non-context-free
constructions in natural language (such as cross-serial dependencies in Swiss
German and reduplication in Bambara ), the vast majority of forms in natural
language are indeed context-free.[