Está en la página 1de 14

BNF (Backus Normal Form or Backus–Naur Form)

BNF is an acronym for "Backus Naur Form". John Backus and Peter Naur introduced for the
first time a formal notation to describe the syntax of a given language
BNF (Backus Normal Form or Backus–Naur Form) is a notation technique for
context-free grammars, often used to describe the syntax of languages used in
computing, such as computer programming languages, document formats,
instruction sets and communication protocols. It is applied wherever exact
descriptions of languages are needed, for instance, in official language
specifications, in manuals, and in textbooks on programming language theory

A BNF specification is a set of derivation rules, written as

<symbol> ::= __expression__

where <symbol> is a nonterminal, and the __expression__ consists of one or more sequences of
symbols; more sequences are separated by the vertical bar, '|', indicating a choice, the whole
being a possible substitution for the symbol on the left. Symbols that never appear on a left side
are terminals. On the other hand, symbols that appear on a left side are non-terminals and are
always enclosed between the pair <>.

The meta-symbols of BNF are:


::=
meaning "is defined as"
|
meaning "or"
<>
angle brackets used to surround category names.
The angle brackets distinguish syntax rules names (also called non-terminal symbols) from
terminal symbols which are written exactly as they are to be represented. A BNF rule defining a
nonterminal has the form:
nonterminal ::= sequence_of_alternatives consisting of strings of
terminals or nonterminals separated by the meta-symbol |
For example, the BNF production for a mini-language is:
<program> ::= program
<declaration_sequence>
begin
<statements_sequence>
end ;
Types of Grammer:.
Recall that a formal grammar $G=(\Sigma,N,P,\sigma)$ consists of an alphabet
$\Sigma$ , an alphabet $N$ of non-terminal symbols properly included in $\Sigma$ , a
non-empty finite set $P$ of productions, and a symbol $\sigma\in N$ called the start
symbol. The non-empty alphabet $T:=\Sigma-N$ is the set of terminal symbols. Then
$G$ is called a

Type-0 grammar

if there are no restrictions on the productions. Type-0 grammar is also known as an


unrestricted grammar, or a phrase-structure grammar.

Type-1 grammar

if the productions are of the form $uAv \to uWv$ , where $u,v,W\in \Sigma^*$ with $W\ne
\lambda$ , and $A\in N$ , or $\sigma\to \lambda$ , provided that $\sigma$ does not
occur on the right hand side of any productions in $P$ . As $A$ is surrounded by words
$u,v$ , a type-1 grammar is also known as a context-sensitive grammar.

Type-2 grammar

if the productions are of the form $A\to W$ , where $A\in N$ and $W\in \Sigma^*$ .
Type-2 grammars are also called context-free grammars, because the left hand side of
any productions are ``free'' of contexts.

Type-3 grammar

if the productions are of the form $A\to u$ or $A\to uB$ , where $A,B\in N$ and $u\in
T^*$ . Owing to the fact that languages generated by type-3 grammars can be
represented by regular expressions, type-3 grammars are also known as regular
grammars.

It is clear that every type-$i$ grammar is type-0, and every type-3 grammar is type-2. A type-2
grammar is not necessarily type-1, because it may contain both $\sigma\to \lambda$ and $A\to
W$ , where $\lambda$ occurs in $W$ . Nevertheless, the relevance of the hierarchy has more
to do with the languages generated by the grammars. Call a formal language a type-$i$

language if it is generated by a type-$i$ grammar, and denote the family of type-$i$


languages. Then it can be shown that

grammar language family abbreviation automaton

type-0 recursively enumerable turing machine


or
type-1 context-sensitive linear bounded automaton
or

type-2 context-free pushdown automaton


or

type-3 regular finite automaton


or

Classification of Grammars
Due to Noam Chomsky (1956)
Grammars are sets of productions of the form α = β.
class 0 Unrestricted grammars (α and β arbitrary)
e.g: X = a X b | Y c Y.
aYc = d.
dY = bb.
X ⇒ aXb ⇒ aYcYb ⇒ dYb ⇒ bbb
Recognized by Turing machines
class 1 Context-sensitive grammars (|α| ≤ |β|)
e.g: a X = a b c.
Recognized by linear bounded automata
class 2 Context-free grammars (α = NT, β ≠ ε)
e.g: X = a b c.
Recognized by push-down automata
class 3 Regular grammars (α = NT, β = T or T NT)
e.g: X = b | b Y.
Recognized by finite automata
Only these two classes
are relevant in compiler
construction
Introduction to Compilers

(NOTE: To view the material in this section correctly, you need


to use a PC with the TrueType Symbol font installed.)

• What is a Compiler?

1. A compiler is software (a program) that


translates a high-level programming language to
machine language. So, a simple representation would
be:

Source Code ----> Compiler -----> Machine Language


(Object File)
2. But, a compiler has to translate high-level code
to machine language, so it's not as simple as an
assembler translator. A compiler has to perform
several steps to produce an object file in machine
code form.

• Analysis of the source code:


o Lexical Analysis: scan the input
source file to identify tokens of the
programming language. Tokens are basic
units (keywords, identifier names, etc.)
that can be identified using rules. This
step is performed by a lexical recognizer,
or scanner.
o Syntax Analysis: group the tokens
identified by the scanner into grammatical
phrases that will be used by the compiler to
generate the output. This process is called
parsing and is performed by a parser based
on the formal grammar of the programming
language. The parser is created from the
grammar using a parser generator or
compiler-compiler.
o Semantic Analysis: check the source
program for semantic (meaning) errors and
gather type information for the subsequent
code generation phase. It uses the
hierarchical structure determined by the
parser to identify the operators and
operands of expressions and statements.
• Synthesis of the target program:
o Generate an intermediate representation
of the source program. This is performed by
some compilers, but not necessarily all.
An intermediate representation can be
thought of as a program for an abstract
machine --- it should be easy to produce and
easy to translate into the target program.
o Code Optimization: improve the
intermediate code to produce a faster
running machine code in the final
translation. Not all compilers include the
code optimization step, which can require a
lot of time.
o Code Generation: generate the target
code, normally either relocatable machine
code or assembly code. If the compiler
produces assembly code, the compiler output
has to subsequently be translated to machine
code by an assembler translator as an extra
step.

3. After the compiler (or assembler translator) has


produced the object file, two additional steps are
needed to produce and run the program.

• Linking the program: the linker (link-


editor) links the object files from the program
modules, and any additional library files to
create the executable program. Usually this
includes the use of relocatable addresses within
the program to allow the program to run in
different memory locations.
• Loading the program: the loader identifies
a memory location in which the program can be
loaded and alters the relocatable machine code
addresses to run in the designated memory
location. A program is loaded into memory each
time it is run (unless it's a TSR that remains in
memory, even when not active). In some
situations, the loader performs both steps of
linking and loading a program.

• Objectives:

The specific objectives for our discussion of compilers


will be lexical analysis and syntax analysis. We will
create lexical analyzers (scanners) using a scanner
generator and syntax analyzers (parsers) using a parser
generator. We will leave the actual generation of machine
code to the C compiler.

Grammars and Languages

Definitions

Alphabet. A finite set of symbols.

Token. A terminal symbol in the grammar for the source language.

• Typical tokens in a programming language include:


keywords, operators, identifiers, constants, literal
strings, and punctuation symbols such as parentheses,
commas, and semicolons.

String. A finite sequence of symbols drawn from an alphabet.

• Greek letters are used to denote strings. Roman


letters are used to denote symbols.
• The length of the string α , denoted as |α |, is
the number of occurrences of symbols in the string α .
• The empty string, denoted as ε , is a string of
length 0.
• If α is a string, then by α i we mean
α α α . . . α , i times.
• A terminal string is one composed only of
terminal symbols (tokens). ε is considered a terminal
string also.

Language. Any set of strings over some fixed alphabet.


This general definition includes the strings:

• The empty set,  ↵


• The set containing the empty string, {ε }

Grammar. Rules that specify the syntactic structure of well-


formed programs (sentences) in a language.

• A grammar is a 4-tuple (VN, VT, G0, P) where VN is


a set of non-terminal symbols, VT is a set of terminal
symbols (tokens), G0 is the goal symbol, and P is a set
of productions (rules) of the grammar.
• V, which denotes VN  ↵ VT, is called the alphabet
or vocabulary of the grammar.
• Terminal symbols are the basic symbols from which
strings are formed. The word "token" is a synonym for
terminal.
• Nonterminal symbols are syntactic variables that
denotes sets of strings. They impose a hierarchical
structure on the language that is useful for syntax
analysis and translation.
• In a grammar, one nonterminal symbol is defined
as the start (goal) symbol. The set of strings it
generates constitutes the language defined by the
grammar.
• The productions (rules) of a grammar specify the
manner in which the terminals and nonterminals can be
combined to form strings. Each production consists
of a nonterminal, followed by an arrow (sometimes the
symbol ::= is used instead), followed by a string of
nonterminals and terminals.

E.g., A -> B d

Classification of Grammars According to Types of Productions


Allowed

Type 0: Phase Structure Grammars (most general classification)

Productions allowed: α -> ω where ω may equal ε

E.g., a b c -> d e f g

Type 1: Context-Sensitive Grammars

Productions allowed:

α X β -> α ω β where X  ↵ VN and ω  ↵ ε (α and β are


called the context of X)
G -> ε

Type 2: Context-Free Grammars

Productions allowed: X -> ω where x  ↵ VN and ω may equal


ε .

Type 3: Regular Grammars (most restrictive classification)


Productions allowed:

X -> a and

X -> Y a where X, Y  ↵ VN, a  ↵ VT

Note: Type 3  ↵ LR(0)  ↵ LR(1)  ↵ LR(K)  ↵ Type 2

Definition:

If α -> β is a production and µ α η is a string of


the grammar (i.e., a string of symbols from the
vocabulary), then we write µ α η ==> µ β η
(the notation for "immediately derives")

Definition:

If in some grammar σ 1, σ 2, ..., σ t are strings such


that , for 1  ↵ i  ↵ t - 1, σ i ==> σ i+1 or t = 1,
then we write σ 1 =*=> σ t (the notation for "derives")

Note: alpha ==> alpha might not be true, but alpha


=*=> alpha is always true.

Definition:

The language defined by a grammar G, denoted as L(G), is given by

L(G) = {α | G0 =*=> α } , where α is a terminal string. I.e.,


alpha is composed entirely of terminals}

If G0 =*=> α , then α is called a sentential form. A terminal


sentential form is called a sentence.

Definition:

A language (on some alphabet) is a set of strings on that


alphabet. A language is called a {phase-structure | context-
sensitive | context-free | regular} language if it has a grammar
of the corresponding type.

Definition:

If α =*=> β , then β is called a descendent of α , and α is


called an ancestor of β .
If α ==> β , then β is called an immediate descendent of α ,
and α is called an immediate ancestor of β .

Definition:

Backus Naur Form is a method for representing context-free


grammars.

Examples of Grammars and Derivations:

Grammar 1:

1. SENTENCE -> NOUNPHRASE VERB NOUNPHRASE

2. NOUNPHRASE -> the ADJECTIVE NOUN

3. NOUNPHRASE -> the NOUN

4. VERB -> pushed

5. VERB -> helped

6. ADJECTIVE -> pretty

7. ADJECTIVE -> poor

8. NOUN -> man

9. NOUN -> boy

10. NOUN -> cat

Derivation of the sentence: "the man helped the poor boy"

1. SENTENCE (goal symbol)

2. ==> NOUNPHRASE VERB NOUNPHRASE (by Rule 1)

3. ==> the NOUN VERB NOUNPHRASE (Rule 3)

4. ==> the man VERB NOUNPHRASE (Rule 8)

5. ==> the man helped NOUNPHRASE

6. ==> the man helped the ADJECTIVE NOUN


7. ==> the man helped the poor NOUN

8. ==> the man helped the poor boy

(this derivation shows that "the man helped the poor boy" is a
sentence in the language defined by the grammar.)

This derivation may also be represented diagrammatically by a


syntax tree:

Typical format of a grammar for a programming language:

PROGRAM -> PROGRAM STATEMENT

PROGRAM -> STATEMENT

STATEMENT -> ASSIGNMENT-STATEMENT

STATEMENT -> IF-STATEMENT

STATEMENT -> DO-STATEMENT

...

ASSIGNMENT-STATEMENT -> ...

...

IF-STATEMENT -> ...

...

DO-STATEMENT -> ...


...

Grammar 2 (a simple grammar for arithmetic statements)


1. E -> E + T
2. E -> T
3. T -> T * a
4. T -> a
Derivation of: a + a * a
1. E Goal Symbol
2. ==> E + T Rule 1
3. ==> E + T * a Rule 3
4. ==> E + a * a Rule 4
5. ==> T + a * a Rule 2
6. ==> a + a * a Rule 4
Derivation of: a + a * a written in reverse:
1. a + a * a Given sentential form
2. T + a * a Rule 4 in reverse
3. E + a * a Rule 2 in reverse
4. E + T * a Rule 4
5. E + T Rule 3 in reverse
6. E Rule 1

Note: a derivation in which the terminal symbols are introduced


(or resolved) from right to left is called a rightmost
derivation. (It is also possible to do derivations using
leftmost derivations.)

También podría gustarte