Você está na página 1de 30

Syntax

The study of programming languages can be


divided into the examination of syntax and
semantics
Syntax - is the form of expressions, statements, and
program units
Semantics - is the meaning of those expressions,
statements, and program units

Meaning that syntax is the form (structure,


grammar) of a language and semantics is the
meaning of a language .
In a well-designed programming language,
semantics should follow directly from syntax
Describing syntax is easier than describing
semantics

Example:

if (a > b) a = a + 1;
else b = b + 1;

syntax: if-else is an operator that


takes three operands - a condition and
two statements
semantics: if the value of a is greater
than the value of b, then increment a.
Otherwise, increment b.

Syntax is what the grammar allows,


semantics is what it means.
int x = "five"; // syntax is okay
(type identifier = value), semantics is
wrong ("five" is not an int).

Both the syntax and semantics of a


programming language must be carefully
defined so that:
language implementors can implement
the language (correctly), so that programs
developed with one implementation run
correctly under another (portability)
programmers can use the language
(correctly)

Describing Syntax
A language is a set of strings of
characters from some alphabet.
Examples:
English (using the standard alphabet)
binary numbers (using the alphabet
{0, 1})

The syntax rules of a language


determine whether or not arbitrary
strings belong to the language. The first
step in specifying syntax is describing
the basic units or words of the
language, called lexemes.
For example, some typical Java lexemes
include:
if
++
+

THE GENERAL PROBLEM OF


DESCRIBING SYNTAX
Lexemes - the lowest level of
syntactic unit
The lexemes of a programming
language include its identifiers,
literals, operators and special words
Token of a language is a category of
its lexemes

Lexemes are grouped into categories


called tokens. Each token has one or
more lexemes.
Tokens are specified using regular
expressions or finite automata.
The scanner/lexical analyzer of a
compiler processes the character
strings in the source program and
determines the tokens that they
represent.
Once the tokens of a language are
defined, the next step is to determine
which sequences of tokens are in the

LANGUAGE RECOGNIZERS
Languages can be defined in two
ways: by recognition and by
generation
A language generator is a device that
can be used to generate the
sentences of a language

FORMAL METHODS OF DESCRIBING


SYNTAX
John Backus and Noam Chomsky
invented a notation that is most
widely
used
for
describing
programming language syntax

CONTEXT FREE GRAMMARS


Chomsky described 4 classes of
grammars that define 4 classes of
languages. Two of these grammar
classes, context-free and regular
turned out to be useful for describing
the
syntax
of
programming
languages
The tokens of programming
languages can be described by
regular grammars

ORIGINS OF BACKUS -NAUR FORM


(BNF)
BNF is a very natural notation for describing syntax
Chomsky's context-free languages is almost same as
BNF's context-free grammars meta-language is a
language that is used to describe another language
BNF is meta-language for programming languages
The abstractions in BNF, or grammar are called nonterminals
The lexemes and tokens of the rules are called
terminals
A BNF description, or grammar, is simply a collection
of rules

BNF Backus-Naur Form


BNF is:
a metalanguage - a language used to
describe other languages
the standard way to describe programming
language syntax
often used in language reference manuals
The class (set) of languages that can be
described using BNF is called the context-free
languages, and BNF descriptions are also
called
context-free
grammars
or
just
grammars.

BNF Notation

Parse Tree
A parse tree is a graphical way of
representing a derivation.
the root of the parse tree is always the
start symbol
each interior node is a nonterminal
each leaf node is a token
the children of a nonterminal (interior
node) are the RHS of some rule whose
LHS is the nonterminal

PARSE TREES
Most attractive feature of grammars is that they
describe the hierarchical syntactic structure of
sentences of the languages they define. These
hierarchical structures are called parse trees
A grammar that generates a sentence for which
there are two or more distinct parse trees is said to
be ambiguous
Syntactic ambiguity of language structures is a
problem because compilers often base the
semantics of those structures on their syntactic
form

For example, a parse tree for:


if (id > num) id = num; else { id = id +
num; id = id; }
using the previous grammar is:

A grammar is ambiguous if there are 2 or more


distinct parse trees (or equivalently, leftmost
derivations) for the same string.
Consider the grammar:
<expr> id | num | (<expr>) | <expr> + <expr> |
<expr> * <expr>
and the string:
id + num * id
The following parse trees show that this grammar is
ambiguous:

Which parse tree would we prefer?


Grammars can often be modified to remove ambiguity - in
this case, by enforcing associativity and precedence:
<expr> <expr> + <term> | <term>
<term> <term> * <factor> | <factor>
<factor> id | num | (<expr>)
The intuition behind this approach is to try to force the +
to occur higher in the parse tree.
The parse tree for :
id + num * id

This grammar modification gives rise to


three proof obligations:
the two grammars define the same
language
the second grammar always gives correct
associativity and precedence
the second grammar is not ambiguous
These proofs are omitted.

SYNTAX GRAPHS
A graph is a collection of nodes, some of which
are connected by lines, called edges
A directed graph is one in which the edges are
directional; they have arrowheads on one end
to indicate a direction
The information in BNF rules can be
represented in a directed graph, such graphs
are called syntax graphs. These graphs use
rectangles for non-terminals and circles for
terminals

GRAMMARS AND
RECOGNIZERS
One of the most widely used of the syntax
analyzer generators is named yacc - yet another
compiler compiler Syntax analyzers for
programming languages, which are often called
parsers, construct parse trees for given
programs
The 2 broad classes of parsers are top-down, in
which the tree is built from the root downward to
the leaves, and bottom-up, in which the parse
tree is built from the leaves upward to the root.

RECURSIVE DECENT
PARSING
Context-free grammar can serve as the
basis for the syntax analyzer, or parser, of a
compiler
A simple kind of grammar-based top-down
parser is named recursive decent
Parsing is the process of tracing a parse
tree for a given input string
The basic idea of a recursive decent parser
is that there is a subprogram for each nonterminal in the grammar

ATTRIBUTE GRAMMARS
An attribute grammar is a device
used to describe more of the
structure of
a programming language than is
possible with a context-free grammar
An attribute grammar is an extension
to a context-free grammar

Você também pode gostar