Você está na página 1de 49

What is the most compressed file ever?

In percent
decrease. Post the original size and the output size. Also
include the algorithm/process used.
25 Answers

Mike MacHenry, software engineer, improv comedian, maker


Updated 117w ago · Author has 296 answers and 396.6k answer views
My hat's off to Wenceslao Terra for the cool historical account of zip bombing.

It is, however, worth pointing out that there is no theoretical upper bound to how much
arbitrary data can be compressed by an arbitrary algorithm. A pathological example can
easily achieve an arbitrary amount of compression.

Pi compression is a commonly used joke compression algorithm--it works though. Basically


you take the data you want to compress in binary and search for it in the binary expansion
of the digits of pi. Since pi is infinite and does not repeat, all subsequences of digits will
eventually appear. This is a commonly held belief but not yet proven. But if we need only
find *a* file that is compressed a lot and not *any* file, this mechanism will provably work
just fine.

To get the compressed version of a file you store the index of the first digit in pi and the size
of your data as the compressed value. This can compress your data to a pretty insanely small
percentage of its original size. It also takes an unbelievable amount of time just find it unless
you get lucky.

This process can, potentially, be a larger value since you might find the subsequence you're
looking or only after a very large amount of digits. In fact it's quite likely that you'll
compress to an even larger value. This is something that the joke pages rarely point out. In
fact the pifs @philipl/pifs (Pi Filesystem) project claims 100% compression which is just
mathematically incorrect.

However, pi compression definitely compresses some values to much smaller. For any given
percentage compression you'd like, pi compression can beat it for some example file.

So how much can I compress a well-chosen file? Just take whatever the best answer has
been so far and add 0.1%.

24.9k Views · View Upvoters


Your response is private.
Is this answer still relevant and up to date?

Promoted by DigitalOcean
DigitalOcean® Python hosting - free $100/60-day credit.

Sign up now for $100 free credit. Deploy your Python project with 1vCPU, 1GB RAM and
25GB SSD.

Learn more at try.digitalocean.com

Related QuestionsMore Answers Below

 Data Compression: What is the most powerful file compressor in terms of size reduction?
 What is the potential of a compression utility app in today's market? If it can compress any
file (literally any file) to sizes of 1% of origi...
 Can I compress a video to 1% of the original size? And is it possible to compress the data
again using the same method?
 My website has a size of 3GB. How can I compress page size?
 How do I reduce Zip file size?
Ask New Question

David Rutter, M.S. C.S. Georgia Tech


Updated 81w ago · Author has 3.3k answers and 2m answer views
That's easy. The best compression ratio ever achieved is that demonstrated by the LenPEG
3 compression algorithm.

It can compress a 786kb bmp down to (arbitrary small size dependent on system
specifications).

Namely, the disk usage of a machine which has compressed Lena

using the LenPEG 3 algorithm will be exactly 0%. Since this will be less than the disk usage
of a machine that doesn't contain any copy of Lena at all—even one compressed using any
other compression algorithm—it is literally impossible to top its compression ratio. In fact,
this means that the compression ratio is an arbitrarily large negative number! Furthermore,
since one need not demonstrate an image compression algorithm on any image except Lena,
this is all the information you need to realize that it is truly the perfect compression
algorithm.

I mean, how many other answers to this question have demonstrated a negative
compression ratio? I'll tell you how many. None. LenPEG 3 will not be beat.
(P.S. If you want to see what a LenPEG 3 compressed Lena looks like, it's very easy to find
out. All new hard disks, thumb drives, and SD cards are sold with a copy these days. That's
how impressed the industry is with this amazing feat!)

Edit: Everyone go upvote Ian Kelly's comment proposing a new algorithm for LenPEG 4. It's
a great idea.

84.6k Views · View Upvoters

Dave Lindbergh, Engineer


Answered 50w ago · Upvoted by Justin Rising, MSE in CS and Gideon Shavit, Master's Computer
Science, University of Washington · Author has 550 answers and 886.3k answer views
Here is an algorithm that will compress any file, of any size, into 1 single bit.

Suppose you want to compress the BluRay of the movie “The Wizard of Oz” - about 4 GBytes
- into one bit. Here’s the pseudocode:

if (file == “The Wizard of Oz”)

{ output 1; }

else

output 0;

output file;

There you go - if the file is The Wizard of Oz, it outputs one bit (a 1).

If it’s anything else, it outputs a zero followed by whatever it is.

This works for any file, of any size.

Here’s the decompression pseudocode:

if (first_bit == 1)

{ output the contents of The Wizard of Oz; }

else

{ discard first_bit, output the rest of the input }


If you understand why this works, and why it is not useful, you understand why there can be
no such thing as a “universal” compression algorithm, and why your question isn’t really
meaningful.

85.2k Views · View Upvoters

Alex Coninx, P.h.D. Computer Science, University of Grenoble (2012)


Answered 205w ago · Author has 2.3k answers and 3.7m answer views
for s in 1 10 100 1000 10000; do dd if=/dev/zero bs=1M count=$s | bzip2 >
testcomp${s}M.bz2; done

That create bzip2 archives of files consisting only of zeros, for sizes of 1, 10, 100, 1000 and
10000 MB. It ran in about 3 minutes on my box.

Now the result:

45 testcomp1M.bz2
49 testcomp10M.bz2
113 testcomp100M.bz2
753 testcomp1000M.bz2
7346 testcomp10000M.bz2

The first figure is the size in bytes. In percent, it gives us:


99.9957% for the 1MB file
99.99953% for the 10MB file
99.99989% for the 100MB file
99.999928% for the 1000MB file
99.999930% for the 10000MB file

It looks like it slowly goes on decreasing with file size. Probably you could achieve an
arbitrarily high compression rate with really huge files full of zeros.
14.6k Views · View Upvoters

Kelly Kinkade, CS major for a few years way back. Programmer for far longer.
Answered 203w ago · Upvoted by Tamer Aly, M.S. Computer Science & Cognitive Neuroscience,
Fordham University (2015) and Nupul Kukreja, Ph.D. Computer Science & Software Engineering,
University of Southern California · Author has 7.6k answers and 37.3m answer views
I got in a small bit of trouble once by creating a zip bomb in a public directory on one of the
UNIX machines where I went to college. dd if=/dev/zero of=- bs=65536
count=65536 | compress - > interesting-file.Z The resulting file was fairly small,
but if uncompressed would expand to 4 gigabytes of zeros. Predictably, some junior system
admin, trolling about, saw it, copied it to his home directory, and uncompressed it to look at
it... and almost immediately filled the file system. He got in trouble for doing so, and tried
to blame me for his foolishness.
27.7k Views · View Upvoters
Cedric Mamo
Answered 202w ago · Author has 143 answers and 1.2m answer views
All compression is based on statistics in some way or another. (more frequent patterns
represented in less symbols, less frequent symbols represented in more symbols).

If I have a string "AAAAAAABBC", there are 7 As, 2 Bs and 1 C. So I can choose to represent
A as "0", B as "10" and C as "11". You always use the probability of a symbol (or a sequence)
of appearing. There are lots of different methods, but almost all compression is based on
this concept.

So if I had a string "AAAAAAAAAAAAAAAAAA", then I always know at the next step I will
get an A with 100% certainty (the string has no other letters), so in theory, that can be
compressed to just an empty string. In practice, you'd just have to store one of the letters.

So basically, asking what is the most compressed file ever is a bit pointless, because as I
said, it depends on the data just as much as it does on the algorithm itself.

So again, depending on the data, you can get compression ratios to be as large as you want.
You just put a long string of repeated characters in there and any algorithm will produce the
absolute best compression ratio it can that way. And if the compression ratio doesn't seem
high enough, you just make the string longer and it'll still compress to the same size
(because to the algorithm, there's still always a certainty what the next character will be)
3.8k Views · View Upvoters

Andrea Campana, photographer, videomaker, seo specialist


Answered 46w ago
There is no one universally best compression algorithm. Different algorithms have been
invented to handle different data.

JPEG compression allows you to compress images quite a lot because it doesn't matter too
much if the red in your image is 0xFF or 0xFE (usually).

Best Compression Algorithm:

The PAQ compression algorithm is by far the best algorithm for compressing files. However,
it is also one of the slowest, taking well over 10 hours to compress 1GB on even the best
CPU's

All PAQ compressors use a context mixing algorithm described in Wikipedia. A large
number of models independently predict the next bit of input. The predictions are combined
using a neural network and arithmetic coded. There are specialized models for text, binary
data, x86 executable code, BMP, TIFF, and JPEG images (except in the paq8hp* series,
which are tuned for English text only).
Best Compression Software:

PeaZip (Giorgio Tani) is a GUI front end for Windows and Linux that supports the paq8o,
lpaq1, and many other compression formats.

PeaZip is a free cross-platform file archiver & compressor that provides an unified portable
GUI for many Open Source technologies like 7-Zip, FreeArc, PAQ, UPX...

Create 7Z, ARC, BZ2, GZ, *PAQ, PEA, QUAD/BALZ, TAR, UPX, WIM, XZ, ZIP files

Open and extract over 180 archive types: ACE, ARJ, CAB, DMG, ISO, LHA, RAR, UDF,
ZIPX files and more...

Features of PeaZip includes extract, create and convert multiple archives at once, create
self-extracting archives, split/join files, strong encryption with two factor authentication,
encrypted password manager, secure deletion, find duplicate files, calculate hashes, export
job definition as script.

File Compression
File compression is used to reduce the file size of one or more files.
When a file or a group of files is compressed, the resulting "archive"
often takes up 50% to 90% less disk space than the original file(s).
Common types of file compression include Zip, Gzip, RAR, StuffIt, and
7z compression. Each one of these compression methods uses a
unique algorithm to compress the data.
So how does a file compression utility actually compress data? While
each compression algorithm is different, they all work in a similar
fashion. The goal is to remove redundant data in each file by replacing
common patterns with smaller variables. For example, words in a plain
text document might get replaced with numbers or another type of
short identifier. These identifiers then reference the original words that
are saved in a key within the compressed file. For instance, the word
"computer" may be replaced with the number 5, which takes up much
less space than the word "computer." The more times the word
"computer" is found in the text document, the more effective the
compression will be.
While file compression works well with text files, binary files can also
be compressed. By locating repeated binary patterns, a compression
algorithm can significantly reduce the size of binary files, such
as applications and disk images. However, once a file is compressed, it
must be decompressed in order to be used. Therefore, if
you downloador receive a compressed file, you will need to use a file
decompression program, such as WinZip or StuffIt Expander, to
decompress the file before you can view the original contents.

Algorithm
An algorithm is a set of instructions designed to perform a specific
task. This can be a simple process, such as multiplying two numbers,
or a complex operation, such as playing a compressed video
file. Search engines use proprietary algorithms to display the most
relevant results from their search index for specific queries.
In computer programming, algorithms are often created as functions.
These functions serve as small programs that can be referenced by a
larger program. For example, an image viewing application may
include a library of functions that each use a custom algorithm to
render different image file formats. An image editing program may
contain algorithms designed to process image data. Examples of
image processing algorithms include cropping, resizing, sharpening,
blurring, red-eye reduction, and color enhancement.
In many cases, there are multiple ways to perform a specific operation
within a software program. Therefore, programmers usually seek to
create the most efficient algorithms possible. By using highly-efficient
algorithms, developers can ensure their programs run as fast as
possible and use minimal system resources. Of course, not all
algorithms are created perfectly the first time. Therefore, developers
often improve existing algorithms and include them in future software
updates. When you see a new version of a software program that has
been "optimized" or has "faster performance," it most means the new
version includes more efficient algorithms.
A New Algorithm for Data Compression
Optimization I Made Agus Dwi Suarjaya Information
Technology Department Udayana University Bali, Indonesia
Abstract— People tend to store a lot of files inside theirs
storage. When the storage nears it limit, they then try to reduce
those files size to minimum by using data compression software.
In this paper we propose a new algorithm for data compression,
called jbit encoding (JBE). This algorithm will manipulates each
bit of data inside file to minimize the size without losing any
data after decoding which is classified to lossless compression.
This basic algorithm is intended to be combining with other data
compression algorithms to optimize the compression ratio. The
performance of this algorithm is measured by comparing
combination of different data compression algorithms.
Keywords- algorithms; data compression; j-bit encoding; JBE;
lossless.
I. INTRODUCTION Data compression
is a way to reduce storage cost by eliminating redundancies that
happen in most files. There are two types of compression, lossy
and lossless. Lossy compression reduced file size by eliminating
some unneeded data that won’t be recognize by human after
decoding, this often used by video and audio compression.
Lossless compression on the other hand, manipulates each bit of
data inside file to minimize the size without losing any data after
decoding. This is important because if file lost even a single bit
after decoding, that mean the file is corrupted. Data compression
can also be used for in-network processing technique in order to
save energy because it reduces the amount of data in order to
reduce data transmitted and/or decreases transfer time because
the size of data is reduced [1]. There are some well-known data
compression algorithms. In this paper we will take a look on
various data compression algorithms that can be use in
combination with our proposed algorithms. Those algorithms
can be classified into transformation and compression
algorithms. Transformation algorithm does not compress data
but rearrange or change data to optimize input for the next
sequence of transformation or compression algorithm. Most
compression methods are physical and logical. They are
physical because look only at the bits in the input stream and
ignore the meaning of the contents in the input. Such a method
translates one bit stream into another, shorter, one. The only way
to understand and decode of the output stream is by knowing
how it was encoded. They are logical because look only at
individual contents in the source stream and replace common
contents with short codes. Logical compression method is useful
and effective (achieve best compression ratio) on certain types
of data [2]. II. RELATED ALGORITHMS A. Run-length
encoding Run-length encoding (RLE) is one of basic technique
for data compression. The idea behind this approach is this: If a
data item d occurs n consecutive times in the input stream,
replace the n occurrences with the single pair nd [2]. RLE is
mainly used to compress runs of the same byte [3]. This
approach is useful when repetition often occurs inside data. That
is why RLE is one good choice to compress a bitmap image
especially the low bit one, example 8 bit bitmap image. B.
Burrows-wheeler transform Burrows-wheeler transform (BWT)
works in block mode while others mostly work in streaming
mode. This algorithm classified into transformation algorithm
because the main idea is to rearrange (by adding and sorting)
and concentrate symbols. These concentrated symbols then can
be used as input for another algorithm to achieve good
compression ratios. Since the BWT operates on data in memory,
you may encounter files too big to process in one fell swoop. In
these cases, the file must be split up and processed a block at a
time [3]. To speed up the sorting process, it is possible to do
parallel sorting or using larger block of input if more memory
available.
C. Move to front transform Move to front transform (MTF) is
another basic technique for data compression. MTF is a
transformation algorithm which does not compress data but can
help to reduce redundancy sometimes [5]. The main idea is to
move to front the symbols that mostly occur, so those symbols
will have smaller output number. This technique is intended to
be used as optimization for other algorithm likes Burrows-
wheeler transform.
D. Arithmetic coding Arithmetic coding (ARI)
is using statistical method to compress data. The method starts
with a certain interval, it reads the input file symbol by symbol,
and uses the probability of each symbol to narrow the interval.
Specifying a narrower interval requires more bits, so the number
constructed by the algorithm grows continuously. To achieve
compression, the algorithm is designed such that a high-
probability symbol

assignmen ofthe
compiler design
1
Expalin about the syntax direccted traselation? With
examples
Compiler Design | Syntax Directed Translation
Background : Parser uses a CFG(Context-free-Grammer) to validate the input string and produce
output for next phase of the compiler. Output could be either a parse tree or abstract syntax tree. Now to
interleave semantic analysis with syntax analysis phase of the compiler, we use Syntax Directed
Translation.
Definition
Syntax Directed Translation are augmented rules to the grammar that facilitate semantic analysis. SDT
involves passing information bottom-up and/or top-down the parse tree in form of attributes attached to
the nodes. Syntax directed translation rules use 1) lexical values of nodes, 2) constants & 3) attributes
associated to the non-terminals in their definitions.
Example

E -> E+T | T

T -> T*F | F

F -> INTLIT

This is a grammar to syntactically validate an expression having additions and multiplications in it. Now, to
carry out semantic analysis we will augment SDT rules to this grammar, in order to pass some information
up the parse tree and check for semantic errors, if any. In this example we will focus on evaluation of the
given expression, as we don’t have any semantic assertions to check in this very basic example.

E -> E+T { E.val = E.val + T.val } PR#1

E -> T { E.val = T.val } PR#2

T -> T*F { T.val = T.val * F.val } PR#3

T -> F { T.val = F.val } PR#4

F -> INTLIT { F.val = INTLIT.lexval } PR#5

For understanding translation rules further, we take the first SDT augmented to [ E -> E+T ] production
rule. The translation rule in consideration has val as attribute for both the non-terminals – E & T. Right
hand side of the translation rule corresponds to attribute values of right side nodes of the production rule
and vice-versa. Generalizing, SDT are augmented rules to a CFG that associate 1) set of attributes to
every node of the grammar and 2) set of translation rules to every production rule using attributes,
constants and lexical values.

Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree corresponding to S
would be
To evaluate translation rules, we can employ one depth first search traversal on the parse tree. This is
possible only because SDT rules don’t impose any specific order on evaluation until children attributes
are computed before parents for a grammar having all synthesized attributes. Otherwise, we would have
to figure out the best suited plan to traverse through the parse tree and evaluate all the attributes in one
or more traversals. For better understanding, we will move bottom up in left to right fashion for computing
translation rules of our example.
Above diagram shows how semantic analysis could
happen. The flow of information happens bottom-up and
all the children attributes are computed before parents, as
discussed above. Right hand side nodes are sometimes
annotated with subscript 1 to distinguish between children
and parent.
Additional Information
Synthesized Attributes are such attributes that depend
only on the attribute values of children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a
synthesized attribute val corresponding to node E. If all the
semantic attributes in an augmented grammar are
synthesized, one depth first search traversal in any order
is sufficient for semantic analysis phase.
Inherited Attributes are such attributes that depend on
parent and/or siblings attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val }
], where E & Ep are same production symbols annotated
to differentiate between parent and child, has an inherited
attribute val corresponding to node

2 In computer science, an abstract syntax tree (AST),


or just syntax tree, is a tree representation of the abstract
syntactic structure of source code written in
a programming language. Each node of the tree denotes a
construct occurring in the source code. The syntax is
"abstract" in not representing every detail appearing in the
real syntax. For instance, grouping parentheses are
implicit in the tree structure, and a syntactic construct like
an if-condition-then expression may be denoted by means
of a single node with three branches.
This distinguishes abstract syntax trees from concrete
syntax trees, traditionally designated parse trees, which
are typically built by a parser during the source code
translation and compiling process. Once built, additional
information is added to the AST by means of subsequent
processing, e.g., contextual analysis.
Abstract syntax trees are also used in program
analysis and program transformation systems.

Us t try syntax trees and realize that I have a few problems. I


have a problem especially with two examples because I am very
unsure how to handle the cases. In case 1, I do not know how to
deal with fixed terms such as "the church of England". And in
case 2, I don't know how to deal with "the girl who left us"
These are my solutions. Would someone kindly make me more
understandable? I would be very grateful!
syntax syntax-trees

shareimprove this question


asked Jan 28 '17 at 10:52

Kendel Ventonda
39115
add a comment
4 Answers
activeoldestvotes
up vote3down vote
Although what is "correct" always depends on theory, there are
various things that are definitely not quite right with your trees.
Tree #1
the founder of the church of England
The whole thing taken together is an NP (it starts with a definite
article and can serve as the subject of a sentence, so it is
something nominal, not prepositional), so the root of the tree
should be labelled NP rather than PP.
In general, an XP must always have an X as its head.
Thus, when there is an NP, there must be an N as the head, and
for a PP, there is a P head. This principle is not always follwed
in your trees.
The same goes for NPs. Now I don't know what theory you are
using, because there are basically two opposing approaches:
1) Make the whole thing an NP, i.e. a phrase with an N head to
which the determiner is a specifier:

The head of the NP is the N "church". The DP consisting of the


D "the" is a specifier because it is the sister of N' and daughter
of NP.
2) Make the whole thing a DP, i.e. a phrase with a D head to
which the noun phrase is a complement:
The head of the DP is the D "the". The complement of this D
head is an NP which consists of the single N head "church".
I will not go into a discussion of the motivations of each
approach (and neither into a discussion about whether you
should leave redundant bar levels away), but you need to decide
what your phrase and its head is supposed to be. Having an NP
branching into a D and an N violates the X-bar scheme because
a phrase must have an identifiable head and can not branch into
two lexical items (D and N); one of them must be an X' or an
XP. Either you make it an NP with an N head and the DP as a
specifier, or you make it a DP with a D head and the NP as a
complement.
Assuming that you want to have the whole as an NP, I'll
continue with the first approach.
So a first rudimentary picture of your tree looks like this:

You can now argue about whether the PP "of the church of
England" is an adjunct rather than a complement, but in this case
I find the latter approach more plausible. So within N', we have
an N head "founder" and a PP complement "of the church of
England":

Now about the PP. As said above, the head of the PP must be a
P of which the complement is an NP, thus:
The NP "the church of England" again branches into the
determiner and the N' "church of England":

Within this N', "church" is the head and "of England" is a PP


complement to the N head "church":

Again, you could also argue about making the PP "of England"
an adjunction, but here too I find a complement more plausible.
The PP "of England" itself looks similar as the other PPs, with
the difference that the NP "England" doesn't have a DP
specifier:

And now you are done with your tree.


The whole phrase is an NP, of which the head is the noun
"founder" and the PP "of the church of England" is a
complement with a P head "of". The determiner "the" is located
in specifier position to the NP. the PP "of the church of
England" later branches into another PP "of England".

Tree #2
the brother of the girl who left us
I'll keep my explanation a bit briefer here.
Similarly as above, you have an NP in which the N' consists of
the N head "brother" and a PP complement "of the girl who left
us":

Within the PP, the complement NP "the girl" is modified by


adjunction of the relative clause "who left us":

It is also possible to locate the relative clause as an adjunct to


the N' "girl" rather than the whole NP "the girl":
For reasons that are too complicated to discuss here, I will
assume adjunction to the NP rather than N'.
The difficult part now is how to handle the relative clause "who
left us". The assumption is the following:
Within the relative clause CP, the relative pronoun "who" is
assumed to start in the subject position, i.e. in the specifier
position of IP (SpecI), because the NP it refers to ("the girl") is
the subject of the sentence:

This NP pronoun is then moved to the specifier of CP (SpecC)


to get into the position of a relative pronoun:

The moved pronoun leaves a trace (t_i) and is now located in


SpecC position, where it serves as a relataive pronoun referring
to "girl".
The tree as a whole thus looks as follows:

To summarize, the whole expression is an NP, where the head N


"brother" has a PP complement "of the girl who ...", and within
that PP complement, the NP "the girl" is modified by adjunction
of a relative clause CP in which the NP "who" was moved from
SpecI to SpecC to serve as a relative pronoun referring to "the
girl".
General remarks
 My proposal is not the one and only gold standard solution;

there can not be one. Details of what a tree looks like always
depends on theory. In particular,
 there are opposing views on how to account for
determiner + noun (making it an NP, as I did, or a DP,
with consequences for their internal structure)
 whether to omit redundant bar levels (as I did),

 how to label the relative clause (CP or S or ...) ,

 where to attach the relative clause (as an adjunct to the NP

"the girl" or as an adjunct to the N' "girl"),


 whether the PPs act as complements or as adjuncts to the

NPs.
Which solution is deemed correct depends on what theory
you are using.
 You really should take a look again into the basics of how
phrase structure trees work. For example, having a VP with a
P head, as you did in your second tree, makes absolutely no
sense. It seems that there are some substantial assumptions
about phrase structure trees that are not quite clear to you yet.
You must always make sure that the labels of your (sub) trees
are in accordance with what is in the tree: A PP consists of a
P and an NP complement, if you have an NP, then this must
have an N as its head, and an expression "the brother of ..." is
certainly not a VP.
Once you gained a better understanding of how phrase
structure trees work, what a phrase consists of an what
relations hold between constutuents, it will get far more
obivious to you how to assign a sentence a tree structure.
shareimprove this answer
edited Jan 28 '17 at 17:20
answered Jan 28 '17 at 13:06

lemontree♦
4,4691826
add a comment
up vote2down vote
The two sentences are Nps
check this:

it cant be a PP because it clearly starts with an NP and "Of"


cannot be separated from "The church" because both of them
construct a PP (even in the meaning 'of' will be meaningless,and
you will ask: of what?)
3. Symbol Tables A symbol table is a major data structure used
Symbol tables: a data structure for storing information about
symbols in the program • Used in semantic analysis and
subsequent compiler stages • Next time: type-checking in a
compiler: Associates attributes with identifiers used in a
program For instance, a type attribute is usually associated with
each identifier A symbol table is a necessary component
Definition (declaration) of identifiers appears once in a program
Use of identifiers may appear in many places of the program
text Identifiers and attributes are entered by the analysis phases
When processing a definition (declaration) of an identifier In
simple languages with only global variables and implicit
declarations:

The scanner can enter an identifier into a symbol table if it is not


already there In block-structured languages with scopes and
explicit declarations:

The parser and/or semantic analyzer enter identifiers and


corresponding attributes Symbol table information is used by the
analysis and synthesis phases To verify that used identifiers
have been defined (declared) To verify that expressions and
assignments are semantically correct – type checking To
generate intermediate or target code Symbol Tables, Hashing,
and Hash Tables – 2 Compiler Design – © Muhammed
Mudawwar Symbol Table Interface The basic operations
defined on a symbol table include: allocate – to allocate a new
empty symbol table free – to remove all entries and free the
storage of a symbol table insert – to insert a name in a symbol
table and return a pointer to its entry lookup – to search for a
name and return a pointer to its entry set_attribute – to associate
an attribute with a given entry get_attribute – to get an attribute
associated with a given entry Other operations can be added
depending on requirement For example, a delete operation
removes a name previously inserted Some identifiers become
invisible (out of scope) after exiting a block This interface
provides an abstract view of a symbol table Supports the
simultaneous existence of multiple tables Implementation can
vary without modifying the interface Symbol Tables, Hashing,
and Hash Tables – 3 Compiler Design – © Muhammed
Mudawwar Basic Implementation Techniques First
consideration is how to insert and lookup names Variety of
implementation techniques Unordered List Simplest to
implement Implemented as an array or a linked list Linked list
can grow dynamically – alleviates problem of a fixed size array
Insertion is fast O(1), but lookup is slow for large tables – O(n)
on average Ordered List If an array is sorted, it can be searched
using binary search – O(log2 n) Insertion into a sorted array is
expensive – O(n) on average Useful when set of names is known
in advance – table of reserved words Binary Search Tree Can
grow dynamically Insertion and lookup are O(log2 n) on
average Symbol Tables, Hashing, and Hash Tables – 4 Compiler
Design – © Muhammed Mudawwar Hash Tables and Hash
Functions A hash table is an array with index range: 0 to
TableSize – 1 Most commonly used data structure to implement
symbol tables Insertion and lookup can be made very fast –
O(1) A hash function maps an identifier name into a table index
A hash function, h(name), should depend solely on name
h(name) should be computed quickly h should be uniform and
randomizing in distributing names All table indices should be
mapped with equal probability Similar names should not cluster
to the same table index Symbol Tables, Hashing, and Hash
Tables – 5 Compiler Design – © Muhammed Mudawwar Hash
Functions Hash functions can be defined in many ways . . . A
string can be treated as a sequence of integer words Several
characters are fit into an integer word Strings longer than one
word are folded using exclusiv
symbol Table : 1. In a compiler: a data structure used by the
compiler to keep track of identifiers used in the source
program.This is a compile-time data structure. Not used at run
time. 2. In object files: a symbol table (mapping var name to
address) can be build into object programs, to be used during
linking of different object programs to resolve reference. 3. In
executables: a symbol table (again mapping name to address)
can be included in executables, to recover variable names during
debugging. • Our Focus: Symbol table in compiler The Symbol
Table 6 Symbol Tables (i) Variable Declarations • In static-
typing programming languages, variables need to be declared
before they are used. The declaration provides the data type of
the variable. • E.g., int a; float b; string c; • Most typically,
declaration is valid for the scope in which it occurs: • Function
Scoping: each variable is defined anywhere in the function in
which it is defined, after the point of definition • Block Scoping:
variables are only valid within the block of code in which it is
defined, e.g, prog xxx {int a; float b} { int c; { int b; c = a + b; }
return float(c) / b } The Symbol Table 4 7 Symbol Tables (i) •
Identifiers: User-supplied names, such as: • variable names, •
function names, • labels (e.g., where goto is allowed) • Symbol
table typically implemented as a hash table: • KEY: the symbol •
VALUE: information about the symbol The Symbol Table 8
Symbol Tables (i) Terminology: • Symbol : the character string
recognised during lexical analysis, e.g., “begin”, “int”, “A”, “;”
etc. are symbols. The Symbol Table begin int A; A = 100; A =
A+A; output A end 5 9 Symbol Tables (i) Terminology: •
Token: the syntactic label for the symbol. • This label is that
used during syntactic analysis. For example, assume we have a
line of code: A = A + A ; • During lexical analysis, we label
each symbol with its token: (id, ‘A’)(eqop, ‘=‘)(id, ‘A’) (plus,
‘+’)(id, ‘A’)(semic, ‘;’) • During Syntactic analysis, we might
have rules: Statement := id eqop Expr semic Expr := id plus id •
We can see then that the tokens are simply the labels replacing
symbols for syntactic parsing. • Tokens are the terminals of the
syntactic analysis The Symbol Table 10 Symbol Tables (i) •
Types of Tokens: tokens can be divided into: • Language-
defined tokens: • Reserved words: sets of symbols defined by
the language to have special meaning, e.g., “begin”, “defun”,
etc. • Operators: =, +, *, etc. • Dividers: { } ; • User-defined
tokens: • Identifiers: names of variables, procedures, etc. •
Literals: numbers, strings, etc. specified in the program. • The
symbol table is (in most cases) concerned only with storing
identifiers and their attributes. The Symbol Table 6 11 Symbol
Tables (i) • A simple symbol table: • Data Structure: A hash
table where: • Key: a symbol • Value: the token for the symbol
(id, num, etc.) The Symbol Table 12 Symbol Tables (i) • A
simple symbol table: • Data Structure: A hash table where: •
Key: a symbol • Value: the token for the symbol (id, num, etc.) •
Methods: • Insert (symbol, tok) – set the token of symbol to tok
• Lookup(symbol) – if symbol is known, return its tok, else
return 0 The Symbol Table 7 13 Symbol
Type Checking •
Type checking[edit]
The process of verifying and enforcing the constraints of
types—type checking—may occur either at compile-
time (a static check) or at run-time. If a language
specification requires its typing rules strongly (i.e., more or
less allowing only those automatic type conversions that
do not lose information), one can refer to the process
as strongly typed, if not, as weakly typed. The terms are
not usually used in a strict sense.
Static type checking[edit]
Static type checking is the process of verifying the type
safety of a program based on analysis of a program's text
(source code). If a program passes a static type checker,
then the program is guaranteed to satisfy some set of type
safety properties for all possible inputs.
Static type checking can be considered a limited form
of program verification (see type safety), and in a type-
safe language, can be considered also an optimization. If
a compiler can prove that a program is well-typed, then it
does not need to emit dynamic safety checks, allowing the
resulting compiled binary to run faster and to be smaller.
Static type checking for Turing-complete languages is
inherently conservative. That is, if a type system is
both sound (meaning that it rejects all incorrect programs)
and decidable(meaning that it is possible to write an
algorithm that determines whether a program is well-
typed), then it must be incomplete (meaning there are
correct programs, which are also rejected, even though
they do not encounter runtime errors).[6] For example,
consider a program containing the code:
if <complex test> then <do something> else
<generate type error>
Even if the expression <complex test> always
evaluates to true at run-time, most type checkers will
reject the program as ill-typed, because it is difficult (if not
impossible) for a static analyzer to determine that
the else branch will not be taken.[7] Conversely, a static
type checker will quickly detect type errors in rarely used
code paths. Without static type checking, even code
coverage tests with 100% coverage may be unable to find
such type errors. The tests may fail to detect such type
errors, because the combination of all places where
values are created and all places where a certain value is
used must be taken into account.
A number of useful and common programming language
features cannot be checked statically, such
as downcasting. Thus, many languages will have both
static and dynamic type checking; the static type checker
verifies what it can, and dynamic checks verify the rest.
Many languages with static type checking provide a way to
bypass the type checker. Some languages allow
programmers to choose between static and dynamic type
safety. For example, C# distinguishes between statically-
typed and dynamically-typed variables. Uses of the former
are checked statically, whereas uses of the latter are
checked dynamically. Other languages allow writing code
that is not type-safe. For example, in C, programmers can
freely cast a value between any two types that have the
same size.
For a list of languages with static type checking, see the
category for statically typed languages.
Dynamic type checking and runtime type
information[edit]
Dynamic type checking is the process of verifying the type
safety of a program at runtime. Implementations of
dynamically type-checked languages generally associate
each runtime object with a type tag (i.e., a reference to a
type) containing its type information. This runtime type
information (RTTI) can also be used to implement dynamic
dispatch, late binding, downcasting, reflection, and similar
features.
Most type-safe languages include some form of
dynamic type checking, even if they also have a static
type checker.[citation needed] The reason for this is that
many useful features or properties are difficult or
impossible to verify statically. For example, suppose
that a program defines two types, A and B, where B is
a subtype of A. If the program tries to convert a value
of type A to type B, which is known as downcasting,
then the operation is legal only if the value being
converted is actually a value of type B. q5 5

5.Specification of a Simple Type


Checker
A type checker for a simple language checks the type of
each identifier. The type checker is a translation scheme
that synthesizes the type of each expression from the
types of its subexpressions. The type checker can handle
arrays, pointers, statements and functions.
SPECIFICATION OF A SIMPLE TYPE CHECKER

A type checker for a simple language checks the type of


each identifier. The type checker is a translation scheme that
synthesizes the type of each expression from the types of its
subexpressions. The type checker can handle arrays, pointers,
statements and functions.

A Simple Language
Consider the following grammar:
P→D;E
D → D ; D | id : T
T → char | integer | array [ num ] of T | ↑ T
E → literal | num | id | E mod E | E [ E ] | E ↑

Translation scheme:
P→D;E
D→D;D
D → id : T { addtype (id.entry , T.type) }
T → char { T.type : = char }
T → integer { T.type : = integer }
T → ↑ T1 { T.type : = pointer(T1.type) }
T → array [ num ] of T1 { T.type : = array ( 1… num.val ,
T1.type) }

In the above language,


→ There are two basic types : char and integer ; → type_error is
used to signal errors;
→ the prefix operator ↑ builds a pointer type. Example , ↑
integer leads to the type expression
pointer ( integer ).

Type checking of expressions

In the following rules, the attribute type for E gives the type
expression assigned to the expression generated by E.

1. E → literal { E.type : = char } E→num { E.type : = integer }


Here, constants represented by the tokens literal and num have
type char and integer.

2. E → id { E.type : = lookup ( id.entry ) }

lookup ( e ) is used to fetch the type saved in the symbol table


entry pointed to by e.

3. E → E1 mod E2 { E.type : = if E1. type = integer and E2.


type = integer then integer
else type_error }

The expression formed by applying the mod operator to two


subexpressions of type integer has type integer; otherwise, its
type is type_error.

4. E → E1 [ E2 ] { E.type : = if E2.type = integer and E1.type =


array(s,t) then t
else type_error }

In an array reference E1 [ E2 ] , the index expression E2 must


have type integer. The result is the element type t obtained from
the type array(s,t) of E1.
5. E → E1 ↑ { E.type : = if E1.type = pointer (t) then t
else type_error }
The postfix operator ↑ yields the object pointed to by its
operand. The type of E ↑ is the type t of the object pointed to by
the pointer E.
Type checking of statements

Statements do not have values; hence the basic type void


can be assigned to them. If an error is detected within a
statement, then type_error is assigned.

Translation scheme for checking the type of statements:

1. Assignment statement: S→id: = E

2. Conditional statement: S→if E then S1

3. While statement:
S → while E do S1

4. Sequence of statements:
S → S1 ; S2 { S.type : = if S1.type = void and S1.type =
void then void else type_error }

Type checking of functions


The rule for checking the type of a function application is : E → E1 ( E2) { E.type :
= if E2.type = s and

E1.type = s → t then t else type_error }

Thus, a dynamic check is needed to verify that the


operation is safe. This requirement is one of the criticisms
of downcasting.
By definition, dynamic type checking may cause a
program to fail at runtime. In some programming
languages, it is possible to anticipate and recover from
these failures. In others, type-checking errors are
considered fatal.
Programming languages that include dynamic type
checking but not static type checking are often called
"dynamically typed programming languages". For a list of
such languages, see the category for dynamically typed
programming languages.
Combining static and dynamic type checking[edit]
Some languages allow both static and dynamic typing
(type checking), sometimes called soft typing. For
example, Java and some other ostensibly statically typed
languages support downcasting types to their subtypes,
querying an object to discover its dynamic type, and other
type operations that depend on runtime type information.
More generally, most programming languages include
mechanisms for dispatching over different 'kinds' of data,
such as disjoint unions, subtype polymorphism,
and variant types. Even when not interacting with type
annotations or type checking, such mechanisms are
materially similar to dynamic typing implementations.
See programming language for more discussion of the
interactions between static and dynamic typing.
Objects in object-oriented languages are usually accessed
by a reference whose static target type (or manifest type)
is equal to either the object's run-time type (its latent type)
or a supertype thereof. This is conformant with the Liskov
substitution principle, which states that all operations
performed on an instance of a given type can also be
performed on an instance of a subtype. This concept is
also known as subsumption. In some languages subtypes
may also possess covariant or contravariant return types
and argument types respectively.
Certain languages, for example Clojure, Common Lisp,
or Cython are dynamically type-checked by default, but
allow programs to opt into static type checking by
providing optional annotations. One reason to use such
hints would be to optimize the performance of critical
sections of a program. This is formalized by gradual
typing. The programming environment DrRacket, a
pedagogic environment based on Lisp, and a precursor of
the language Racket was also soft-typed.
Conversely, as of version 4.0, the C# language provides a
way to indicate that a variable should not be statically
type-checked. A variable whose type is dynamic will not
be subject to static type checking. Instead, the program
relies on runtime type information to determine how the
variable may be used.[8]
Static and dynamic type checking in practice[edit]
The choice between static and dynamic typing requires
certain trade-offs.
Static typing can find type errors reliably at compile time,
which should increase the reliability of the delivered
program. However, programmers disagree over how
commonly type errors occur, resulting in further
disagreements over the proportion of those bugs that are
coded that would be caught by appropriately representing
the designed types in code.[9][10]Static typing
advocates[who?] believe programs are more reliable when
they have been well type-checked, whereas dynamic-
typing advocates[who?] point to distributed code that has
proven reliable and to small bug databases.[citation
needed]
The value of static typing, then,
presumably[vague] increases as the strength of the type
system is increased. Advocates of dependent
typing,[who?] implemented in languages such as Dependent
ML and Epigram, have suggested that almost all bugs can
be considered type errors, if the types used in a program
are properly declared by the programmer or correctly
inferred by the compiler.[11]
Static typing usually results in compiled code that
executes faster. When the compiler knows the exact data
types that are in use (which is necessary for static
verification, either through declaration or inference) it can
produce optimized machine code. Some dynamically
typed languages such as Common Lisp allow optional
type declarations for optimization for this reason.
By contrast, dynamic typing may allow compilers to run
faster and interpreters to dynamically load new code,
because changes to source code in dynamically typed
languages may result in less checking to perform and less
code to revisit.[clarification needed] This too may reduce the edit-
compile-test-debug cycle.
Statically typed languages that lack type inference (such
as C and Java) require that programmers declare the
types that a method or function must use. This can serve
as added program documentation, that is active and
dynamic, instead of static. This allows a compiler to
prevent it from drifting out of synchrony, and from being
ignored by programmers. However, a language can be
statically typed without requiring type declarations
(examples include Haskell, Scala, OCaml, F#, and to a
lesser extent C# and C++), so explicit type declaration is
not a necessary requirement for static typing in all
languages.
Dynamic typing allows constructs that some static type
checking would reject as illegal. For
example, eval functions, which execute arbitrary data as
code, become possible. An evalfunction is possible with
static typing, but requires advanced uses of algebraic data
types. Further, dynamic typing better accommodates
transitional code and prototyping, such as allowing a
placeholder data structure (mock object) to be
transparently used in place of a full data structure (usually
for the purposes of experimentation and testing).
Dynamic typing typically allows duck typing (which
enables easier code reuse). Many[specify] languages with
static typing also feature duck typing or other mechanisms
like generic programming that also enable easier code
reuse.
Dynamic typing typically makes metaprogramming easier
to use. For example, C++ templates are typically more
cumbersome to write than the
equivalent Ruby or Python code since C++ has stronger
rules regarding type definitions (for both functions and
variables). This forces a developer to write more
boilerplate code for a template than a Python developer
would need to. More advanced run-time constructs such
as metaclasses and introspection are often harder to use
in statically typed languages. In some languages, such
features may also be used e.g. to generate new types and
behaviors on the fly, based on run-time data. Such
advanced constructs are often provided by dynamic
programming languages; many of these are dynamically
typed, although dynamic typing need not be related
to dynamic programming languages.

Type checking is the validation of the set of type rules •


Examples: – The type of a variable must match the type from its
declaration – The operands of arithmetic expressions (+, *, -, /)
must have integer types; the result has integer type – The
operands of comparison expressions (==, !=) must have integer
or string types; the result has Boolean type CS 412/413 Spring
2008 Introduction to Compilers 8 Type Checking • More
examples: – For each assignment statement, the type of the
updated variable must match the type of the expression being
assigned – For each call statement foo(v1, …, vn), the type of
each actual argument vi must match the type of the
corresponding formal parameter fi from the declaration of
function foo – The type of the return value must match the
return type from the declaration of the function

Introduction The fundamental purpose of a type


system
is to prevent the occurrence of execution errors during the
running of a program. This informal statement motivates the
study of type systems, but requires clarification. Its accuracy
depends, first of all, on the rather subtle issue of what constitutes
an execution error, which we will discuss in detail. Even when
that is settled, the absence of execution errors is a nontrivial
property. When such a property holds for all of the program runs
that can be expressed within a programming language, we say
that the language is type sound. It turns out that a fair amount of
careful analysis is required to avoid false and embarrassing
claims of type soundness for programming languages. As a
consequence, the classification, description, and study of type
systems has emerged as a formal discipline. The formalization
of type systems requires the development of precise notations
and definitions, and the detailed proof of formal properties that
give confidence in the appropriateness of the definitions.
Sometimes the discipline becomes rather abstract. One should
always remember, though, that the basic motivation is
pragmatic: the abstractions have arisen out of necessity and can
usually be related directly to concrete intuitions. Moreover,
formal techniques need not be applied in full in order to be
useful and influential. A knowledge of the main principles of
type systems can help in avoiding obvious and not so obvious
pitfalls, and can inspire regularity and orthogonality in language
design. When properly developed, type systems provide
conceptual tools with which to judge the adequacy of important
aspects of language definitions. Informal language descriptions
often fail to specify the type structure of a language in sufficient
detail to allow unambiguous implementation. It often happens
that different compilers for the same language implement
slightly different type systems. Moreover, many language
definitions have been found to be type unsound, allowing a
program to crash even though it is judged acceptable by a
typechecker. Ideally, formal type systems should be part of the
definition of all typed programming languages. This way,
typechecking algorithms could be measured unambiguously
against precise specifications and, if at all possible and feasible,
whole languages could be shown to be type sound. In this
introductory section we present an informal nomenclature for
typing, execution errors, and related concepts. We discuss the
expected properties and benefits of type systems, and we review
how type systems can be formalized. The terminology used in
the introduction is not completely standard; this is due to the
inherent inconsistency of standard terminology arising from
various sources. In general, we avoid the Type Systems Luca
Cardelli Digital Equipment Corporation Systems Research
Center 2 words type and typing when referring to run time
concepts; for example we replace dynamic typing with dynamic
checking and avoid common but ambiguous terms such as
strong typing. The terminology is summarized in the Defining
Terms section. In section 2, we explain the notation commonly
used for describing type systems. We review judgments, which
are formal assertions about the typing of programs, type rules,
which are implications between judgments, and derivations,
which are deductions based on type rules. In section 3, we
review a broad spectrum of simple types, the analog of which
can be found in common languages, and we detail their type
rules. In section 4, we present the type rules for a simple but
complete imperative language. In section 5, we discuss the type
rules for some advanced type constructions: polymorphism and
data abstraction. In section 6, we explain how type systems can
be extended with a notion of subtyping. Section 7 is a brief
commentary on some important topics that we have glossed
over. In section 8, we discuss the type inference problem, and
we present type inference algorithms for the main type systems
that we have considered. Finally, section 9 is a summary of
achievements and future directions. Execution errors The most
obvious symptom of an execution error is the occurrence of an
unexpected software fault, such as an illegal instruction fault or
an illegal memory reference fault. There are, however, more
subtle kinds of execution errors that result in data corruption
without any immediate symptoms. Moreover, there are software
faults, such as divide by zero and dereferencing nil, that are not
normally prevented by type systems. Finally, there are languages
lacking type systems where, nonetheless, software faults do not
occur. Therefore we need to define our terminology carefully,
beginning with what is a type. Typed and untyped languages A
program variable can assume a range of values during the
execution of a program. An upper bound of such a range is
called a type of the variable. For example, a variable x of type
Boolean is supposed to assume only boolean values during
every run of a program. If x has type Boolean, then the boolean
expression not(x) has a sensible meaning in every run of the
program. Languages where variables can be given (nontrivial)
types are called typed languages. Languages that do not restrict
the range of variables are called untyped languages: they do not
have types or, equivalently, have a single universal type that
contains all values. In these languages, operations may be
applied to inappropriate arguments: the result may be a fixed
arbitrary value, a fault, an exception, or an unspecified effect.
The pure λ-calculus (see Chapter 139) is an extreme case of an
untyped language where no fault ever occurs: the only operation
is function application and, since all values are functions that
operation never fails. A type system is that component of a
typed language that keeps track of the types of variables and, in
general, of the types of all expressions in a program. Type
systems 3 are used to determine whether programs are well
behaved (as discussed subsequently). Only program sources that
comply with a type system should be considered real programs
of a typed language; the other sources should be discarded
before they are run. A language is typed by virtue of the
existence of a type system for it, whether or not types actually
appear in the syntax of programs. Typed languages are explicitly
typed if types are part of the syntax, and implicitly typed
otherwise. No mainstream language is purely implicitly typed,
but languages such as ML and Haskell support writing large
program fragments where type information is omitted; the type
systems of those languages automatically assign types to such
program fragments. Execution errors and safety It is useful to
distinguish between two kinds of execution errors: the ones that
cause the computation to stop immediately, and the ones that go
unnoticed (for a while) and later cause arbitrary behavior. The
former are called trapped errors, whereas the latter are untrapped
errors. An example of an untrapped error is improperly
accessing a legal address, for example, accessing data past the
end of an array in absence of run time bounds checks. Another
untrapped error that may go unnoticed for an arbitrary length of
time is jumping to the wrong address: memory there may or may
not represent an instruction stream. Examples of trapped errors
are division by zero and accessing an illegal address: the
computation stops immediately (on many computer
architectures). A program fragment is safe if it does not cause
untrapped errors to occur. Languages where all program
fragments are safe are called safe languages. Therefore, safe
languages rule out the most insidious form of execution errors:
the ones that may go unnoticed. Untyped languages may enforce
safety by performing run time checks. Typed languages may
enforce safety by statically rejecting all programs that are
potentially unsafe. Typed languages may also use a mixture of
run time and static checks. Although safety is a crucial property
of programs, it is rare for a typed language to be concerned
exclusively with the elimination of untrapped errors. Typed
languages usually aim to rule out also large classes of trapped
errors, along with the untrapped ones

programming languages, a type system is a set of


rules that assigns a property called type to the various
constructs of a computer program, such
as variables, expressions, functions or modules.[1] These
types formalize and enforce the otherwise implicit
categories the programmer uses for data structures and
components (e.g. "string", "array of float", "function
returning boolean"). The main purpose of a type system is
to reduce possibilities for bugs in computer programs[2] by
defining interfaces between different parts of a computer
program, and then checking that the parts have been
connected in a consistent way. This checking can happen
statically (at compile time), dynamically (at run time), or as
a combination of static and dynamic checking. Type
systems have other purposes as well, such as expressing
business rules, enabling certain compiler optimizations,
allowing for multiple dispatch, providing a form of
documentation, etc.
A type system associates a type with each computed
value and, by examining the flow of these values, attempts
to ensure or prove that no type errors can occur. The
given type system in question determines exactly what
constitutes a type error, but in general the aim is to
prevent operations expecting a certain kind of value from
being used with values for which that operation does not
make sense (logic errors). Type systems are often
specified as part of programming languages, and built into
the interpreters and compilers for them; although the type
system of a language can be extended by optional
tools that perform added kinds of checks using the
language's original type syntax and grammar.

Você também pode gostar