Você está na página 1de 5

Ascii vs.

Binary Files
Introduction

Most people classify files in two categories: binary files and ASCII (text) files. You've
actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an
ASCII file.

An ASCII file is defined as a file that consists of ASCII characters. It's usually created by
using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for
writing code, but they may not always save it as ASCII.

As an aside, ASCII text files seem very "American-centric". After all, the 'A' in ASCII stands
for American. However, the US does seem to dominate the software market, and so
effectively, it's an international standard.

Computer science is all about creating good abstractions. Sometimes it succeeds and
sometimes it doesn't. Good abstractions are all about presenting a view of the world that the
user can use. One of the most successful abstractions is the text editor.

When you're writing a program, and typing in comments, it's hard to imagine that this
information is not being stored as characters. Of course, if someone really said "Come on,
you don't really think those characters are saved as characters, do you? Don't you know about
the ASCII code?", then you'd grudgingly agree that ASCII/text files are really stored as 0's
and 1's.

But it's tough to think that way. ASCII files are really stored as 1's and 0's. But what does it
mean to say that it's stored as 1's and 0's? Files are stored on disks, and disks have some way
to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction.
Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think
of them that way.

In effect, ASCII files are basically binary files, because they store binary numbers. That is,
ASCII files store 0's and 1's.

The Difference between ASCII and Binary Files?

An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit
code stored in a byte. To be more specific, there are 128 different ASCII codes, which means
that only 7 bits are needed to represent an ASCII character.

However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any
byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the
bits. In particular, the most significant bit of each byte is not being used.

Although ASCII files are binary files, some people treat them as different kinds of files. I like
to think of ASCII files as special kinds of binary files. They're binary files where each byte is
written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in
any byte of a binary file.

We work with binary files all the time. Executables, object files, image files, sound files, and
many file formats are binary files. What makes them binary is merely the fact that each byte
of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.

Example of ASCII files

Suppose you're editing a text file with a text editor. Because you're using a text editor, you're
pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters
'c', then 'a', then 't'. Then, you save the file and quit.

What happens? For the time being, we won't worry about the mechanism of what it means to
open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.

If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x
merely indicates the values are in hexadecimal, instead of decimal/base 10).

Here's how it looks:

ASCII 'c' 'a' 't'


Hex 63 61 74
Binary 0110 0011 0110 0001 0111 1000

Each time you type in an ASCII character and save it, an entire byte is written which
corresponds to that character. This includes punctuations, spaces, and so forth. I recall one
time a student has used 100 asterisks in his comments, and these asterisks appeared
everywhere. Each asterisk used up one byte on the file. We saved thousands of bytes from his
files by removing comments, mostly the asterisks, which made the file look nice, but didn't
add to the clarity.

Thus, when you type a 'c', it's being saved as 0110 0011 to a file.

Now sometimes a text editor throws in characters you may not expect. For example, some
editors "insist" that each line end with a newline character.

What does that mean? I was once asked by a student, what happens if the end of line does not
have a newline character. This student thought that files were saved as two-dimensions
(whether the student realized ir or not). He didn't know that it was saved as a one dimensional
array. He didn't realize that the newline character defines the end of line. Without that
newline character, you haven't reached the end of line.

The only place a file can be missing a newline at the end of the line is the very last line. Some
editors allow the very last line to end in something besides a newline character. Some editors
add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use
newline characters on UNIX files, but in Windows, it's common to use two characters to end
each line (carriage return, newline, which is \r and \n, I believe). Why two characters when
only one is necessary?

This dates back to printers. In the old days, the time it took for a printer to return back to the
beginning of a line was equal to the time it took to type two characters. So, two characters
were placed in the file to give the printer time to move the printer ball back to the beginning
of the line.

This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case
you've wondered why transferring files to UNIX from Windows sometimes generates funny
characters.

Editing Binary Files

Now that you know that each character typed in an ASCII file corresponds to one byte in a
file, you might understand why it's difficult to edit a binary file.

If you want to edit a binary file, you really would like to edit individual bits. For example,
suppose you want to write the binary pattern 1100 0011. How would you do this?

You might be naive, and type in the following in a file:

11000011

But you should know, by now, that this is not editing individual bits of a file. If you type in '1'
and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001 and
0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.

"But, how am I suppose to edit binary files?", you exclaim! Sometimes I see this dilemma.
Students are told to perform a task. They try to do the task, and even though their solution
makes no sense at all, they still do it. If asked to think about whether this solution really
works, they might eventually reason that it's wrong, but then they'd ask "But how do I edit a
binary file? How do I edit the individual bits?"

The answer is not simple. There are some programs that allow you type in 49, and it
translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can
call these programs hex editors. Unfortunately, these may not be so readily available. It's not
too hard to write a program that reads in an ASCII file that looks like hex pairs, but then
converts it to a true binary file with the corresponding bit patterns.

That is, it takes a file that looks like:

63 a0 de

and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary).
Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' '
(space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate
binary code and write that to a file.

Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the
output binary file would contain 3 bytes, one byte per hex pair.

Viewing Binary Files

Most operating systems come with some program that allows you to view a file in "binary"
format. However, reading 0's and 1's can be cumbersome, so they usually translate to
hexadecimal. There are programs called hexdump which come with the Linux distribution or
xxd.

While most people prefer to view files through a text editor, you can only conveniently view
ASCII files this way. Most text editors will let you look at a binary file (such as an
executable), but insert in things that look like ^@ to indicate control characters.

A good hexdump will attempt to translate the hex pairs to printable ASCII if it can. This is
interesting because you discover that in, say, executables, many parts of the file are still
written in ASCII. So this is a very useful feature to have.

Writing Binary Files, Part 2

Why do people use binary files anyway? One reason is compactness. For example, suppose
you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters
(which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using
4 bytes.

ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space.
You can represent information more compactly by using binary files.

For example, one thing you can do is to save an object to a file. This is a kind of serialization.
To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object
and the number of bytes used to represent the object (use the sizeof operator to determine
this) to the write() method. The method then dumps out the bytes as it appears in memory
into a file.

You can then recover the information from the file and place it into the object by using a
corresponding read() method which typically takes a pointer to an object (and it should point
to an object that has memory allocated, whether it be statically or dynamically allocated) and
the number of bytes for the object, and copies the bytes from the file into the object.

Of course, you must be careful. If you use two different compilers, or transfer the file from
one kind of machine to another, this process may not work. In particular, the object may be
laid out differently. This can be as simple as endianness, or there may be issues with padding.

This way of saving objects to a file is nice and simple, but it may not be all that portable.
Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will
write out the addresses to the file. Those addresses are likely to be totally meaningless.
Addresses may make sense at the time a program is running, but if you quit and restart, those
addresses may change.

This is why some people invent their own format for storing objects: to increase portability.

But if you know you aren't storing objects that contain pointers, and you are reading the file
in on the same kind of computer system you wrote it on, and you're using the same compiler,
it should work.

This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire
objects. They tend to be somewhat more portable.

Summary

An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit
encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to
0. Think of an ASCII file as a special kind of binary file.

A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring
patterns (as opposed to an ASCII file which only has 128 bitstring patterns).

There may be a time where Unicode text files becomes more prevalent. But for now, ASCII
files are the standard format for text files.

Você também pode gostar