Você está na página 1de 5

Homework 3

Overview
In this assignment you will design and implement a tool to suggest potentially useful indexes
for a Postgres database, based on the read and write workload on the database. For example,
if a common query is "SELECT * FROM Emp WHERE Salary>100000" then it might be
useful to create an index on Emp.Salary. On the other hand, if Emp.Salary is frequently
updated, you will have to weigh the tradeoff of extra cost on update compared to the benefit
to select queries.
The goal of this assignment is threefold: 1. to understand and modify the code of a real
DBMS, 2. to think about index selection strategies, and 3. to design an architecture for a
DBMS tool. In particular, you will modify the Postgres source code to gather the appropriate
workload statistics, and a tool to use those statistics to suggest indexes to the DBA. You will
also have to think carefully about how to make the decision whether to suggest a potential
index.
Due
This assignment is due May 8. While we expect that the modifications to the Postgres source
may not be very extensive (potentially just touching a few source files) understanding the
structure of the Postgres source and how to get the information you need will likely be
challenging.
1. Obtain and build Postgres source
We suggest for this assignment that you work on ilinux1.eecs.berkeley.edu or
ilinux2.eecs.berkeley.edu. Once you start using ilinux1 or ilinux2 you should continue to use
the same machine, as binaries compiled on one machine may not work properly on the other.
The full source and build of Postgres is 160 MB or more, which is too large to fit in your
home directory given student quotas. So you will be doing most of your work in /home/tmp,
which is a large filesystem. Because /home/tmp is shared between many classes, and we want
to be good citizens, we strongly suggest that you work in pairs for this assignment, and that
only one of you uncompress and build the postgres source. Thus, you will have to work
together to write code in this shared code tree.
Warning: Make sure that you prefix all of the postgres commands with the appropriate path,
to make sure you pick up the commands you have built and not the standard class postgres
instance. For example:
./postgres

will run the postgres command in the current directory, while


postgres

will likely run the postgres command from your path, which won't be the one you built.
To obtain and build the postgres source:

1. If you don't already have a directory in /home/tmp, create one by running the
following command:
2.
3. /share/b/bin/mkhometmpdir

If you have problems running this command on ilinux1 or ilinux2, try logging into
pulsar.eecs.berkeley.edu and running it there. That is, do the project on one of the
linux machines, but carry out just this step of making the directory on pulsar.
After you have done this step, you should have a directory called /home/tmp/cs186xx, where cs186-xx is your login.
4. Change to this tmp directory, and untar the source from the "cs186" account home
directory:
5.
6. cd /home/tmp/cs186-xx
7. tar xzvf ~cs186/postgresql-8.3.1.tar.gz

This will create a directory called "postgresql-8.3.1" in your /home/tmp/cs186-xx


directory, containing all the source.
The files in /home/tmp should not get deleted, but just to be safe you should
periodically copy any files you edit to your home directory as a backup.
8. Configure your build. First, choose a directory, preferably on /home/tmp/cs186-xx,
where your built binaries will be installed. For example, you might choose
"/home/tmp/cs186-xx/pgbin". This directory must be specified to the configure script
using the --prefix flag. You should also use the --enable-debug flag, so that postgres
will be built using debugging symbols and you can use a debugger like gdb during
your project.
9.
10.
11.

cd /home/tmp/cs186-xx/postgresql-8.3.1
./configure --prefix=/home/tmp/cs186-xx/pgbin --enable-debug

This step will create the necessary configuration scripts and makefiles to build
postgres.
12. Make postgres.
13.
14.

gmake

If all goes well, at the end you should get a message that says: "All of PostgreSQL
successfully made. Ready to install."
15. Install postgres in your build directory:
16.
17.

gmake install

If successful you should be see the message: "PostgreSQL installation complete."

18. If necessary, create a data directory for postgres. You might still have a data directory
from previous projects, but if you have trouble using your existing data directory with
this version of postgres, you can throw it away and start over.
19.
20.
21.
22.

mkdir /home/cc/cs186/sp08/class/c186-xx/pgdata
cd /home/tmp/cs186-xx/pgbin/bin
./initdb -D /home/cc/cs186/sp08/class/c186-xx/pgdata

Note that this will require about 33 MB in your home directory, so clear out files from
previous assignments if necessary.
23. Run your built postgres:
24.
25.
26.

cd /home/tmp/cs186-xx/pgbin/bin
./postgres -D /home/cc/cs186/sp08/class/c186-xx/pgdata

Now, from another shell, you should be able to run standard postgres commands:
cd /home/tmp/cs186-xx/pgbin/bin
./createdb test
./psql test

2. Modify Postgres
Next, we suggest you make a small change to Postgres, such as printing out "Hello world"
from the server or something when a query is run. Make sure you can make, install and run
your modified server. You will need to repeat the "gmake; gmake install" commands, and
shut down and restart your running version of postgres.
For this assignment, when you turn in modified source files, you should clearly mark your
changes by surrounding them with comments:
/* BEGIN HW3 */
your additions, modifications, deletions to code ...
/* END HW3 */

This will help us find the relevant portions of code.


3. Tasks
You should track the number of times Postgres does a sequential scan of a table, as well as
the number of tuples inserted into or deleted from the table, and the number of times a
particular attribute of a tuple in the table is updated. Using these statistics (and any other
information you need to collect), your tool should suggest zero or more indexes to be created.
This suggestion will require a "cost model" that weighs the potential benefits of an index
versus its costs (more on this below.) The tool can be run offline; that is, the tool to suggest
indexes itself does not have to be a part of the Postgres server process. However, it must use
real statistics from plans that are actually executed against the postgres backend.
Your cost model is really just a formula that determines whether an index will produce a net
benefit, or net cost, to the performance of the system. Ideally, you would count the number of
I/O's done by plans before and after using the index. However, for the purposes of this

assignment, you can use a simpler cost model based on the number of tuples examined under
alternative plans. Part of this assignment is to design a simple but effective cost model, gather
the necessary statistics from the execution of postgres to feed this cost model, and write a tool
that uses these statistics and the cost model to suggest indexes.
Postgres is implemented in C and you will need to modify it by writing some C. Any external
tools you write to parse runtime statistics, gather additional statistics, and suggest indexes can
be in your language of choice, as long as you provide clear instructions on how to compile
and run your tool.
Also, you should run some test workloads against your modified server that include both
reads and writes. After running this workload, you should run your tool to make sure the
appropriate indexes are being selected.
4. Some hints

You'll need to gather statistics about plans that Postgres executes. This will require
you to understand at least something about the plan data structure. Spend some time
in relevant portions of the backend code to see what data structures are used. You
might start your exploration in execMain.c, which contains the entry point into the
executor.
Most of the time postgres takes a query and compiles it into a plan. However, there
are a few operations where postgres takes a compiled plan and interprets it like a
query. EXPLAIN PLAN is one example, and it may be useful to see how the code
implementing EXPLAIN PLAN interprets the plan structure.
If you use elog or even printf inside of the postgres engine, the output will be
produced in the console or logfile from the postgres command. In order to understand
what is going on during the execution of a plan, it may be useful to liberally sprinkle
some printfs or elogs around.
It may be important to debug your modified postgres. You can use gdb to attach to the
running postgres process, and debug it. First, determine the process id to attach to.
Postgress will start multiple processes, and when you connect with psql, it will start a
separate process to handle the queries. So, if you type:
ps ux | grep postgres

you'll see something like:


cs186-md 17756 0.4 0.0 45468 3372 pts/8
S
21:38
./postgres -D /home/cc/cs186/sp08/class/cs186-md/pgdata/
cs186-md 17758 0.0 0.0 45468
976 ?
Ss
21:38
postgres: writer process
cs186-md 17759 0.0 0.0 45468
816 ?
Ss
21:38
postgres: wal writer process
cs186-md 17760 0.0 0.0 45468
964 ?
Ss
21:38
postgres: autovacuum launcher process
cs186-md 17761 0.0 0.0
8156
868 ?
Ss
21:38
postgres: stats collector process
cs186-md 17864 0.0 0.0 46276 2704 ?
Ss
21:40
postgres: cs186-md test [local] idle

0:00
0:00
0:00
0:00
0:00
0:00

Postgres modifies the process name so that you can see more clearly what is going on.
In the example above, process 17864 (labeled "postgres: cs186-md test [local] idle")
is the process spawned to handle your queries. You can attach to this process using:
gdb postgres

and then, in gdb, typing:


(gdb) attach 17864

This will interrupt the process and allow you to use gdb to step through it and conduct
other debugging tasks. There are many good tutorials for gdb on the web if you are
not familiar with that tool.
5. Deliverables
You need to turn in several things:

The files from the postgres source tree that you have modified. You don't need to turn
in the whole source tree; just the files you have modified. Remember to include
"BEGIN HW3" and "END HW3" comments surrounding your modifications.
The source files and build instructions for any code you write outside of the postgres
source. Make sure you are extremely clear on how to build and run these tools.
A text file, called README, which describes several things:
o The general architecture of your solution. What functionality did you add to
postgres, what functionality did you build outside of postgres, how do they
interact.
o Your cost model. What are the inputs, what is the output, what is the
justification for the cost model.
o Some sample runs of your system. Describe the workload you ran, and cut and
paste the results of your tool showing what indexes were suggested. Provide
some discussion as to why you think those suggestions are reasonable.

For submission:

Compress all the source files into a modified.tar file.


Compress all the related files into an addition.tar file.

Você também pode gostar