Você está na página 1de 6

[MUSIC].

Okay, so where are we now?


So just, given an overview of Data
Science itself.
And one of the things we talked about was
that there's this important aspect of
data munging or manipulation, cleaning,
restructuring and so on.
That is, perhaps, you know, ill-defined,
but is kind of what keeps people up at
night when they're working on data
science problems.
Okay.
Then we also gave an overview of
relational databases kind of a hi,
history of relational databases, and why
they came into being in the first place.
And, you know, we found that the original
problem being addressed was this one of
physical data independence.
That, you know, when aspects of the data
changed, all the applications broke.
And so you wanted to insulate
applications from certain kinds of
changes.
Okay, and one of the tricks here, the
secret sauce of relational databases, is
the algebra of tables that allows you to
reason about manipulation tasks.
Reason about data manipulation tasks,
independently of the grubby details of
the physical representation okay.
So this idea will come up over and over
and over again, even outside of the
context of say, you know, Oracle and
Microsoft Sequel Server and IBM DB2 and
so on.
You know, you don't have to be talking
about a commercial flagship relation
database system to make use of this
relational algebra.
Okay, and we'll see that.
And so I want to spend some time in this
segment, and [INAUDIBLE] probably the
next few segments on understanding the
relational algebra.
And so, you know, at, at times this may
look like more of a theoretical exercise,
but I promise you it's not.
Alright, this, this, this.
There's, there's, an entire database
course offered say here in the University
of Washington and everywhere else.
That I think is a great idea to take.
we're taking out segments of that, that
are demonstrolly practical in a data
science setting.
Okay.
So, I also mentioned that many of the
slides in the next segment or two came

from the Introduction to Data Management


course developed by Dan Suciu and Magda
Balazinska taught here at the University
of Washington.
Okay.
So the relational algebra operators that
we hinted at but maybe not listed out
explicitly are these.
Include this set operations that are
lifted to support relations, and we'll
see examples of that.
And then the big three are selection,
projection, and join.
Alright, and we'll talk about the mean of
those.
And then there's this, these extended
relational algebra operators that have to
do with manipulating duplicates of two
poles.
And on the next slide, I'm going to
explain where duplicates come up and why
it's important to make a different,
distinction between working when, when
they're in the presence of duplicates and
when there aren't duplicates.
Okay, and these include just an operator
to eliminate duplicates altogether.
There's a group by operation you may be
familiar with if you worked with SQL.
And this operator appears here and then
you can sort and so forth.
These are extended in some.
These are also extended in the sense that
I shouldn't have said just duplicates,
it's also sorting for example doesn't
deal with duplicates.
It's, it's extensions relations algebra,
you know, off of, you know, away from the
pure set-based, set-theory-based model.
So, for example a set of objects doesn't
have an kind of order applied to it, and
yet we were allowed to sort thing in SQL.
Okay, and it's a practical you know it's
something that is practical for
applications.
You'll be able to define what kind of
order the tuples come back in.
But it's not part of the formalism.
It was added in afterward.
So that's just extension between the pure
relation algebra and the extended
relation algebra.
This is probably is close to the
theoretical opinions that I'm going care
to get the difference between these two
comes up a lot when you're trying to
prove properties about the formalism.
Right, because the [UNKNOWN] relational
algebra is much more, much more difficult
to prove things about if, if you can at

all.
But as a practical matter, the difference
between these two classes of operators is
not particularly important.
Okay.
So the take away here is that there's a
big set of, of, you know, a rich set of
operators.
But if someone says the relation algebra
to you, the first thing you should think
of is set operations plus selection
projection and join.
Okay.
All right.
So this notion of sets versus bags, the
duplicate question.
Well, so first of all, what is a set?
A set is a collection of objects where
there are no duplicates, and a bag is a
collection of objects where there can be
duplicates.
And so right up here, you know, a is
repeated, is not repeated all in a set,
but it may be repeated in a bag.
And whether that's legal or illegal is
what gives you the semantics of a set
person's bag.
So you can define the relational algebra
in terms of these two different
semantics.
You can define in terms of set or you can
define in terms of bag.
And this notion of an extended relational
algebra come from the need to sort of
work with bags, as well as other things
like sorting, as I mentioned.
Okay, so the rule of thumb here, this is
the last time I'm really going to mention
this.
The rule of thumb here is that every
paper that you read, if you, if you end
up reading some on the papers we talked
about in this course or beyond.
Will you know unless it said explicitly,
we'll assume set semantics.
Okay, so be prepared for that.
While every implementation you know every
commercial database will assume bag
semantics.
And we'll sort of see where that comes up
in the language.
Okay so I just want to put, put that out
there up front that you know, I may play
fast and loose with the difference
between sets versus bags, but it, it can
be important in practice.
Okay, so when lifting set operation, you
can define the union of two sets in the
standard way, the union of two relations
is natural given that a relation is a set

of tuples.
And in relate algebra notation, I write
it like this, and I can also write it in
SQL with the union keyword.
And here's where setting that will come
up if I want to, by, unqualified union
does indeed remove duplicate in which
case the answer is of the, the union of
this relation with a1, b1 as a tuple and
a2, b1 and a1, b1, and a3, b4 is these
three tuples.
The duplicate of a1, b1 didn't get passed
through.
To express this in bags is to make sure
we do include duplicates, you can say
union all and that would include all four
tuples.
Okay.
You can find the difference operation the
same way, or in the same way a, in the
sense that you're lifting it from the
set, from the natural definition of over
sets, that find every se, every tuple in
this set and removing tuples that also
appear in this set.
And we see one, we see a1, b1 as we saw
before also appears in R1.
And so you, you get rid of it.
And all you're left with is this tuple.
Alright.
So why isn't this one in there?
Well we don't, if it doesn't, if a3, b4
doesn't appear in R1, we know it's not in
the set.
All we want is everything that's in R1.
Removing things that also appear in R2.
Okay.
Alright.
So what about intersection?
That's another set operation that we
could lift up.
You can indeed define intersection but
you don't necessarily need to have it as
a fundamental operator because you could
really express it in terms of difference.
Right so if I want the intersection of R1
and R2 two, I can take everything in R1
that is not in R2.
And then I can take everything R1 that is
not in that result.
So if you think about this for a second.
This expression returns everything that
isn't, that is only in R1.
And then this expression overall removes
everything that is only in R1, leaving
things that are both in R1 and R2, and so
that's what intersection is.
Okay.
And we'll touch on this later but you can
also express intersection in terms of

join, which, that operator we haven't


defined yet.
Okay, the selection operator is how we
take tuples that satisfy a certain
condition.
And so we write it with the, with sigma
and we put c to express the condition.
this notation honestly we won't
necessarily use too much throughout this
course.
But I think it's important to be familiar
with it when it does come up.
I'm more interested in, in recognizing
the sort of English translation of the
select, union, join and so on.
The, the, the Greek notation is prob, is
perhaps less important.
Okay.
So, if we find where the salary is
greater than 4,000 of an employee or
where the name is equal to Smith, that's,
that's an instance of a selection
operator.
And the, let's see, it says that
condition c can involve equals, you know,
less than, greater than equal to and so
on.
But it can be more than this, right.
It can be any sort of Boolean expression.
In fact, as we'll see in maybe a segment
or two, it can be sort of any arbitrary
function.
That returns a Boolean value.
So it doesn't necessarily have to just be
a less than b or a equals b, it could be
some complicated function.
Okay.
But let's say in between some complicated
function that's user defined.
And a simple condition like this where
you just say salary equals 4000 or,
sorry, salary greater than 4000.
You can have arbitrary Boolean
expressions.
So you could have conjunctions, you could
say where salary's greater than 4000, and
sname equals Smith.
And of course you can say OR and you can
say NOT.
All of these are legal.
Okay.
So as an example if we want to have a
selection where a salary is greater than
4,000 of employee.
Which would, excuse me, which one passed
this test?
Well I've been saying 4,000 this whole
time and that's 40,000.
Excuse me.
these numbers looked [LAUGH] for a

moment.
So John has a salary less than 40,000 so
he can be removed from the set and so the
result of this expression is this table.
Right, we have tabes in and tables out.
The result of this expression is this
table, same three columns and only two
tables in it.
Okay, I guess sometimes I gesture here
I'm not sure you can see it when I, I
don't know, maybe you can.

Você também pode gostar