Você está na página 1de 5

Hello and welcome to

the JPL-Caltech Virtual Summer School


on Big Data Analytics.
My name is George Djorgovski,
I'm a Professor of Astronomy and
Director of the Center For
Data Driven Discovery at Caltech.
And let me introduce the summer school for
you.
So first, what is it about?
It's about applications of
advanced computational, and
statistical tools for data analysis.
Whether it's big or
not doesn't really matter.
But it's becoming increasingly important
for scientists or engineers or
in, indeed anybody dealing
with vast amounts of data.
To master such tools.
What we can cover in this school, is only
a tiny subset of all of the skills, that
the modern data scientist should have.
And we started with a subset that some
of our lecturers could provide but
in the future we hope to grow it and
add additional material.
So, what this means is that these
lectures will really just be a start of
learning on any given subject to tell
you roughly what's, what it is about.
Show you how to use it.
But you, to get the full benefit you
really need to explore further and
we'll provide links, and
other resources for you to do so.
And I should point out that this
is not about computer science,
perse, even though we use tools
derived from computer science, and
it's certainly not about
high performance computing.
As such, it's really about finalizing
data and extracting knowledge from data.
So let me put this in a somewhat
broader context of what's happening.
I think everybody now knows that,
everything is being completely transformed
by information computation technology.
And certainly science is
no exception to that.
Now that brings interesting new
possibilities, but, also new challenges.
And those, many of those are universal and
common to all the different fields which
is how to actually deal
with vast amounts of data?
How to store it, how to address
it how to check it and so on?
But more importantly, how to explore it,
how to discover knowledge in it?
And tools and methods developed for this
really form new parts of the scientific
methodology, adding to the tools we've
been developing over centuries so far.
It is also really an excellent synergy
between the main sciences like astronomy,
or physics, or biology,
or geological sciences.
And information and
computation technology.
Where as the domain science is used tools
that come from computer science, say for
statistics, out challenges brought
in to improve them further and
to develop new and better tools which then
may be found, find other applications.
There is, a concept of the Forth Paradigm,
introduced by the great
computer scientist Jim Gray.
Which is, if you think how we learn
about the world, how we understand it,
it started, science started experimentally
with the likes of Galileo and so on.
What very quickly followed by analytical
approach, say by Newton and others.
And we still use experimental,
analytical methods and will forever.
But then in the mid 20th century,
something new came about.
That's computers.
And we can call it third or
[INAUDIBLE] paradigm.
Where we use machines to simulate,
what physical processes do in nature.
Not because we are lazy to write formula,
but,
because there is no other
way in which we can do it.
And then over the last 20 years or so,
we saw different kind of computing arise.
Computing that's not
about number crunching.
In traditional sense, it's really about
accessing and understanding data.
It's a different kind of computing,
has different demands that
optimize in a different way and that's
in fact what most of the scientists do.
So we, some of us still run large
numerical simulations, but,
their output is also a huge data set.
And to understand your
theoretical output you also need,
to understand how to do
data driven computing.
So in my own field, Astronomy,
is a good example of this.
Astronomy's been completely
transformed by the modern digital
computation of technology.
And most of our data come in form
of large Digital Sky surveys,
which typically are now 10's or 100's of
terabytes each, if not more, petabyte now.
And we're already talking
about exabyte and beyond.
We have at least 10, I would say
probably few 10's of petabytes stored.
In good quality archives,
in site circa mid 2014 and
we generate tons, tons of terabytes per
day, many tons of terabytes per day.
The interesting point is that the data
volume, doubles every year and a half.
Following Mars, Mars law for
exact same reason the technology.
Gives us the date that follows
Moore's law and that's a rather
stunning if you think about it,
which means that in the next year and
a half from whenever, we will generate as
much data as in all of the past history.
Now it's not about data.
It's about discovering stuff, and
to put these large numbers in context.
Human genome itself can be coded
in less than one Gigabyte, and
Terabyte is about two million
books in just pure text.
Humans can maybe process information
roughly at one Terabyte per year,
more if it's images, less if it's text.
But, you can see that that implies that
we're now getting into regime [INAUDIBLE]
simply cannot fol,
even follow the data, let alone do
something with it, in a simple fashion.
And the progress continues.
Astronomers are now building
large new surveying facilities,
like large synoptic survey telescope.
Which will generate about
30 terabytes per night,
roughly speaking will do one stand
digital sky survey every week.
And then are building even more ambitious
machine to square kilometer array.
Where the raw data generation,
of the instrument will be
about exabyte per second.
And they'll be reduced to maybe
a few beta bytes per year.
So there's some general trends, that all
sciences are following worth looking at.
First and most obvious is
the Exponential Growth of Data Volumes,
which is why people talk about big data.
But, much more interesting are,
is the growth of data complexity.
The informational content of the data
has also been increasing, and
that's where things get real interesting.
So we've moved from data poverty and
starvation to exponential over-abundance,
and we're also moving from fixed datasets
that you obtain once and that's it.
To constantly arriving new data streams,
from different kinds of sensors,
whether a telescopes or
environmental sensors.
It doesn't matter.
It means that we have to
understand the data in real time.
And that poses a whole new set
of challenges for data analysis.
Also data are very
distributed geographically,
even in the given disciplines there
will be many different data centers,
many contributing groups or labs.
And intrinsically it has to be that way.
So, we have to have ways by which they
are connected in a very efficient way
in order to put the data together for
further scientific analysis.
Since there is so much data now,
the value of having data is not very big.
In the past, data recurrency of the realm,
who had access to the data can do science.
Now the data are over
abundant most of it is free.
And the real value is
having the expertise,
to extract knowledge from this data and
this is what the school is all about.
So simple way to represent modern
scientific process is say,
you gather the data whether it's from
gene sequencing machines or satellites or
telescopes, doesn't matter, and
then you do what I call data farming,
data have to be organized and indexed and
made easily accessible and findable.
And ways to which they can be combined,
and we know how to do this very well but
it still takes some skills.
Then comes the interesting part,
discovering regularities
in the data themselves.
That's what science is really about,
finding patterns in nature and
trying to understand them.
And that could be correlations,
or clusters, or
outliers, anomalies and so on.
And a lot of interesting
tools from statistics and
machine learning can be used for
that purpose.
At, at the end, of course,
comes the real role of scientists which is
understanding what that means,
and therefore the new knowledge.
This is grossly oversimplified of course.
And they're feedback loops
at every step of this lag.
But, I think you get the idea.
The data science is not about data.
It's about what you find in the data.
And so it's interesting to think,
how is this different?
From what we had in the past.
The first and
obvious difference is that, for
the first time in history, we'll
never be able to see all of our data.
Which means we need to have technology
that's reliable that will kind of look
over, after the data without
human intervention and
enable us to find pieces we want.
But then, perhaps even more interesting
is that informational content.
Of the data is so high that there
are meaningful constructs in the data that
humans cannot easily comprehend unaided.
And we're moving ever more towards
a collaborative human-computer discovery.
So all this I think adds to our.
Toolkit scientific
methodology in 21st century.
It's not all about science, because every
field of human endeavour, medicine,
security, finance everything
depends on good handling and
understanding of large data sets.
So with that I'll end, and
my colleagues from JPL.
Will also provide their take of,
what are we doing here and why?

Você também pode gostar