on Big Data Analytics. My name is George Djorgovski, I'm a Professor of Astronomy and Director of the Center For Data Driven Discovery at Caltech. And let me introduce the summer school for you. So first, what is it about? It's about applications of advanced computational, and statistical tools for data analysis. Whether it's big or not doesn't really matter. But it's becoming increasingly important for scientists or engineers or in, indeed anybody dealing with vast amounts of data. To master such tools. What we can cover in this school, is only a tiny subset of all of the skills, that the modern data scientist should have. And we started with a subset that some of our lecturers could provide but in the future we hope to grow it and add additional material. So, what this means is that these lectures will really just be a start of learning on any given subject to tell you roughly what's, what it is about. Show you how to use it. But you, to get the full benefit you really need to explore further and we'll provide links, and other resources for you to do so. And I should point out that this is not about computer science, perse, even though we use tools derived from computer science, and it's certainly not about high performance computing. As such, it's really about finalizing data and extracting knowledge from data. So let me put this in a somewhat broader context of what's happening. I think everybody now knows that, everything is being completely transformed by information computation technology. And certainly science is no exception to that. Now that brings interesting new possibilities, but, also new challenges. And those, many of those are universal and common to all the different fields which is how to actually deal with vast amounts of data? How to store it, how to address it how to check it and so on? But more importantly, how to explore it, how to discover knowledge in it? And tools and methods developed for this really form new parts of the scientific methodology, adding to the tools we've been developing over centuries so far. It is also really an excellent synergy between the main sciences like astronomy, or physics, or biology, or geological sciences. And information and computation technology. Where as the domain science is used tools that come from computer science, say for statistics, out challenges brought in to improve them further and to develop new and better tools which then may be found, find other applications. There is, a concept of the Forth Paradigm, introduced by the great computer scientist Jim Gray. Which is, if you think how we learn about the world, how we understand it, it started, science started experimentally with the likes of Galileo and so on. What very quickly followed by analytical approach, say by Newton and others. And we still use experimental, analytical methods and will forever. But then in the mid 20th century, something new came about. That's computers. And we can call it third or [INAUDIBLE] paradigm. Where we use machines to simulate, what physical processes do in nature. Not because we are lazy to write formula, but, because there is no other way in which we can do it. And then over the last 20 years or so, we saw different kind of computing arise. Computing that's not about number crunching. In traditional sense, it's really about accessing and understanding data. It's a different kind of computing, has different demands that optimize in a different way and that's in fact what most of the scientists do. So we, some of us still run large numerical simulations, but, their output is also a huge data set. And to understand your theoretical output you also need, to understand how to do data driven computing. So in my own field, Astronomy, is a good example of this. Astronomy's been completely transformed by the modern digital computation of technology. And most of our data come in form of large Digital Sky surveys, which typically are now 10's or 100's of terabytes each, if not more, petabyte now. And we're already talking about exabyte and beyond. We have at least 10, I would say probably few 10's of petabytes stored. In good quality archives, in site circa mid 2014 and we generate tons, tons of terabytes per day, many tons of terabytes per day. The interesting point is that the data volume, doubles every year and a half. Following Mars, Mars law for exact same reason the technology. Gives us the date that follows Moore's law and that's a rather stunning if you think about it, which means that in the next year and a half from whenever, we will generate as much data as in all of the past history. Now it's not about data. It's about discovering stuff, and to put these large numbers in context. Human genome itself can be coded in less than one Gigabyte, and Terabyte is about two million books in just pure text. Humans can maybe process information roughly at one Terabyte per year, more if it's images, less if it's text. But, you can see that that implies that we're now getting into regime [INAUDIBLE] simply cannot fol, even follow the data, let alone do something with it, in a simple fashion. And the progress continues. Astronomers are now building large new surveying facilities, like large synoptic survey telescope. Which will generate about 30 terabytes per night, roughly speaking will do one stand digital sky survey every week. And then are building even more ambitious machine to square kilometer array. Where the raw data generation, of the instrument will be about exabyte per second. And they'll be reduced to maybe a few beta bytes per year. So there's some general trends, that all sciences are following worth looking at. First and most obvious is the Exponential Growth of Data Volumes, which is why people talk about big data. But, much more interesting are, is the growth of data complexity. The informational content of the data has also been increasing, and that's where things get real interesting. So we've moved from data poverty and starvation to exponential over-abundance, and we're also moving from fixed datasets that you obtain once and that's it. To constantly arriving new data streams, from different kinds of sensors, whether a telescopes or environmental sensors. It doesn't matter. It means that we have to understand the data in real time. And that poses a whole new set of challenges for data analysis. Also data are very distributed geographically, even in the given disciplines there will be many different data centers, many contributing groups or labs. And intrinsically it has to be that way. So, we have to have ways by which they are connected in a very efficient way in order to put the data together for further scientific analysis. Since there is so much data now, the value of having data is not very big. In the past, data recurrency of the realm, who had access to the data can do science. Now the data are over abundant most of it is free. And the real value is having the expertise, to extract knowledge from this data and this is what the school is all about. So simple way to represent modern scientific process is say, you gather the data whether it's from gene sequencing machines or satellites or telescopes, doesn't matter, and then you do what I call data farming, data have to be organized and indexed and made easily accessible and findable. And ways to which they can be combined, and we know how to do this very well but it still takes some skills. Then comes the interesting part, discovering regularities in the data themselves. That's what science is really about, finding patterns in nature and trying to understand them. And that could be correlations, or clusters, or outliers, anomalies and so on. And a lot of interesting tools from statistics and machine learning can be used for that purpose. At, at the end, of course, comes the real role of scientists which is understanding what that means, and therefore the new knowledge. This is grossly oversimplified of course. And they're feedback loops at every step of this lag. But, I think you get the idea. The data science is not about data. It's about what you find in the data. And so it's interesting to think, how is this different? From what we had in the past. The first and obvious difference is that, for the first time in history, we'll never be able to see all of our data. Which means we need to have technology that's reliable that will kind of look over, after the data without human intervention and enable us to find pieces we want. But then, perhaps even more interesting is that informational content. Of the data is so high that there are meaningful constructs in the data that humans cannot easily comprehend unaided. And we're moving ever more towards a collaborative human-computer discovery. So all this I think adds to our. Toolkit scientific methodology in 21st century. It's not all about science, because every field of human endeavour, medicine, security, finance everything depends on good handling and understanding of large data sets. So with that I'll end, and my colleagues from JPL. Will also provide their take of, what are we doing here and why?