Escolar Documentos
Profissional Documentos
Cultura Documentos
University of Houston
Department of Mechanical Engineering
Guide to Engineering Data Science
Fall 2017
Session #1
Instructors:
@#GECON&*
1
October 1, 2017
Course Description: Course Description:
1. Overview on Data and Data Sources 2. Data Preparation
o Data types Data Processing: data cleansing and
o Data sources separation
o Data acquisition o Filtering
o Imputation
o Dimensionality reduction
o Normalization and Transformation
o Feature extraction
Simulation
o Discrete event simulator
o Markov models
o Agent-based simulation
o Monte Carlo simulation
o Systems dynamics
o Activity-based simulation
o ODES and PDES
o Fuzzy logic
@#GECON&*
4
October 1, 2017
Artificial Intelligence
From personal assistants like Siri, Alexa, Cortana,
etc to movie suggestions on Netflix, artificial
intelligence (AI) is rapidly becoming ubiquitous in
everyday life. As this technology continues to
advance in capability and prevalence, we seek to
explore AI and several closely related subtopics:
machine leaning, deep learning, and neural
networks.
The basic premise of learning from data is the use of a set of observations to
uncover an underlying process. Its very broad, and difficult to fit in a single
framework. Hence, different learning paradigms have arisen to deal with
different situation and different assumptions.
Supervised Learning Its the most studied and most utilized type of
learning.
The variation in learning within this category is dependent on the
nature of the data set.
Reinforcement learning When the data does not explicitly contain the
correct output for each input, we are no longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
Unsupervised Learning The training data does not contain any output
information at all. Its the task of spontaneously finding patterns and
structure in input data.
Its being able to place items with similar properties together into
@#GECON&*
one category. 8
October 1, 2017
Other Views/Types of Learning,
Statistical Learning - it has the same premise as learning from data; that
is, the use of a set of observations to uncover an underlying process.
Its a mathematical field where the process is a probability distribution
and the observations are samples from that distribution.
Statistics focuses on somewhat idealized models and analyzes them
in great detail.
Data Mining A field that is quite practical and that focuses on finding
patterns, correlations, or anomalies in large relational databases. n the
data does not explicitly contain the correct output for each input, we are no
longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
@#GECON&*
9
October 1, 2017
Other Views/Types of Learning,
Statistical Learning
o Statistical learning arose as a subfield of Statistics.
Statistical Learning
o Analysis of Variance (ANOVA)
@#GECON&*
11
October 1, 2017
Analysis of Variance (ANOVA) Statistical Learning
We can describe the observations in the table by a linear statistical model as
i 1,2,..., a
Yij i ij
j 1,2,..., n
Yij is a random variable denoting the (ij)th observation
is the overall mean common to all treatments
i is a parameter associated with the ith treatment
ij is a random error component
@#GECON&*
12
October 1, 2017
Analysis of Variance (ANOVA)
Statistical Learning
ANOVA F-test
SStreatments
Fo
a 1 MStreatments
SS E MS E
an 1
H o : 1 2 ... a 0
H1 : i 0 for at least one i
Significance:
If the Null hypothesis is true, each observation consists of the
overall mean plus a realization of the random error component ij.
This is equivalent to saying that all N observations are taken from a
normal distribution with mean and variance 2.
Net: if the Null is true, changing levels of the factor has no
effect on the mean response!
@#GECON&*
13
October 1, 2017
Other Views/Types of Learning,
Data Mining
Definition: In simple words, data mining is defined as a process used to
extract usable data from a larger set of any raw data. It implies analyzing
data patterns in large batches of data using one or more software.
In a sense, were giving computers the ability to learn from data without
explicitly creating a computer code for this task. Therefore, these algorithms
can generate (turn into) knowledge from data from which engineers and/or
analysts can make informed decisions.
These are important examples of how Machine Learning and A.I. affect and
will affect future technologies and applications.
@#GECON&*
16
October 1, 2017
Objectives
o On the basis of the training data we would like to:
Philosophy
Mathematical Expressions
@#GECON&*
19
October 1, 2017
Overview on Data and Data Sources
2.0 Classification for Predicting Class Labels Predictions
Class labels are discrete and unordered values that can be understood
as the group memberships of the instances.
-- ex. e-mail spam is an example of a binary classification task
Linear Regression:
+ + Given a predictor variable
+ + ++ x, and a response variable
+ ++ + y, we fit a straight line to
+ the data that minimizes the
+ +
y
distance called average
+ + + squared distance
between the sample pt.
++ and the fitted line.
@#GECON&*
22
x October 1, 2017
Overview on Data and Data Sources
4.0 Reinforcement Learning(R.L.)
Reward
Environment
Action
State
Agent
Reinforcement Learning(R.L.)
@#GECON&*
24
October 1, 2017
Overview on Data and Data Sources
5.0 Unsupervised Learning(U.L.)
o A third type of learning in this type of learning , we are dealing with
unlabeled data or data of unknown structure.
o Using Unsupervised learning techniques (clustering) we are able to
explore the structure of the data to extract meaningful information
without the guidance of a known outcome variable or reward
function.
o Goal - Discovering hidden structures with unsupervised learning
f = -1
f = +1
The answer is not so obvious since more than one function that fits
the six training example with some @ +1 and others @ -1.
f = -1
f = +1
When we have a training set such the prior examples, for the first two
rows we know the value of f, but this doesnt mean we have learned
f.
Meaning, we dont know the function.
Well, the quality of the learning will be determined by how close our
prediction is to the true value for the training set, and giving
confidence for future prediction accuracy.
@#GECON&*
28
October 1, 2017
1. Overview on Data and Data Sources
o Use of the basic learning setup
gq ( xi ) fq (xi ) = q0 + q1 x1 + q2 x2 +... + qi xi
H
Final Hypothesis
@#GECON&*
29
October 1, 2017
2. Is Learning Feasible Error and Noise
@#GECON&*
30
October 1, 2017
2. Is Learning Feasible Error and Noise
The same learning task in different contexts may warrant the use of
different error measures.
this cost depends on the what h is used for, and cannot be dictated just
by learning techniques.
@#GECON&*
31
October 1, 2017
2. Is Learning Feasible Error and Noise
Instead, they are generated in a noisy way such that the output is not
uniquely determined from the input.
ex., Two people with identical salaries, outstanding loans, etc.
but end up with different credit behavior. Hence, the credit
function is not really a deterministic function, but a noisy one.
P ( y | x) y = f ( x)
Substituted for
A noisy target as a
deterministic one
plus added noise. Target Distribution Target function
A data (x, y) is now
@#GECON&* P ( x, y) = P(x)P(y | x) generated by the
32
joint distribution October 1, 2017
Questions ?
Session #1
Instructors:
Drs. Rafik Borji, Egidio (Ed) Marotta
@#GECON&*
33
October 1, 2017
Downloading Programming Languages
Python
R-Studio
@#GECON&*
34
October 1, 2017
Downloading Python from Anaconda Site
https://www.continuum.io/downloads
Recommended
@#GECON&*
36
https://www.continuum.io/downloads October 1, 2017
Downloading Python from Anaconda Site
Editor
File Explorer
Variable Output
@#GECON&*
Console MatLab View 38
October 1, 2017
Python History
https://docs.python.org/3/tutorial/introduction.html
@#GECON&*
39
October 1, 2017
Introduction to R and R Studio
Lecture 1
@#GECON&*
40
October 1, 2017
Contents
Getting started with R
Installing R
Installing Rstudio
Using Rstudio
@#GECON&*
41
October 1, 2017
Getting Started with R
History
R is dialect of S language
1976: S started as a statistical Analysis Environment implemented as
FORTRAN libraries (John Chambers)
1988: Rewritten in C language
1998: Version 4 of S language released: this is what is used today
1991: R language created in New Zealand
1993: 1st announcement of R to public
1995: software in GNU general public software license to make R free
software
1996: R-help and R-devel created
1997: R core group formed to control the source code for R
2000: R 1.0.0 released
2013: R 3.0.2 released December
Present: R 3.4.1 version
@#GEPUB&*
42
October 1, 2017
Getting Started with R
Features
Syntax similar to S
Very active development
Runs on almost standard computing platforms
Lean (functionality divided into modular packages)
Sophisticated graphics capabilities compared to statistical packages
Powerful for developing new tools
Very active users community
It is a free software (4 aspects) :
1. Run the program
2. Study how the program works
3. Redistribute copies
4. Improve the program and release to public
www.fsf.org
@#GEPUB&*
43
October 1, 2017
Getting Started with R
R system
R system divided into two parts:
1. The base R system : downloaded from CRAN
2. Everything else
R functionality divided into many packages:
1. The base R packages contained in base R required to run R
2. The other packages included in base R (stats, datasets, graphics
etc..)
3. Recommend packages (class, cluster, lattice etc)
4. Other packages from CRAN (+10,000)
5. Packages available on personal websites
R resources (www.cran.r-project.org)
1. R installation
2. An introduction to R
3. Writing R extensions
4. R data imports/exports
5. R internals
@#GEPUB&*
44
October 1, 2017
Installing R
Downloading-installing R version
https://cran.r-project.org/bin/windows/base/
CRAN: Comprehensive R Archive Network
Download available version of R
Run installation file
@#GEPUB&*
45
October 1, 2017
Installing R
R environment
install.packages
Library()
Search()
Help.start()
Help(NameofFunction/?NameofFunction)
Variable assignement
Calling function (c for concatenation)
Comments (#)
Print
Ls
rm
@#GEPUB&*
46
October 1, 2017
Installing R Studio
R Studio
Most popular R code editor.
www.rstudio.com
https://www.rstudio.com/products/rstudio/download/
Control Panel\User Accounts\User Accounts to change/add environment
variables
@#GEPUB&*
47
October 1, 2017
Installing R Studio
R Studio Console
@#GEPUB&*
48
October 1, 2017
Installing R Studio
Hype Cycle for Data Science
@#GEPUB&*
49
October 1, 2017
Installing R Studio
Hype Cycle
@#GEPUB&*
50
October 1, 2017