Guide To Eng. Data Science Lecture (1) Ver2

Course Description Outline
University of Houston
Department of Mechanical Engineering
Guide to Engineering Data Science
Fall 2017
Session #1
Instructors:
Drs. Rafik Borji,

Egidio (Ed) Marotta
@#GECON&*
1
October 1, 2017
Course Description: Course Description:
1. Overview on Data and Data Sources 2. Data Preparation
o Data types Data Processing: data cleansing and
o Data sources separation
o Data acquisition o Filtering
o Imputation
o Dimensionality reduction
o Normalization and Transformation
o Feature extraction
Data aggregation: collect and summarize

data
o Basic statistics
o Distribution fitting
o Baseball card aggregation
Data enrichment: add new information to

data
o Annotation
o Date Handling o Relational algebra
o Theoretical o Feature addition
o Empirical
@#GECON&*
2
October 1, 2017
3. Advanced Data Analytics 3. Advanced Data Analytics
Clustering Classification
o Hierarchical clustering o Bayesian Network

o X-means, Canopy, a priori clustering o Neural nets
o Topic modeling for text data o Random Forests
o Fractal and DB scan o Deep Learning
o Gaussian mixture models o Decision Trees
o K-means clustering o K-nearest Neighbors
o Nave Bayes
Regression For Variables Assessment o Hidden Markov Model
o Tree-based methods o Support Vector Machines
o Generalized linear models
o Regression with shrinkage Regression for Value Prediction
o Stepwise regression o Tree-based methods
o Generalized linear models
Hypothesis Testing o K-nearest neighbors
o T-test for groups comparison
o ANOVA Recommendation
o Collaborative filtering (SVD, PCA)
o Content-based methods
o Graph-based methods
@#GECON&*
3
October 1, 2017
3. Advanced Data Analytics 4. Data Models Operationalization
Logical reasoning o Dashboard Development
o Expert systems and Logical reasoning o Introduction to Internet of Things
o Introduction to Digital Twins
Optimization o Introduction to Adaptive Reduced Order
o Stochastic research Modeling
o Genetic algorithm, simulated annealing and
gradient search
o Linear, integer and non-linear programming
o Active learning
o Ensemble learning
Simulation
o Discrete event simulator
o Markov models
o Agent-based simulation
o Monte Carlo simulation
o Systems dynamics
o Activity-based simulation
o ODES and PDES
o Fuzzy logic
@#GECON&*
4
October 1, 2017
Artificial Intelligence
From personal assistants like Siri, Alexa, Cortana,
etc to movie suggestions on Netflix, artificial
intelligence (AI) is rapidly becoming ubiquitous in
everyday life. As this technology continues to
advance in capability and prevalence, we seek to
explore AI and several closely related subtopics:
machine leaning, deep learning, and neural
networks.
What are the Differences between Artificial

Intelligence, Machine Learning, and Deep Learning?
@#GECON&*
5
October 1, 2017
Artificial Intelligence
What are the Differences between Artificial
Intelligence, Machine Learning, and Deep Learning?
While artificial intelligence (AI), machine learning (ML), and

Deep Learning (DL) are often used interchangeably, there
are several key differences. One way to visualize the
relationship is through a series of concentric circles.
At a basic Applied AI: Machines designed to

complete very specifics tasks like
level, artificial navigating a vehicle, trading stocks, or
intelligence is the playing chess
concept of machines
accomplishing tasks General AI: Machines designed to
complete any task which would normally
which have historically require human intervention. The broad
required human nature of General AI requires machines
intelligence.1 to learn as they encounter new tasks or
situations.
@#GECON&*
1. https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine- 6
October 1, 2017
learning/#35aa37802742
Artificial Intelligence - Layers
3. https://www.globalxfunds.com/artificial-intelligence-explained/?utm_source=Dianomi&utm_medium=CPC&utm_campaign=AI+Explained
Cutting edge The key difference

research of AI is between ML and AI is
focusing on machine that ML does not rely
learning (ML). explicitly on the code
of its creator. Rather,
In simple terms, ML ML systems use
is the process of computer code as a
building machines starting point and
which can access then gather data,
data, apply information, and
algorithms to this inputs which can be
data, and then train studied
themselves to
deduce valuable
insights based on Deep learning takes artificial intelligence a step further, by
these underlying mimicking how the human brain works through the use of
datasets. artificial neural networks. In an artificial neural network,
each neuron is charged with providing a binary (yes/no)
@#GECON&* response to basic questions about a piece of data.
7
2. http://www.explainthatstuff.com/introduction-to-neural-networks.html October 1, 2017
Types of Learning,
The basic premise of learning from data is the use of a set of observations to
uncover an underlying process. Its very broad, and difficult to fit in a single
framework. Hence, different learning paradigms have arisen to deal with
different situation and different assumptions.
Supervised Learning Its the most studied and most utilized type of
learning.
The variation in learning within this category is dependent on the
nature of the data set.
Reinforcement learning When the data does not explicitly contain the
correct output for each input, we are no longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
Unsupervised Learning The training data does not contain any output
information at all. Its the task of spontaneously finding patterns and
structure in input data.
Its being able to place items with similar properties together into
@#GECON&*
one category. 8
October 1, 2017
Other Views/Types of Learning,
The main field dedicated to the subject of learning is called Machine

Learning. However, there are two more similar approaches to learning from
data that have their own ways.
Statistical Learning - it has the same premise as learning from data; that
is, the use of a set of observations to uncover an underlying process.
Its a mathematical field where the process is a probability distribution
and the observations are samples from that distribution.
Statistics focuses on somewhat idealized models and analyzes them
in great detail.
Data Mining A field that is quite practical and that focuses on finding
patterns, correlations, or anomalies in large relational databases. n the
data does not explicitly contain the correct output for each input, we are no
longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
@#GECON&*
9
October 1, 2017

Statistical Learning
o Statistical learning arose as a subfield of Statistics.
o There is much overlap both fields focus on supervised and unsupervised

problems:
Machine learning has a greater emphasis on large scale applications and

prediction accuracy.
Statistical learning emphasizes models and their interpretability, and
precision and uncertainty.
But the distinction has become more and more blurred,

and there is a great deal of cross-fertilization.
Machine learning has the upper hand in Marketing!

@#GECON&*
10
October 1, 2017

o Analysis of Variance (ANOVA)
The Purpose of Analysis of Variance

In general, the purpose of analysis of variance (ANOVA) is to test for
significant differences between means.
If we are only comparing two means, then ANOVA will give the same
results as the t test for independent samples (if we are comparing two
different groups of cases or observations), or the t test for dependent
samples (if we are comparing two variables in one set of cases or
observations).
@#GECON&*
11
October 1, 2017
Analysis of Variance (ANOVA) Statistical Learning
We can describe the observations in the table by a linear statistical model as
i 1,2,..., a
Yij i ij
j 1,2,..., n
Yij is a random variable denoting the (ij)th observation
is the overall mean common to all treatments
i is a parameter associated with the ith treatment
ij is a random error component
The above expression is the underlying model for a single-factor experiment.

Also, since we require that the observations are taken in a random order and
the environment as uniform as possible; this experimental design is called a
completely randomized design
@#GECON&*
12
October 1, 2017
Analysis of Variance (ANOVA)
ANOVA F-test
SStreatments
Fo
a 1 MStreatments
SS E MS E
an 1
H o : 1 2 ... a 0
H1 : i 0 for at least one i
Significance:
If the Null hypothesis is true, each observation consists of the
overall mean plus a realization of the random error component ij.
This is equivalent to saying that all N observations are taken from a
normal distribution with mean and variance 2.
Net: if the Null is true, changing levels of the factor has no
effect on the mean response!
@#GECON&*
13
October 1, 2017

Data Mining
Definition: In simple words, data mining is defined as a process used to
extract usable data from a larger set of any raw data. It implies analyzing
data patterns in large batches of data using one or more software.
Data mining involves effective data collection and warehousing as well as

computer processing. Data mining has applications in multiple fields, like
science and research.
As an application of data mining, businesses can learn

more about their customers and develop more effective
strategies related to various business functions and
in turn leverage resources in a more optimal and
insightful manner.
@#GECON&*
14
October 1, 2017
Machine Learning, which can be stated as the application and science of
algorithms that can help make sense of data, is the most exciting field of all
computer science, Sebastian Raschka (Python Machine Learning)
In a sense, were giving computers the ability to learn from data without
explicitly creating a computer code for this task. Therefore, these algorithms
can generate (turn into) knowledge from data from which engineers and/or
analysts can make informed decisions.
Learn how to utilize powerful algorithms to spot patterns in data and

then, make predictions about future events with quantified accuracy
and some cases with an assigned probability.
In the age of modern science and technology, there is a resource that
exist in an abundance and whose volume is growing exponentially
everyday: structured and unstructured data.
Machine Learning is a subfield of Artificial Intelligence that includes
the development of self-learning algorithms to gain knowledge from
that data in order to make predictions.
Ultimately, machine learning algorithms predictions must provide
some value-add ($) to the decision-maker; otherwise, its a waste
@#GECON&*
of time and resources. 15
October 1, 2017
Examples:
Robust e-mail spam filters

Text, voice, and face recognition software
Web search engines
Challenging games that learn
Robotics
Self-driving cars
Process and Design (device) optimization
Virtual models (Virtual Twins/Reduced Order Models)
These are important examples of how Machine Learning and A.I. affect and
will affect future technologies and applications.
In a nutshell, offering more efficient alternatives for capturing knowledge in

data to gradually improve the performance of predictive models, and make
data-driven decisions, is the ultimate reward from data-science analyses.
@#GECON&*
16
October 1, 2017
Objectives
o On the basis of the training data we would like to:
Accurately predict unseen test cases.

Understand which inputs affect the outcome, and how.
Assess the quality of our predictions and inferences.
Philosophy
It is important to understand the ideas behind the various techniques, in

order to know how and when to use them.
One has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
It is important to accurately assess the performance of a method, to know
how well or how badly it is working [simpler methods often perform as well as
fancier ones!]
This is an exciting research area, having important applications in science,
industry and finance.
Statistical learning is a fundamental ingredient in the training of a modern
data scientist.
@#GECON&*
17
October 1, 2017
Overview on Data and Data Sources
Data types that lead to machine learning
1. Supervised Data (Classified, Discrete versus Continuous values)
2. Reinforcement Data
3. Unsupervised Data
Output signal Output signal
(Labels) are (Labels) are
already known not known
The main use of

supervised data
occurs in supervised Ex. e-mail spam filtering;
learning where a e-mail that is correctly
learned model from marked as spam or not-
spam (classification task).
labeled training data
allows for predictions This is used to predict
to be made from whether or not new e-mail
unseen or future data. belongs to the two
categories.
@#GECON&*
18
October 1, 2017
1.0 Making Predictions with Supervised Learning
Flowchart to Generation of Predictive Models
Mathematical Expressions
Generate Empirical Expressions

fq (xi ) = q0 + q1x1 + q2 x2 +... + qi xi
@#GECON&*
19
October 1, 2017
2.0 Classification for Predicting Class Labels Predictions
Classification is a subcategory of supervised learning where the goal is

to predict the categorical labels (output signal) of new instances based
on past observations.
Class labels are discrete and unordered values that can be understood
as the group memberships of the instances.
-- ex. e-mail spam is an example of a binary classification task
The machine learning algorithm learns a set of rules in order to

distinguish between two possible classes: (binary) spam and non-spam
e-mail.
Not all sets of class labels need be binary! Multi-class classification is

an example of one where the output signal is not binary, but a
predictive model can be trained to learn from this supervised data set.
Ex. Letter recognition or number?

@#GECON&*
20
October 1, 2017
2.0 Classification for Predicting Class Labels Predictions
Illustrates the concept of a

binary classification task
+ given xm samples
+ +
x2 + +
+ +
+
+ + +
Some of the samples are
classified negative (circles)
+
+ and some are positive
+ ++ class (plus signs).
+ +
x1
The dataset is two-dimensional, which means each
sample has two feature values, x1 & x2.
We can use a Machine Learning algorithm to learn a rule.

@#GECON&*
21
October 1, 2017
3.0 Regression for Predicting Class Labels - Continuous
o A second type of supervised learning is the prediction of continuous

outcomes.
o This is called Regression Analysis.
o You are given a number of predictor variables (inputs) and a
continuous response variable (outcome), and you try to find a
relationship between those variables that you to predict an outcome.
Linear Regression:
+ + Given a predictor variable
+ + ++ x, and a response variable
+ ++ + y, we fit a straight line to
+ the data that minimizes the
+ +
y
distance called average
+ + + squared distance
between the sample pt.
++ and the fitted line.
@#GECON&*
22
x October 1, 2017
4.0 Reinforcement Learning(R.L.)
o A second type of learning goal is to develop a system (agent) that

improves its performance based on interactions with its
environment.
o This includes a reward system signal.
o We can think of Reinforcement learning as a field related to
supervised learning.
o However, in R.L. this feedback is not the correct ground for truth
label or value, but a measure of how well the action was measured
by the reward function.
o Through the interaction with its environment, an agent can then use
reinforcement learning to learn a series of actions that maximizes
this reward via an exploratory trail-and-error approach or
deliberative planning
ex., a chess engine a reinforcement agent the agent
decides on a series of moves depending on the state of the board (e),
and the reward can be defined as win or lose at the end of the game.
@#GECON&*
23
October 1, 2017
4.0 Reinforcement Learning(R.L.)
o Illustration of R.L. with action, reward, and state
Reward
Environment
Action
State
Agent
Reinforcement Learning(R.L.)
@#GECON&*
24
October 1, 2017
5.0 Unsupervised Learning(U.L.)
o A third type of learning in this type of learning , we are dealing with
unlabeled data or data of unknown structure.
o Using Unsupervised learning techniques (clustering) we are able to
explore the structure of the data to extract meaningful information
without the guidance of a known outcome variable or reward
function.
o Goal - Discovering hidden structures with unsupervised learning
Clustering: Exploratory data analysis

technique that allows us to organize a pile of
Illustrates the
information into meaningful subgroups
concept of
(clusters) without having any prior knowledge
Clustering to
of their group membership.
find subgroups
Each group of objects share a certain degree
x2 of similarity but are more dissimilar to objects
in other clusterswhy its called
unsupervised classification.
Clustering is a great technique for structuring

information and deriving meaningful
relationships among the data.
@#GECON&*
x1 25
October 1, 2017
2. Is Learning Feasible
The target function f , is the object of the learning. The most
important aspect of the target function is that it is unknown.
We really mean unknown!!
What is the question that it raises?

How could a limited data set reveal enough information
to pin down the entire target function, f ?
f = -1
f = +1
Visual Learning problem; first 2 rows are learning

examples, 9 bit vector visually represented as 3x3
f =?
@#GECON&*
Black and white array what is your answer 26
October 1, 2017
A simple learning task with 6 training examples of a +/- 1 class label.
The answer is not so obvious since more than one function that fits
the six training example with some @ +1 and others @ -1.
There doesnt seem to be enough information to learn from and thus,

make a accurate prediction.
f = -1
f = +1
Visual Learning problem; first 2 rows are learning

examples, 9 bit vector visually represented as 3x3
f =?
@#GECON&*
Black and white array what is your answer 27
October 1, 2017
When we have a training set such the prior examples, for the first two
rows we know the value of f, but this doesnt mean we have learned
f.
Meaning, we dont know the function.
So now we have a dilemma? Right?
If we can multiple functions f, how do we choose?
Well, the quality of the learning will be determined by how close our
prediction is to the true value for the training set, and giving
confidence for future prediction accuracy.
@#GECON&*
28
October 1, 2017
1. Overview on Data and Data Sources
o Use of the basic learning setup
fq (xi ) = q0 + q1x1 + q2 x2 +... + qi xi
gq ( xi ) fq (xi ) = q0 + q1 x1 + q2 x2 +... + qi xi
H
Final Hypothesis
@#GECON&*
29
October 1, 2017
2. Is Learning Feasible Error and Noise
Two notions in the learning feasible problem that needs to

be addressed for real world applications.
What do we mean when we say that our hypothesis

approximates the target function well?
What is the nature of the function?
In many situations, noise exists that makes the output of f not

uniquely determined by the input!
What are the ramifications of having noisy target on the learning

problem?
Learning is not expected to replicate the target function perfectly.

The final function g, is only an approximation of f.
@#GECON&*
30
October 1, 2017
To quantify how well g approximates f, we need to define a error

measure.
An error measures/quantifies how well each hypothesis, h in the

model approximates the target function f.
Error = E ( h, f )
In an ideal world, E ( h, f ) should be a user-specified value.
The same learning task in different contexts may warrant the use of
different error measures.
One can think/view E ( h, f ) as the cost of using h when you should

f.
this cost depends on the what h is used for, and cannot be dictated just
by learning techniques.
@#GECON&*
31
October 1, 2017
In many practical applications, the data we learn from are not

generated by deterministic target functions.
Instead, they are generated in a noisy way such that the output is not
uniquely determined from the input.
ex., Two people with identical salaries, outstanding loans, etc.
but end up with different credit behavior. Hence, the credit
function is not really a deterministic function, but a noisy one.
We handle this type by introducing a distribution by which we can

take the output y, to be a random variable that is affected by, rather
than determined by, the input x.
P ( y | x) y = f ( x)
Substituted for
A noisy target as a
deterministic one
plus added noise. Target Distribution Target function
A data (x, y) is now
@#GECON&* P ( x, y) = P(x)P(y | x) generated by the
32
joint distribution October 1, 2017
Questions ?
Session #1
Instructors:
Drs. Rafik Borji, Egidio (Ed) Marotta
@#GECON&*
33
October 1, 2017
Downloading Programming Languages
Python
R-Studio
@#GECON&*
34
October 1, 2017
Downloading Python from Anaconda Site
https://www.continuum.io/downloads
Open Source and Free Language

@#GECON&*
35
October 1, 2017
Recommended
@#GECON&*
36
https://www.continuum.io/downloads October 1, 2017
@#GECON&* Python Icon 37

October 1, 2017
Python GUI for Spyder from Anaconda
Editor
File Explorer
Variable Output
@#GECON&*
Console MatLab View 38
October 1, 2017
Python History
Python is an interpreted, object-oriented, high-level programming

language with dynamic semantics.
It was initially developed in the early 1990s by Guido van Rossum
and is now controlled by the not-for-profit Python Software
Foundation, sponsored by (among others) Microsoft and Google.
Python was named for the BBC called Monty Python's Flying Circus
Python 2.0 was released on October 16, 2000
Python 3.0, a major, backwards-incompatible release, was released
on December 3, 2008
It started as a successor to the ABC programming language
For aspiring Data Scientists, Python is probably the most important
language to learn because of its rich ecosystem.
https://docs.python.org/3/tutorial/introduction.html
@#GECON&*
39
October 1, 2017
Introduction to R and R Studio
Lecture 1
@#GECON&*
40
October 1, 2017
Contents
Getting started with R
Installing R
Installing Rstudio
Using Rstudio
@#GECON&*
41
October 1, 2017
Getting Started with R
History
R is dialect of S language
1976: S started as a statistical Analysis Environment implemented as
FORTRAN libraries (John Chambers)
1988: Rewritten in C language
1998: Version 4 of S language released: this is what is used today
1991: R language created in New Zealand
1993: 1st announcement of R to public
1995: software in GNU general public software license to make R free
software
1996: R-help and R-devel created
1997: R core group formed to control the source code for R
2000: R 1.0.0 released
2013: R 3.0.2 released December
Present: R 3.4.1 version
@#GEPUB&*
42
October 1, 2017
Features
Syntax similar to S
Very active development
Runs on almost standard computing platforms
Lean (functionality divided into modular packages)
Sophisticated graphics capabilities compared to statistical packages
Powerful for developing new tools
Very active users community
It is a free software (4 aspects) :
1. Run the program
2. Study how the program works
3. Redistribute copies
4. Improve the program and release to public
www.fsf.org
@#GEPUB&*
43
October 1, 2017
R system
R system divided into two parts:
1. The base R system : downloaded from CRAN
2. Everything else
R functionality divided into many packages:
1. The base R packages contained in base R required to run R
2. The other packages included in base R (stats, datasets, graphics
etc..)
3. Recommend packages (class, cluster, lattice etc)
4. Other packages from CRAN (+10,000)
5. Packages available on personal websites
R resources (www.cran.r-project.org)
1. R installation
2. An introduction to R
3. Writing R extensions
4. R data imports/exports
5. R internals
@#GEPUB&*
44
October 1, 2017
Installing R
Downloading-installing R version
https://cran.r-project.org/bin/windows/base/
CRAN: Comprehensive R Archive Network
Download available version of R
Run installation file
@#GEPUB&*
45
October 1, 2017
Installing R
R environment
install.packages
Library()
Search()
Help.start()
Help(NameofFunction/?NameofFunction)
Variable assignement
Calling function (c for concatenation)
Comments (#)
Print
Ls
rm
@#GEPUB&*
46
October 1, 2017
Installing R Studio
R Studio
Most popular R code editor.
www.rstudio.com
https://www.rstudio.com/products/rstudio/download/
Control Panel\User Accounts\User Accounts to change/add environment
variables
@#GEPUB&*
47
October 1, 2017
Installing R Studio
R Studio Console
@#GEPUB&*
48
October 1, 2017
Installing R Studio
Hype Cycle for Data Science
@#GEPUB&*
49
October 1, 2017
Installing R Studio
Hype Cycle
@#GEPUB&*
50
October 1, 2017

Guide To Eng. Data Science Lecture (1) Ver2

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Guide To Eng. Data Science Lecture (1) Ver2

Enviado por

Direitos autorais:

Formatos disponíveis

Course Description Outline

Drs. Rafik Borji,

Data aggregation: collect and summarize

Data enrichment: add new information to

o Hierarchical clustering o Bayesian Network

What are the Differences between Artificial

While artificial intelligence (AI), machine learning (ML), and

At a basic Applied AI: Machines designed to

Cutting edge The key difference

The main field dedicated to the subject of learning is called Machine

The main field dedicated to the subject of learning is called Machine

o There is much overlap both fields focus on supervised and unsupervised

Machine learning has a greater emphasis on large scale applications and

But the distinction has become more and more blurred,

Machine learning has the upper hand in Marketing!

The main field dedicated to the subject of learning is called Machine

The Purpose of Analysis of Variance

The above expression is the underlying model for a single-factor experiment.

The main field dedicated to the subject of learning is called Machine

Data mining involves effective data collection and warehousing as well as

As an application of data mining, businesses can learn

Learn how to utilize powerful algorithms to spot patterns in data and

Robust e-mail spam filters

In a nutshell, offering more efficient alternatives for capturing knowledge in

Accurately predict unseen test cases.

It is important to understand the ideas behind the various techniques, in

The main use of

Generate Empirical Expressions

Classification is a subcategory of supervised learning where the goal is

The machine learning algorithm learns a set of rules in order to

Not all sets of class labels need be binary! Multi-class classification is

Ex. Letter recognition or number?

Illustrates the concept of a

We can use a Machine Learning algorithm to learn a rule.

o A second type of supervised learning is the prediction of continuous

o A second type of learning goal is to develop a system (agent) that

o Illustration of R.L. with action, reward, and state

Clustering: Exploratory data analysis

Clustering is a great technique for structuring

What is the question that it raises?

Visual Learning problem; first 2 rows are learning

There doesnt seem to be enough information to learn from and thus,

Visual Learning problem; first 2 rows are learning

So now we have a dilemma? Right?

If we can multiple functions f, how do we choose?

fq (xi ) = q0 + q1x1 + q2 x2 +... + qi xi

Two notions in the learning feasible problem that needs to

What do we mean when we say that our hypothesis

What is the nature of the function?

In many situations, noise exists that makes the output of f not

What are the ramifications of having noisy target on the learning

Learning is not expected to replicate the target function perfectly.

To quantify how well g approximates f, we need to define a error

An error measures/quantifies how well each hypothesis, h in the

One can think/view E ( h, f ) as the cost of using h when you should

In many practical applications, the data we learn from are not

We handle this type by introducing a distribution by which we can

Open Source and Free Language

@#GECON&* Python Icon 37

Python is an interpreted, object-oriented, high-level programming

Você também pode gostar