Você está na página 1de 50

Course Description Outline

University of Houston
Department of Mechanical Engineering
Guide to Engineering Data Science
Fall 2017
Session #1
Instructors:

Drs. Rafik Borji,


Egidio (Ed) Marotta

@#GECON&*
1
October 1, 2017
Course Description: Course Description:
1. Overview on Data and Data Sources 2. Data Preparation
o Data types Data Processing: data cleansing and
o Data sources separation
o Data acquisition o Filtering
o Imputation
o Dimensionality reduction
o Normalization and Transformation
o Feature extraction

Data aggregation: collect and summarize


data
o Basic statistics
o Distribution fitting
o Baseball card aggregation

Data enrichment: add new information to


data
o Annotation
o Date Handling o Relational algebra
o Theoretical o Feature addition
o Empirical
@#GECON&*
2
October 1, 2017
Course Description: Course Description:
3. Advanced Data Analytics 3. Advanced Data Analytics
Clustering Classification

o Hierarchical clustering o Bayesian Network


o X-means, Canopy, a priori clustering o Neural nets
o Topic modeling for text data o Random Forests
o Fractal and DB scan o Deep Learning
o Gaussian mixture models o Decision Trees
o K-means clustering o K-nearest Neighbors
o Nave Bayes
Regression For Variables Assessment o Hidden Markov Model
o Tree-based methods o Support Vector Machines
o Generalized linear models
o Regression with shrinkage Regression for Value Prediction
o Stepwise regression o Tree-based methods
o Generalized linear models
Hypothesis Testing o K-nearest neighbors
o T-test for groups comparison
o ANOVA Recommendation
o Collaborative filtering (SVD, PCA)
o Content-based methods
o Graph-based methods
@#GECON&*
3
October 1, 2017
Course Description: Course Description:
3. Advanced Data Analytics 4. Data Models Operationalization
Logical reasoning o Dashboard Development
o Expert systems and Logical reasoning o Introduction to Internet of Things
o Introduction to Digital Twins
Optimization o Introduction to Adaptive Reduced Order
o Stochastic research Modeling
o Genetic algorithm, simulated annealing and
gradient search
o Linear, integer and non-linear programming
o Active learning
o Ensemble learning

Simulation
o Discrete event simulator
o Markov models
o Agent-based simulation
o Monte Carlo simulation
o Systems dynamics
o Activity-based simulation
o ODES and PDES
o Fuzzy logic

@#GECON&*
4
October 1, 2017
Artificial Intelligence
From personal assistants like Siri, Alexa, Cortana,
etc to movie suggestions on Netflix, artificial
intelligence (AI) is rapidly becoming ubiquitous in
everyday life. As this technology continues to
advance in capability and prevalence, we seek to
explore AI and several closely related subtopics:
machine leaning, deep learning, and neural
networks.

What are the Differences between Artificial


Intelligence, Machine Learning, and Deep Learning?
@#GECON&*
5
October 1, 2017
Artificial Intelligence
What are the Differences between Artificial
Intelligence, Machine Learning, and Deep Learning?

While artificial intelligence (AI), machine learning (ML), and


Deep Learning (DL) are often used interchangeably, there
are several key differences. One way to visualize the
relationship is through a series of concentric circles.

At a basic Applied AI: Machines designed to


complete very specifics tasks like
level, artificial navigating a vehicle, trading stocks, or
intelligence is the playing chess
concept of machines
accomplishing tasks General AI: Machines designed to
complete any task which would normally
which have historically require human intervention. The broad
required human nature of General AI requires machines
intelligence.1 to learn as they encounter new tasks or
situations.
@#GECON&*
1. https://www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine- 6
October 1, 2017
learning/#35aa37802742
Artificial Intelligence - Layers
3. https://www.globalxfunds.com/artificial-intelligence-explained/?utm_source=Dianomi&utm_medium=CPC&utm_campaign=AI+Explained

Cutting edge The key difference


research of AI is between ML and AI is
focusing on machine that ML does not rely
learning (ML). explicitly on the code
of its creator. Rather,
In simple terms, ML ML systems use
is the process of computer code as a
building machines starting point and
which can access then gather data,
data, apply information, and
algorithms to this inputs which can be
data, and then train studied
themselves to
deduce valuable
insights based on Deep learning takes artificial intelligence a step further, by
these underlying mimicking how the human brain works through the use of
datasets. artificial neural networks. In an artificial neural network,
each neuron is charged with providing a binary (yes/no)
@#GECON&* response to basic questions about a piece of data.
7
2. http://www.explainthatstuff.com/introduction-to-neural-networks.html October 1, 2017
Types of Learning,

The basic premise of learning from data is the use of a set of observations to
uncover an underlying process. Its very broad, and difficult to fit in a single
framework. Hence, different learning paradigms have arisen to deal with
different situation and different assumptions.

Supervised Learning Its the most studied and most utilized type of
learning.
The variation in learning within this category is dependent on the
nature of the data set.
Reinforcement learning When the data does not explicitly contain the
correct output for each input, we are no longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
Unsupervised Learning The training data does not contain any output
information at all. Its the task of spontaneously finding patterns and
structure in input data.
Its being able to place items with similar properties together into
@#GECON&*
one category. 8
October 1, 2017
Other Views/Types of Learning,

The main field dedicated to the subject of learning is called Machine


Learning. However, there are two more similar approaches to learning from
data that have their own ways.

Statistical Learning - it has the same premise as learning from data; that
is, the use of a set of observations to uncover an underlying process.
Its a mathematical field where the process is a probability distribution
and the observations are samples from that distribution.
Statistics focuses on somewhat idealized models and analyzes them
in great detail.
Data Mining A field that is quite practical and that focuses on finding
patterns, correlations, or anomalies in large relational databases. n the
data does not explicitly contain the correct output for each input, we are no
longer in supervised learning set.
Ex. Supervised Learning: Input, correct output.
Reinforcement Learning: Input, some output, grade for this
output.
@#GECON&*
9
October 1, 2017
Other Views/Types of Learning,

The main field dedicated to the subject of learning is called Machine


Learning. However, there are two more similar approaches to learning from
data that have their own ways.

Statistical Learning
o Statistical learning arose as a subfield of Statistics.

o There is much overlap both fields focus on supervised and unsupervised


problems:

Machine learning has a greater emphasis on large scale applications and


prediction accuracy.
Statistical learning emphasizes models and their interpretability, and
precision and uncertainty.

But the distinction has become more and more blurred,


and there is a great deal of cross-fertilization.

Machine learning has the upper hand in Marketing!


@#GECON&*
10
October 1, 2017
Other Views/Types of Learning,

The main field dedicated to the subject of learning is called Machine


Learning. However, there are two more similar approaches to learning from
data that have their own ways.

Statistical Learning
o Analysis of Variance (ANOVA)

The Purpose of Analysis of Variance


In general, the purpose of analysis of variance (ANOVA) is to test for
significant differences between means.
If we are only comparing two means, then ANOVA will give the same
results as the t test for independent samples (if we are comparing two
different groups of cases or observations), or the t test for dependent
samples (if we are comparing two variables in one set of cases or
observations).

@#GECON&*
11
October 1, 2017
Analysis of Variance (ANOVA) Statistical Learning
We can describe the observations in the table by a linear statistical model as

i 1,2,..., a
Yij i ij
j 1,2,..., n
Yij is a random variable denoting the (ij)th observation
is the overall mean common to all treatments
i is a parameter associated with the ith treatment
ij is a random error component

The above expression is the underlying model for a single-factor experiment.


Also, since we require that the observations are taken in a random order and
the environment as uniform as possible; this experimental design is called a
completely randomized design

@#GECON&*
12
October 1, 2017
Analysis of Variance (ANOVA)
Statistical Learning
ANOVA F-test
SStreatments
Fo
a 1 MStreatments
SS E MS E
an 1
H o : 1 2 ... a 0
H1 : i 0 for at least one i

Significance:
If the Null hypothesis is true, each observation consists of the
overall mean plus a realization of the random error component ij.
This is equivalent to saying that all N observations are taken from a
normal distribution with mean and variance 2.
Net: if the Null is true, changing levels of the factor has no
effect on the mean response!
@#GECON&*
13
October 1, 2017
Other Views/Types of Learning,

The main field dedicated to the subject of learning is called Machine


Learning. However, there are two more similar approaches to learning from
data that have their own ways.

Data Mining
Definition: In simple words, data mining is defined as a process used to
extract usable data from a larger set of any raw data. It implies analyzing
data patterns in large batches of data using one or more software.

Data mining involves effective data collection and warehousing as well as


computer processing. Data mining has applications in multiple fields, like
science and research.

As an application of data mining, businesses can learn


more about their customers and develop more effective
strategies related to various business functions and
in turn leverage resources in a more optimal and
insightful manner.
@#GECON&*
14
October 1, 2017
Machine Learning, which can be stated as the application and science of
algorithms that can help make sense of data, is the most exciting field of all
computer science, Sebastian Raschka (Python Machine Learning)

In a sense, were giving computers the ability to learn from data without
explicitly creating a computer code for this task. Therefore, these algorithms
can generate (turn into) knowledge from data from which engineers and/or
analysts can make informed decisions.

Learn how to utilize powerful algorithms to spot patterns in data and


then, make predictions about future events with quantified accuracy
and some cases with an assigned probability.
In the age of modern science and technology, there is a resource that
exist in an abundance and whose volume is growing exponentially
everyday: structured and unstructured data.
Machine Learning is a subfield of Artificial Intelligence that includes
the development of self-learning algorithms to gain knowledge from
that data in order to make predictions.
Ultimately, machine learning algorithms predictions must provide
some value-add ($) to the decision-maker; otherwise, its a waste
@#GECON&*
of time and resources. 15
October 1, 2017
Examples:

Robust e-mail spam filters


Text, voice, and face recognition software
Web search engines
Challenging games that learn
Robotics
Self-driving cars
Process and Design (device) optimization
Virtual models (Virtual Twins/Reduced Order Models)

These are important examples of how Machine Learning and A.I. affect and
will affect future technologies and applications.

In a nutshell, offering more efficient alternatives for capturing knowledge in


data to gradually improve the performance of predictive models, and make
data-driven decisions, is the ultimate reward from data-science analyses.

@#GECON&*
16
October 1, 2017
Objectives
o On the basis of the training data we would like to:

Accurately predict unseen test cases.


Understand which inputs affect the outcome, and how.
Assess the quality of our predictions and inferences.

Philosophy

It is important to understand the ideas behind the various techniques, in


order to know how and when to use them.
One has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
It is important to accurately assess the performance of a method, to know
how well or how badly it is working [simpler methods often perform as well as
fancier ones!]
This is an exciting research area, having important applications in science,
industry and finance.
Statistical learning is a fundamental ingredient in the training of a modern
data scientist.
@#GECON&*
17
October 1, 2017
Overview on Data and Data Sources
Data types that lead to machine learning
1. Supervised Data (Classified, Discrete versus Continuous values)
2. Reinforcement Data
3. Unsupervised Data
Output signal Output signal
(Labels) are (Labels) are
already known not known

The main use of


supervised data
occurs in supervised Ex. e-mail spam filtering;
learning where a e-mail that is correctly
learned model from marked as spam or not-
spam (classification task).
labeled training data
allows for predictions This is used to predict
to be made from whether or not new e-mail
unseen or future data. belongs to the two
categories.
@#GECON&*
18
October 1, 2017
Overview on Data and Data Sources
1.0 Making Predictions with Supervised Learning
Flowchart to Generation of Predictive Models

Mathematical Expressions

Generate Empirical Expressions


fq (xi ) = q0 + q1x1 + q2 x2 +... + qi xi

@#GECON&*
19
October 1, 2017
Overview on Data and Data Sources
2.0 Classification for Predicting Class Labels Predictions

Classification is a subcategory of supervised learning where the goal is


to predict the categorical labels (output signal) of new instances based
on past observations.

Class labels are discrete and unordered values that can be understood
as the group memberships of the instances.
-- ex. e-mail spam is an example of a binary classification task

The machine learning algorithm learns a set of rules in order to


distinguish between two possible classes: (binary) spam and non-spam
e-mail.

Not all sets of class labels need be binary! Multi-class classification is


an example of one where the output signal is not binary, but a
predictive model can be trained to learn from this supervised data set.

Ex. Letter recognition or number?


@#GECON&*
20
October 1, 2017
Overview on Data and Data Sources
2.0 Classification for Predicting Class Labels Predictions

Illustrates the concept of a


binary classification task
+ given xm samples
+ +
x2 + +
+ +
+
+ + +
Some of the samples are
classified negative (circles)
+
+ and some are positive
+ ++ class (plus signs).
+ +
x1
The dataset is two-dimensional, which means each
sample has two feature values, x1 & x2.

We can use a Machine Learning algorithm to learn a rule.


@#GECON&*
21
October 1, 2017
Overview on Data and Data Sources
3.0 Regression for Predicting Class Labels - Continuous

o A second type of supervised learning is the prediction of continuous


outcomes.
o This is called Regression Analysis.
o You are given a number of predictor variables (inputs) and a
continuous response variable (outcome), and you try to find a
relationship between those variables that you to predict an outcome.

Linear Regression:
+ + Given a predictor variable
+ + ++ x, and a response variable
+ ++ + y, we fit a straight line to
+ the data that minimizes the
+ +
y
distance called average
+ + + squared distance
between the sample pt.
++ and the fitted line.
@#GECON&*
22
x October 1, 2017
Overview on Data and Data Sources
4.0 Reinforcement Learning(R.L.)

o A second type of learning goal is to develop a system (agent) that


improves its performance based on interactions with its
environment.
o This includes a reward system signal.
o We can think of Reinforcement learning as a field related to
supervised learning.
o However, in R.L. this feedback is not the correct ground for truth
label or value, but a measure of how well the action was measured
by the reward function.
o Through the interaction with its environment, an agent can then use
reinforcement learning to learn a series of actions that maximizes
this reward via an exploratory trail-and-error approach or
deliberative planning
ex., a chess engine a reinforcement agent the agent
decides on a series of moves depending on the state of the board (e),
and the reward can be defined as win or lose at the end of the game.
@#GECON&*
23
October 1, 2017
Overview on Data and Data Sources
4.0 Reinforcement Learning(R.L.)

o Illustration of R.L. with action, reward, and state

Reward

Environment
Action

State
Agent

Reinforcement Learning(R.L.)

@#GECON&*
24
October 1, 2017
Overview on Data and Data Sources
5.0 Unsupervised Learning(U.L.)
o A third type of learning in this type of learning , we are dealing with
unlabeled data or data of unknown structure.
o Using Unsupervised learning techniques (clustering) we are able to
explore the structure of the data to extract meaningful information
without the guidance of a known outcome variable or reward
function.
o Goal - Discovering hidden structures with unsupervised learning

Clustering: Exploratory data analysis


technique that allows us to organize a pile of
Illustrates the
information into meaningful subgroups
concept of
(clusters) without having any prior knowledge
Clustering to
of their group membership.
find subgroups
Each group of objects share a certain degree
x2 of similarity but are more dissimilar to objects
in other clusterswhy its called
unsupervised classification.

Clustering is a great technique for structuring


information and deriving meaningful
relationships among the data.
@#GECON&*
x1 25
October 1, 2017
2. Is Learning Feasible
The target function f , is the object of the learning. The most
important aspect of the target function is that it is unknown.
We really mean unknown!!

What is the question that it raises?


How could a limited data set reveal enough information
to pin down the entire target function, f ?

f = -1

f = +1

Visual Learning problem; first 2 rows are learning


examples, 9 bit vector visually represented as 3x3
f =?
@#GECON&*
Black and white array what is your answer 26
October 1, 2017
2. Is Learning Feasible
A simple learning task with 6 training examples of a +/- 1 class label.

The answer is not so obvious since more than one function that fits
the six training example with some @ +1 and others @ -1.

There doesnt seem to be enough information to learn from and thus,


make a accurate prediction.

f = -1

f = +1

Visual Learning problem; first 2 rows are learning


examples, 9 bit vector visually represented as 3x3
f =?
@#GECON&*
Black and white array what is your answer 27
October 1, 2017
2. Is Learning Feasible

When we have a training set such the prior examples, for the first two
rows we know the value of f, but this doesnt mean we have learned
f.
Meaning, we dont know the function.

So now we have a dilemma? Right?

If we can multiple functions f, how do we choose?

Well, the quality of the learning will be determined by how close our
prediction is to the true value for the training set, and giving
confidence for future prediction accuracy.

@#GECON&*
28
October 1, 2017
1. Overview on Data and Data Sources
o Use of the basic learning setup

fq (xi ) = q0 + q1x1 + q2 x2 +... + qi xi

gq ( xi ) fq (xi ) = q0 + q1 x1 + q2 x2 +... + qi xi
H
Final Hypothesis

@#GECON&*
29
October 1, 2017
2. Is Learning Feasible Error and Noise

Two notions in the learning feasible problem that needs to


be addressed for real world applications.

What do we mean when we say that our hypothesis


approximates the target function well?

What is the nature of the function?

In many situations, noise exists that makes the output of f not


uniquely determined by the input!

What are the ramifications of having noisy target on the learning


problem?

Learning is not expected to replicate the target function perfectly.


The final function g, is only an approximation of f.

@#GECON&*
30
October 1, 2017
2. Is Learning Feasible Error and Noise

To quantify how well g approximates f, we need to define a error


measure.

An error measures/quantifies how well each hypothesis, h in the


model approximates the target function f.
Error = E ( h, f )
In an ideal world, E ( h, f ) should be a user-specified value.

The same learning task in different contexts may warrant the use of
different error measures.

One can think/view E ( h, f ) as the cost of using h when you should


f.

this cost depends on the what h is used for, and cannot be dictated just
by learning techniques.

@#GECON&*
31
October 1, 2017
2. Is Learning Feasible Error and Noise

In many practical applications, the data we learn from are not


generated by deterministic target functions.

Instead, they are generated in a noisy way such that the output is not
uniquely determined from the input.
ex., Two people with identical salaries, outstanding loans, etc.
but end up with different credit behavior. Hence, the credit
function is not really a deterministic function, but a noisy one.

We handle this type by introducing a distribution by which we can


take the output y, to be a random variable that is affected by, rather
than determined by, the input x.

P ( y | x) y = f ( x)
Substituted for
A noisy target as a
deterministic one
plus added noise. Target Distribution Target function
A data (x, y) is now
@#GECON&* P ( x, y) = P(x)P(y | x) generated by the
32
joint distribution October 1, 2017
Questions ?

Session #1
Instructors:
Drs. Rafik Borji, Egidio (Ed) Marotta
@#GECON&*
33
October 1, 2017
Downloading Programming Languages
Python
R-Studio

@#GECON&*
34
October 1, 2017
Downloading Python from Anaconda Site

https://www.continuum.io/downloads

Open Source and Free Language


@#GECON&*
35
October 1, 2017
Downloading Python from Anaconda Site

Recommended

@#GECON&*
36
https://www.continuum.io/downloads October 1, 2017
Downloading Python from Anaconda Site

@#GECON&* Python Icon 37


October 1, 2017
Python GUI for Spyder from Anaconda

Editor
File Explorer

Variable Output

@#GECON&*
Console MatLab View 38
October 1, 2017
Python History

Python is an interpreted, object-oriented, high-level programming


language with dynamic semantics.
It was initially developed in the early 1990s by Guido van Rossum
and is now controlled by the not-for-profit Python Software
Foundation, sponsored by (among others) Microsoft and Google.
Python was named for the BBC called Monty Python's Flying Circus
Python 2.0 was released on October 16, 2000
Python 3.0, a major, backwards-incompatible release, was released
on December 3, 2008
It started as a successor to the ABC programming language
For aspiring Data Scientists, Python is probably the most important
language to learn because of its rich ecosystem.

https://docs.python.org/3/tutorial/introduction.html

@#GECON&*
39
October 1, 2017
Introduction to R and R Studio

Lecture 1

@#GECON&*
40
October 1, 2017
Contents
Getting started with R

Installing R

Installing Rstudio

Using Rstudio
@#GECON&*
41
October 1, 2017
Getting Started with R
History
R is dialect of S language
1976: S started as a statistical Analysis Environment implemented as
FORTRAN libraries (John Chambers)
1988: Rewritten in C language
1998: Version 4 of S language released: this is what is used today
1991: R language created in New Zealand
1993: 1st announcement of R to public
1995: software in GNU general public software license to make R free
software
1996: R-help and R-devel created
1997: R core group formed to control the source code for R
2000: R 1.0.0 released
2013: R 3.0.2 released December
Present: R 3.4.1 version
@#GEPUB&*
42
October 1, 2017
Getting Started with R
Features
Syntax similar to S
Very active development
Runs on almost standard computing platforms
Lean (functionality divided into modular packages)
Sophisticated graphics capabilities compared to statistical packages
Powerful for developing new tools
Very active users community
It is a free software (4 aspects) :
1. Run the program
2. Study how the program works
3. Redistribute copies
4. Improve the program and release to public
www.fsf.org
@#GEPUB&*
43
October 1, 2017
Getting Started with R
R system
R system divided into two parts:
1. The base R system : downloaded from CRAN
2. Everything else
R functionality divided into many packages:
1. The base R packages contained in base R required to run R
2. The other packages included in base R (stats, datasets, graphics
etc..)
3. Recommend packages (class, cluster, lattice etc)
4. Other packages from CRAN (+10,000)
5. Packages available on personal websites
R resources (www.cran.r-project.org)
1. R installation
2. An introduction to R
3. Writing R extensions
4. R data imports/exports
5. R internals

@#GEPUB&*
44
October 1, 2017
Installing R
Downloading-installing R version
https://cran.r-project.org/bin/windows/base/
CRAN: Comprehensive R Archive Network
Download available version of R
Run installation file

@#GEPUB&*
45
October 1, 2017
Installing R
R environment
install.packages
Library()
Search()
Help.start()
Help(NameofFunction/?NameofFunction)
Variable assignement
Calling function (c for concatenation)
Comments (#)
Print
Ls
rm

@#GEPUB&*
46
October 1, 2017
Installing R Studio
R Studio
Most popular R code editor.
www.rstudio.com
https://www.rstudio.com/products/rstudio/download/
Control Panel\User Accounts\User Accounts to change/add environment
variables

@#GEPUB&*
47
October 1, 2017
Installing R Studio
R Studio Console

@#GEPUB&*
48
October 1, 2017
Installing R Studio
Hype Cycle for Data Science

@#GEPUB&*
49
October 1, 2017
Installing R Studio
Hype Cycle

@#GEPUB&*
50
October 1, 2017

Você também pode gostar