Machine Learning Intro Final

Jelena Nadj - Data Scientist
Machine Learning 101: Concepts,

Theory and Application
Talking Notes:
○ My aim is to provide a general overview of machine learning
(ML) and to present and explain some of the major concepts
related to ML by emulating the process of building the ML
model.
A short history and Defining terms
current status of ML related to ML
Classification of ML
Building a ML model
algorithms
Talking Notes:
○ So, the whole presentation is conceptualized in the following
way
○ I will start with presenting the origins and short history of ML,
the current status, some examples from everyday use to sort
of high-level applications
○ Later we define the landscape of different terms and
terminologies used in the context of ML and in relation to ML
and artificial intelligence (AI).
○ This will be followed by general classification of ML
algorithms.
○ In the final part I will explain the main concepts and terms
related to ML algorithms by emulating the standard path of
building the ML model and also present some of the major ML
algorithms.
○ So, we will start with the first segment of the presentation and
this is the summary of ML history and current status.
Origins of Machine Learning and Artificial Intelligence
“Give machines ability to learn

without explicitly programming
them.” - Arthur Samuel (IBM),
1955 “A History of Machine
“Incomplete History of
Machine Learning” by Robert
Learning” by Google
Colner
Talking Notes:
○ Origins of ML and AI go back to 1950’s and coincide with the
invention of the first general purpose digital computer called
ENICA.
○ Many say that beginnings of real applied AI actually started
with the Arthur Samuel’s invention of the machine that was
able to learn the game of checkers and this
machine/computer improved it by playing against humans.
○ Another major event in the late 1950’s, was also invention of
perceptron by Frank Rosenblatt which was basically the first
artificial neural network being designed.
○ The two events can be considered as one of the major ones in
the history of ML some argue there are some others.
○ Since these two major milestones I mentioned, ML and AI
went through several ups and downs but I would not go into
these.
○ However, for those interested in history of ML and AI the
following two sources “Incomplete history of ML by Robert
Colner and an interactive visualization of ML history
○ developed by Google are really nice resources for those who
would like to dig deeper.
○ In the following slides we will focus on currents status and
recent developments in ML and AI.
○ Just to underline in the next couple of slides we will use ML and
AI terms in the same context and almost interchangeably but we
will make distinction between the two terms later on when we
explain the difference between the two and other terms used.
○ Time: 01:25
Rapid development of ML and AI in the last decade
?
Large amounts of
data ?
Improvement of ML
algorithms ?
Increased computing
power
Talking Notes:
○ Today you will hear people talking about ML and AI almost
everywhere. These two are major buzzwords.
○ So, the reasonable question one would ask is: “What are the
reasons and the major breakthroughs in the last 10 to 20 years
that lead this rapid development of ML and consequently AI?”
○ And among many three factors or reasons can be considered as
main ones:
○ PROMPT QUESTION: Anyone would like to guess :)
○ So, yes the three main reasons would be:
■ Production and access to large amounts of data:
■ Improvements of ML algorithms
■ Increased computing power
○ We will talk about each of these three factors a bit more.
○ Time: 01:05 talk + prompt question: answers
Large amounts of data
● IDC’s Digital Universe Study

● “50x growth from the beginning of 2010 to the
end of 2020.”
Talking Notes:
○ Data production exploded over the last ten or twenty years
mainly due to data collected from various sources such as
smartphones, industrial equipment, digital photos, social media
and these are just some of the main sources of unparalleled
data production.
○ And just to illustrate here is the old quite known plot from IDC’s
Digital Universe Study which forecasted the 50x growth from
the beginning of 2010 to the end of 2020.
○ This trend will continue if not increase even more especially
considering the emerging technologies such as IoT, autonomous
vehicles and other emerging technology that will increase data
production even more.
○ We will come back later on and shortly talk about “big data”
and how is and will influence the field of ML and AI in coming
years.
○ Time: 00:52
Improvement of ML algorithms
● In 2006, Geoffrey Hinton rebranded neural
net research as “deep learning”.
● In 2009, BellKor's Pragmatic Chaos nets the
$1M Netflix prize.
● In 2012, Google Brain detects human faces

in images.
● In 2015, a computer wins at the world's
hardest board game.
○ So, it is an indirect illustration of what deep learning (read neural

network) algorithms are able to learn. Here is also a link to the
GIF [link].
● Talking Notes:
○ About the second reason, actually ML algorithms have been
around for quite some time as we said neural networks (NN)
have been around since 1950’s, but NN have been generally
disproved and also required large datasets to provide good
and reliable results.
○ However, in 2006 Geoffrey Hinton’s rebranded neural
networks as deep learning and show several examples in
which NN beat the traditional AI such as speech and image
recognition.
○ This brought huge attention to the field from giants such as
Google, Microsoft and Facebook which started to invest
heavily in development of ML and this gave chance to ML to
shine/
○ This was followed by other breakthroughs and recently
○ culminating by a computer beating the human world
champion in the game of Go.
○ In parallel, fast and large production of data has been also one
of the main driving forces for the development of new
algorithms basically to meet the demand for analysis of all this
data. So, these two were interconnected in a way.
○ Actually, in the background you see the unbranching all
possible options in the three sequential moves in the game of
Go. Showing huge complexity of options moves in game of Go
and a computer was able to learn and perfect the game by
using deep learning to such a level to beat the world champion.
○ Time: 01:04
Increased computing power
Talking Notes:
○ Finally the last but not least reason was continuous increase in
computing power. Everybody knows about the Moore’s law and
doubling of circuit capability which provided an opportunity for
practical use of machine learning techniques.
○ In addition to this, use of GPU’s proved to be very effective when
applied to the types of calculation needed for deep learning
algorithms with speedups of 10x when comparing to traditional
central processing units.
○ Also there is a quantum computing which is still in its infancy,
but it will be a new disruptive technology in general and
specifically in fields of ML and AI.
○ Time: 00:40
Where are ML and AI will be in the near future?
Well, everywhere!
Talking Notes:
○ OK, I assume most of the people are aware that ML and AI are
game changers. To underline this even more I will show what
we can expect in the coming years.
○ Well, it appears we will see it everywhere. The Gartner Hype
Cycle curve for 2017 suggests that one third of current and
emerging technologies, those labeled in yellow, are ML/AI
based or heavily dependant on ML and AI.
○ Actually Gartner calls this whole new trend “AI Everywhere!”
○ Time: 00:32
AlphaGo vs. Lee Sedol AlphaGo Zero vs. AlphaGo
Game Go has 10¹⁷⁰ possible board positions and there are

only 10⁸⁰ atoms in the universe.
Talking Notes:
○ Now let see some of the popular high-level examples of ML and
AI.
○ The one that we already mentioned is development of AlphaGo
by DeepMind which beat the world champion Lee Sedol. Look it
him. Totally confused.
○ In addition to this, DeepMind developed AlphaGo Zeo, an
updated version of AlphaGo AI program that learnt the game in
three days and without any human help except telling the rules.
○ Just to remind, Go is an ancient Chinese war strategy game, with
its 10¹⁷⁰ possible board positions and there are only 10⁸⁰ atoms
in the universe.
○ The future application of AlphaGo Zero is in the research of
protein folding, a huge scientific challenge that could help in
improving drug discovery for various diseases and disorders.
○ Time: 00:50
Google Sunroof
Talking Notes:
○ Another interesting example is Google Sunroof which uses deep
machine learning for estimation of the solar savings potential of
your home.
○ Time: 00:10
Tesla Self-Driving Cars
Talking Notes:
○ Another one, probably already a bit old one is use of deep
learning for self-driving cars. Obviously self driving cars will not
be possible without ML.
○ Time: 00:10
Adapting cloud services to meet ML demands
Talking Notes:
○ Also, another new interesting trend which indirectly tells how
important ML and AI are becoming is related to big cloud
providers.
○ Where all major cloud providers from Amazon, Google, IBM are
changing and adapting their cloud services to meet demands of
AI and ML.
○ Time: 00:20
Everyday use of ML inside enterprises
Talking Notes:
○ In addition to these sort of high-end applications of ML there are
also everyday application of ML such as
○ For improvement and use of generated content like Yelp and
Pinterest are doing.
○ For lead prediction and scoring like in the case of Salesforce’s
Einstein
○ Application of ML for detection of money laundry and fraudulent
transactions as PayPal and many banks are actually doing.
○ Providing suggestions for treatment of certain types of cancers
for which IBM Watson Health is used.
○ These are just some of the major ones and with these examples I
will finish the first part of presentation.
○ Time: 00:45
Classification of ML Building an ML
algorithms model
Talking Notes:
○ Now I will try to define some of the main terms related to ML
and AI and explain the differences between those.
○ Time: 00:10
Putting AI, ML and DL in the chronological context
ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
EARLY AI STIRS
EXCITEMENT DEEP LEARNING
MACHINE LEARNING DEEP LEARNING
BREAKTHROUGHS DRIVE AI
BEGINS TO BOOM
FLOURISH
1950s
1960s
1970s
1980s
1990s
2000s
2010s
Talking Notes:
○ Let’s first start with the three main terms AI, ML and DL and I
believe the easiest way to explain these three is to put them
in a chronological context.
○ As we mentioned at the beginning of this presentation, the
origins of AI and ML go back to 1950’s where first attempts to
create “intelligent machines” were made
○ But these early attempts were related to traditional
deterministic approaches or rule-based approaches of AI.
○ The rapid development in ML started in the 1980’s which
consequently increased the development of AI.
○ However, the real and extremely fast development of AI
came from breakthroughs in Deep Learning introduced by
Geoffrey Hinton.
○ PROMPT: Before continuing. How would you make a
difference between AI, ML and DL? Anyone?
○ Let’s start with AI.
○ In simple terms it could be explained as “human intelligence
exhibited by machines” [link]
○ On the other hand ML can be defined as the practice of using
○ algorithms to parse data, learn from it, and then make a
determination or prediction about something in the world.
○ So, ML can be considered as tool for achieving AI, but just to
underline I am not saying that AI can not be explicitly achieved
without ML. ML actually facilitates and speeds up the process of
obtaining AI.
○ Finally, DL can be considered as a subdomain of ML which is able
to process natural data in their raw form such as for example
images, video and audio and learn patterns from that type of data.
○ Time: 01:30
○ “AI enables that an autonomous agent (such as for example
robot, car, etc.) executes and recommends actions based on the
knowledge it gained through machine learning.”
○ And here we come to the point where we mention ML.
AI was boosted by development of ML
Talking Notes:
○ Rapid development of AI was boosted by ML more specifically
DL.
○ To illustrate this let’s remember of DeepBlue, IBM’s chess
playing computer that won against Garry Kasparov.
○ At that time DeepBlue was heavily dependant on deterministic
procedural programming as most of the AI was prior to the
increased use of ML and DL.
○ In the last ten years huge improvements have been made in AI
which were mainly due to introduction of ML and actually DL
methods.
○ And another illustration how far AI based on DL went is OpenAI
bot that won in playing Dota, a popular video game against the
well known professional Dota player [link].
○ Time: 01:10
Artificial Intelligence
Narrow AI General AI
Talking Notes:
○ Still a bit more about AI.
○ Generally AI is divided to two groups: narrow AI and general AI.
○ Narrow AI or weak AI uses data and ML algorithms to train,
learn and perform specific tasks as well or better than humans
can
○ So, examples would be AlphaGo and AlphaGo Zero or AI
applied in case of self driving cars
○ On the other side the general AI or also called strong AI which
corresponds to machines that will have all our senses and
reasoning and perform different tasks just like we humans do or
even better.
○ So examples of those should correspond to science fiction
examples such as Terminator or humanoids from Westworld TV
show and it seems that is something that we are not so far
away.
○ Many predict that we will have general AI within next 40 years,
but this projection has been repeated for quite some time now.
○ Time: 00:50
MACHINE LEARNING
Input Data Output

Information (+ Answers) Optimum Model
• Relationships • Dependencies
• Patterns • Hidden structures
ALGORITHMS + TECHNIQUES
22
Talking Notes:
○ Coming back to the heart of today’s AI which many will
agree is ML.
○ Now it is easier to distinguish the difference between AI
and ML.
○ As I said ML is focused around how to learn and gain
knowledge from data. It basically uses algorithms to
parse data and learn from it and then make predictions
about something in the world.
○ ML is opposite to the conventional deterministic
procedural programming where a specific set of
instructions are written to achieve a specific task.
○ This deterministic procedural programming has been
used for AI until ML came into the play.
○ The main elements or ingredients of ML
○ are input data, ML algorithms and sufficient computing
power to learn from this/
○ So, the formula is simple, right :) Have enough data, take
an ML algorithm, train and evaluate your model and voila
○ that’s it.
○ However, this is approach have some limitations and one of
those is how to extract information from data in a form that will
be easy to digest by ML algorithms
○ And this is where DL comes into the play.
○ Time: 01:20
WHAT IS DATA SCIENCE
SEMANTIC
DATA SCIENCE
ADVANCED
SIMULATIONS
AND
OPTIMIZATION
FRAMEWORK UNDERSTANDING
PATTERNS
DATA
VISUALIZATION SYSTEMS
PREDICTIVE
ANALYTICS MATURITY
UNDERSTANDING
IDENTIFYING SOCIAL CONTEXT
FACTORS
AND CAUSES
DIAGNOSTICS AND MEANING
FORECASTING AND
PROBABILITIES
DESCRIPTIVE
BIG DATA
Text analytics
Network analytics
Geospatial analytics
Social media analytics
Sentiment analytics
Images
DATA QUALITY
ASPIRATIONAL
BUSINESS BUSINESS
INTELLIGENCE ACUMEN
TRANSACTIONAL STRATEGIC
BUSINESS VALUE
PROGRAMMER STATISTICIAN PROGRAMMER
BUSINESS ANALYST BUSINESS ANALYST
DATA SCIENTIST
Talking Notes:
○ And another term often used around ML and in
combination with ML is data science and quite often
mixed up.
○ So, in simple terms data Science is interdisciplinary field
and there is a wide range of definitions and colliding
opinions on what exactly is data science.
○ But most of those consider DS in a way that it contains
ML as a subfield of DS.
○ DS means much more, including domains and activities
such as business intelligence, data exploration and
visualization and domains such as data integration and
engineering.
○ Time: 01:10
• Processing performances
• Dirty and noisy data
BIG
• Data locality
• Real-time processing/
streaming
• Non-linearity
DATA • Feature engineering

• Curse of modularity
• Curse of dimensionality
• Bonferroni’s principle
• Data heterogeneity
Talking Notes:
○ Finally lets just shortly explain how big data, another
important term or domain, is and will impact ML and AI.
○ Big Data is defined as is high-volume and high-velocity
and/or high-variety information.
○ Big data brings some major challenges to ML / AI and
here are some of those and we don't have enough time to
discuss all these challenges, but I recommend a
summary paper that discusses challenges that Big Data
brings in the context of ML and AI.
○ Just to name some like “curse of modularity” probably
many of you have heard of “curse of dimensionality”, but
“curse of modularity” is related to the fact that many
learning algorithms rely on the assumption that the data
being processed can be held entirely in memory or in a
single file on a disk which increasingly will not be the
case.
○ Time: 01:30 + Prompt Questions
algorithms model
Talking Notes:
○ Now I will shortly present the classification of ML algorithms
which will also be a foundation on how to select the best ML
algorithm for your dataset presented later on.
○ Time:
Brief Introduction to Types of ML
● Talking Notes:
○ The following classification is generally well known and we will
just quickly go through:
○ Many will say that there are main groups of ML and those are:
■ Unsupervised ML
■ Supervised ML
■ Reinforced Learning
■ Semi-supervised ML
■ Anomaly Detection
○ These main groups can be further divided into additional
sub-types of machine learning.
○ Time: 00:50
algorithms model
Talking Notes:
○ Now we come to the part where I will try to emulate the process
of building the ML model and while doing this explain the main
concepts and terms in relation to this.
○ Time: 00:15
Example Dataset
Talking Notes:
○ And for doing this I will use the dataset that will also be used for
the practical part of presentation.
○ By using examples from the data it is much easier to explain the
major concepts and terms related to the process of building ML
model.
○ The data we will use is collected during bank telemarketing
campaign.
○ The goal was to access if the client would sign (bank term
deposit) would be ('yes') or not ('no') subscribed.
○ Dataset is composed of three set of features/variables:
■ Demographic data such as age, education, employment
■ Data related with the last contact of the current
campaign
■ Social and economic data related to the general
economic situation.
○ Time: 00:50
Telling the story of building a ML model
ML Algorithm
Define Problem Data Preparation
Selection
Fine-tuning of ML Presenting the results

ML Model Evaluation
model
Talking Notes:
○ So, this process of building an ML model can be divided in
several steps: :
■ Defining the problem that ML model aims to solve and
output results
■ Data preparation which includes data selection,
preprocessing and data transformation or often called
feature engineering.
■ Selection of ML algorithms
■ Evaluation of the ML model
■ Fine-tuning the ML algorithm
■ Presenting the results
■ Time: 00:35
Define the problem you want to solve
Talking Notes:
○ Usually, it is assumed that the problem or question the ML
model tries to solve or answer is well defined and clear.
○ However, often this is not the case and therefore spending the
time to define the problem you want to answer is usually of high
importance.
○ In doing this one generally has to answer the following three
questions:
■ What is the problem or question you want to solve or
answer using your ML model?
■ Why does the problem need to be solved?
■ How would I solve the problem?
■ Time: 00:35
Data Exploration
● Exploring data
○ Getting familiar with your data
○ What types features are available?
○ Performing simple descriptive analysis
○ Visualizing relationship patterns between features
Talking Notes:
○ Once we selected our data sources and selected data we think
might be of high importance
○ The next step is to perform exploratory data analysis and in
parallel data processing.
○ We summarize our data and check what types of features we
have: categorical features which refers to so called bucketable
features such is in our case for example: “Does the client have a
housing loan” which has two classes (yes or no) or “What type of
job” with multiple classes:
○ Or we have numeric features such as age.
○ Getting familiar with your dataset and types of features it
contains influences many downstream actions such selection of
the most appropriate ML algorithm.
○ Time: 01:00
Data Exploration
Talking Notes:
○ One of the crucial steps in exploratory data analysis is visualize
your data.
○ These are some of the most common types of plots used to
visualize data among many.
○ The goal is to visualize relationship patterns within in your data
set
○ Here I plotted numeric variables from our dataset and here, as
you can see, we have several different variables taking into
account our label/target variable which is colored in red or
green.
○ Also for example we see that socio-economic features are
highly correlated and influence a lot the label/target feature
outcome.
○ Time: 01:25
Data Exploration
● Missing values: Remove or impute?

○ Domain knowledge
○ Trade-off
■ More data, more noise
■ Less data, less noise
○ Imputing missing values
■ Mean or median
■ Building a model to predict missing
values
Talking Notes:
○ Following up on the what has been said about the noisy data is
another issue related to data preprocessing which is to handle
missing values.
○ And the main question here is whether we should remove or
impute missing values.
○ This is a balance between two sides: imputation means that we
will have more data that is potentially noisy or removing missing
data which leads us to less data and consequently less noise.
○ Depends on the context, but usually the answers is to go with
imputation
○ Imputing missing values is a whole domain for itself and there
are many different ways of doing it. We will mention some
generally used such as replacement of mean and medians with
NAs or building a model to predict missing values.
○ Time: 01:30
“
COMING UP WITH FEATURES IS
DIFFICULT, TIME-CONSUMING,
REQUIRES EXPERT KNOWLEDGE.
'APPLIED MACHINE LEARNING' IS
BASICALLY FEATURE
ENGINEERING.
— ANDREW NG
MACHINE LEARNING AND AI VIA BRAIN SIMULATIONS
Feature extraction Feature importance
Feature Engineering
Feature selection Feature construction
● Talking Notes:
○ As one would expect feature engineering encompass many
aspects and activities and can generally be divided in several
steps and activities which are more or less overlapping and
those would be
○ Feature extraction, Feature importance, selection and
construction.
○ We will discuss each of these in more details by using our
dataset.
○ It is important to underline that data exploration and
preprocessing which we shortly covered are overlapping with
some of these steps of feature engineering and often in
practices go in parallel or interchangeably.
○ This furthermore depends on type of data we have.
○ Time: 00:35
Feature Extraction
Image > Extracted features: colors, contours,

textures
Signals / Sound > Extract features: frequency,

phase, spectrum, samples
Text > Extracted features: words, POS tags,

grammatical dependencies
Talking Notes:
○ Feature extraction is probably the most important part of
feature engineering because all following steps depend on it.
○ Feature extraction corresponds to automatic construction of
new features from raw data.
○ This step is crucial especially with input data such as videos,
images, audios and other different types of signal data.
○ So, in case of images we extract pixels and consequently the
aim is to obtain information about contours, texture and colors.
○ In case of signals the frequency patterns, spectrum and
samples are attributes that will most probably contain
information about valuable for our ML algorithm
○ And for example in case of large quantities of text we extract
word frequencies, POS tags and grammatical dependencies.
○ Ir s very important aspect in the case of conventional ML and
less so when you apply DL which can handle complex data.
○ Time: 01:15
Feature Importance
● Talking Notes:
○ Another key aspect in feature engineering is evaluating the
feature importance.
○ Some ML algorithms are performing feature importance
estimation internally such as Random Forest and Gradient
Boosting Machines.
○ There are many different importance measures that can be used
from permutation, gini important to partial correlation.
○ Here is the Feature importance plot obtained after we fit the
Random Forest model on our bank marketing data.
○ Where I plotted the feature importance cumulatively starting
with the most important ones to the least important.
○ We see that importance does not reach the plateau very fast
meaning that there are many features of similar effect.
○ We see that duration of the call is among the important ones for
the outcome of bank telemarketing campaign and the euribor
3m rates and age.
○ There are of course other ways for determining the feature
○ importance like partial derivatives.
○ Time: 01:30
Feature Selection
Shrinkage methods for feature selection
Subset and stepwise selection
● Least Absolute Shrinkage and Selection
● Best subset selection
Operator (LASSO)
○ Evaluate all combination of
○ LASSO is a method that involves
features
penalizing the absolute size of the
○ Computationally demanding
regression coefficients in your
● Stepwise selection
model.
○ Backward selection
○ Penalization refers to a uniform
■ Start with all features and
contain of the sum of the absolute
substractivelly remove
values of the estimates.
some and evaluate
○ The final output is that some of the
○ Forward selection
parameter estimates may be exactly
■ Start with no features and
zero which can be removed from
incrementally add new
the model
ones and evaluate
Talking Notes:
○ Feature selection corresponds to the process of selecting the
few most important/informative features from large set of raw
features.
○ There are two main groups of approaches used for feature
selection.
○ Subset and stepwise selection approaches
■ With for example best subset selection approach where
we fit model with all possible combination of features and
select the one with the best performances. This is of
course the best approach but it it demanding from
computation point
■ Stepwise selection with two approaches forward and
backward selection where we start incrementally add
features to the model and evaluate performances
starting with all features and removing features and the
opposite approach.
○ The other group is regularization methods like Least Absolute
Shrinkage and Selection Operator (LASSO)
■ LASSO penalize features if they would potentially lead to
overfitting and indirectly it can be used for feature
selection.
■ Time: 02:10
Feature Construction
● Manual construction of new features from

raw data
○ Combining and/or splitting currently
available features
● Domain knowledge
● Time consuming
● Additional approaches
○ Feature standardization
○ Feature normalization
Talking Notes:
○ The final step we will talk about is feature construction where
we use currently available features to create new ones or we
edit current ones.
○ This step often requires domain knowledge which can provide
insights on how to construct new features. As how to combine
and/or split specific features.
○ Feature construction requires a lot of time as it is iterative
process.
○ Feature construction includes standardization, scaling and
normalization.
○ Time: 00:40
Feature Construction
Talking Notes:
○ Going back to our example of bank marketing data here is the
raw dataset before any feature engineering and after
○ As we can see we scaled variable age and we also created
dummy variables from marital status as it has several classes.
We did the same thing for type of job and in case of duration we
performed standardization.
○ What else can be done is to combine features based on our
insights from exploratory data analysis. For example to combine
all socio-economic features euribor 3 rates, employment index
rate which showed to be highly correlated.
○ With this we finish the feature engineering part and we are ready
and sure that we have data set that is ready to be feed into
some ML algorithm.
○ Time: 01:20
Choosing the right ML algorithm for your problem
Training Data Number of

Accuracy
Time complexity parameters
Need for
Number of
Interpretation Speed incremental
features
training
Talking Notes:
○ So, we come to the point to select a ML algorithm to build our
model and the next big problem comes in. How to select the
best ML algorithm. Well, there is no clear answer for this. One
would say try all of them, but this is never feasible in practice.
○ However, there are some major criteria that always can be
considered as important when selecting the ML algorithm.
○ Selecting the most accurate algorithm is not always possible,
but it is of high importance.
○ Considering the how much time ML algorithm needs to be train
is also of high importance.
○ How complex is your data for example logistic regression can
not work on highly complex data with a non-linear boundary.
○ Number of parameters of ML algorithm affect time needed for
fine-tuning the model and improving performances
○ On the other side some algorithms do not perform with large
number of features.
○ Recently, interpretation plays an important role in some sector
such as finances and health.
○ Prediction speed is another as well as need for incremental
training.
○ So these are some of the major constraints or criterias for
selecting the best machine learning algorithm.
○ Time: 02:30
Selecting the most appropriate ML algorithm
● Talking Notes:
○ To put this into the perspective I will use this cheat sheet from
official SAS blog to helps us go through the process of selecting the
best ML for our problem.
○ I will use our bank telemarketing data to illustrate this process.
○ We usually start by defining what kind of ML task we have. In our
case it is a supervised machine learning task as we have target
variable to predict as you remember which is the outcome of
telemarketing campaign.
○ So, we go with supervised machine learning. The next step is to
decide what type of supervised ML we have and in our case it is
classification problem.
○ Do we care about accuracy or speed. Well in our case it accuracy
as we do not have some live system where it is essential to obtain
predictions fast. According to this we end up with SVM, ensemble
methods and neural networks.
○ You can find many similar cheat sheets and can play around with
some problem you have to select the best ML algorithm fo your
○ problem.
○ Time: 02:00
Unsupervised Learning
K-means Clustering
Advantages
● Fast, simple and quite flexible if data is properly processed
(scaled data)
● Works very well if data has globular clustering pattern
Disadvantages
● Number of clusters needs to be defined which sometimes is
not a straightforward thing to do.
● Scaling and preprocessing data
● Sensitive to irregular clustering patterns and large number
of features.
Talking Notes:
○ We will start with K-means clustering which is considered a
simple general purpose algorithm used for data clustering.
○ It creates clusters based on geometric distances between data
points.
○ K-means is fast, simple and quite flexible if data is properly
processed (scale data). It works very good if data has globular
clustering pattern as we can see on this illustration.
○ The main disadvantages are that requires from use to define
number of clusters
○ Data also must be scaled and it does not work with categorical
data
○ And finally it does not work well with data which does not
non-globular clustering patterns.
○ Time: 01:20
Hierarchical Clustering
Advantages
● Not necessary for data to have globular clustering patterns
● Works well with large datasets
Disadvantages
● User must select cutting threshold at what dendrogram tree
should be cut deciding the final number of clusters.
Talking Notes:
○ Another unsupervised approach is hierarchical clustering
○ Hierarchical clustering can be done using two major types of
algorithms: divisive which works in a bottom-up manner and
agglomerative which works in a top-bottom manner.
○ In case of agglomerative clustering each data point is one
cluster. In the next step the two nearest (distance measure)
points are merged in new cluster and this is repeated until all
points are one cluster. The end product is a dendrogram or a
division tree.
○ In case of divisive clustering the procedure is opposite. In the
initial step all data points are considered as one cluster. In the
next one the cluster is divided in two using some criterion like.
○ Main advantage of hierarchical clustering is that does not
require for data to have globular clustering patterns.
○ And the main disadvantage is that similarly to K-means
clustering the cutting threshold must be defined by user.
○ Time: 01:20
DBSCAN
Advantages
● No need for data to have globular pattern
● Scalable
● Not all points have to be assigned to a cluster which can
reduce noise
Disadvantages
● Two parameters to tune which is not a straightforward job
○ Epsilon which is distance from data point within which
count of points is done)
○ Minimum number of points within range of that
distance
● The two parameters define the density of clusters and
DBSCAN is quite sensitive to these hyperparameters.
Talking Notes:
○ The final unsupervised ML algorithm we will present here is
DBSCAN which stands for Density-based spatial clustering of
applications with noise
○ DBSCAN is density based algorithm which means that it takes
into account the number of data points within some predefined
distance and this is considered as rule for assigning any new
point into the cluster.
○ Similarly to other user have to predefine number of clusters.
○ The main advantage is that works very well on data with highly
irregular non globular patterns as our visualizations illustrate.
○ The main disadvantages are to tune the two main parameters
distance and minimal number of points.
○ Time: 01:10
Supervised Learning
Talking Notes:
■ Unsupervised ML
■ Supervised ML
○ Time: 00:50
Logistic Regression
Advantages
● Outputs have a nice probabilistic interpretation
● It can be regularized to avoid overfitting
● Logistic regression is fast to train and predict.
● Explainable results
Disadvantages
● It does not work well with multiple and/or non-linear
decision boundaries
● Not flexible enough to capture more complex relationships.
Talking Notes:
○ I will starts with LR which is the classification counterpart to
linear regression. Logistic regression is used for binomial
classification problems where predicted values are mapped
between 0 and 1.
○ Advantages of LR is that it provides nice probabilistic output and
it can be regularized to avoid overfitting.
○ Moreover it is fast to train and predict and it provides
explainable results.
○ Downside of LR is that does not work with multiple and
non-linear data meaning it can not capture more complex
relationships.
○ Time: 00:55
Decision Trees
Advantages
● Able to learn non-linear relationships due to the hierarchical
structure.
● Quite robust to outliers
● Scalable to large datasets
● Fast to train
● Interpretable.
Disadvantages
● Individual trees prone to overfitting because they can keep
branching until they memorize the training data
Talking Notes:
○ Another very popular supervised ML is Decision Trees (DT).
○ Decision Trees learn in a hierarchical way by repeatedly splitting
your dataset into separate branches where each split maximizes
the information gain.
○ This branching structure allows regression trees to naturally
learn non-linear relationships thanks to this hierarchical
structure.
○ Also DT are robust to outliers, scalable to large datasets and
easy to interpret.
○ However, the main disadvantage is that DTs are prone to
overfitting because they keep branching until they memorize the
training data.
○ Time: 01:00.
Ensemble Methods
● Examples of ensemble methods:

○ Random Forests (RF)
○ Gradient Boosted Trees (GBM)
○ Other …
● Ensemble methods are considered as extension of DTs
where predictions from multiple DTs are combined
● Very successful among non-deep machine learning
algorithms
Talking Notes:
○ To solve this main disadvantage of DT we can go for some of
the ensemble methods such as random forest and gradient
boosted trees.
○ Ensemble methods are extensions of DTs where predictions
from multiple DTs are combined.
○ Ande they appeared to be very successful among non-deep
learning algorithms.
○ The main problem of ensemble methods is that they are
complex to set up and tune.
○ Time: 00:50
Support Vector Machines
Advantages
● Good choice for modeling non-linear decision boundaries
● Many different kernels to select from.
● Fairly robust against overfitting, especially in
high-dimensional space
Disadvantages
● Computationally demanding
● Hard to tune since there are different parameters.
● Hard to scale to large datasets.
Talking Notes:
○ Support vector machines for quite some time have been among
the most popular ML algorithms mainly because of their
flexibility to learn the non-linear patterns from target features.
○ They are also fairly robust against overfitting
However, SVM's are computationally demanding, trickier to tune
as you have to select the right kernel and tune SVM parameters.
○ Also, it hardly scales to large datasets. Currently in the industry,
ensemble methods are usually preferred over SVM's.
○ Time: 00:50
Neural Networks
Advantages
● Deep learning is the current state-of-the-art for specific
fields like computer vision and speech recognition.
● Deep neural networks can be easily updated with new data
using batch propagation.
● Their architectures can be adjusted to many different types
of problems
● Hidden layers reduce the need for feature engineering
Disadvantages
● Requires large amounts of data.
● Computationally intensive to train
● Require much more knowledge/expertise and time to tune
Talking Notes:
○ Finally, we come to neural networks. Here we consider the
multi-layer neural networks (NN) which are part of so-called
deep learning domain.
○ Also called Deep NN are able to model intermediary
representations of the data that other algorithms cannot easily
learn.
○ Deep learning is the current state-of-the-art for specific fields
like computer vision and speech recognition.
○ Also, deep neural networks perform very well on image, audio,
and text data, and they can be easily updated with new data
using batch propagation.
○ Their architectures can be adjusted to many different types of
problems, and their hidden layers reduce the need for feature
engineering.
○ The main disadvantage is that deep learning algorithms require
large amounts of data.
○ In fact, they are usually outperformed by tree ensembles for
○ classical machine learning problems. In addition, they are
computationally intensive to train, and require much more
knowledg and time to tune (i.e. set the architecture and
hyperparameters).
○ Time: 01:30
Reinforcement Learning
Talking Notes:
■ Unsupervised ML
■ Supervised ML
○ Time: 00:50
Reinforcement learning
Environment
Action
Reward
State
Agent
Splitting data Cross-validation
Fine-tuning the
Performance metrics
model
Talking Notes:
○ The following set of slides will present some of the major
activities related to the fitting of ML model.
○ We will shortly talk about splitting data. What is it and why
splitting data is important?
What is cross-validation and what cross-validation
approaches there are.
○ Explain main performance metrics for evaluating our ML
models and finally how to tune our models.
○ Time: 00:30
Splitting data
Talking Notes:
○ Splitting data before fitting your ML model makes your model
generalizable which means that it can be succesfully applied on
the new incoming data and basically in the production
environment.
○ Therefore, dataset used for fitting ML model is generally divided
in three sets: training data, validation data and test data.
○ The next question is what is the perfect ration.. There is no clear
cut answer as there are different problems that can occur when
splitting data
○ If your test set is too small, you will have an unreliable
estimation of model performance (performance statistics will
have high variance). On the other side, if training set is too small
you can not train model that will be generalizable.
○ The general rule of thumb used for this is 80/20 train/test split.
○ In the next step your train set can be split into train and
validation set or into partitions for cross-validation.
○ Time: 01:00
Cross-validation
K-fold Cross-validation Leave One Out Cross-validation
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 5 fold cross-validation
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Talking Notes:
○ The next concept we mentioned is CV. Cross-validation helps
you measure the performance of one model on different sets of
data.
○ It basically makes multiple random splits of training data and
evaluates the ML model.
○ The two most common CV approaches are K-fold
cross-validation and leave one out cross-validation (LOOCV).
○ K-fold cross-validation splits data in k data subsets as
illustrated here and k value depends on the size of dataset After
data is split in k subsets, the model is trained on k-1 folds while
use the last fold to evaluate the model.
○ In the case of leave-one-out cross-validation model is trained
on whole dataset except one observation and this is repeated
for all observation.
○ LOOCV is known to be computationally demanding and
generally 10-fold cross-validation is generally used.
○ Time: 01:10
Performance metrics
✓ Classification accuracy
✓ Logarithmic loss
✓ Confusion matrix
✓ Area under curve (AUC)
✓ F1 score
✓ Mean absolute error
✓ Mean squared error
Precision vs. Recall
Confusion Matrix Precision = Positive Predictive Value = TP / (TP + FP)
Recall = True Positive Rate = TP / (TP + FN)
Talking Notes:
○ Evaluation metrics for classification ML models
■ Confusion matrix is of course not an evaluation metric, but a
cross-table which contains counts for predictions and actual
targets.
■ It provides number of true positive, true negative, false
negatives and false positives. These measure are further used
to calculate some more informative evaluation metrics.
■ Additional set of highly used evaluation metrics are: precision,
recall and accuracy.
● Precision is basically accuracy of positive predictions or
percent of positive prediction. More formally it would
look like this:
○ Precision = True Positive / (True Positive + False
Positive). If precision is less than 100% it means
that we falsely predicted positive values.
● Recall
○ Recall = True Positive / (True Positive + False
Negative). If recall is less than 100% it means
○ that we falsely predicted.
● Accuracy
○ Accuracy = (True Positive + True Negative) /
(Positive + Negative)
● Illustration of precision and accuracy [link].
■ Taking the above mentioned metrics the common visual or tool
to evaluate the ML model performances is to plot so called
precision-recall curve. Precision-recall curve as its name tells
plots precision vs. recall.
■ Time: 02:00
The bias-variance trade-off
Talking Notes:
○ One important thing. In order to correctly evaluate the machine
learning model and read and understand the evaluation metrics
it is crucial to understand the bias-variance trade-off.
○ The bias-variance trade-off is crucial concept in building the ML
model and it comes from the mean squre error which is the
average squared deviation from original values.
○ The MSE is generally composed of three quantities:
■ Variance of predicted values (Variance): Variance
describes how much your prediction will change using
different training dataset. In practical terms, if you have
high variance it means that your model is almost
perfectly fitted to your training dataset, but it does not
generalize to future data. This is overfitting which means
that you model does not generalizes well.
■ Bias is telling how well your model is representing the
nature of relationship between input features and target
feature you want to predict. For example high bias means
■ that your ML model is not representing the true nature of
relationship between target feature and input features.
This happens when you are using linear ML algorithm to
predict non-linear relationship present in data. High bias
means that your model is underfitting.
■ And there is Irreducible error (noise): Noise is broad term
with multiple explanations, but it generally refers to
information that is not inherently related to relationship
between target feature and input data. For example
measuring errors are considered as noise.
■ However, the main goal to minimize both terms variance
and bias in the balanced same way.
■ Time: 02:00
Fine-tuning the model
Talking Notes:
○ And for this we have two usually applied approaches and those
are: grid search approach and random search approach.
○ Grid search approach means that if we have an algorithm that
takes two parameters.
○ Let’s take an example of support vector machines with RBF
kernel which has C and gamma values as SVM parameters. We
will define range values for both parameters and design the
search grid matrix which will contain all combinations of values
for those two parameters.
○ Finally we will fit our model for all those combinations C and
gamma values, evaluate our model and pick the best one.
○ As you probably already guessed this can be time consuming if
we have many parameters and want to really fine-tune the
model.
○ The other approach is to use random layout or randomized
search where we set up the same matrix as in the previous case,
but pick combinations randomly and assess accuracy for
selections we made.
○ Randomized search is used in cases where we have many
parameters to set and not much time.
○ Once you did a first pass of grid or random search you can use
the best combination of values to zoom in to further improve
performances.
○ Time: 01:20
Talking Notes:
○ Before I finish I will wrap up with the following visualization
created by Brendan Tierney.
○ I selected this slide to illustrate how field of DS and ML is
complex and extremely interdisciplinary.
○ This presentation tried to provide a general overview scratching
all major topics related to ML and DS and more specifically also
the process of building the ML model.
○ My goal is to give you a framework that I hope will be useful as
a nice overview of the whole domain.
○ Time:
Questions?
Thank you!

Machine Learning Intro Final

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Machine Learning Intro Final

Enviado por

Direitos autorais:

Formatos disponíveis

Jelena Nadj - Data Scientist

Machine Learning 101: Concepts,

“Give machines ability to learn

● IDC’s Digital Universe Study

● In 2012, Google Brain detects human faces

○ So, it is an indirect illustration of what deep learning (read neural

Game Go has 10¹⁷⁰ possible board positions and there are

Input Data Output

BUSINESS ANALYST BUSINESS ANALYST

DATA • Feature engineering

Fine-tuning of ML Presenting the results

● Missing values: Remove or impute?

Feature selection Feature construction

Image > Extracted features: colors, contours,

Signals / Sound > Extract features: frequency,

Text > Extracted features: words, POS tags,

● Manual construction of new features from

Training Data Number of

● Examples of ensemble methods:

K-fold Cross-validation Leave One Out Cross-validation

Você também pode gostar