Você está na página 1de 62

Data-driven modelling in water-related problems.

PART 1
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl

UNESCO-IHE Institute for Water Education


Hydroinformatics Chair

Outline of the course

Notion of data-driven modelling (DDM) Sources of technologies for DMM: machine learning, data mining, soft computing Introduction to methods: decision, regression and model trees, Bayesian approaches, neural networks, fuzzy systems, chaotic systems Demonstration of applications
Reservoir optimization Rainfall-runoff modelling Real-time control of water levels in the polder areas surge water level prediction in the North Sea interpretation of aerial photos
2

D.P. Solomatine. Data-driven modelling (part 1).

Why data-driven now?

Measuring campaigns using automatic computerised equipment: a lot of data became available important breakthroughs in computational intelligence and machine learning methods penetration of computer sciences into civil engineering (e.g., hydroinformatics, geo-informatics etc.)

D.P. Solomatine. Data-driven modelling (part 1).

Hydroinformatics system: typical architecture


User interface

Judgement engines
Decision support systems for management

Fact engines
Physically-based models Data-driven models

Data, information, knowledge

Knowledge inference engines


Knowledge-base systems

Communications

Real world
D.P. Solomatine. Data-driven modelling (part 1). 4

Modelling

Model is
a simplified description of reality an encapsulation of knowledge about a particular physical or social process in electronic form

Goals of modelling are:


understand the studied system or domain (understand the past) predict the future
predict the future values of some of the system variables, based on the knowledge about other variables

use the results of modelling for making decisions (change the future)

D.P. Solomatine. Data-driven modelling (part 1).

Classification of models

specific - general model estimation - first principles models numerical - analytical stochastic - deterministic microscopic - macroscopic discrete - continuous qualitative - quantitative

D.P. Solomatine. Data-driven modelling (part 1).

Example of a simple data-driven model


Linear regression model Y = a1 X + a2 Y

independent variable X (input) and dependent variable Y (output) linear regression roughly describes the observed relationship parameters a1 and a2 are unknown and are found by feeding the model with data and solving an optimization problem (training) the model then predicts output for the new input without actual knowledge of what drives Y

actual output value

model predicts new output value

X
new input value

D.P. Solomatine. Data-driven modelling (part 1).

Data: attributes, inputs, outputs


set K of examples (or instances) represented by the duple <xk, yk>, where k = 1,, K, vector xk = {x1,,xn}k , vector yk = {y1,,ym}k , n = number of inputs, m = number of outputs. The process of building a function ( model) y = f (x) is called training. Often only one output is considered, so m = 1.

Measured data
Instances

Attributes

Instance 1 Instance 2 Instance K

x1 x11 x21 xK1

Inputs x2 x12 x21 xK2

xn x 1n x2 n xK n

Output y y1 y2 yK

Model output y* = f (x) y* y1* y2* yK*


8

D.P. Solomatine. Data-driven modelling (part 1).

Data-driven model
Actual (observed) output Y

Input data

Modelled (real) system

Machine learning (data-driven) model

Learning is aimed at minimizing this difference

Predicted output Y

DDM learns the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 9

Data-driven models vs Knowledge-driven (physicallybased) models (1)


Physically-based", or "knowledge-based" models are based on the understanding of the underlying processes in the system
examples: river models based on main principles of water motion, expressed in differential equations, solved using finite-difference approximations

"Data-driven" model is defined as a model connecting the system state variables (input, internal and output) without much knowledge about the "physical" behaviour of the system
examples: regression model linking input and output

Current trend: combination of both (hybrid models)

D.P. Solomatine. Data-driven modelling (part 1).

10

Data-driven models vs Physically-based models


physically-based (conceptual) hydrological rainfall-runoff model

P = precipitation, E = evapotranspiration, Q = runoff, SP = snow pack, SM = soil moisture, UZ = upper groundwater zone, LZ =lower groundwater zone, lakes = lake volume

Coefficients to be identified by calibration

Precipitation P(t) Moving average of Precipitation MAP3(t-2) Evapotranspiration E(t) Runoff Q(t)
D.P. Solomatine. Data-driven modelling (part 1).

Runoff flow Q(t+1)

data-driven rainfallrunoff model (artificial neural network)

neurons (non-linear functions)

connections with weights to be identified by training

11

Using data-driven methods in rainfall-runoff modelling


Available data: rainfalls Rt

Qtup Rt Qt

runoffs (flows) Qt Inputs: lagged rainfalls Rt Rt-1 Rt-L Output to predict: Qt+T

Model: Qt+T = F (Rt Rt-1 Rt-L Qt Qt-1 Qt-A Qtup Qt-1up )


(past rainfall) (autocorrelation) (routing)

Questions:
how to find the appropriate lags? (lags embody the physical properties of the catchment) how to build non-linear regression function F ?
D.P. Solomatine. Data-driven modelling (part 1). 12

So, what is then data-driven modelling in engineering?


Data-driven modelling:
oriented towards building predictive models follows general modelling guidelines adopted in engineering uses the apparatus of machine learning

Proper data-driven modelling is impossible without


good understanding of the modelled system, relation to the external environment and possible connections to other models and decision making processes

It differs from data mining in:


application area (engineering and natural processes) used terminology and relation to physically-based models usually smaller data sets (and hence possible different choice of methods)

D.P. Solomatine. Data-driven modelling (part 1).

13

Steps in modelling process: details


State the problem (why do the modelling?) Evaluate data availability, data requirements Specify the modelling methods and choose the tools Build (identify) the model:
Choose variables that reflect the physical processes Collect, analyse and prepare the data Build the model Choose objective function for model performance evaluation Calibrate (identify, estimate) the model parameters:
if possible, maximize model performance by comparing the model output to past measured data and adjusting parameters

Evaluate the model:


Evaluate the model uncertainty, sensitivity Test (validate) the model using the unseen measured data

Apply the model (and possibly assimilate real-time data)


Evaluate results, refine the model
D.P. Solomatine. Data-driven modelling (part 1). 14

Some golden rules in building a model (1)

1. Select clearly defined problem that the model will help to resolve. 2. Specify the required solution or the problem. 3. Define how the solution delivered is going to be used in practice. 4. Learn the problem, collect the domain knowledge, understand it. 5. Let the problem drive the modelling, including the tool selection, data preparation, etc. That is take the best tool for the job, not just a job you can do with the available tool. 6. Clearly define assumptions (do not just assume, but discuss them with the domain knowledge experts). ... ->

D.P. Solomatine. Data-driven modelling (part 1).

15

Some golden rules in building a model (2)

7. Refine the model iteratively (try different things until the model seems as good as it is going to get). 8. Make the model as simple as possible, but no simpler, formulated also as:
KISS ("Keep It Sufficiently Simple", or "Keep It Simple, Stupid") Minimum Description Length principle: the best model is one that that is the smallest the Occam's Razor principle - formulated by William of Occam in 1320 in the following form: shave off all the "unneeded philosophy off the explanation".

9. Define instability in the model (critical areas where small changes in inputs lead to large change in output). 10. Define uncertainty in the model (critical areas and ranges in the data where the model produces low confidence predictions).
D.P. Solomatine. Data-driven modelling (part 1).

16

Suppliers of methods for data-driven modelling

Statistics Machine learning Soft computing (fuzzy systems) Computational intelligence Artificial neural networks Data mining Non-linear dynamics (chaos theory)

D.P. Solomatine. Data-driven modelling (part 1).

17

Suppliers of methods for data-driven modelling (1) Machine learning (ML)


ML = constructing computer programs that automatically improve with experience Most general paradigm for DDM ML draws on results from:
statistics artificial intellingence philosophy, psychology, cognitive science, biology information theory, computational complexity control theory

For a long time concentrated on categorical (non-continuous) variables

D.P. Solomatine. Data-driven modelling (part 1).

18

Suppliers of methods for data-driven modelling (2) Soft computing


Soft computing - tolerant for imprecision and uncertainty of data (Zadeh, 1991). Currently includes almost everything:
fuzzy logic neural networks evolutionary computing probabilistic computing (incl. belief networks) chaotic systems parts of machine learning theory

D.P. Solomatine. Data-driven modelling (part 1).

19

Suppliers of methods for data-driven modelling (3) Data mining


Data mining (preparation, reduction, finding new knowledge):
automatic classification identification of trends (eg. statistical methods like ARIMA) data normalization, smoothing, data restoration association rules and decision trees
IF (WL>1.2 @3 h ago, Rainfall>50 @1 h ago) THEN (WL>1.5 @now)

neural networks fuzzy systems

Other methods oriented towards optimization:


automatic calibration (with a lot of data involved, makes a physically-driven model partly data-driven)

D.P. Solomatine. Data-driven modelling (part 1).

20

Machine learning: Learning from data

D.P. Solomatine. Data-driven modelling (part 1).

21

Data-driven modelling: (machine) learning


Actual (observed) output Y

Input data

Modelled (real) system

Machine learning (data-driven) model

Learning is aimed at minimizing this difference

Predicted output Y

DDM tries to learn the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 22

Training (calibration), cross-validation, testing

Ideally, the observed data has to be split in three data sets:


training (calibration) cross-validation (imitates test set and model operation,

used to build the model

used to test model performance during calibration process)

testing (imitates model operation,

used for final model test, should not be seen by developer)

used to test the model after it is built

One should distinguish

minimizing the error during the model calibration, cross-validation minimizing the error during model operation (or on the unseen test set)

Ideally, we should aim at minimizing the cross-validation error since this will give hopes that error on test set will be also small. In practice, process of training uses training set, and cross-validation set is used to check periodically the model error and stop training 23 D.P. Solomatine. Data-driven modelling (part 1).

What is a good model?


Models being progressively made more

accurate and complex during training Consider a model being progressively made more Y accurate (and complex): actual (e.g., flow) output Green Red Blue value Green (linear) model is simple but it is not accurate enough model predicts Blue model is the most accurate new output value but is it the best? Red model: less accurate than the Blue one, but captures the X trend in data. It will generalise new input (e.g. rainfall) well. value Question: how to determine Which model is better: during training when to stop improving the model? green, red or blue?
D.P. Solomatine. Data-driven modelling (part 1). 24

Necessity of cross-validation during training

good moment to stop training

in-sample = training out-of-sample = cross-validation, or verification (testing)

D.P. Solomatine. Data-driven modelling (part 1).

25

Training, cross-validation, verification (2)


Y Relationship Y=F(X) to be modelled (unknown) Data {xi, yi}collected about this relationship Data-driven model Training Cross-validation

Training: iterative refinement of the model with each iteration model parameters are changed to reduce error on training set error on training set gets lower and lower error on test set gets lower, then starts to increase This may lead to overfitting and high error on crossvalidation set The moment to stop training
D.P. Solomatine. Data-driven modelling (part 1). 26

Data-Driven Modelling: care is needed

Difficulties with extrapolation (working outside the variables range)


A solution: exhaustive data collection, optimal construction of the calibration set

Care needed if the time series is not stationary


A solution: to build several models responsible for different regimes

Need to ensure that the relevant physical variables are included


A solution: use correlation and average mutual information analysis

D.P. Solomatine. Data-driven modelling (part 1).

27

Data

D.P. Solomatine. Data-driven modelling (part 1).

28

Data: attributes, inputs, outputs


set K of examples (or instances) represented by the duple <xk, yk>, where k = 1,, K, vector xk = {x1,,xn}k , vector yk = {y1,,ym}k , n = number of inputs, m = number of outputs. The process of building a function ( model) y = f (x) is called training. Often only one output is considered, so m = 1.

Measured data
Instances

Attributes

Instance 1 Instance 2 Instance K

x1 x11 x21 xK1

Inputs x2 x12 x21 xK2

xn x 1n x2 n xK n

Output y y1 y2 yK

Model output y* = f (x) y* y1* y2* yK*


29

D.P. Solomatine. Data-driven modelling (part 1).

Types of data (roughly)

Class (category, label) Ordinal (order) Numeric (real-valued) etc. (considered later)

(Time) series data: numerical data which values have associated index variable with it.

D.P. Solomatine. Data-driven modelling (part 1).

30

Four styles of learning

Classification
on the basis of classified examples, a way of classifying unseen examples is to be found

Association
association between features (which combinations of values are most frequent) is to be identified

Clustering
groups of objects (examples) that are "close" are to be identified

Numeric prediction
outcome is not a class, but a numeric (real) value often called regression

D.P. Solomatine. Data-driven modelling (part 1).

31

Important notions in machine learning

Hypotheses
there are various possible functions Y=f (X) (hypotheses) relating input and output machine learning is searching through this hypotheses space in order to determine one that fits the observed data and the prior knowledge of the learner

Concepts: the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality

Instances (examples) representing concepts Attributes (describing instances)


D.P. Solomatine. Data-driven modelling (part 1). 32

Occam's razor: short hypotheses, minimum description length (MDL)


Principle: accept the simplest hypothesis (=model) Known as the Occam's razor principle: shave off all the "unneeded philosophy off the explanation (William of Occam, 1320) In ML also known as Minimum Description Length (MDL) principle

D.P. Solomatine. Data-driven modelling (part 1).

33

Concepts

Concept - the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality

Often concept is a boolean-valued function (Yes/No) Concept learning = inferring (building) a boolean-valued function from training examples of its input and output

D.P. Solomatine. Data-driven modelling (part 1).

34

Learning concepts (1)

C = target (real) concept of "+" class + examples of class "+" examples of class ""

D.P. Solomatine. Data-driven modelling (part 1).

35

Learning concepts (2)

C' C

C' = concept of class "+" induced (learned) from data, a hypothesis. It is fully consistent with data (all +, no ) Concepts as sets: U, set of all objects Concept C: C U Learning C: for all X U to recognize whether X U, or not
D.P. Solomatine. Data-driven modelling (part 1). 36

Learning concepts (3)

C' C

false + false -

Errors (incorrect classification): (C-C') (C'-C) Accuracy of induced concept C' = proportion of correct classifications: |U - (C-C') - (C'-C)| / |U|
D.P. Solomatine. Data-driven modelling (part 1). 37

Instances (examples)

Instances = examples of input data. Instances that can be stored in a simple rectangular table (only these will be mainly considered):
individual unrelated customers described by a set of attributes records of rainfall, runoff, water level taken every hour

Instances that cannot be stored in a table, but require more complex structures: instances of pairs that are sisters, taken from a family tree
related tables in complex databases describing staff, their ownership, involvement in projects, borrowing of computers, etc.

D.P. Solomatine. Data-driven modelling (part 1).

38

Attributes: more detailed view at measured data (1)

Nominal (also called class, category, labels). Examples:


customers that tend to do shopping on Fridays combinations of hydrometeorological conditions leading to high surge combination of conditions leading to a high probability of flood

Ordinal - categories that can be ordered (ranked) temperature expressed as cool, mild, hot water level expressed as low, medium, high ... ...

D.P. Solomatine. Data-driven modelling (part 1).

39

Attributes: more detailed view at measured data (2)

... ... Interval - ordered and expressed in fixed equal units. Examples:
dates (cannot be however multiplied) temperature expressed in degrees Celsius

Ratio (real numbers) - there is a zero point. Examples:


distance, precipitation, water level, soil moisture, etc. but not temperature, since using another zero (C/F) changes ratios amount of money a customer tend to spend on a typical Friday

D.P. Solomatine. Data-driven modelling (part 1).

40

Attributes: main types of data in machine learning (simplified)


Nominal (also called class, category, labels, enumerated)
discrete but without ordering Special case - dichotomy (Boolean)

Ordinal (also called numeric, continuous)


could be also discrete but must be ordered prediction of real-valued output is also called regression

more complex types, requiring additional descriptions (metadata), eg.:


dimensions (for building expressions that are dimensionally correct) partial ordering (like "same day next week") subsets (like "holidays" when water consumption is different), etc.
D.P. Solomatine. Data-driven modelling (part 1). 41

Data preparation and surveying

prepare the data - this may include complex procedures of restoring the missing data, data transformation, etc.; survey the data - understand the nature of the data, get insight into the problem this data describes - includes identification and analysis of variability, sparsity, peaks and valleys, entropy, mutual information inputs to outputs, etc. (this step is often merged with the previous one); build the model

D.P. Solomatine. Data-driven modelling (part 1).

42

Data preparation results in:

training data set - raw data is presented in a form necessary to to train the DDM; cross-validation data set - needed to detect overtraining; testing, or validation data set - it is needed to validate (test) the model's predictive performance; algorithms and software to perform pre-processing (eg., normalization); algorithms and software to perform post-processing (eg., denormalization).

D.P. Solomatine. Data-driven modelling (part 1).

43

Important steps in data preparation


Replace missing, empty, inaccuarate values Handle issue of spatial and temporal resolution Linear scaling and normalization Non-linear transformations Transform the distributions Time series:
Fourier and wavelet transforms Identification of trend, seasonality, cyles, noise Smoothing data

Finding relationships between attributes (eg. correlation, average mutual information - AVI) Discretizing numeric attributes into {low, medium, high} Data reduction (Principal components analysis - PCA)
D.P. Solomatine. Data-driven modelling (part 1). 44

Example: raw data set

D.P. Solomatine. Data-driven modelling (part 1).

45

Example: normalization and redistribution

D.P. Solomatine. Data-driven modelling (part 1).

46

Replacing missing and empty values

What to do with the outliers? How to reconstruct missing values? Estimator is a device (algorithm) used to make a justifiable guess about the value of some particular variable, that is, to produce an estimate Unbiased estimator is a method of guessing that does not change important characteristics of the data set when the estimates are included with the existing values Example: Dataset 1 2 3 x 5
Estimators:
2.750, if the mean is to be unbiased; 4.659, if the standard deviation is to be unbiased; 4.000, if the step-wise change in the variable value (trend) is to be unbiased (that is, linear interpolation is used xi = (xi+1 + xi-1) / 2 )
D.P. Solomatine. Data-driven modelling (part 1). 47

Issue of spatial and temporal resolution

Examples:
in a harbour sedimentation is measured once in two weeks at one locations and once a month at other two locations, and never at other (maybe important) locatons in a catchment the rainfall data that was manually collected at a three gauging stations for 20 years once a day, and 3 years ago the measurements started also at 4 new automatic stations as well, with the hourly frequency

Solutions:
filling-in missing data introducing an artificial resolution being equal to the maximum for all variables

D.P. Solomatine. Data-driven modelling (part 1).

48

Linear scaling and normalization

General form:

x'i = a xi + b
to keep data positive: x'i = xi + min (x1...xn) + SmallConst Squashing data into the range [0, 1]

xi =

xi min( x1... xn ) max( x1... xn ) min( x1... xn )

D.P. Solomatine. Data-driven modelling (part 1).

49

Non-linear transformations

Logarithmic x'i = log (xi) Softmax scaling:

D.P. Solomatine. Data-driven modelling (part 1).

50

Logistic function

L( x ) =
Output va lue

1 1 + e x

L og istic fun ctio n


1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10

Input va lue

D.P. Solomatine. Data-driven modelling (part 1).

51

Softmax function

1. {x} should be first transformed linearly to vary around the mean mx :

xi =

xi E ( x ) ( x / 2 )

where E(x) is mean value of variable x; x is the standard deviation of variable x; is linear response measured in standard deviations for example (that is on either side of the central point of the distribution) cover 68% of the total range of x, 2 cover 95.5%, 3 cover 99.7%. 3.14

2. logistic function applied L(x )


D.P. Solomatine. Data-driven modelling (part 1).

Logistic fun ctio n


Output va lue
1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10

52

Input va lue

Transforming the distributions: Box-Cox transform


The first step uses the power transform to adjust the changing variance: where

xi =
xi is the original value, x'i is the transformed value,
is a user-selected value.

xi 1

The second step balances the distribution by subtracting the mean and dividing the result by the standard deviation:
where

xi =

xi E ( x)

x'i is the value after the first transform, x''i is (final) standardized value, E(x') is mean value of variable x' x' is standard deviation of variable x'.
D.P. Solomatine. Data-driven modelling (part 1).

53

Box-Cox transform of the rainfall data


Origin al ho urly d isch arg e d ata (30 d ays)
500 Disch arge [m3/s] 400 300 200 100 0 0 100 200 300 400 Time [ hrs] 500 600 700

B o x-C o x tran sfo rm of the d isch arge d ata


10 8 6 4 2 0 -2 0 100 200 300 400 500 600 700 B ox-Cox, step 1 B ox-Cox, final

Transformed discharge

T ime [hrs]

D.P. Solomatine. Data-driven modelling (part 1).

54

Box-Cox transform: original and resulting histograms

Distibution of the origina l discha rge

D istribution of the B ox -Cox tra nsform e d discha rge

400 350 300 Fre que ncy

160 140 120 Fre que ncy


.6 .9 1 4 9 6 4 1 9 6 4 1 e

250 200 150 100 50 0

100 80 60 40 20 0
51 90 29 2 3 34 73 12 67 06

.1

0.

9.

45

74

17

2.

8.

3.

45 3.

2.

4.

3.

7.

5.

or

10

13

21

24

27

30

.4

.0

1.

0.

1.

1.

2.

18

16

33

36

-1

-0

-0

-0

3.

0.

2.

Discha rge [m 3/s], bins

Tra nsform e d discha rge , bins

or

.2

.8

D.P. Solomatine. Data-driven modelling (part 1).

55

Multistationary data: care needed

transforming the distributions could be dangerous: such variables actually change the nature of data and the relationships between variables

a) original data with two clusters (two samples) visible b) normalized data clusters cannot be identified
D.P. Solomatine. Data-driven modelling (part 1). 56

Smooth data?
Simple and Weighted Moving Averages SavitzkyGolay filter: builds local polynomial regression (of degree k) on a series of values other filters (Gaussian, Fourier, etc.)

D.P. Solomatine. Data-driven modelling (part 1).

57

Fourier transform can be used to smooth data (extract only low-frequency harmonics) loworiginal signal (time series)

six harmonic components

D.P. Solomatine. Data-driven modelling (part 1).

58

Fourier transform: the same signal in frequency domain

D.P. Solomatine. Data-driven modelling (part 1).

59

Is life linear? Are outliers bad?

D.P. Solomatine. Data-driven modelling (part 1).

60

Finding relationships between i/o variables


( x x )( y y )
i =1 i i n

Correlation coefficient R

R=

(x x ) ( y y )
2 i =1 i i =1 i

Average mutual information (AMI). It represents the measure of information that can be learned from one set of data having knowledge of another set of data.
I ( X ;Y ) =

AMI can be used to identify the optimal time lag for a datadriven rainfall-runoff model

x X

P ( x, y ) log 2 P ( x ) P ( y )
yY

P ( x, y )

where P (x,y) is the joint probability for realisation x of X and y of Y; and P (x) and P (y) are the individual probabilities of these realisations If X is completely independent of Y then AMI I (X;Y) is zero.
D.P. Solomatine. Data-driven modelling (part 1). 61

Us of AMI: relatedness between flow and past rainfall


FLOW1: effective rainfall and discharge data 800 700 600 500 400 300 200 16 100 0 0 500 1000 Time [hrs] 1500 2000 18 20 2500 Discharge [m3/s] Effective rainfall [mm] Discharge [m3/s] Eff.rainfall [mm] 0 2 4 6 8 10 12 14

Consider future discharge Qt+1 and past rainfalls Rt-L. What is lag L such that the relatedness is strongest ? AMI can be used to identify the optimal time lag to ensure
AMI between Qt+1 and past lagged rainfalls Rt-L

35 0 30 0

Lag

3 .5 3 2 .5

Zoom in:
Q [ m 3 /s ]

25 0 20 0 15 0 10 0 50 0 90 100 11 0 120 t [hrs ] 13 0 140

Q R

2 1 .5 1 0 .5 0 1 50

R [m m ]

Max AMI

optimal lag

Introducing classification:
main ideas

D.P. Solomatine. Data-driven modelling (part 1).

63

k-Nearest neighbors method: a common sense method of classification


instances are points in 2-dim. space, output is boolean (+ or -) new instance xq is classified w.r.t. proximity of nearest training instances
to class + (if 1 neighbor is considered) to class - (if 4 neighbors are considered)

for discrete-valued outputs assign: the most common value

Voronoi diagram for 1-Nearest neighbor


D.P. Solomatine. Data-driven modelling (part 1).

64

Discriminating surfaces: a traditional method of classification from statistics


surface (line, hyperplane) separates examples of different classes
Y Y = a1 X + a2 Y

X
linearly separable examples

X
more difficult example: linear function will misclassify several examples, so a non-linear function needed (or transformation of space)
65

D.P. Solomatine. Data-driven modelling (part 1).

Decision tree: example with 2 numeric input variables, 2 output classes: 0 and 1
X2
4 3 2 class 0 1 class 1 0 1 2 3 4 5 6 class 0 class 0 class 0

class 1

X1
Yes

x2 > 2

No

x1 > 2.5
Yes No Yes

x1 < 4
No

x2 < 3.5
Yes No

class 0

class 0
Yes

x2 < 1
No

class 1
D.P. Solomatine. Data-driven modelling (part 1).

class 0

class 1

class 0
66

Classification rules and decision trees: Play Tennis example


Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humid. high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Play? no no yes yes yes no yes no yes yes yes yes yes no

Objective: to represent this knowledge using ML techniques


D.P. Solomatine. Data-driven modelling (part 1). 67

Several ways to represent knowledge about data set (1) Classification rules
Classification rules - they predict the classification of examples in terms of whether to play of not. E.g.:
if if if if if (Outlook=sunny) and (Humidity=high) then Play=No (Outlook=rainy) and (Windy=strong) then Play=No (Outlook=overcast) then Play=Yes (Humidity=normal) then Play=Yes (non of the above) then Play=Yes

These rules are to be interpreted in order (also called decision list)

D.P. Solomatine. Data-driven modelling (part 1).

68

Several ways to represent knowledge about data set (2) Decision trees
Decision trees - treelike structure representing classification rules

D.P. Solomatine. Data-driven modelling (part 1).

69

Several ways to represent knowledge about data set (3) Association rules
Association rules - they associate different attribute values. E.g.:
if if if if (Temperature=cool) then Humidity=normal (Humidity=normal) and (Windy=weak) then Play=Yes (Outlook=sunny) and (Play=No) then Humidity=high (Windy=false) and (Play=No) then (Outlook=sunny) and (Humidity=high)

In total there are around 60 of such rules that are 100% correct These rules can predict any of the attributes, not just Play attribute

D.P. Solomatine. Data-driven modelling (part 1).

70

Classification:
decision trees and ID3 and C4.5 algorithms

D.P. Solomatine. Data-driven modelling (part 1).

71

Example with categorical attributes: 14 or 36 possible combinations

Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy

Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild

Humid. high high high high normal normal normal high normal normal normal high normal high

Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong

Play? no no yes yes yes no yes no yes yes yes yes yes no
72

D.P. Solomatine. Data-driven modelling (part 1).

Objective of classification

to build a data-driven model that would


be based on the available historical data on behaviour of a particular person (14 instances) classify a new instance (observation) into a proper class in this case into Yes (play) or No (not to play).

D.P. Solomatine. Data-driven modelling (part 1).

73

Decision tree for 'PlayTennis'


Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humid. high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Played tennis? no no yes yes yes no yes no yes yes yes yes yes no

each node = a test for an attribute, leaves = classification of an instance

How to construct such tree?: algorithm ID3 (Quinlan 86), extended later to C4.5 and C5
D.P. Solomatine. Data-driven modelling (part 1). 74

Verification (test) data set


oops! Our model is wrong here

verification (test) set: observations that also happened:


Outlook rainy sunny overcast Temp cool hot cool Humid. high normal high Wind strong strong strong Play? yes yes yes

| Model prediction | no (wrong) | yes (correct) | yes (correct)

classification error for this verification data set of 3 examples is 1/3 = 33%.

D.P. Solomatine. Data-driven modelling (part 1).

75

Model in operation

new instance that the model never saw:

Outlook rainy

Temp cool

Humid. Wind low weak

Play? ?

| Model prediction | yes

Correct? We do not know!

D.P. Solomatine. Data-driven modelling (part 1).

76

Algorithm for building decision trees

1. A := the 'best' decision (split) attribute for next node ('best' = giving max information gain) 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes But a question remains:
how to identify the best split attribute, i.e. how to compute the information gain ? Lets consider Interachtive Dichotomizer 3 (ID3) algorithm
D.P. Solomatine. Data-driven modelling (part 1). 77

Entropy: measure of the uncertainty (unpredictability) associated with a random variable


Consider a set S of members that can be + or We sample a member randomly, and assume we get + with probability P if P+=1 (all members are +), then E=0 (0 bits are needed to encode a message about a sample since they are all +) if P+=P-=0.5, then E=1 (1 bit needed since a sample is + or with equal probability) if P+=0.8 then E=0.72 (so < 1 bit per message about the class) Entropy(S) = expected information (number of bits) needed to encode class (+ or ) of randomly drawn sample of S (under the shortestlength "code")
D.P. Solomatine. Data-driven modelling (part 1). 78

Entropy: general case

if members in S may belong to c classes:

E ( S ) = pi log 2 pi
i =1

logarithms is still base 2 because entropy is the measure of the expected encoding length measured in bits max value of entropy is log2c for example if c=8, and p1==p8=0.125, then E(S) = 3 (3 bits needed to send a message about the class number) if all examples belong to one class, E=0. This is what we are aiming at.
D.P. Solomatine. Data-driven modelling (part 1). 79

Entropy and Information Gain: essence of the Interachtive Dichotomizer 3 (ID3) algorithm for building decision trees
S = (14 instances: 9+, 5-)
Entropy E([9+, 5-]) = -(9/14) log(9/14) - (5/14) log(5/14) = 0.940 we want to reduce total entropy by splitting the set S into subsets with lower entropy (i.e. with higher share of examples of the same class) lower entropy = information gain if split is made on the basis of attribute A: let Values(A) is the set of all possible values for attribute A
e.g.: Values (Humidity) = {normal, high}

Sv is the subset of S for which attribute A has value v


e.g.: S

then Information gain (expected reduction in Entropy caused by knowing the value of attribute A):

Humidity=normal

= {7 examples},

|S

Humidity=normal

|=7

Gain( S , A) = E ( S )
D.P. Solomatine. Data-driven modelling (part 1).

| Sv | E (Sv ) vValues ( A ) | S |

80

Selecting 'best' attributes for leaves


for a node an attribute is selected that provides max Gain: Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 Gain(S, Temperature) = 0.029 Gain(S, Outlook) = 0.246 -highest

D.P. Solomatine. Data-driven modelling (part 1).

81

Selecting the next best attribute

D.P. Solomatine. Data-driven modelling (part 1).

82

Same example with continuous or mixed data


Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humid. 85 90 86 96 80 70 65 95 70 80 70 90 75 91 Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Play? no no yes yes yes no yes no yes yes yes yes yes no

Temperature: 40 48 60 Play? No No Yes Thresholds: ^ (for max information gain)


D.P. Solomatine. Data-driven modelling (part 1).

72 Yes

80 Yes ^

90 No

83

Unknown (missing) values in decision trees

If some examples do not have value for attribute A:


assign most common value of examples sorted to this node, or assign most common value of examples with same the target value, or assign probability pi to each possible value vi of A, and then provide fraction pi of examples to each descendant in a tree

D.P. Solomatine. Data-driven modelling (part 1).

84

Pruning Decision trees to improve interpretability and remove overfitting


Unknown values: Tree may appear to be too complex (hundreds of nodes):
difficult to interpret "too accurate" (overfitting) following all (even noisy) examples

Solution: tree can be reduced (pruned)


reduced error pruning (Quinlan 87): replace a subtree by a leaf node and assign the most common class of the associated training examples; the new tree must be not worse on the validation set algorithms can be forced to have a certain number of instances in a leave to sort tree will inevitably misclassify some instances, so accuracy is decreased but this normally removes overfitting however understandability of smaller trees is better
D.P. Solomatine. Data-driven modelling (part 1). 85

Overfitting: can also happen in decision trees (1)

Frequent situation: accuracy during learning increases, but on the test examples drops

D.P. Solomatine. Data-driven modelling (part 1).

86

Overfitting: can also happen in decision trees (2)

Example of overfitting:
consider adding new, 15th, instance [sunny, hot, normal, strong, NO] this example is noisy - could be just wrong the new tree will have to take into account this (wrong) example (NB: check on Temperature is added) - built by ID3 algorithm in Weka software: (Original tree built on 14 examples):
outlook = sunny | temperature = hot: no | temperature = mild | | humidity = high: no | | humidity = normal: yes | temperature = cool: yes outlook = overcast: yes outlook = rainy | windy = strong: no | windy = weak : yes outlook = sunny | humidity = high: no (3.0) | humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy | windy = strong: no (2.0) | windy = weak: yes (3.0)

(it misclassifies the 15th example)


87

D.P. Solomatine. Data-driven modelling (part 1).

Problems appropriate for decision tree learning

Instances describable by attribute-value pairs.


Easiest is when an attribute takes a small number of disjoint possible values (e.g. Temp={cool, mild, hot}) however with proper discretization real values are also possible

Target function is discrete valued (i.e. it is a class). Possibly noisy (missing) training data are allowed

D.P. Solomatine. Data-driven modelling (part 1).

88

Classification: use of rules

D.P. Solomatine. Data-driven modelling (part 1).

89

Classification rules

A popular alternative to decision trees General form (if-then or if-then-else):


if (Antecedent) then (Consequent) Antecedent is normally several ANDed preconditions

Several rules are often connected with the OR operator (ORed) Rules can be read off a decision tree, but they are then are far more too complex than necessary. Well constructed rules are often more compact than trees Problems: Rules can be interpreted in order (as a decision list) or individually, what to do if different rules lead to different conclusions for the same instance, etc.

D.P. Solomatine. Data-driven modelling (part 1).

90

Classification rules based on representative instances (prototype-based rules)


if there is a representative example Er that carries a lot of information, it could be used as the basis for rules like:
if NewExample is close to Er then its class is the same as of Er

D.P. Solomatine. Data-driven modelling (part 1).

91

Example classification rule based on a pruned tree


Decision tree is built, then pruned, and rules are built (PART algorithm). Example:
if (outlook = overcast) then Play=YES if (humidity = high) then Play=NO otherwise Play=YES (total=4 / errors=0) (total=5 / errors=1) (total=5 / errors=1)

Accuracy is reduced, but the set of rules is very simple

D.P. Solomatine. Data-driven modelling (part 1).

92

Example of using trees and rules: influence of environmental factors on diseases


Influence of environmental factors on respiratory diseases (Kontic, Dzeroski, 2000) - decision trees (C4.5), classification rules (CN2)
760 patients: 33 input attributes:
gender, age, residence, practices: cooking, heating, smoking, allergies, education, type of house, number of people in household, exposition, month, etc.

the trained classifier was able to predict the disease (4 classes):


acute nasopharingitis (common cold), acute infection of upper respiratory organs, influenza, acute bronchitis

D.P. Solomatine. Data-driven modelling (part 1).

93

Case Study Woudse: using decision trees to replicate pumping strategy for Woudse water system (Delfland, NL)

D.P. Solomatine. Data-driven modelling (part 1).

94

Wodse: schematisation of the water system

polder

D.P. Solomatine. Data-driven modelling (part 1).

95

Woudse Case Study: water levels and pumping (full data set)
Water level & pump discharge for Woudse
-3.7 -3.8 -3.9 -4 -4.1 -4.2 -4.3 -4.4 -4.5 -4.6 -4.7 0 500 1000 Observation 1500 WL Pump 2

0 2000

Pump Discharge

Water Level (M)

D.P. Solomatine. Data-driven modelling (part 1).

96

Woudse Case Study: water levels and pumping (fragment)


Water level & pump discharge for Woudse -3.7 -3.8 -3.9 -4 -4.1 -4.2 -4.3 -4.4 -4.5 -4.6 -4.7
900 950 1000 1050 1100

0
1150

Observation

WL Pump

Pump Discharge

Water Level (M)

D.P. Solomatine. Data-driven modelling (part 1).

97

Woudse Case Study: problem description

Input: water level(t) and pump discharge(t-1) Output: pump discharge(t) The pumping station has two pumps, each with capacity 0.133 m3/sec. Possible pump discharge: 0, 0.133 or 0.266 m3/sec Pump discharge has been described as a category variable and is expressed as 0, 1 or 2 If water level goes up pump(s) should be switched on to reduce the water level, target water level is -4.6 M At each time level we have to determine:
the pump discharge (0, 1 or 2) based on two inputs: water level(t) and pump discharge (t-1)

The problem is posed as a classification problem.


D.P. Solomatine. Data-driven modelling (part 1). 98

Woudse case study: resulting decision tree solving the classification problem (trained on 5000 instances)
Pumpt = f (WLt, Pumpt-1) PumpT-1 | WLt | WLt | | | | | | | | PumpT-1 | WLt | | | | | WLt | | | | PumpT-1 | WLt | | | | | WLt Pump = {0, 1, 2} = 0 <= -4.577: 0 (1084.0/1.0) > -4.577 WLt <= -4.57: 0 (417.0/111.0) WLt > -4.57 | WLt <= -4.551: 1 (125.0) | WLt > -4.551: 2 (12.0) = 1 <= -4.595 WLt <= -4.601: 0 (129.0) WLt > -4.601: 1 (189.0/64.0) > -4.595 WLt <= -4.55: 1 (680.0/3.0) WLt > -4.55: 2 (41.0) = 2 <= -4.593 WLt <= -4.6: 0 (37.0) WLt > -4.6: 2 (39.0/16.0) > -4.593: 2 (2247.0/1.0)
99

D.P. Solomatine. Data-driven modelling (part 1).

Woudse case study: verification result (full data set, 3759 instances)
Water level & pump discharge for Woudse
-2,9 -3,1 -3,3 -3,5 -3,7 -3,9 -4,1 -4,3 -4,5 -4,7 0 500 1000 1500 2000 2500 3000 3500 2,5 2 1,5 1 0,5 0 4000 Pump Discharge

Water Level (M)

Observation

WL Known Pump Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1).

100

Woudse case study: verification result (fragment with the 100% correct classification)
Water level & pump discharge for Woudse -2.9 Water Level (M) -3.1 -3.3 -3.5 -3.7 -3.9 950 2.5 2 1.5 1 0.5 0 1000 Pump Discharge

960

970

980

990 WL

Observation

Known Pump Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1).

101

Woudse case study: verification result (fragment with some errors present)
Water level & pump discharge for Woudse -4.5 Water Level (M) 2.5 2 1.5 -4.6 1 0.5 -4.7 600 0 650 Pump Discharge

610

620

630

640 WL

Observation

Known Pump Predicted Pump

D.P. Solomatine. Data-driven modelling (part 1).

102

Clustering and classification: some applications


SOFM in finding 12 groups of catchments based on their 12 characteristics, and then applying ANN to model the regional flood frequency (Hall et al. 2000) Hannah et al. (2000) used clustering for finding groups of hydrographs on the basis of their shape and magnitude; clusters are then used for classification by experts similarly identifying the classes of river regimes (Harris et al. 2000) fuzzy c-means in classifying shallow Dutch groundwater sites into homogeneous groups (Frapporti et al. 1993) fuzzy classification in soil classification on the basis of cone penetration tests (CPTs) (Zhang et al. 1999)

D.P. Solomatine. Data-driven modelling (part 1).

103

Clustering and classification: some applications


decision trees in classifying surge water levels in the coastal zone depending on the hydrometeorological data, with the Dutch Ministry for public works (Solomatine et al., 2000; Velickov, 2004); decision trees in classifying the river flows in the problem of flood control self-organizing feature maps (Kohonen neural networks) as clustering methods, and SVM as classification method in aerial photos interpretation (Velickov et al., 2000); using decision trees, Bayesian methods and neural networks in soil classification on the basis of cone penetration tests (CPT) (Bhattacharya and Solomatine, 2006)

D.P. Solomatine. Data-driven modelling (part 1).

104

Classification: conclusions

there is a wide choice of methods classification methods are mainly applied in pattern recognition problems engineering numerical problems could be sometimes posed as classification problems. Using classification methods (decision trees) often leads to simpler models and requires less accurate data

D.P. Solomatine. Data-driven modelling (part 1).

105

Numeric prediction (regression):


linear models and their combinations in tree-like structures (M5 model trees)

D.P. Solomatine. Data-driven modelling (part 1).

106

Models for numeric prediction

Target function is real-valued There are many methods:


Linear and non-linear regression ARMA (auto-regressive moving average) and ARIMA models Artificial Neural Networks (ANN)

We will consider now:


Linear regression Regression trees Model trees

D.P. Solomatine. Data-driven modelling (part 1).

107

Linear regression
actual output value

Y = a1 X + a2

model predicts new output value y(v)

y(t)

X
x(t)
new input value x(v)

Given measured (training) data: T vectors {x(t),y(t)}, t =1,T . Unknown a1 and a2 are found by solving an optimization problem

E = y (t ) (a0 + a1 x ( t ) ) min
2 t =1

Then for the new V vectors {x(v)}, v =1,V this equation can approximately reproduce the corresponding functions values

{y(v)}, v =1,V

D.P. Solomatine. Data-driven modelling (part 1).

108

Numeric prediction by averaging in subsets (regression trees in 1D)


input X1 is split into intervals; averaging is performed in each interval

Y (output)

input space can be split according to standard deviation in subsets


D.P. Solomatine. Data-driven modelling (part 1). 109

Numeric prediction by piece-wise linear models (model trees in 1D)


input X1 is split into intervals; separate linear models can be built for each of the intervals

Y (output)

question is: how to split the input space in an optimal way?


D.P. Solomatine. Data-driven modelling (part 1). 110

Regression and M5 model trees: building them in 2D


4 3 Model 1 2 Model 4 1 Model 5 1 2 3 4 5 6 x2 > 2
Yes No Yes

X2
Model 3

Model 2

input space X1X2 is split into regions; separate regression models can be built for each of the regions Tree structure where
nodes are splitting conditions leaves are:

Model 6

X1

constants ( regression tree) linear regression models ( M5 model tree)

Y (output)
Yes

x1 > 2.5

Example of Model 1: - in a regression tree: Y = 10.5 - in M5 model tree: Y = 2.1*X1 + 0.3*X2

No

x1 < 4

No

x2 < 3.5
Yes No

Model 3

Model 4
Yes

x2 < 1
No

Model 1

Model 2

Model 5

Model 6
111

D.P. Solomatine. Data-driven modelling (part 1).

How to select an attribute for split in regression trees and M5 model trees
regression trees: same as in decision trees (information gain) main idea: choose the attribute that splits the portion T of the training data that reaches a particular node into subsets T1, T2, use the standard deviation sd (T) of the output values in T as a measure of error at that node (in decision trees - entropy E(T) was used) split should result in subsets Ti with low standard deviation sd(Ti)
so model trees splitting criterion is SDR (standard deviation reduction) X2 that has to be maximized: Model 2 4 Model 3

T SDR = sd (T ) i sd (Ti ) i T

3 2 Model 4 1

Model 1 Model 6 Model 5 X1

1 Y (output)

D.P. Solomatine. Data-driven modelling (part 1).

112

Measures to improve performance of model trees

smoothing:
smoothing process is used to compensate for the sharp discontinuities between adjacent linear models

pruning (size reduction) - needed when a large tree overfits the data:
a subtree is replaced by one linear model

D.P. Solomatine. Data-driven modelling (part 1).

113

Regression and model trees in numerical prediction: some applications

D.P. Solomatine. Data-driven modelling (part 1).

114

Trees and rules: influence of soil habitat on an Collembola apterigota (an insect)
Influence of soil habitat features on the abundance of Collembola apterigota (an insect) (Kampichler, Dzeroski) - regression and model
inputs: field type, microbial respiration, microbial biomass, soil moisture, alkalinity (pH), carbon, nitrogen, median particle size outputs: total number of collembolan individuals (abundance), total number of collembolan species (biodiversity), number of individuals of Folsomia quadrioculata (a particular type of Collembola) methods compared: linear regression (highest error), regression and model trees, neural networks (least error)

trees

D.P. Solomatine. Data-driven modelling (part 1).

115

Regression trees: prediction of Collembola population


In accuracy better than linear regression Very easy to interpret

D.P. Solomatine. Data-driven modelling (part 1).

116

M5 model trees: prediction of Collembola population


In accuracy a bit worse than ANN But very easy to interpret

D.P. Solomatine. Data-driven modelling (part 1).

117

M5 tree in Rainfall-runoff modelling (Huai river, China)


Qt+T = F (Rt Rt-1 Rt-L Qt Qt-1 Qt-A Qtup Qt-1up )
Output: predicted discharge QXt+1 Inputs (with different time lags): - daily areal rainfall (Pa) - moving average of daily areal rainfall (PaMov) - discharges (QX) and upstream (QC) Smoothed variables have higher correlation coeff. with the output, e.g. 2-day-moving average of rainfall (PaMov2t) Final model for the flood season: Output: Inputs:
discharge the next day QXt+1 Pat Pat-1 PaMov2t PaMov2t-1

Model structure:

Variables considered

Daily discharges (QX, QC) Daily rainfall at 17 stations Daily evaporation for 14 years (1976-1989) at 3 stations Training data: 1976-89 Cross-valid. & testing: 1990-96

Data (1976-1996)

Techniques used: M5 model trees, ANN


118

QCt QCt-1 QXt

D.P. Solomatine. Data-driven modelling (part 1).

Resulting M5 model tree with 7 models (Huai river)


QXt | | | | QXt | | | | | | <= 154 : PaMov2t <= 4.5 : LM1 (1499/4.86%) PaMov2t > 4.5 : | PaMov2t <= 18.5 : LM2 (315/15.9%) | PaMov2t > 18.5 : LM3 (91/86.9%) > 154 : PaMov2t-1 <= 13.5 : | PaMov2t <= 4.5 : LM4 (377/15.9%) | PaMov2t > 4.5 : LM5 (109/89.7%) PaMov2t-1 > 13.5 : | PaMov2t <= 26.5 : LM6 (135/73.1%) | PaMov2t > 26.5 : LM7 (49/270%)

Models at the leaves:


LM1: LM2: LM3: LM4: LM5: LM6: LM7: QXt+1 = 2.28 + 0.714PaMov2t-1 - 0.21PaMov2t + 1.02Pat-1 + 0.193Pat - 0.0085QCt-1 + 0.336QCt + 0.771QXt QXt+1 = -24.4 - 0.0481PaMov2t-1 - 4.96PaMov2t + 3.91Pat-1 + 4.51Pat - 0.363QCt-1 + 0.712QCt + 1.05QXt QXt+1 = -183 + 10.3PaMov2t-1 + 8.37PaMov2t - 5.32Pat-1 + 1.49Pat - 0.0193QCt-1 + 0.106QCt + 2.16QXt QXt+1 = 47.3 + 1.06PaMov2t-1 - 2.05PaMov2t + 1.91Pat-1 + 4.01Pat - 0.3QCt-1 + 1.11QCt + 0.383QXt QXt+1 = -151 - 0.277PaMov2t-1 - 37.8PaMov2t + 31.1Pat-1 + 30.3Pat - 0.672QCt-1 + 0.746QCt + 0.842QXt QXt+1 = 138 - 5.95PaMov2t-1 - 39.5PaMov2t + 29.6Pat-1 + 35.4Pat - 0.303QCt-1 + 0.836QCt + 0.461QXt QXt+1 = -131 - 27.2PaMov2t-1 + 51.9PaMov2t + 0.125Pat-1 - 5.29Pat - 0.0941QCt-1 + 0.557QCt + 0.754QXt

D.P. Solomatine. Data-driven modelling (part 1).

119

Performance of M5 and ANN models (Huai river)


D.P. Solomatine and Y. Xue. M5 model trees compared to neural networks: application to flood forecasting in the upper reach of the Huai River in China. ASCE Journal of Hydrologic Engineering, 9(6), 2004, 491-501.

4000 OBS Discharge (m3/s) 3000 2000 1000 0 96-6-1 96-7-1 Time 96-7-31 96-8-30 FS-M5 FS-ANN

M5 and ANN, flood season data (testing, fragment)

D.P. Solomatine. Data-driven modelling (part 1).

120

M5 model trees and ANNs in rainfall-runoff modelling: predicting flow three hours ahead (Sieve catchment)
Inputs: REt, REt-1, REt-2, REt-3,Qt,Qt-1 (rainfall for 3 past hours, runoff for 2) ANN verification RMSE=11.353 NRMSE=0.234 COE=0.9452 MT verification RMSE=12.548 NRMSE=0.258 COE=0.9331

The model:

Qt+3 = f (REt, REt-1, REt-2, REt-3, Qt, Qt-1 )


Prediction of Qt+3 : Verification performance

350 300 250 Q [ m 3 /s ] 200 150 100 50 0 0 20 40 60 80 t [hrs]


121

Observed Modelled (ANN) Modelled (MT)

100

120

140

160

180

D.P. Solomatine. Data-driven modelling (part 1).

Numerical prediction by M5 model trees: conclusions

Transparency of trees: model trees is easy to understand (even by the managers) M5 model tree is a mixture of local accurate models Pruning (reducing size) allows:
to prevent overfitting to generate a family of models of various accuracy and complexity

D.P. Solomatine. Data-driven modelling (part 1).

122

End of part 1

D.P. Solomatine. Data-driven modelling (part 1).

123

Você também pode gostar