Você está na página 1de 69

Sanitation Coverage Evolution in Rio Grande

do Sul on the New Brazilian Sanitation Legal


Framework and CORSAN privatization: a
predictive inference approach
Objetivo: Avaliar se o novo marco legal do saneamento e a
privatização da Corsan contribuirão com o aumento da
cobertura do saneamento básico.
 Utilizaremos métodos de inferência preditiva, em particular, aprendizado de máquina e quebra
estrutural em séries multivariadas. Os dados usados são receita tributária e despesa com saneamento
municipal obtidos do TCE-RS e cobertura de saneamento.

A hipótese de trabalho é que, em virtude da complexidade da questão do saneamento básico (materializada por
inúmeras responsabilidades concorrentes pelos entes da União) os resultados serão ambíguos em virtude da
relevância de inúmeros outros fatores, ainda que os coeficientes apontem no sentido de aumento de oferta de
serviços de saneamento com redução de custos nos mesmo moldes do ocorrido com a privatização do setor de
telecomunicações levada a cabo pelo governo federal.

1. Problema de pesquisa
Qual será a trajetória da cobertura de saneamento básico no Rio Grande do Sul à luz do Novo Marco Legal do
Saneamento e da Privatização da CORSAN? Há impactos na mortalidade infantil e em doenças evitáveis?

1.1 Objetivos gerais


Trata-se de avaliar o impacto do novo marco legal e da privatização da corsan na quebra estrutural das séries
de cobertura no saneamento, tendo em vista a receita tributária (proxy para pib municipal) e o gasto com
saneamento dos municípios gaúchos.

1.2 Objetivos específicos


Fazer um exercício de inferência preditiva para a evolução da cobertura de saneamento básico nos municípios
gaúchos a partir da privatização da CORSAN nos termos do Novo Marco Legal do Saneamento.

2. Importância do problema
A cobertura de saneamento básico é um dos principais fatores de determinação de saúde da população a longo
prazo e, em particular, de mortalidade infantil como mostram o Portal do Saneamento Básico citando
a OMS ou ainda o World Bank.
Alguns poucos artigos tentam avaliar o impacto da cobertura do saneamento à luz do novo marco legal do
saneamento, mas não o fazem com técnicas baseadas em dados. Exemplos vão abaixo:

 The variable income marjet and sanitation branch companiel: the impact os investments associated
with the New Sanitation Marl
 Prognóstico do processo de privatização da Companhia Riograndense de Saneamento (CORSAN)
Aspectos históricos e uma comparação com o cenário latino e europeu

3. Hipótese
A hipótese central do trabalho é que o aumento da cobertura sanitária depende de diversos fatores e não apenas
da privatização. Por isso, a ideia é ajustar um modelo de série de tempo multivariada e um algoritmo de
aprendizado de máquina, tendo em princípio as seguintes variáveis como relevantes:

 cobertura sanitária, medida pelo total da população do RS com acesso à rede de água e esgoto.
 produto interno bruto do RS
 efetividade agregada das gestões municipais, medidas pelo IEGM ponderado pela parcela da
população em cada município.
 total dos investimentos em saneamento básico, medido pelos investimentos municipais e estadual em
saneamento com dados fornecidos pelo TCE-RS na conta de despesa código 123210506

4.1. Dados
Os dados serão extraídos de várias fontes que mencionamos a seguir.

A receita tributária municipal (como proxy para o PIB municipal) e a despesa com saneamento dos municípios
obtida do Portal de Dados Abertos do TCE.

A cobertura da rede de águas e esgotos provém do Sistema Nacional de Informações sobre Saneamento.

5. Resultados esperados
A hipótese que substancia a decisão do Congresso Nacional de aprovação do Novo Marco Legal e do Governo
do Estado do RS de privatizar a CORSAN é que haverá um aumento na cobertura do saneamento em vista de
atingir o objetivo 6 da Agenda 2030 da ONU.

Não obstante, supõe-se aqui que o atingimento da meta acima dependerá também de outros fatores, quais
sejam:

a. a efetividade da gestão municipal, medida pelo IEGM; \ b. o crescimento do pib municipal com proxy dada
pelo total de receitas tributárias dos municípios;

c. entre outras variáveis, o modelo de privatização e/ou a parceria público-privada a ser adotada.
Dicionário de Variáveis Saneamento

 PIB Estadual
 Município
 Ano de Referência
 Prestador
 Tipo de serviço
 Natureza jurídica
 POP_TOT - População total do município do ano de referência (Fonte: IBGE):
 POP_URB - População urbana do município do ano de referência (Fonte: IBGE):
 AG001 - População total atendida com abastecimento de água
 ES001 - População total atendida com esgotamento sanitário
 FN001 - Receita operacional direta total
 FN002 - Receita operacional direta de água
 FN003 - Receita operacional direta de esgoto
 FN006 - Arrecadação total
 FN023 - Investimento realizado em abastecimento de água pelo prestador de serviços
 FN024 - Investimento realizado em esgotamento sanitário pelo prestador de serviços

Table of contents

1. Scope of Analysis

1.1 Description of analysis \ 1.2 Methodology \ 1.3 Hypothesis

2. Importing tools and data

3. Checking the data

4. Data manupulation / Preprocessing

4.1 Changing the categorical columns to numerical \ 4.2 Setting column Time to datetime format \ 4.3 Dealing
with NAN values

5. EDA: Data explained by descriptive statistics

5.1 Describing the data \ 5.2 Analyzing PIB, Pop_Total_Agua, Pop_Total_Esgoto, ROD_Total,ROD_Agua,
ROD_Esgoto, Invest_Agua_Prestador, Invest_Esgoto_Prestador and Population over time \ 5.3 Plotting and
interpretating the results of time series analysis

 Time Series plot


 X-Y scatter
 Boxplot 5.4 Scatterplot Matrix
5.5 Correlation Matrix

6. Inferential statistics

6.1 Hypothesis tests

7. Autocorrelation and Unit Root test

7.1 Time Series Plots


 7.1.1 STL Decomposition 7.2 Correlogram (autocorrelation) \ 7.3 Dickey Fuller test \ 7.4 Cross-
correlation \ 7.5 Conclusion of the results so far and the next steps

8. VAR/VECM

8.1 \ 8.2

9. Panel Analysis

10. Machine learning analysis

10.1 Using Train/Test Split on Dataframe \ 10.2 Defining models for linear Regression Sklearn \ 10.3 Using
SVR Model, KNeighbors Regressor, DecisionTree Regressor, Gradient Booster, RandomForest Regressor,
MLP Regressor, DNN, LSTM, CNN, ResNet \ 10.4 Interpretation of results

11. Conclusion of Analysis

rpy2 is an interface to R running embedded in a Python process

In [1]:
!pip install rpy2
In [2]:
import rpy2
In [3]:
rpy2.__path__
Out[3]:
In [4]:
%load_ext rpy2.ipython

1. Scope of Analysis
1.1 Description of analysis

Using indicators, the influence of XXXX is statistically evaluated using time series analysis. The type,
strength and interdependence of the examined indicators are checked analytically. The results will be
compared with subsequent visualization.

Problem

How can interactions between individual indicators be measured and dependence be visualized?

This analysis attempts to provide answers to these questions based on selected indicators through exploratory
and inferential data analysis first. Second advanced statistical models are used through machine learning
algorithms using Python.

1.2 Methodology
Indicators used are: PIB, Pop_Total_Agua, Pop_Total_Esgoto, ROD_Total,ROD_Agua, ROD_Esgoto,
Invest_Agua_Prestador, Invest_Esgoto_Prestador and Population over time.

1.3 Hypothesis

Descrição dos indicadores e hipóteses.

Algumas hipóteses a serem investigadas neste trabalho são:

 Qual é o nível de retorno do investimento em água e esgoto sobre a cobertura de atendimento desses
serviços na população?
 Qual é a estrutura dos retornos de escala nos serviços de água e esgoto?
 O investimento em saneamento vêm acompanhando o crescimento do PIB per capita?
 Como os retornos de escala se comportam de acordo com
 Critérios regionais
 Natureza jurídica do prestador (autarquia, empresa pública, empresa privada, etc.) </font>

Can the Indicator be explained by the other indicators used? Further: Exists a significant impact of xxx
on xxx? How strong is Correlation between indicators and xxxx?

2. Loading packages and data


In [5]:
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from datetime import datetime

from tqdm.notebook import tqdm

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.max_columns', None)
In [6]:
# Import modules
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Get the current working directory


cwd = os.getcwd()

# Print the current working directory


print("Current working directory: {0}".format(cwd))
# Print the type of the returned object
print("os.getcwd() returns an object of type: {0}".format(type(cwd)))
In [7]:
print("Current working directory: {0}".format(os.getcwd()))

# Change the current working directory


# os.chdir('')

# Print the current working directory


print("Current working directory: {0}".format(os.getcwd()))
Importing panel data and show head, remove index column
In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/ecompfin-ufrgs/tce-
corsan_privatization/main/Dados_Agua_Esgoto_v2.csv', sep=';')

#retirar coluna obs


df = df.drop('obs', 1)
#df.head()
display(df)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: FutureWarning:
In a future version of pandas all arguments of DataFrame.drop except for the
argument 'labels' will be keyword-only
after removing the cwd from sys.path.
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

1 21 4.9
1269 1040
9 06 207300 115200 22876 1505183 879
0 NaN NaN 5677 8037 9351714
9 61 0 0 398 5 35e
5 7
5 2 +10

1 21 5.8
1615 1326
9 02 205529 120206 28863 1632475 807
1 2073000.0 1152000.0 4016 7682 12499107
9 85 6 5 339 4 37e
0 1
6 7 +10

1 18 6.4
9 78 2768 2167 60065 991
2 550075 1767795.0 226199 1201123.0 2764096 1958632
9 08 4599 8023 76 31e
7 9 +10

1 20 6.7
9 34 2729 2003 72585 673
3 52301 550075.0 290197 226199.0 69951 582364
9 28 5557 6996 62 12e
8 3 +10
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

1 20 7.4
9 52 3363 2659 70461 015
4 712063 52301.0 358113 290197.0 2361972 698287
9 25 7355 1236 19 78e
9 7 +10

2 23 8.1
2066 1655
0 56 228103 135901 41028 2049092 814
5 593064.0 262793.0 0208 7318 13381602
0 19 6 3 904 6 71e
4 0
0 0 +10

2 23 9.2
2019 1611
0 93 238712 141592 40833 1495734 310
6 2281036.0 1359013.0 4166 0835 19105796
0 75 0 7 309 3 08e
7 8
1 5 +10

2 25 9.8
2885 2337
0 43 249667 150810 54886 1642711 847
7 2387120.0 1415927.0 1151 7622 11849180
0 28 1 8 292 4 21e
4 5
2 9 +10

2 28 1.1
3491 2753
0 06 270640 148474 70672 2338128 932
8 2488871.0 1508108.0 6729 8622 13018441
0 85 9 6 699 3 54e
3 0
3 8 +11

2 28 1.3
3880 3076
0 73 277984 154320 78428 2779462 119
9 2706409.0 1484746.0 6010 0810 10823603
0 16 9 3 932 3 22e
2 0
4 5 +11

2 29 1.3
4205 3314
1 0 14 285360 161421 86887 2351404 636
2776649.0 1543203.0 7849 3048 8216931
0 0 60 9 7 693 4 28e
8 1
5 2 +11

2 29 1.4
4503 3484
1 0 70 289351 191263 10058 3896224 762
2853609.0 1612821.0 4736 7527 16804484
1 0 78 4 0 3306 1 26e
0 6
6 3 +11

1 2 29 289670 2891172.0 190679 1912630.0 4680 3621 10518 4556536 33573195 1.6
2 0 10 7 1 5297 6222 6111 9 800
0 82 8 5 98e
7 0
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

+11

2 30 1.9
5072 3915
1 0 01 291528 188934 11379 3653185 022
2893961.0 1906791.0 8394 4881 61894411
3 0 42 1 5 2947 9 98e
2 2
8 2 +11

2 35 2.0
5556 4266
1 0 67 311967 224354 12366 5742949 434
2896223.0 1889297.0 7772 1478 90571927
4 0 44 5 0 9696 3 49e
5 1
9 2 +11

2 34 2.4
6152 4718
1 0 75 307915 202910 13793 7798305 15348933 124
3060614.0 2156180.0 5590 1259
5 1 96 0 8 2113 0 2 92e
4 5
0 9 +11

2 37 2.6
6932 5342
1 0 16 321416 209831 15301 1533279 21282042 505
3018538.0 1985026.0 7709 1557
6 1 79 6 0 7888 69 3 64e
9 0
1 6 +11

2 40 2.8
7680 5923
1 0 89 326768 229803 16943 1186692 21559242 758
3157817.0 2083681.0 2746 3693
7 1 84 7 8 3273 39 2 70e
7 9
2 0 +11

2 42 3.3
8724 6710
1 0 96 341713 253946 19521 6562923 11899594 229
3225617.0 2198923.0 1566 0552
8 1 19 0 4 1241 4 4 27e
5 4
3 8 +11

2 41 3.5
9772 7503
1 0 91 349747 247129 22494 4944829 781
3400819.0 2364594.0 2950 3015 68565880
9 1 51 1 6 8182 6 64e
3 2
4 9 +11

2 40 3.8
1122 8393
2 0 88 350036 251341 28163 7224192 199
3459152.0 2401248.0 9167 7650 38329645
0 1 60 0 5 1534 2 26e
38 4
5 8 +11

2 2 42 357833 3474465.0 254260 2397243.0 1136 8674 26760 6999890 61591035 4.0
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

0 35 878
8729 5662
1 1 81 2 3 8021 1 95e
19 0
6 5 +11

2 43 4.2
1230 9272
2 0 03 356058 260602 30135 7758892 327
3525406.0 2467302.0 4927 6820 51996265
2 1 88 0 2 5993 9 00e
47 9
7 7 +11

2 43 4.5
1275 9502
2 0 52 359600 267672 32275 5835315 729
3545034.0 2565528.0 0358 6602 40869546
3 1 34 6 7 9865 3 40e
09 1
8 8 +11

2 43 4.8
1384 1052
2 0 33 366340 263734 33058 7910813 246
3580889.0 2571132.0 8545 0757 52007350
4 1 85 7 5 3381 3 42e
53 50
9 1 +11

2 44 4.8
1375 1034
2 0 42 368695 270125 33782 8336927 017
3656586.0 2620735.0 0620 9751 73273169
5 2 35 9 5 0578 7 33e
29 08
0 2 +11

3. Checking the data


In [9]:
df.info()
Information about dataframe shows 12 columns, 26 rows, Dtype int64 = 9, Dtype float64 = 3. PIB has 7 NAN-
values.
In [10]:
df.shape
Out[10]:
another overview over rows = 26, columns = 12
In [11]:
df.values
Out[11]:
outputs a two-dimensional numpy array of the values from panel data
In [12]:
df.columns
Out[12]:
Output are column names of dataframe
In [13]:
df.index
Out[13]:

4. Data manupulation / Preprocessing


4.1 Changing the categorical columns to numerical with pd.to_numeric
It was not necessary.
4.2 Setting column Time to datetime format
It was not necessary.
4.3 Dealing with the NAN Values
In [14]:
df.info()
Checking for NaN Values
In [15]:
nulls = pd.DataFrame(df.isna().sum()/len(df))
nulls= nulls.reset_index()
nulls.columns = ['column_name', 'Percentage Null Values']
nulls.sort_values(by='Percentage Null Values', ascending = False)
Out[15]:
column_name Percentage Null Values

3 AG001A_Pop_Total_Agua_Anterior 0.038462

5 ES001A_Pop_Total_Esgoto_Anterio 0.038462

0 Ano 0.000000

1 Pop_Total 0.000000

2 AG001_Pop_Total_Agua 0.000000

4 ES001_Pop_Total_Esgoto 0.000000

6 FN001_ROD_Total 0.000000
column_name Percentage Null Values

7 FN002_ROD_Agua 0.000000

8 FN003_ROD_Esgoto 0.000000

9 FN023_Invest_Agua_Prestador 0.000000

10 FN024_Invest_Esgoto_Prestador 0.000000

11 PIB 0.000000

List the columns with missing values \ Create histograms showing the distributions cols_with_missing
In [16]:
cols_with_missing = ['PIB', 'AG001A_Pop_Total_Agua_Anterior',
'ES001A_Pop_Total_Esgoto_Anterio']
df[cols_with_missing].hist()
Out[16]:

I won't impute values for the NaN.

Os valores para o PIB Estadual de 1995-2001 podem ser encontrados


aqui: https://arquivofee.rs.gov.br/indicadores/pib-rs/estadual/serie-historica/
In [17]:
df = pd.read_csv('https://raw.githubusercontent.com/ecompfin-ufrgs/tce-
corsan_privatization/main/Dados_Agua_Esgoto_v2.csv', sep=';')

#retirar coluna obs


df = df.drop('obs', 1)

# Adicionando PIB "per capita" ao data.frame


# df["PIB_cap"] = df["PIB"]/df["Pop_Total"]

display(df)

df.info()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: FutureWarning:
In a future version of pandas all arguments of DataFrame.drop except for the
argument 'labels' will be keyword-only
after removing the cwd from sys.path.
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

1 21 4.9
1269 1040
9 06 207300 115200 22876 1505183 879
0 NaN NaN 5677 8037 9351714
9 61 0 0 398 5 35e
5 7
5 2 +10

1 21 5.8
1615 1326
9 02 205529 120206 28863 1632475 807
1 2073000.0 1152000.0 4016 7682 12499107
9 85 6 5 339 4 37e
0 1
6 7 +10

1 18 6.4
9 78 2768 2167 60065 991
2 550075 1767795.0 226199 1201123.0 2764096 1958632
9 08 4599 8023 76 31e
7 9 +10

1 20 6.7
9 34 2729 2003 72585 673
3 52301 550075.0 290197 226199.0 69951 582364
9 28 5557 6996 62 12e
8 3 +10

1 20 7.4
9 52 3363 2659 70461 015
4 712063 52301.0 358113 290197.0 2361972 698287
9 25 7355 1236 19 78e
9 7 +10

5 2 23 228103 593064.0 135901 262793.0 2066 1655 41028 2049092 13381602 8.1
0 56 6 3 0208 7318 904 6 814
0 19 4 0 71e
0 0
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

+10

2 23 9.2
2019 1611
0 93 238712 141592 40833 1495734 310
6 2281036.0 1359013.0 4166 0835 19105796
0 75 0 7 309 3 08e
7 8
1 5 +10

2 25 9.8
2885 2337
0 43 249667 150810 54886 1642711 847
7 2387120.0 1415927.0 1151 7622 11849180
0 28 1 8 292 4 21e
4 5
2 9 +10

2 28 1.1
3491 2753
0 06 270640 148474 70672 2338128 932
8 2488871.0 1508108.0 6729 8622 13018441
0 85 9 6 699 3 54e
3 0
3 8 +11

2 28 1.3
3880 3076
0 73 277984 154320 78428 2779462 119
9 2706409.0 1484746.0 6010 0810 10823603
0 16 9 3 932 3 22e
2 0
4 5 +11

2 29 1.3
4205 3314
1 0 14 285360 161421 86887 2351404 636
2776649.0 1543203.0 7849 3048 8216931
0 0 60 9 7 693 4 28e
8 1
5 2 +11

2 29 1.4
4503 3484
1 0 70 289351 191263 10058 3896224 762
2853609.0 1612821.0 4736 7527 16804484
1 0 78 4 0 3306 1 26e
0 6
6 3 +11

2 29 1.6
4680 3621
1 0 10 289670 190679 10518 4556536 800
2891172.0 1912630.0 5297 6222 33573195
2 0 82 7 1 6111 9 98e
8 5
7 0 +11

2 30 1.9
5072 3915
1 0 01 291528 188934 11379 3653185 022
2893961.0 1906791.0 8394 4881 61894411
3 0 42 1 5 2947 9 98e
2 2
8 2 +11

1 2 35 311967 2896223.0 224354 1889297.0 5556 4266 12366 5742949 90571927 2.0
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

0 67 434
7772 1478
4 0 44 5 0 9696 3 49e
5 1
9 2 +11

2 34 2.4
6152 4718
1 0 75 307915 202910 13793 7798305 15348933 124
3060614.0 2156180.0 5590 1259
5 1 96 0 8 2113 0 2 92e
4 5
0 9 +11

2 37 2.6
6932 5342
1 0 16 321416 209831 15301 1533279 21282042 505
3018538.0 1985026.0 7709 1557
6 1 79 6 0 7888 69 3 64e
9 0
1 6 +11

2 40 2.8
7680 5923
1 0 89 326768 229803 16943 1186692 21559242 758
3157817.0 2083681.0 2746 3693
7 1 84 7 8 3273 39 2 70e
7 9
2 0 +11

2 42 3.3
8724 6710
1 0 96 341713 253946 19521 6562923 11899594 229
3225617.0 2198923.0 1566 0552
8 1 19 0 4 1241 4 4 27e
5 4
3 8 +11

2 41 3.5
9772 7503
1 0 91 349747 247129 22494 4944829 781
3400819.0 2364594.0 2950 3015 68565880
9 1 51 1 6 8182 6 64e
3 2
4 9 +11

2 40 3.8
1122 8393
2 0 88 350036 251341 28163 7224192 199
3459152.0 2401248.0 9167 7650 38329645
0 1 60 0 5 1534 2 26e
38 4
5 8 +11

2 42 4.0
1136 8674
2 0 35 357833 254260 26760 6999890 878
3474465.0 2397243.0 8729 5662 61591035
1 1 81 2 3 8021 1 95e
19 0
6 5 +11

2 43 4.2
1230 9272
2 0 03 356058 260602 30135 7758892 327
3525406.0 2467302.0 4927 6820 51996265
2 1 88 0 2 5993 9 00e
47 9
7 7 +11
Po AG001 AG001A_ ES001 ES001A_ FN00 FN00 FN00 FN023_I FN024_I
A
p_ _Pop_ Pop_Total _Pop_ Pop_Total 1_R 2_R 3_RO nvest_A nvest_Es
n PIB
To Total_ _Agua_A Total_ _Esgoto_ OD_ OD_ D_Es gua_Pre goto_Pre
o
tal Agua nterior Esgoto Anterio Total Agua goto stador stador

2 43 4.5
1275 9502
2 0 52 359600 267672 32275 5835315 729
3545034.0 2565528.0 0358 6602 40869546
3 1 34 6 7 9865 3 40e
09 1
8 8 +11

2 43 4.8
1384 1052
2 0 33 366340 263734 33058 7910813 246
3580889.0 2571132.0 8545 0757 52007350
4 1 85 7 5 3381 3 42e
53 50
9 1 +11

2 44 4.8
1375 1034
2 0 42 368695 270125 33782 8336927 017
3656586.0 2620735.0 0620 9751 73273169
5 2 35 9 5 0578 7 33e
29 08
0 2 +11

5. EDA: Data explained by descriptive


statistics
Exploratory Data Analysis (EDA) is an approach to data analysis for summarizing and visualizing the
important characteristics of the data. EDA can be considered as a free assumption, normally carried in the data
analytics behavior. It is also known as visual analytics or descriptive statistics. It is the practice of observing,
and exploring data, before you emphasizing some hypotheses, fitting predictors, and other more intention to
the inferential statistics. It typically includes the computation of simple summary statistics that capture some
property of interest in the data and visualization.

We are looking to understand what variables you have, how many records the data set contains, how many
missing values, what is the variable structure, what are the variable relationships, and more. In this section we
perform a basic EDA.

 How to explore: with summary-statistics, or visually?


 How many variables analyzed simultaneously: univariate, bivariate, or multivariate?
 What type of variable: categorical or continuous?

5.1 Describing the data

In [18]:
df.describe()
Out[18]:
FN00
Po AG00 AG001A_ ES001 ES001A_ FN0 FN0 FN023_I FN024_I
3_R
An p_ 1_Pop Pop_Tota _Pop_ Pop_Tota 01_R 02_R nvest_A nvest_Es PI
OD_
o Tot _Total l_Agua_A Total_ l_Esgoto_ OD_ OD_ gua_Pre goto_Pre B
Esgot
al _Agua nterior Esgoto Anterio Total Agua stador stador
o

c
26. 2.6 2.6
o 2.6000 2.600 2.600 2.600
00 000 2.500000e 2.6000 2.500000e 2.60000 2.600000 000
u 00e+0 000e 000e 000e
00 00e +01 00e+01 +01 0e+01 e+01 00e
n 1 +01 +01 +01
00 +01 +01
t

20
m 3.2 2.2
07. 2.7243 6.024 4.615 1.388
e 322 2.652649e 1.8161 1.743058e 4.79748 5.353310 320
50 79e+0 915e 333e 586e
a 93e +06 41e+06 +06 1e+07 e+07 81e
00 6 +08 +08 +08
n +06 +11
00

7.6 8.8 1.4


s 9.7487 4.417 3.309 1.102
48 995 9.833820e 7.4406 7.133759e 3.73393 6.070364 725
t 29e+0 264e 329e 429e
52 29e +05 97e+05 +05 4e+07 e+07 59e
d 5 +08 +08 +08
9 +05 +11

19
1.8 4.9
m 95. 5.2301 2.729 2.003 6.006
780 5.230100e 2.2619 2.261990e 6.99510 5.823640 879
i 00 00e+0 556e 700e 576e
89e +04 90e+05 +05 0e+04 e+05 35e
n 00 4 +07 +07 +06
+06 +10
00

20
2.4 9.3
2 01. 2.4145 2.270 1.826 4.449
311 2.387120e 1.4331 1.415927e 1.74430 1.201166 944
5 25 08e+0 794e 239e 325e
38e +06 32e+06 +06 7e+07 e+07 36e
% 00 6 +08 +08 +07
+06 +10
00

20
2.9 1.7
5 07. 2.9059 4.876 3.768 1.094
861 2.893961e 1.9097 1.906791e 4.22638 3.595142 911
0 50 94e+0 685e 555e 895e
02e +06 10e+06 +06 0e+07 e+07 98e
% 00 6 +08 +08 +08
+06 +11
00

20
4.1 3.5
7 13. 3.4773 9.510 7.304 2.175
660 3.400819e 2.5028 2.364594e 7.16811 6.689801 143
5 75 86e+0 260e 990e 139e
99e +06 85e+06 +06 7e+07 e+07 55e
% 00 6 +08 +08 +08
+06 +11
00

m 20 4.4 3.6869 3.656586e 2.7012 2.620735e 1.384 1.052 3.378 1.53328 2.155924 4.8
a 20. 423 59e+0 +06 55e+06 +06 855e 076e 206e 0e+08 e+08 246
x 00 52e 6 42e
FN00
Po AG00 AG001A_ ES001 ES001A_ FN0 FN0 FN023_I FN024_I
3_R
An p_ 1_Pop Pop_Tota _Pop_ Pop_Tota 01_R 02_R nvest_A nvest_Es PI
OD_
o Tot _Total l_Agua_A Total_ l_Esgoto_ OD_ OD_ gua_Pre goto_Pre B
Esgot
al _Agua nterior Esgoto Anterio Total Agua stador stador
o

00
+06 +09 +09 +08 +11
00

In [19]:
# similar no R
%%R -i df

library(tidyverse)
quan <- df %>%
select_if(is.numeric) # select only numeric
columns
names(quan) # check the names of
Quantitave variables
R[write to console]: ── Attaching packages
─────────────────────────────────────── tidyverse 1.3.1 ──

R[write to console]: ✔ ggplot2 3.3.5 ✔ purrr 0.3.4


✔ tibble 3.1.6 ✔ dplyr 1.0.8
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1

R[write to console]: ── Conflicts ──────────────────────────────────────────


tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()

In [20]:
%%R -i df

summary(quan) # basic summary statistics


in one function

5.2 EDA: Analyzing all variables (except the lagged ones) over time

Creating san_df dataframe from all columns except the lagged ones
In [21]:
san_df = df.drop(['AG001A_Pop_Total_Agua_Anterior',
'ES001A_Pop_Total_Esgoto_Anterio'], 1)
san_df
san_df.info()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning:
In a future version of pandas all arguments of DataFrame.drop except for the
argument 'labels' will be keyword-only
"""Entry point for launching an IPython kernel.
In [22]:
san2_df = df.drop(['AG001A_Pop_Total_Agua_Anterior',
'ES001A_Pop_Total_Esgoto_Anterio', 'PIB'], 1)
san2_df
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning:
In a future version of pandas all arguments of DataFrame.drop except for the
argument 'labels' will be keyword-only
"""Entry point for launching an IPython kernel.
Out[22]:
A Pop AG001_Po ES001_Pop FN001_ FN002_ FN003_R FN023_Invest FN024_Invest_
n _Tot p_Total_A _Total_Esg ROD_To ROD_A OD_Esgo _Agua_Presta Esgoto_Presta
o al gua oto tal gua to dor dor

1
9 2106 1269567 1040803
0 2073000 1152000 22876398 15051835 9351714
9 612 75 77
5

1
9 2102 1615401 1326768
1 2055296 1202065 28863339 16324754 12499107
9 857 60 21
6

1
9 1878 2768459 2167802
2 550075 226199 6006576 2764096 1958632
9 089 9 3
7

1
9 2034 2729555 2003699
3 52301 290197 7258562 69951 582364
9 283 7 6
8

1
9 2052 3363735 2659123
4 712063 358113 7046119 2361972 698287
9 257 5 6
9

2
0 2356 2066020 1655731
5 2281036 1359013 41028904 20490926 13381602
0 190 84 80
0

2
0 2393 2019416 1611083
6 2387120 1415927 40833309 14957343 19105796
0 755 67 58
1
A Pop AG001_Po ES001_Pop FN001_ FN002_ FN003_R FN023_Invest FN024_Invest_
n _Tot p_Total_A _Total_Esg ROD_To ROD_A OD_Esgo _Agua_Presta Esgoto_Presta
o al gua oto tal gua to dor dor

2
0 2543 2885115 2337762
7 2496671 1508108 54886292 16427114 11849180
0 289 14 25
2

2
0 2806 3491672 2753862
8 2706409 1484746 70672699 23381283 13018441
0 858 93 20
3

2
0 2873 3880601 3076081
9 2779849 1543203 78428932 27794623 10823603
0 165 02 00
4

2
1 0 2914 4205784 3314304
2853609 1614217 86887693 23514044 8216931
0 0 602 98 81
5

2
1 0 2970 4503473 3484752 10058330
2893514 1912630 38962241 16804484
1 0 783 60 76 6
6

2
1 0 2910 4680529 3621622 10518611
2896707 1906791 45565369 33573195
2 0 820 78 25 1
7

2
1 0 3001 5072839 3915488 11379294
2915281 1889345 36531859 61894411
3 0 422 42 12 7
8

2
1 0 3567 5556777 4266147 12366969
3119675 2243540 57429493 90571927
4 0 442 25 81 6
9

2
1 0 3475 6152559 4718125 13793211
3079150 2029108 77983050 153489332
5 1 969 04 95 3
0
A Pop AG001_Po ES001_Pop FN001_ FN002_ FN003_R FN023_Invest FN024_Invest_
n _Tot p_Total_A _Total_Esg ROD_To ROD_A OD_Esgo _Agua_Presta Esgoto_Presta
o al gua oto tal gua to dor dor

2
1 0 3716 6932770 5342155 15301788
3214166 2098310 153327969 212820423
6 1 796 99 70 8
1

2
1 0 4089 7680274 5923369 16943327
3267687 2298038 118669239 215592422
7 1 840 67 39 3
2

2
1 0 4296 8724156 6710055 19521124
3417130 2539464 65629234 118995944
8 1 198 65 24 1
3

2
1 0 4191 9772295 7503301 22494818
3497471 2471296 49448296 68565880
9 1 519 03 52 2
4

2
2 0 4088 1122916 8393765 28163153
3500360 2513415 72241922 38329645
0 1 608 738 04 4
5

2
2 0 4235 1136872 8674566 26760802
3578332 2542603 69998901 61591035
1 1 815 919 20 1
6

2
2 0 4303 1230492 9272682 30135599
3560580 2606022 77588929 51996265
2 1 887 747 09 3
7

2
2 0 4352 1275035 9502660 32275986
3596006 2676727 58353153 40869546
3 1 348 809 21 5
8

2
2 0 4333 1384854 1052075 33058338
3663407 2637345 79108133 52007350
4 1 851 553 750 1
9
A Pop AG001_Po ES001_Pop FN001_ FN002_ FN003_R FN023_Invest FN024_Invest_
n _Tot p_Total_A _Total_Esg ROD_To ROD_A OD_Esgo _Agua_Presta Esgoto_Presta
o al gua oto tal gua to dor dor

2
2 0 4442 1375062 1034975 33782057
3686959 2701255 83369277 73273169
5 2 352 029 108 8
0

5.3 Plotting and interpretating the results of time series analysis

Lineplot of mean percentage of total unemployment and youth unemployment over the examined period
In [23]:
from pylab import rcParams
san2_df.plot(x='Ano', kind='line', title='Sanitation evolution over time')
rcParams['figure.figsize'] = 15, 4
plt.show()

In [24]:
pop_df = san_df[['Pop_Total', 'AG001_Pop_Total_Agua',
'ES001_Pop_Total_Esgoto', 'Ano']]
pop_df
Out[24]:
Pop_Total AG001_Pop_Total_Agua ES001_Pop_Total_Esgoto Ano

0 2106612 2073000 1152000 1995

1 2102857 2055296 1202065 1996

2 1878089 550075 226199 1997

3 2034283 52301 290197 1998

4 2052257 712063 358113 1999

5 2356190 2281036 1359013 2000

6 2393755 2387120 1415927 2001

7 2543289 2496671 1508108 2002

8 2806858 2706409 1484746 2003

9 2873165 2779849 1543203 2004

10 2914602 2853609 1614217 2005

11 2970783 2893514 1912630 2006

12 2910820 2896707 1906791 2007

13 3001422 2915281 1889345 2008

14 3567442 3119675 2243540 2009

15 3475969 3079150 2029108 2010

16 3716796 3214166 2098310 2011

17 4089840 3267687 2298038 2012


Pop_Total AG001_Pop_Total_Agua ES001_Pop_Total_Esgoto Ano

18 4296198 3417130 2539464 2013

19 4191519 3497471 2471296 2014

20 4088608 3500360 2513415 2015

21 4235815 3578332 2542603 2016

22 4303887 3560580 2606022 2017

23 4352348 3596006 2676727 2018

24 4333851 3663407 2637345 2019

25 4442352 3686959 2701255 2020

In [25]:
pop_df.plot(x='Ano',kind='line', title='Total population evolution over time')
rcParams['figure.figsize'] = 15, 4
plt.show()

Houve uma queda "estranha" por volta de 1998. Estarão certos os valores na base de dados?
In [26]:
rod_df = san_df[['FN001_ROD_Total', 'FN002_ROD_Agua', 'FN003_ROD_Esgoto',
'Ano']]
rod_df
Out[26]:
FN001_ROD_Total FN002_ROD_Agua FN003_ROD_Esgoto Ano

0 126956775 104080377 22876398 1995

1 161540160 132676821 28863339 1996

2 27684599 21678023 6006576 1997

3 27295557 20036996 7258562 1998

4 33637355 26591236 7046119 1999

5 206602084 165573180 41028904 2000

6 201941667 161108358 40833309 2001

7 288511514 233776225 54886292 2002

8 349167293 275386220 70672699 2003

9 388060102 307608100 78428932 2004

1
420578498 331430481 86887693 2005
0

1
450347360 348475276 100583306 2006
1

1
468052978 362162225 105186111 2007
2

1
507283942 391548812 113792947 2008
3

1
555677725 426614781 123669696 2009
4

1
615255904 471812595 137932113 2010
5
FN001_ROD_Total FN002_ROD_Agua FN003_ROD_Esgoto Ano

1
693277099 534215570 153017888 2011
6

1
768027467 592336939 169433273 2012
7

1
872415665 671005524 195211241 2013
8

1
977229503 750330152 224948182 2014
9

2
1122916738 839376504 281631534 2015
0

2
1136872919 867456620 267608021 2016
1

2
1230492747 927268209 301355993 2017
2

2
1275035809 950266021 322759865 2018
3

2
1384854553 1052075750 330583381 2019
4

2
1375062029 1034975108 337820578 2020
5

In [27]:
rod_df.plot(x='Ano', kind='line', title='ROD evolution over time')
rcParams['figure.figsize'] = 15, 4
plt.show()
In [28]:
invest_df = san_df[['FN023_Invest_Agua_Prestador',
'FN024_Invest_Esgoto_Prestador', 'Ano']]
invest_df
Out[28]:
FN023_Invest_Agua_Prestador FN024_Invest_Esgoto_Prestador Ano

0 15051835 9351714 1995

1 16324754 12499107 1996

2 2764096 1958632 1997

3 69951 582364 1998

4 2361972 698287 1999

5 20490926 13381602 2000

6 14957343 19105796 2001

7 16427114 11849180 2002

8 23381283 13018441 2003

9 27794623 10823603 2004

1
23514044 8216931 2005
0
FN023_Invest_Agua_Prestador FN024_Invest_Esgoto_Prestador Ano

1
38962241 16804484 2006
1

1
45565369 33573195 2007
2

1
36531859 61894411 2008
3

1
57429493 90571927 2009
4

1
77983050 153489332 2010
5

1
153327969 212820423 2011
6

1
118669239 215592422 2012
7

1
65629234 118995944 2013
8

1
49448296 68565880 2014
9

2
72241922 38329645 2015
0

2
69998901 61591035 2016
1

2
77588929 51996265 2017
2

2
58353153 40869546 2018
3

2 79108133 52007350 2019


FN023_Invest_Agua_Prestador FN024_Invest_Esgoto_Prestador Ano

2
83369277 73273169 2020
5

In [29]:
invest_df.plot(x='Ano', kind='line', title='Invest evolution over time')
rcParams['figure.figsize'] = 15, 4
plt.show()

O maior valor foi nos anos de 2011 e 2012.


In [30]:
inv_pop_df = san_df[['FN023_Invest_Agua_Prestador',
'FN024_Invest_Esgoto_Prestador', 'AG001_Pop_Total_Agua',
'ES001_Pop_Total_Esgoto']]
inv_pop_df
Out[30]:
FN023_Invest_Agua_Presta FN024_Invest_Esgoto_Presta AG001_Pop_Total_Ag ES001_Pop_Total_Esg
dor dor ua oto

0 15051835 9351714 2073000 1152000

1 16324754 12499107 2055296 1202065

2 2764096 1958632 550075 226199

3 69951 582364 52301 290197

4 2361972 698287 712063 358113


FN023_Invest_Agua_Presta FN024_Invest_Esgoto_Presta AG001_Pop_Total_Ag ES001_Pop_Total_Esg
dor dor ua oto

5 20490926 13381602 2281036 1359013

6 14957343 19105796 2387120 1415927

7 16427114 11849180 2496671 1508108

8 23381283 13018441 2706409 1484746

9 27794623 10823603 2779849 1543203

1
23514044 8216931 2853609 1614217
0

1
38962241 16804484 2893514 1912630
1

1
45565369 33573195 2896707 1906791
2

1
36531859 61894411 2915281 1889345
3

1
57429493 90571927 3119675 2243540
4

1
77983050 153489332 3079150 2029108
5

1
153327969 212820423 3214166 2098310
6

1
118669239 215592422 3267687 2298038
7

1
65629234 118995944 3417130 2539464
8

1 49448296 68565880 3497471 2471296


FN023_Invest_Agua_Presta FN024_Invest_Esgoto_Presta AG001_Pop_Total_Ag ES001_Pop_Total_Esg
dor dor ua oto

2
72241922 38329645 3500360 2513415
0

2
69998901 61591035 3578332 2542603
1

2
77588929 51996265 3560580 2606022
2

2
58353153 40869546 3596006 2676727
3

2
79108133 52007350 3663407 2637345
4

2
83369277 73273169 3686959 2701255
5

In [31]:
l_inv_pop_df = np.log(inv_pop_df)
# Adicionando Ano sem logs no data.frame
l_inv_pop_df['Ano'] = san_df['Ano']
l_inv_pop_df.info()
In [32]:
# O gráfico foi feito em logs para que a escala fique mais apropriada para
comparação.
l_inv_pop_df.plot(x='Ano', kind='line', title='Evolução do investimento e
população atendida ao longo do tempo')
rcParams['figure.figsize'] = 15, 4
plt.show()
Interpretation

 As séries são claramente correlacionadas.

In [33]:
ag_pop_df = san_df[['FN023_Invest_Agua_Prestador', 'AG001_Pop_Total_Agua']]
esg_pop_df = san_df[['FN024_Invest_Esgoto_Prestador',
'ES001_Pop_Total_Esgoto']]

l_ag_pop_df = np.log(ag_pop_df)
l_esg_pop_df = np.log(esg_pop_df)

# Adicionando Ano sem logs nos df


l_ag_pop_df['Ano'] = san_df['Ano']
l_esg_pop_df['Ano'] = san_df['Ano']
In [34]:
l_ag_pop_df.plot(x='Ano', kind='line', title='Evolução do investimento em água
e população atendida ao longo do tempo')
rcParams['figure.figsize'] = 15, 4
plt.show()

In [35]:
# O gráfico foi feito em logs para que a escala fique mais apropriada para
comparação.
l_esg_pop_df.plot(x='Ano', kind='line', title='Evolução do investimento em
esgoto e população atendida ao longo do tempo')
rcParams['figure.figsize'] = 15, 4
plt.show()
In [36]:
# Adicionando Ano nos df
ag_pop_df['Ano'] = san_df['Ano']
esg_pop_df['Ano'] = san_df['Ano']
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:


https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returnin
g-a-view-versus-a-copy

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:


https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returnin
g-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports
until
In [37]:
fig, axs = plt.subplots(2)
#fig.suptitle('Vertically stacked subplots')
axs[0].plot(ag_pop_df['AG001_Pop_Total_Agua'])
# Turn off xtick labels
axs[0].set_xticklabels([])
axs[1].plot(ag_pop_df['Ano'], ag_pop_df['FN023_Invest_Agua_Prestador'])
axs[0].set_title('Pop_Total_Agua')
axs[1].set_title('Invest_Agua_Prestador')
Out[37]:
In [38]:
fig, axs = plt.subplots(2)
#fig.suptitle('Vertically stacked subplots')
axs[0].plot(esg_pop_df['ES001_Pop_Total_Esgoto'])
# Turn off xtick labels
axs[0].set_xticklabels([])
axs[1].plot(esg_pop_df['Ano'], esg_pop_df['FN024_Invest_Esgoto_Prestador'])
axs[0].set_title('Pop_Total_Esgoto')
axs[1].set_title('Invest_Esgoto_Prestador')
Out[38]:

In [39]:
# Adicionando Investment "per capita" ao data.frame
df["inv_esg_cap"] =
df["FN024_Invest_Esgoto_Prestador"]/df["ES001_Pop_Total_Esgoto"]
df["inv_ag_cap"] =
df["FN023_Invest_Agua_Prestador"]/df["AG001_Pop_Total_Agua"]
In [40]:
df.plot(x='Ano', y=['inv_esg_cap','inv_ag_cap'], kind='line', title='Evolução
do investimento em água e esgoto "per capita" ao longo do tempo')
rcParams['figure.figsize'] = 15, 4
plt.show()
In [41]:
pib_df = san_df[['PIB', 'Ano']]

pib_df.plot(x='Ano', kind='line', title='PIB evolution over time')


rcParams['figure.figsize'] = 15, 4
plt.show()

Scatterplot of Total population


In [42]:
# create figure and axis objects with subplots()
fig,ax=plt.subplots()
ax.scatter(san_df['Pop_Total'], san_df['AG001_Pop_Total_Agua'], marker="o",
label ='Pop_Total_Agua')
ax.scatter(san_df['Pop_Total'], san_df['ES001_Pop_Total_Esgoto'], marker="o",
label = 'Pop_Total_Esgoto')
plt.legend(loc='upper left');
plt.xlabel("Total Population")
plt.show()
Scatterplot of Invest
In [43]:
fig,ax=plt.subplots()
ax.scatter(san_df['FN023_Invest_Agua_Prestador'],
san_df['FN024_Invest_Esgoto_Prestador'], marker="o", label ='Invest')
plt.legend(loc='upper left');
plt.xlabel("Invest Agua Prestador")
plt.ylabel("Invest Esgoto Prestador")
plt.show()

Scatterplot of ROD
In [44]:
fig,ax=plt.subplots()
ax.scatter(san_df['FN001_ROD_Total'], san_df['FN002_ROD_Agua'], marker="o",
label ='ROD_Agua')
ax.scatter(san_df['FN001_ROD_Total'], san_df['FN003_ROD_Esgoto'], marker="o",
label = 'ROD_Esgoto')
plt.legend(loc='upper left');
plt.xlabel("ROD Total")
plt.show()
Interpretation

 There is a fairly strong positive correlation between the series above


 a derivada de ROD Esgoto é menor do que ROD Água

Boxplots
In [45]:
# Import libraries
#import matplotlib.pyplot as plt
#import numpy as np

data_pop = [san_df['Pop_Total'],
san_df['AG001_Pop_Total_Agua'],san_df['ES001_Pop_Total_Esgoto']]

fig = plt.figure(figsize =(10, 7))

# Creating axes instance


ax = fig.add_axes([0, 0, 1, 1])

## add patch_artist=True option to ax.boxplot()


## to get fill color
# Creating plot
bp = ax.boxplot(data_pop, patch_artist=True)

## change outline color, fill color and linewidth of the boxes


for box in bp['boxes']:
# change outline color
box.set( color='orange', linewidth=2)
# change fill color
box.set( facecolor = 'white' )

## change color and linewidth of the whiskers


for whisker in bp['whiskers']:
whisker.set(color='orange', linewidth=2)

## change color and linewidth of the caps


for cap in bp['caps']:
cap.set(color='orange', linewidth=2)
## change color and linewidth of the medians
for median in bp['medians']:
median.set(color='black', linewidth=2)

## change the style of fliers and their fill


for flier in bp['fliers']:
flier.set(marker='o', color='#e7298a', alpha=0.5)

## Custom x-axis labels


ax.set_xticklabels(['Pop_Total', 'Pop_Total_Agua', 'Pop_Total_Esgoto'])

# show plot
plt.show()

In [46]:
data_rod = [san_df['FN001_ROD_Total'],
san_df['FN002_ROD_Agua'],san_df['FN003_ROD_Esgoto']]

fig = plt.figure(figsize =(10, 7))

# Creating axes instance


ax = fig.add_axes([0, 0, 1, 1])

## add patch_artist=True option to ax.boxplot()


## to get fill color
# Creating plot
bp = ax.boxplot(data_rod, patch_artist=True)

## change outline color, fill color and linewidth of the boxes


for box in bp['boxes']:
# change outline color
box.set( color='orange', linewidth=2)
# change fill color
box.set( facecolor = 'white' )

## change color and linewidth of the whiskers


for whisker in bp['whiskers']:
whisker.set(color='orange', linewidth=2)

## change color and linewidth of the caps


for cap in bp['caps']:
cap.set(color='orange', linewidth=2)

## change color and linewidth of the medians


for median in bp['medians']:
median.set(color='black', linewidth=2)

## change the style of fliers and their fill


for flier in bp['fliers']:
flier.set(marker='o', color='#e7298a', alpha=0.5)

## Custom x-axis labels


ax.set_xticklabels(['ROD_Total', 'ROD_Agua', 'ROD_Esgoto'])

# show plot
plt.show()
 Esgoto foi bem menos variável no período (se mede pelo tamanho do box).

In [47]:
data_invest = [san_df['FN023_Invest_Agua_Prestador'],
san_df['FN024_Invest_Esgoto_Prestador']]

fig = plt.figure(figsize =(10, 7))

# Creating axes instance


ax = fig.add_axes([0, 0, 1, 1])

## add patch_artist=True option to ax.boxplot()


## to get fill color
# Creating plot
bp = ax.boxplot(data_invest, patch_artist=True)

## change outline color, fill color and linewidth of the boxes


for box in bp['boxes']:
# change outline color
box.set( color='orange', linewidth=2)
# change fill color
box.set( facecolor = 'white' )
## change color and linewidth of the whiskers
for whisker in bp['whiskers']:
whisker.set(color='orange', linewidth=2)

## change color and linewidth of the caps


for cap in bp['caps']:
cap.set(color='orange', linewidth=2)

## change color and linewidth of the medians


for median in bp['medians']:
median.set(color='black', linewidth=2)

## change the style of fliers and their fill


for flier in bp['fliers']:
flier.set(marker='o', color='#e7298a', alpha=0.5)

## Custom x-axis labels


ax.set_xticklabels(['Invest_Agua_Prestador', 'Invest_Esgoto_Prestador'])

# show plot
plt.show()

5.4 Scatterplot Matrix


In [48]:
sns.pairplot(san_df, diag_kind="kde", corner=True)
Out[48]:

In [49]:
#If you prefer a smaller plot, use less variables. For instance, if you only
want Pop Total,
#Pop Total Agua and Pop Total Esgoto

g = sns.pairplot(san_df, vars=['Pop_Total', 'AG001_Pop_Total_Agua',


'ES001_Pop_Total_Esgoto'], corner = True)
# or g = sns.pairplot(san_df, xvars=['Pop_Total', 'AG001_Pop_Total_Agua',
'ES001_Pop_Total_Esgoto'],\
# yvars= ['Ano', 'PIB'])
# diferentes x e y

In [50]:
%%R -i san_df

pairs(san_df[,2:4], pch = 19, lower.panel = NULL)


In [51]:
%%R -i san_df

pairs(san_df[,5:7], pch = 19, lower.panel = NULL)


In [52]:
%%R -i san_df

pairs(san_df[,8:9], pch = 19, lower.panel = NULL)


 Visualmente podemos ver que a associação/correlação neste banco de dados é forte.

5.5 Correlation Matrix

In [53]:
san_corr = san_df.corr()

display(san_corr)

Po
AG001_ ES001_P FN001 FN002 FN003 FN023_Inv FN024_Inv
An p_ PI
Pop_Tot op_Total _ROD _ROD _ROD_ est_Agua_ est_Esgoto_
o Tot B
al_Agua _Esgoto _Total _Agua Esgoto Prestador Prestador
al

1.0 0.9
0.9
00 0.9745 0.9753 0.96709 74
Ano 775 0.830031 0.917150 0.753730 0.505923
00 60 04 5 48
11
0 5

Pop_Total 0.9 1.0 0.859111 0.941020 0.9608 0.9633 0.94640 0.789297 0.580460 0.9
77 000 58
Po
AG001_ ES001_P FN001 FN002 FN003 FN023_Inv FN024_Inv
An p_ PI
Pop_Tot op_Total _ROD _ROD _ROD_ est_Agua_ est_Esgoto_
o Tot B
al_Agua _Esgoto _Total _Agua Esgoto Prestador Prestador
al

51 63
00 91 61 9
1 7

0.8 0.7
AG001_Pop 0.8
30 0.8236 0.8311 0.79492 67
_Total_Agu 591 1.000000 0.969302 0.698630 0.481933
03 74 60 3 00
a 11
1 5

0.9 0.8
ES001_Pop 0.9
17 0.9093 0.9142 0.88851 72
_Total_Esg 410 0.969302 1.000000 0.741539 0.518565
15 55 45 0 96
oto 20
0 7

0.9 0.9
0.9
FN001_RO 74 1.0000 0.9998 0.99762 92
608 0.823674 0.909355 0.701131 0.406165
D_Total 56 00 00 8 05
91
0 1

0.9 0.9
0.9
FN002_RO 75 0.9998 1.0000 0.99608 90
633 0.831160 0.914245 0.705290 0.412599
D_Agua 30 00 00 9 89
61
4 5

0.9 0.9
0.9
FN003_RO 67 0.9976 0.9960 1.00000 92
464 0.794923 0.888510 0.677423 0.372100
D_Esgoto 09 28 89 0 24
09
5 9

0.7 0.7
FN023_Inv 0.7
53 0.7011 0.7052 0.67742 06
est_Agua_P 892 0.698630 0.741539 1.000000 0.880344
73 31 90 3 28
restador 97
0 2

0.5 0.4
FN024_Inv 0.5
05 0.4061 0.4125 0.37210 36
est_Esgoto_ 804 0.481933 0.518565 0.880344 1.000000
92 65 99 0 65
Prestador 60
3 3

0.9 1.0
0.9
74 0.9920 0.9908 0.99224 00
PIB 586 0.767005 0.872967 0.706282 0.436653
48 51 95 9 00
37
5 0

In [54]:
#san_corr.style.background_gradient(cmap='coolwarm')

 Using Seaborn heatmap (Mapa de Calor)

In [55]:
df_corr = matrix = np.triu(san_df.corr())
sns.heatmap(san_df.corr(), annot=True, mask=matrix, vmin=-1, vmax=1, center=
0, cmap= 'coolwarm')
Out[55]:

Obs : A correlação medida acima é a de Pearson. Esta métrica só é adequada


para medir uma associação linear entre as variáveis. Como possibilidade
poderíamos utilizar os coeficientes de correlação de postos (não-paramétricos) de
Spearman ou Kendall's tau.

6. Inferential statistics
(Some possible) Nonparametric tests
Three sorts of nonparametric test are available here: for a difference between groups, for randomness, and for
(rank) correlation.

Difference tests

We can carry out a nonparametric test for a difference between two populations or groups.

• Sign test: This test is based on the fact that if two samples, x and y, are drawn randomly from the same
distribution, the probability that xi > yi, for each observation i, should equal 0.5. The test statistic is w, the
number of observations for which xi > yi. Under the null hypothesis this follows the Binomial distribution with
parameters (n, 0.5), where n is the number of observations.
• Wilcoxon rank-sum test. This test proceeds by ranking the observations from both samples jointly, from
smallest to largest, then finding the sum of the ranks of the observations from one of the samples. The two
samples do not have to be of the same size, and if they differ the smaller sample is used in calculating the rank-
sum. Under the null hypothesis that the samples are drawn from populations with the same median, the
probability distribution of the rank-sum can be computed for any given sample sizes; and for reasonably large
samples a close Normal approximation exists.

• Wilcoxon signed-rank test. This is designed for matched data pairs such as, for example, the values of a
variable for a sample of individuals before and after some treatment. The test proceeds by finding the
differences between the paired observations, xi – yi, ranking these differences by absolute value, then
assigning to each pair a signed rank, the sign agreeing with the sign of the difference. One then calculates W+,
the sum of the positive signed ranks. As with the rank-sum test, this statistic has a well-defined distribution
under the null that the median difference is zero, which converges to the Normal for samples of reasonable
size.

Obs: Não foi feito.

Correlation

Spearman's rank correlation rho and Kendall's rank correlation tau.

 No Python utilize

calculate spearman's correlation


coef, p = spearmanr(data1, data2)

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

calculate kendall's correlation


coef, p = kendalltau(data1, data2)

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html

onde p representa o p-value do teste (a nula é de que a correlação é igual a zero).

Para exemplificar, eu vou testar com Rod_Esgoto e Invest_Esgoto (menor correlação linear encontrada =
0.37).
In [57]:
from scipy.stats import spearmanr

corr, p_value = spearmanr(san_df['FN003_ROD_Esgoto'],


san_df['FN024_Invest_Esgoto_Prestador'])

print('corr = %.4f.' % corr)


print('p-value = %.4f.' % p_value)
In [58]:
from scipy.stats import kendalltau

tau, p_tau = kendalltau(san_df['FN003_ROD_Esgoto'],


san_df['FN024_Invest_Esgoto_Prestador'])

print('tau = %.4f.' % tau)


print('p_tau = %.4f.' % p_tau)

 Note que os p-values são quase zero.

Obs: Não devemos/podemos fazer testes paramétricos com este tamanho de amostra.

7. Autocorrelation
7.1 Time Series Plots

In [59]:
l_san_df = np.log(san_df)
l_san_df.info()
In [60]:
l_san_df.plot(x='Ano', kind='line', title='Sanitation evolution over time')
rcParams['figure.figsize'] = 15, 4

ax = plt.subplot(111)

# Increase height by 200%


box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width, box.height*3])

# Put a legend to the right of the current axis


ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4:
MatplotlibDeprecationWarning: Adding an axes using the same arguments as a
previous axes currently reuses the earlier instance. In a future version, a
new instance will always be created and returned. Meanwhile, this warning can
be suppressed, and the future behavior ensured, by passing a unique label to
each axes instance.
after removing the cwd from sys.path.
Todas as variáveis acima estão em log.

Note que muitas parecem cointegrar.

7.1.1 STL Decomposition (Seasonal, Trend and Loess)

In [61]:
# import required packages, prepare the graphics environment, and prepare the
data.

import matplotlib.pyplot as plt


import pandas as pd
import seaborn as sns
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
sns.set_style("darkgrid")

plt.rc("figure", figsize=(16, 12))


plt.rc("font", size=13)
In [62]:
!pip install statsmodels==v0.13.2
In [63]:
from statsmodels.tsa.seasonal import STL
stl = STL(l_san_df["Pop_Total"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()

População Total tem uma tendência forte.

In [64]:
stl = STL(l_san_df["AG001_Pop_Total_Agua"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [65]:
stl = STL(l_san_df["ES001_Pop_Total_Esgoto"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [66]:
stl = STL(l_san_df["FN001_ROD_Total"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [67]:
stl = STL(l_san_df["FN002_ROD_Agua"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [68]:
stl = STL(l_san_df["FN003_ROD_Esgoto"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [69]:
stl = STL(l_san_df["FN023_Invest_Agua_Prestador"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [70]:
stl = STL(l_san_df["FN024_Invest_Esgoto_Prestador"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
A série Invest_Esgoto_Prestador é a que eu consigo identificar mais padrões (parece ter
algum padrão sazonal).

In [71]:
stl = STL(l_san_df["PIB"], seasonal=7, period = 25)
res = stl.fit()
fig = res.plot()
In [71]:

7.1.2 Previsão com STL

Veja aqui.

As séries diferenciadas parecem ser um ruído branco (quando olhamos o correlograma). Logo, não há
necessidade de seguirmos com a previsão univariada.
In [72]:
# from IPython.display import Image
# Image("d_log_series.png")

7.2 Correlogram

 As primeiras diferenças das séries são um ruído branco. A única exceção que cabe uma investigação é
a série diferenciada do Invest_Esgoto_Prestador novamente, cujo primeiro lag da ACF e os dois
primeiros da PACF são quase não significativos a 5%. Utilizando a metodologia de Box and Jenkins o
melhor modelo usando os tradicionais critérios de informação é um MA(2). Cabe ressaltar que a
amostra não é grande o suficiente para encontrarmos uma regularidade que permita utilizar com
confiança a metodologia.
 Com a primeira diferença dos logs (série dos retornos) todas parecem um ruído branco.

In [73]:
# first diff

first_diff = san_df["FN024_Invest_Esgoto_Prestador"] -
san_df["FN024_Invest_Esgoto_Prestador"].shift(1)
first_diff = first_diff.dropna(inplace = False)
In [74]:
#
https://www.statsmodels.org/devel/examples/notebooks/generated/tsa_arma_0.htm
l
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from scipy import stats
from statsmodels.graphics.api import qqplot
In [75]:
first_diff.plot(figsize=(12,8))
Out[75]:

In [76]:
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(first_diff.values.squeeze(), lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(first_diff, lags=10, ax=ax2)
/usr/local/lib/python3.7/dist-packages/statsmodels/graphics/tsaplots.py:353:
FutureWarning: The default method 'yw' can produce PACF values outside of the
[-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker
('ywm'). You can use this method now by setting method='ywm'.
FutureWarning,

In [77]:
arma_mod21 = ARIMA(first_diff, order=(2, 0, 1)).fit()
print(arma_mod21.params)
# print(arma_mod21.aic, arma_mod21.bic, arma_mod21.hqic)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/base/tsa_model.py:471:
ValueWarning: An unsupported index was provided and will be ignored when e.g.
forecasting.
self._init_dates(dates, freq)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/base/tsa_model.py:471:
ValueWarning: An unsupported index was provided and will be ignored when e.g.
forecasting.
self._init_dates(dates, freq)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/base/tsa_model.py:471:
ValueWarning: An unsupported index was provided and will be ignored when e.g.
forecasting.
self._init_dates(dates, freq)
In [78]:
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)
ax = arma_mod21.resid.plot(ax=ax)
In [79]:
resid = arma_mod21.resid
In [80]:
fig = plt.figure(figsize=(12, 8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=10, ax=ax2)
/usr/local/lib/python3.7/dist-packages/statsmodels/graphics/tsaplots.py:353:
FutureWarning: The default method 'yw' can produce PACF values outside of the
[-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker
('ywm'). You can use this method now by setting method='ywm'.
FutureWarning,
In [81]:
r, q, p = sm.tsa.acf(resid.values.squeeze(), fft=True, qstat=True)
data = np.c_[np.arange(1, 14), r[1:], q, p]
In [82]:
table = pd.DataFrame(data, columns=["lag", "AC", "Q", "Prob(>Q)"])
print(table.set_index("lag"))

 This indicates a proper fit.

7.3 Dickey-Fuller tests

In [83]:
# Dickey-Fuller
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries, window = 12, cutoff = 0.01):

#Determing rolling statistics


rolmean = timeseries.rolling(window).mean()
rolstd = timeseries.rolling(window).std()

#Plot rolling statistics:


fig = plt.figure(figsize=(12, 8))
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()

#Perform Dickey-Fuller test:


print('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC', maxlag = 5 )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags
Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
pvalue = dftest[1]
if pvalue < cutoff:
print('p-value = %.4f. The series is likely stationary.' % pvalue)
else:
print('p-value = %.4f. The series is likely non-stationary.' % pvalue)

print(dfoutput)
In [84]:
test_stationarity(first_diff, window=8)

 No Gretl eu usei a opção sem constante para fazer o teste (de acordo com o gráfico da série me parece
mais apropriado). O teste ADF acima utiliza a opção com constante. Descobri como fazer abaixo
usando a função adfuller from statsmodels.
 Sem constante o p-valor assintótico é de 0.0007927 (feito no Gretl), isto é, rejeitamos a nula
de que a série não é estacionária.

 Basicamente, todas estas séries são I(1), embora a confiança em uma afirmação como essa seja
limitada pelo tamanho da amostra (alguns termos dificilmente desaparecem na parte matemática
destes modelos com um T deste tamanho - é necessário que desapareçam para garantirmos a
estacionariedade).

In [93]:
#https://www.statsmodels.org/dev/generated/
statsmodels.tsa.stattools.adfuller.html
df = sm.tsa.stattools.adfuller(first_diff, maxlag=8, regression='nc',
autolag='AIC', regresults=True)

#dickey-fuller stat
print('O valor da estatística de teste de Dickey-Fuller é
{:.4f}.'.format(df[0]))

#p-value
print('O valor do p-value associado a estatística de teste de Dickey-Fuller é
{:.4f}.'.format(df[1]))
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:824:
FutureWarning: trend 'nc' has been renamed to 'n' after 0.14 is released. Use
'n' now to avoid this warning.
FutureWarning,
In [94]:
df_kpss = sm.tsa.stattools.kpss(first_diff, regression='c')

#KPSS stat
print('O valor da estatística de teste de KPSS é
{}.'.format(round(df_kpss[0],4)))
#p-value
print('O valor do p-value associado a estatística de teste de KPSS é
{}.'.format(round(df_kpss[1],4)))
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/stattools.py:2023:
InterpolationWarning: The test statistic is outside of the range of p-values
available in the
look-up table. The actual p-value is greater than the p-value returned.

warn_msg.format(direction="greater"), InterpolationWarning

A nula do teste KPSS é estacionariedade. Para o ADF a nula é que a série contém
uma raiz unitária.
7.3 Cross-Correlogram

 As variáveis parecem ser todas coincidentes (fiz os gráficos no Gretl).

In [87]:
#calculate cross correlation
cc = sm.tsa.stattools.ccf(san_df['FN023_Invest_Agua_Prestador'],
san_df['FN024_Invest_Esgoto_Prestador'], adjusted=False)
In [88]:
sm.tsa.stattools.ccf(san_df['FN023_Invest_Agua_Prestador'],
san_df['FN024_Invest_Esgoto_Prestador'], adjusted=False)
Out[88]:
In [95]:
#Interpretation
print('The cross correlation at lag 0 é {}.'.format(round(cc[0],4)))
#p-value
print('The cross correlation at lag 1 é {}.'.format(round(cc[1],4)))

8. VAR/VECM
Vector Autoregression (VAR) is a multivariate forecasting algorithm that is used when two or more time series
influence each other. That is, the relationship between the time series involved is bi-directional. To train and
forecast VAR models in python we can use the library statsmodels.

The basic requirements in order to use VAR are:

 You need at least two time series.


 The time series should influence each other.

It is considered as an Autoregressive model because, each time series are modeled as a function of the past
values, that is the predictors are the lags (time delayed value) of the series. The primary difference between
VAR and models like AR, ARMA, ARIMA is that those models are uni-directional, where, the predictors
influence Y and not vice-versa. Whereas, Vector Auto Regression (VAR) is bi-directional. That is, the
variables influence each other. We will go in more detail in the next sections. We wat to gain a understanding
of:

 Intuition behind VAR Model formula


 How to check the bi-directional relationship using Granger Causality
 Procedure to building a VAR model in Python
 How to determine the right order of VAR model
 Interpreting the results of VAR model
 How to generate forecasts to original scale of time series

8.1 Intuition behind VAR Model Formula


AR models: the time series is modeled as a linear combination of it's own lags. That is, the past values of the
series are used to forecast the current and future. A typical AR(p) model equation looks something like this:

Yt=α+β1Yt−1+β2Yt−2+...+βpYt−p+ϵt
.

In the VAR model, each time series is modeled as a linear combination of past values of itself and the past
values of other time series in the system. Since you have multiple time series that influence each other, it is
modeled as a system of equations with one equation per time series. That is, if you have 5 time series that
influence each other, we will have a system of 5 equations. Let's suppose, you have two time series Y1 and Y2,
and you need to forecast the values of these variables at time t. To calculate Y1(t), VAR will use the past
values of both Y1 as well Y2. Likewise, to compute Y2(t), the past values of both Y1 and Y2 be used. For
example, the system of equations for a VAR(1) model with two time series is as follows:

Y1,t=α1+β11,1Y1,t−1+β12,1Y2,t−1+ϵ1,tY2,t=α2+β21,1Y1,t−1+β22,1Y2,t−1+ϵ2,t
,

where, Y1,t−1 and Y2,t−1 are the first lag of time series Y1 and Y2, respectively. The above equation is
referred to as a VAR(1) model, because, each equation is of order 1, that is, it contains up to one lag of each of
the predictors (Y1 and Y2). Since the Y terms in the equations are interrelated, the Y's are considered as
endogenous variables, rather than as exogenous predictors. Likewise, the second order VAR(2) model for two
variables would include up to two lags for each time series (Y1 and Y2), i.e.,

9. Panel analysis
10. Machine Learning Analysis
 Goal is to explain xxxx by independent variables of time-panel Dataframe df

10.1 Using Train/Test Split on Dataframe

In [96]:
df.head(1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-96-c11b9b275432> in <module>()
----> 1 df.head(1)

AttributeError: 'tuple' object has no attribute 'head'


In [ ]:
#X_train, X_test, y_train, y_test = train_test_split(df.drop(columns = ""),
# df.Sentiment_indicators,
test_size = 0.3)

10.2 Defining models for linear Regression Sklearn

 Linear Regression
 Lasso Regression
 Ridge Regression
 ElasticNet

In [ ]:
def models_automation(models, X_train, y_train):
for model in models:
model.fit(X_train, y_train)
print(f"{model.__class__.__name__}: Train -> {model.score(X_train,
y_train)}, Test -> {model.score(X_test, y_test)}")

linear_models = [LinearRegression(), Lasso(), Ridge(), ElasticNet()]


models_automation(linear_models, X_train, y_train)

10.3 Using SVR Model, KNeighbors Regressor, DecisionTree Regressor, Gradient Booster, RandomForest
Regressor, MLP Regressor

In [ ]:
svr = [SVR()]
models_automation(svr, X_train, y_train)
In [ ]:
knn = [KNeighborsRegressor()]
models_automation(knn, X_train, y_train)
In [ ]:
dtr = [DecisionTreeRegressor()]
models_automation(dtr, X_train, y_train)
In [ ]:
mlpr = [MLPRegressor(max_iter = 1000)]
models_automation(mlpr, X_train, y_train)

10.4 Interpretation of results

 All regression results show a very high statistical significance of the regression model to explain xxxx
rate by independent variables
 Model has a significance on a xxx% level

11. Conclusion of analysis


On a high aggegated level this analysis researched xxxx impacts of xxxx. Used was constructed panel
DataFrame over time-period from 1995 - 2020.
Goal of this work was testing xxx different hypothesis xxxx different ways, eg on different levels of used tools
to analyse and proof hypothesis.

 EDA (only by visualisation)


 time series linear models
 machine learning algorithms

One major goal of this analysis was to explain how indicators are related to each other and how they got
impacted due to xxxx.

 Next time use different indicators.


 Do more specific EDA over time to lower aggregation level and get more inside views of differences
in development of indicators over time.

Fonte de Dados
E-CompFin/UFRGS.

Agência Nacional de Águas e Saneamento Básico - ANASB.

Sistema Nacional de Informações sobre Saneamento - SNIS. Procurar em Água e Esgotos - Informações e
indicadores agregados.

PIB Estadual.

Portal de Dados Abertos do TCE.

Sugestão de próximos passos?

Você também pode gostar