Você está na página 1de 86

Ps-Graduao em Cincia da Computao

An Approach for Profiling Distributed

Applications Through Network Traffic Analysis


Por
THIAGO PEREIRA DE BRITO VIEIRA
Dissertao de Mestrado

Universidade Federal de Pernambuco


posgraduacao@cin.ufpe.br
www.cin.ufpe.br/~posgraduacao

RECIFE, MARO/2013

UNIVERSIDADE FEDERAL DE PERNAMBUCO


CENTRO DE INFORMTICA
PS-GRADUAO EM CINCIA DA COMPUTAO

THIAGO PEREIRA DE BRITO VIEIRA

AN APPROACH FOR PROFILING DISTRIBUTED


APPLICATIONS THROUGH NETWORK TRAFFIC
ANALYSIS"

ESTE TRABALHO FOI APRESENTADO PS-GRADUAO EM


CINCIA DA COMPUTAO DO CENTRO DE INFORMTICA DA
UNIVERSIDADE FEDERAL DE PERNAMBUCO COMO REQUISITO
PARCIAL PARA OBTENO DO GRAU DE MESTRE EM CINCIA
DA COMPUTAO.

ORIENTADOR: Vinicius Cardoso Garcia


C-ORIENTADOR: Stenio Flavio de Lacerda Fernandes

RECIFE, MARO/2013

Catalogao na fonte
Bibliotecria Jane Souto Maior, CRB4-571

Vieira, Thiago Pereira de Brito


An approach for profiling distributed applications
through network traffic analysis. / Thiago Pereira de Brito
Vieira. - Recife: O Autor, 2013.
xv, 71 folhas: fig., tab.
Orientador: Vinicius Cardoso Garcia.
Dissertao (mestrado) - Universidade
Pernambuco. CIn, Cincia da Computao, 2013.

Federal

de

Inclui bibliografia.
1. Cincia da computao. 2. Sistemas distribudos. I. Garcia,
Vinicius Cardoso (orientador). II. Ttulo.
004

CDD (23. ed.)

MEI2013 054

Dissertao de Mestrado apresentada por Thiago Pereira de Brito Vieira PsGraduao em Cincia da Computao do Centro de Informtica da Universidade Federal
de Pernambuco, sob o ttulo An Approach for Profiling Distributed Applications
Through Network Traffic Analysis orientada pelo Prof. Vinicius Cardoso Garcia e
aprovada pela Banca Examinadora formada pelos professores:

______________________________________________
Prof. Jos Augusto Suruagy Monteiro
Centro de Informtica / UFPE
______________________________________________
Prof. Denio Mariz Timoteo de Souza
Instituto Federal da Paraba
_______________________________________________
Prof. Vinicius Cardoso Garcia
Centro de Informtica / UFPE

Visto e permitida a impresso.


Recife, 5 de maro de 2013
___________________________________________________
Profa. Edna Natividade da Silva Barros
Coordenadora da Ps-Graduao em Cincia da Computao do
Centro de Informtica da Universidade Federal de Pernambuco.

Eu dedico esta dissertao aos meus pais, por me


ensinarem a sempre estudar e trabalhar para evoluir como
pessoa e prossional.

Agradecimentos

Primeiramente eu gostaria de agradecer a Deus pela vida, sade e todas oportunidades


criadas em minha vida.
Agradeo aos meus pais, Joo e Ana, por todo o amor, carinho e incentivos para que
Eu possa sempre buscar crescimento pessoal e prossional, alm de sempre me apoiarem
nas minhas decises e se mostrarem sempre preocupados e empenhados em me ajudar a
alcanar meus objetivos.
Agradeo Alynne, minha futura esposa, por todo o amor e pacincia durante todo
nosso relacionamento, principalmente nestes dois intensos anos de mestrado, em que
foram essenciais suas palavras de apoio nos momentos difceis e sua descontrao para
me dar mais engergia e vontade de seguir com cada vez mais dedicao.
Agradeo Agncia Nacional de Telecomunicaes - Anatel por permitir e proporcionar mais um aprendizado na minha vida. Gostaria de agradecer especialmente
a Rodrigo Barbosa, Tlio Barbosa e Jane Teixeira por compreenderem e me apoiarem
nesde desao de cursar um mestrado. Agradeo a Marcio Formiga, pelo apoio antes e
durante o mestrado, e pela compreenso do esforo necessrio para vencer mais este
desao. Agradeo a Wesley Paesano, Marcelo de Oliveira, Regis Novais e Danilo Balby
pelo apoio e suporte para que eu pudesse me dedicar ao mestrado durante estes dois anos.
Tambm agradeo aos amigos da Anatel, que de forma direta ou inditera me ajudaram
a enfrentar mais este desao, dentre eles agradeo em especial a Ricardo de Holanda,
Rodrigo Curi, Esdras Hoche, Francisco Paulo, Cludio Moonen, Otvio Barbosa, Hlio
Silva, Bruno Preto, Luide Liude e Alexandre Augusto.
Agradeo a todos aqueles que me orientaram e forneceram algum ensinamento
durante este mestrado, em especial a Vinicius Garcia pelo acolhimento, apoio, orientaes,
cobranas e todos os importantes ensinamentos durante estes meses. Agradeo a Stenio
Fernandes por todos os ensinamentos e orientaes em momentos importantes da minha
pesquisa. Agradeo a Rodrigo Assad pelo trabalho realizado em conjunto ao usto.re e
pelas orientaes, que me nortearam no desenvolvimento da minha pesquisa. Agradeo
a Marcelo DAmorim pelo acolhimento inicial e pelo trabalho que desempenhamos
juntos, que foi de grande valor para a minha insero na pesquisa cientca e para o meu
desenvolvimento como pesquisador.
Agradeo a Jos Augusto Suruagy e Denio Mariz por aceitarem fazer parte da banca
da minha defesa de dissertao e pelas valiosas crticas e contribuies para o meu
trabalho.
Agradeo a todos os amigos que z durante este perodo de mestrado, que con-

vi

tribuiram para que estes dias dedicados ao mestrado fossem bastante agradveis. Gostaria
de agradecer a Paulo Fernando, Lenin Abadie, Marco Machado, Dhiego Abrantes,
Rodolfo Arruda, Francisco Soares, Sabrina Souto, Adriano Tito, Hlio Rodrigues, Jamilson Batista, Bruno Felipe e demais pessoas que tive o prazer de conhecer durante este
perodo do mestrado.
Tambm agradeo a todos os meus velhos amigos de Joo Pessoa, Geisel, UFPB e
CEFET-PB, que tanto me deram apoio e incentivos para desenvolver este trabalho.
Finalmente, a todos aqueles que colaboraram direta ou indiretamente na realizao
deste trabalho.
Muito Obrigado!!!

vii

Wherever you go, go with all your heart.


CONFUCIUS

Resumo

Sistemas distribudos tm sido utilizados na construo de modernos servios da Internet


e infraestrutura de computao em nvem, com o intuito de obter servios com alto
desempenho, escalabilidade e conabilidade. Os acordos de nves de servio adotados
pela computao na nvem requerem um reduzido tempo para identicar, diagnosticar
e solucionar problemas em sua infraestrutura, de modo a evitar que problemas gerem
impactos negativos na qualidade dos servios prestados aos seus clientes. Ento, a
deteco de causas de erros, diagnstico e reproduo de erros provenientes de sistemas
distribudos so desaos que motivam esforos para o desenvolvimento de mecanismos
menos intrusivos e mais ecientes, para o monitoramento e depurao de aplicaes
distribudas em tempo de execuo.
A anlise de trfego de rede uma opo para a medio de sistemas distribudos,
embora haja limitaes na capacidade de processar grande quantidade de trfego de
rede em curto tempo, e na escalabilidade para processar trfego de rede sob variao de
demanda de recursos.
O objetivo desta dissertao analisar o problema da capacidade de processamento
para mensurar sistemas distribudos atravs da anlise de trfego de rede, com o intuito
de avaliar o desempenho de sistemas distribudos de um data center, usando hardware
no especializado e servios de computao em nvem, de uma forma minimamente
intrusiva.
Ns propusemos uma nova abordagem baseada em MapReduce para profundamente
inspecionar trfego de rede de aplicaes distribudas, com o objetivo de avaliar o
desempenho de sistemas distribudos em tempo de execuo, usando hardware no
especializado. Nesta dissertao ns avaliamos a eccia do MapReduce para um
algoritimo de avaliao profunda de pacotes, sua capacidade de processamento, o ganho
no tempo de concluso de tarefas, a escalabilidade na capacidade de processamento, e o
comportamento seguido pelas fases do MapReduce, quando aplicado inspeo profunda
de pacotes, para extrair indicadores de aplicaes distribudas.
Palavras-chave: Medio de Aplicaes Distribudas, Depurao, MapReduce, Anlise
de Trfego de Rede, Anlise em Nvel de Pacotes, Anlise Profunda de Pacotes

ix

Abstract

Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructures, in order to obtain services with high performance, scalability,
and reliability. Cloud computing SLAs require low time to identify, diagnose and solve
problems in a cloud computing production infrastructure, in order to avoid negative
impacts into the quality of service provided for its clients. Thus, the detection of error
causes, diagnose and reproduction of errors are challenges that motivate efforts to the
development of less intrusive mechanisms for monitoring and debugging distributed
applications at runtime.
Network trafc analysis is one option to the distributed systems measurement, although there are limitations on capacity to process large amounts of network trafc
in short time, and on scalability to process network trafc where there is variation of
resource demand.
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network trafc analysis, in order to evaluate the
performance of distributed systems at a data center, using commodity hardware and cloud
computing services, in a minimally intrusive way.
We propose a new approach based on MapReduce, for deep inspection of distributed
application trafc, in order to evaluate the performance of distributed systems at runtime, using commodity hardware. In this dissertation we evaluated the effectiveness of
MapReduce for a deep packet inspection algorithm, its processing capacity, completion
time speedup, processing capacity scalability, and the behavior followed by MapReduce
phases, when applied to deep packet inspection for extracting indicators of distributed
applications.
Keywords: Distributed Application Measurement, Proling, MapReduce, Network
Trafc Analysis, Packet Level Analysis, Deep Packet Inspection

Contents

List of Figures

xiii

List of Tables

xiv

List of Acronyms

xv

Introduction
1.1 Motivation . . . . . . . .
1.2 Problem Statement . . .
1.3 Contributions . . . . . .
1.4 Dissertation Organization

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

Background and Related Work


2.1 Background . . . . . . . . . . . . . . . . . . . .
2.1.1 Network Trafc Analysis . . . . . . . . .
2.1.2 JXTA . . . . . . . . . . . . . . . . . . .
2.1.3 MapReduce . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . .
2.2.1 Distributed Debugging . . . . . . . . . .
2.2.2 MapReduce for Network Trafc Analysis
2.3 Chapter Summary . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

Proling Distributed Applications Through Deep Packet Inspection


3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . .
3.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . .
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Results Discussion . . . . . . . . . . . . . . . . . . . . .
3.5.2 Possible Threats to Validity . . . . . . . . . . . . . . . .
3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

1
1
4
5
6

.
.
.
.
.
.
.
.

7
7
7
9
10
13
13
14
15

.
.
.
.
.
.
.
.
.
.

17
18
20
28
28
30
31
34
34
35
36

xi

Evaluating MapReduce for Network Trafc Analysis


4.1 Motivation . . . . . . . . . . . . . . . . . . . . . .
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Evaluation Methodology . . . . . . . . . .
4.2.2 Experiment Setup . . . . . . . . . . . . . .
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Results Discussion . . . . . . . . . . . . .
4.4.2 Possible threats to validity . . . . . . . . .
4.5 Chapter Summary . . . . . . . . . . . . . . . . . .
Conclusion and Future Work
5.1 Conclusion . . . . . . .
5.2 Contributions . . . . . .
5.2.1 Lessons Learned
5.3 Future Work . . . . . . .

Bibliography

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

37
38
39
39
41
42
53
53
56
56

.
.
.
.

58
59
60
61
62
63

xii

List of Figures

2.1
2.2

Differences between packet level analysis and deep packet inspection .


MapReduce input dataset splitting into blocks and into records . . . . .

8
10

3.1
3.2
3.3
3.4

Architecture of the the SnifferServer to capture and store network trafc


Architecture for network trafc analysis using MapReduce . . . . . . .
JXTA Socket trace analysis . . . . . . . . . . . . . . . . . . . . . . . .
Completion time scalability of MapReduce for DPI . . . . . . . . . . .
(a) Scalability to process 16 GB . . . . . . . . . . . . . . . . . . . .
(b) Scalability to process 34 GB . . . . . . . . . . . . . . . . . . . .

21
23
31
32
32
32

4.1

DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTAapplication network trafc . . . . . . . . . . . . . . . . . . . . . . . .
DPI Processing Capacity for 90Gb . . . . . . . . . . . . . . . . . . . .
MapReduce Phases Behaviour for DPI of 90Gb . . . . . . . . . . . . .
(a) Phases Time for DPI . . . . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for DPI . . . . . . . . . . . . . . . . . . . .
Completion time comparison of MapReduce for packet level analysis,
evaluating the approach with and without splitting into packets . . . . .
CountUp completion time and speed-up of 90Gb . . . . . . . . . . . .
(a) P3 evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver evaluation . . . . . . . . . . . . . . . . . . . . .
CountUp processing capacity for 90Gb . . . . . . . . . . . . . . . . . .
(a) P3 processing capacity . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver processing capacity . . . . . . . . . . . . . . . .
MapReduce Phases time of CountUp for 90Gb . . . . . . . . . . . . .
(a) MapReduce Phases Times of P3 . . . . . . . . . . . . . . . . . .
(b) MapReduce Phases Times for CountUpDriver . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) Phases Distribution for P3 . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for CountUpDriver . . . . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) DPI Completion Time and Speed-up of MapReduce for 30Gb of a
JXTA-application network trafc . . . . . . . . . . . . . . . . .
(b) DPI Processing Capacity of 30Gb . . . . . . . . . . . . . . . . .

4.2
4.3

4.4
4.5

4.6

4.7

4.8

4.9

43
44
45
45
45
47
48
48
48
49
49
49
50
50
50
51
51
51
52
52
52

xiii

List of Tables

3.1
3.2
3.3
3.4
3.5
3.6

Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI of a JXTA-based network trafc . . . . . . . . . . . . . .
Factors and levels to evaluate the dened metrics . . . . . . . . . . . .
Hypotheses to evaluate the dened metrics . . . . . . . . . . . . . . . .
Hypothesis notation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Completion time to process 16 GB split into 35 les . . . . . . . . . . .
Completion time to process 34 GB split into 79 les . . . . . . . . . . .

28
29
29
29
33
33

4.1
4.2
4.3

Metrics for evaluating MapReduce for DPI and packet level analysis . .
Factors and Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-Distributed Execution Time in seconds . . . . . . . . . . . . . . .

40
40
43

xiv

List of Acronyms

DPI Deep Packet Inspection


EC2 Elastic Compute Cloud
GQM Goal Question Metric
HDFS Hadoop Distributed File System
IP Internet Protocol
I/O Input/Output
JVM Java Virtual Machine
MBFS Message Based Per Flow State
MBPS Message Based Per Protocol State
PBFS Packet Based Per Flow State
PBNS Packet Based No State
PCAP Packet Capture
PDU Protocol Data Unit
POSIX Portable Operating System Interface
RTT Roud-Trip Time
SLA Service Level Agreement
TCP Transmission Control Protocol
UDP User Datagram Protocol

xv

Introduction

Though nobody can go back and make a new beginning, anyone can
start over and make a new ending.
CHICO XAVIER

1.1

Motivation

Distributed systems has been adopted for building high performance systems, due to the
possibility of obtaining high fault tolerance, scalability, availability and efcient use of
resources (Cox et al., 2002; Antoniu et al., 2007). Modern Internet services and cloud
computing infrastructures are commonly implemented as distributed systems, to provide
services with high performance and reliability (Mi et al., 2012). Cloud computing SLAs
require low time to identify, diagnose and solve problems in its production infrastructure,
in order to avoid negative impacts and problems into the quality of service provided for its
clients. Thus, monitoring and performance analysis of distributed systems at production
environment, became more necessary with the growth of cloud computing and the use of
distributed systems to provide services and infrastructure as a service (Fox et al., 2009;
Yu et al., 2011).
On distributed systems developing, maintaining and administration, the detection of
error causes, diagnosis and reproduction of errors are challenges that motivate efforts
to the development of less intrusive and more effective mechanisms for monitoring
and debugging distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analysers (Oliner et al., 2012) provide
relevant information regarding some aspects of a distributed system, but this information
can be complemented by correlated information from others sources (Zheng et al., 2012),

1.1. MOTIVATION

such as network trafc analysis, which can provide valuable information of a distributed
application and its environment, and also increase the number of information sources
to make them more effective for evaluating complex distributed systems. Simulators
(Paul, 2010), emulators or testbeds (Loiseau et al., 2009; Gupta et al., 2011) are also used
to evaluate distributed systems, but these approaches present lacks of to reproduce the
production behavior of a distributed system, and its relation within a complex environment,
such as the cloud computing environment (Loiseau et al., 2009; Gupta et al., 2011).
Monitoring and diagnosing production failures of distributed systems require low
intrusion, high accuracy and fast results. It is complex to achieve these requirements,
because distributed systems are usually composed of asynchronous communication,
unpredictability of network message issues, high number of resources to be monitored
in short time, and black box components (Yuan et al., 2011; Nagaraj et al., 2012). To
measure distributed systems with less intrusion and less dependency on developers,
approaches with low dependency on source code or instrumentation are necessary, such
as log analysis or network trafc analysis (Aguilera et al., 2003).
It is possible to measure, evaluate and diagnose distributed applications through the
evaluation of information from communication protocols, ows, throughput and load
distribution (Mi et al., 2012; Nagaraj et al., 2012; Sambasivan et al., 2011; Aguilera et al.,
2003; Yu et al., 2011). This information can be collected through network trafc analysis,
but to retrieve this kind of information from distributed application trafc it is necessary
to recognize application protocols and perform DPI to retrieve details of the application
behaviors, sessions, and states.
Network trafc analysis is one option to evaluate distributed systems performance
(Yu et al., 2011), although there are limitations on processing capacity to deal with large
amounts of network trafc in short time, on scalability to process network trafc over
variation of resource demands, and on complexity to obtain information of a distributed
application behavior from network trafc (Loiseau et al., 2009; Callado et al., 2009). To
evaluate applications information from network trafc it is necessary to use DPI and
extract information from application protocols, which requires an additional effort in
comparison with traditional approaches of DPI, which usually do not evaluate content of
application protocols and application states.
In the production environment of a cloud computing provider, DPI can be used to
evaluate and diagnose distributed applications, through the analysis of application trafc
inside a data center. However, this kind of DPI presents differences and requires more
effort than common DPI approaches. DPI is usually used to inspect all network trafc that

1.1. MOTIVATION

arrives at a data center, but this approach would not provide reasonable performance for
inspecting application protocols and their states, due to the massive volumes of network
trafc to be online evaluated, and the computational cost to perform this kind of evaluation
in short time (Callado et al., 2009).
Packet level analysis can also be used to evaluate packet ows and load distribution of
network trafc inside a data center (Kandula et al., 2009), providing valuable information
about the behavior of a distributed system and about the dimension, capacity and usage
of network resources. However, with packet level analysis it is not possible to evaluate
application messages, protocols, and their states.
Although much work has been done to improve DPI performance (Fernandes et al.,
2009; Antonello et al., 2012), the evaluation of application states through trafc analysis
decreases the processing capacity of DPI to evaluate large amounts of network trafc.
With the growth of link speeds, Internet trafc exchange and use of distributed systems
to provide Internet services (Sigelman et al., 2010), the development of approaches are
needed to be able to deal with the analysis of the growing amount of network trafc, to
permit the efcient evaluation of distributed systems through network trafc analysis.
MapReduce (Dean and Ghemawat, 2008), which was proposed for distributed processing of large datasets, can be an option to deal with large amounts of network trafc.
MapReduce is a programming model and an associated implementation for processing
and generating large datasets. It becomes an important programming model and distribution platform to process large amounts of data, with diverse use cases in academia and
industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce is a restricted programming
model to easily and automatically parallelize the execution of user functions and to
provide transparent fault-tolerance (Dean and Ghemawat, 2008). Based on functional
combinators from functional languages, it provides a simple programming paradigm for
parallel processing that is increasingly being used for data-intensive applications in cloud
computing environments.
MapReduce can be used for network packet level analysis (Lee et al., 2011), which
evaluates each packet individually to obtain information of network and transport layers.
Lee et al. (2011) proposed an approach to perform network packet level analysis through
MapReduce, using network traces split into packets to process each one individually
and to extract indicators from IP, TCP, and UDP. However, for proling an application
through network trafc analysis it is necessary to perform a deep packet inspection, in
order to evaluate the content of the application layer, and to evaluate application protocols
and reassemble application messages.

1.2. PROBLEM STATEMENT

Because the approach proposed by Lee et al. (2011) is not able to evaluate more than
one packet per MapReduce iteration and analyse application messages, it is necessary a
new MapReduce approach to perform DPI algorithms for proling applications through
network trafc analysis.
The kind of workload submitted for processing by MapReduce impacts on the behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring specic
conguration to obtain an optimal performance. Information about the occupation of
MapReduce phases, about the processing characteristics (if the job is I/O or CPU bound),
and about the mean time duration of Map and Reduce tasks, can be used to optimize
parameter congurations of the MapReduce, in order to improve resource allocation and
task scheduling.
Although studies has been done to understand, analyse and improve workload management decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to
characterize the MapReduce behaviour or to identify its optimal conguration to achieve
the best performance for packet level analysis and DPI.

1.2

Problem Statement

MapReduce can express several kinds of problems, but not all. MapReduce does not efciently express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012),
because its approach adopts batch processing and functions executed independently,
without shared state or data. Although MapReduce is restrictive, it provides a good t
for many problems of processing large datasets. MapReduce expressiveness limitations
may be reduced by decomposition of problems into multiple MapReduce iterations, or
by combining MapReduce with others programming models for sub-problems (Lmmel,
2007; Lin, 2012), although the decomposition into interactions increases the completion
time of MapReduce jobs (Lmmel, 2007).
DPI algorithms require the evaluation of one or more packets to retrieve information from application layer messages; this represents a data dependency to mount an
application message from network packets, and it is a restriction to use MapReduce for
DPI. Because Lee et al. (2011)s approach for MapReduce performs packet level analysis
processes each packet individually, it can not be used to evaluate more than one packet
per MapReduce Map function and efciently reassemble an application message from
network traces. Thus it is necessary a new approach to use MapReduce to perform DPI,
evaluating the effectiveness of MapReduce to express DPI algorithms.

1.3. CONTRIBUTIONS

In elastic environments, like cloud computing providers, where users can request or
discard resources dynamically, it is important to know how to make provisioning and
resource allocation in an optimal way. To run MapReduce jobs efciently, the allocated
resources need to be matched to the workload characteristics, and the allocated resources
should be sufcient to meet a requested processing capacity or deadline (Lee, 2012).
The main performance evaluations of MapReduce are about text processing (Zaharia
et al., 2008; Chen et al., 2011; Jiang et al., 2010; Wang et al., 2009), where the input
data are split into blocks and into records, to be processed by parallel and independent
Map functions. Although studies has been done in order to understand, analyse and
improve workload decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no
evaluation to characterize the MapReduce behavior or to identify its optimal conguration
to achieve the best performance for packet level analysis and DPI. Thus, it is necessary the
characterization of MapReduce jobs for packet level analysis and DPI, in order to permit
its optimal conguration to achieve the best performance, and to obtain information that
can be used to predict or simulate the completion time of a job with given resources, in
order to determine whether the job will be nished by the deadline with the allocated
resources (Lee, 2012).
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network trafc analysis, proposing a solution able to
perform deep inspection in distributed applications trafc, in order to evaluate distributed
systems at a data center, using commodity hardware and cloud computing services, in a
minimally intrusive way. Thus we developed an approach based on MapReduce to evaluate the behavior of distributed systems through DPI, and we evaluated the effectiveness of
MapReduce to a DPI algorithm and its completion time scalability through node addition
into the cluster, to measure a JXTA-based application, using virtual machines of a cloud
computing provider. Also we evaluated the MapReduce performance for packet level
analysis and DPI, characterizing the behavior followed by MapReduce phases, processing
capacity scalability and speed-up. In this evaluation we evaluated the impact caused by
the variation of input size, block size and cluster size.

1.3

Contributions

We analyse the processing capacity problem of distributed system measurements through


network trafc analysis. The results of the work presented in this dissertation provide the
following contributions:

1.4. DISSERTATION ORGANIZATION

1. We proposed an approach to implement DPI algorithms through MapReduce,


using whole blocks as input for Map functions. Was shown the effectiveness of
MapReduce for a DPI algorithm to extract indicators from a distributed application trafc, also it was shown the MapReduce completion time scalability,
through node addition into the cluster, for DPI on virtual machines of a cloud
computing provider;
2. We characterized the behavior followed by MapReduce phases for packet
level analysis and DPI, showing that this kind of job is intense in Map phase
and highlighting points for improvement;
3. We described the processing capacity scalability of MapReduce for packet
level analysis and DPI, evaluating the impact caused by variations in input,
cluster and block size;
4. We showed the speed-up obtained with MapReduce for DPI, with variations of
input, cluster and block size.

1.4

Dissertation Organization

The remainder of this dissertation is organized as follows.


In Chapter 2, we provide the background information on network trafc analysis and
MapReduce, we also investigate previous work that are related to the measurement of
distributed applications at runtime and with the use of MapReduce for network trafc
analysis.
In Chapter 3, we look at the problem of distributed application monitoring and
restriction to use MapReduce for proling application trafc. There are limitations on
capacity to process large amounts of network packet in short time and on scalability to
be able to process network trafc where there are variations of throughput and resource
demand. To address this problem, we present an approach for proling application trafc
using MapReduce. Experiments show the effectiveness of our approach for proling
application through DPI and MapReduce, and shows the achieved completion time
scalability in a cloud computing provider.
In Chapter 4, we performed a performance evaluation of MapReduce for network
trafc analysis. Due to the lack of evaluation of MapReduce for trafc analysis and
the peculiarity of this kind of data, this chapter deeply evaluates the performance of
MapReduce for packet level analysis and DPI of distributed application trafc, evaluating

1.4. DISSERTATION ORGANIZATION

the MapReduce scalability, speed-up and behavior followed by MapReduce phases.


The experiments evidence the predominant phases in this kind of MapReduce job, and
show the impact caused by the input size, block size and number of nodes, into the job
completion time and scalability achieved through the use of MapReduce.
In Chapter 5 we conclude the work done, summarize our contributions and present
future work.

Background and Related Work


No one knows it all. No one is ignorant of everything. We all know
something. We are all ignorant of something.
PAULO FREIRE

In this chapter, we provide background information on network trafc analysis, JXTA


and MapReduce, we also investigate previous studies that are related to the measurement
of distributed applications and to the use of MapReduce for network trafc analysis.

2.1
2.1.1

Background
Network Trafc Analysis

Network trafc measurement can be divided into active or passive measurement, and
a measurement can be performed at packet or ow levels. In packet level analysis,
the measurements are performed on each packet transmitted across the measurement
point. The common packet inspection only analyses the content up to the transport layer,
including the source address, destination address, source port, destination port and the
protocol type, but packet inspection can also analyse the packet payload, performing a
deep packet inspection.
Risso et al. (2008) presented a taxonomy of the methods that can be used for network
trafc analysis. According to Risso et al. (2008), the Packet Based No State (PBNS) operates by checking the value of some elds present in each packet, such as the TCP or UDP
ports, thus this method is very simple computationally. The Packet Based Per Flow State
(PBFS) requires a session table to manage session identication (source/destination address, transport-layer protocol, source/destination port) and the corresponding application

2.1. BACKGROUND

layer protocol, in order to be able to scan the payload looking for a specic rule, which
usually is an application-layer signature, which increases the processing complexity of
this method. The Message Based Per Flow State (MBFS) operates on messages instead
of packets. This method requires a TCP/IP reassembler to handle IP fragments and TCP
segments. In such case, memory requirements increase because of the additional state
information that must be kept for each session and because of buffers required by the
TCP/IP reassembler. The Message Based Per Protocol State (MBPS) interprets exactly
what each application sends and receives. A MBPS processor understands not only the
semantic of the message, but also the different phases of a messages exchange because it
has a full understanding of the protocol state machine. Memory requirements become
even larger, because this method needs to take into account not only the state of the
transport session, but also the state of each application layer session. Also processing
power is the highest because the protocol conformance analysis requires processing the
entire application data, while previous methods are limited to the rst packets within each
session.
The Figure 2.1 illustrates the difference between packet level analysis and DPI from
PCAP les, and shows that packet level analysis evaluates each packet individually, while
DPI requires an evaluation of more than one packet to reassemble some packets and
obtain an application message.

Figure 2.1 Differences between packet level analysis and deep packet inspection

DPI refers for examining both packet header and complete payload to look for
predened patterns or rules. A pattern or rule can be a particular TCP connection, dened
by source and destination IP addresses and port numbers, it can also be a signature string
of a virus, or a segment of malicious code (Piyachon and Luo, 2006). Antonello et al.

2.1. BACKGROUND

(2012) argues that many critical network services rely on the inspection of packet payload,
instead of only looking at the information of packet headers. Although DPI systems
are essentially more accurate to identify application protocols and application messages,
they are also resource-intensive and may not scale well with the growing link speeds.
MBFS, MBPS and DPI evaluate the content of the application layer, thus it is necessary
to recognize the content of the message evaluated, but encrypted messages can make
these kind of evaluation infeasible.

2.1.2

JXTA

JXTA is a language and specication for peer to peer networking, it attempts to formulate
peer to peer standard protocols, in order to provide an infrastructure for building peer
to peer applications, through basic functionalities for peer resource discovery, communication and organization. JXTA introduces an overlay on top of the existing physical
network, with its own addressing and routing (Duigou, 2003; Halepovic and Deters,
2003).
According to JXTA specication (Duigou, 2003), JXTA peers communicate through
messages transmitted by pipes, which are an abstraction of virtual channels composed of
input and output channels, for peer to peer communication. Pipes are not bound to the
physical location, it has its own unique ID. Each peer can carry its pipe with itself even
when its physical network location changes. Pipes are asynchronous, unidirectional and
unreliable, but bi-directional and reliable services are provided on top of them. JXTA
uses source-based routing, each message carries its routing information as a sequence
of peers, and peers along the path may update this information. The JXTA socket adds
reliability and bi-directionality to JXTA communications through one layer of abstraction
on top of the pipes (Antoniu et al., 2005), and it provides an interface similar to the
POSIX sockets specication. JXTA messages are XML-documents composed of well
dened and ordered message elements.
Halepovic and Deters (2005) proposed a performance model, describing important
metrics to evaluate the JXTA throughput, scalability, services and the JXTA behavior
over different versions. Halepovic et al. (2005) analysed the JXTA performance in
order to show the increasing cost or latency with higher workload and with concurrent
requests, and suggests more evaluations about JXTA scalability with large peer groups in
direct communication. Halepovic (2004) cites that network trafc analysis is a feasible
approach to performance evaluation of JXTA-based applications, but do not adopt it due
to the lack on JXTA trafc characterization. Although there are performance models and

10

2.1. BACKGROUND

evaluations of JXTA, there are no evaluations of it for the current versions and there are
not mechanisms to evaluate JXTA applications at runtime. Because JXTA is still used for
building peer to peer systems, such as the U-Store (Fonseca et al., 2012), which motivates
our research, is necessary a solution to measure JXTA-based applications at runtime and
provide information about their behavior and performance.

2.1.3

MapReduce

MapReduce (Dean and Ghemawat, 2008) is a programming model and a framework for
processing large datasets trough distributed computing, providing fault tolerance and high
scalability to big data processing. The MapReduce model was designed for unstructured
data processed by clusters of commodity hardware. Its functional style of Map and
Reduce functions automatically parallelizes and executes large jobs in a cluster. Also,
MapReduce handles failures, application deployment, task duplications, and aggregation
of results, thereby allowing programmers to focus on the core logic of applications.
An application executed through MapReduce is called job. The input data of a
job, which is stored into a distributed le system, it is split into even-sized blocks and
replicated for fault tolerance. Figure 2.2 shows the dataset input splitting adopted by
MapReduce.

Figure 2.2 MapReduce input dataset splitting into blocks and into records

11

2.1. BACKGROUND

Initially the input dataset is split into blocks and stored into the distributed le system
adopted. During the job execution of a dataset, each split is assigned to be processed by a
Mapper, thus the number of splits of the input determines the number of Map tasks of a
MapReduce job. Each Mapper reads its split from the distributed le system and divides
it into records, to be processed by the user-dened Map function. Each Map function
generates intermediate data from the evaluated block, which will be fetched, ordered by
keys and processed by the Reducers to generate the output of a MapReduce job.
A MapReduce job is divided into Map and Reduce tasks, which are composed of
user-dened functions of Map and Reduce. The execution of these tasks can be grouped
into phases, representing the Map and Reduce phases, but Reduce tasks still can be
divided into other phases, which are the Shufe and Sort phases. A job is submitted by
an user to the master node, which selects worker nodes with idle slots and assigns Map or
Reduce tasks.
The execution of a Map task can be divided into two phases. In the rst, the Map
phase reads the tasks split from the distributed le system, parses it into records, and
applies the user-dened Map function to each record. In the second, after the user-dened
Map function has been applied to each input record, the commit phase registers the nal
output with the TaskTracker, which then informs the JobTracker that the task has nished
executing. The output of the Map phase is consumed by the Reduce phase.
The execution of a Reduce tasks can be divided into three phases. The rst phase,
called Shufe phase, fetches the Reduce tasks input data, where for each Reduce task is
assigned a partition of the key produced by the Map phase. The second phase, called Sort
phase, groups records with the same key. The third phase, called Reduce phase, applies
the user-dened Reduce function to each key and its values (Kavulya et al., 2010).
A Reduce task cannot fetch the output of a Map task until the Map has nished and
committed its output to disk. Only after receiving its partition from all Map outputs, the
Reduce task starts the Sort phase, while this does not happens, the Reduce task executes
the Shufe phase. After the Sort phase, the Reduce task enters the Reduce phase, in
which it executes the user-dened Reduce function for each key and its values. Finally
the output of the Reduce function is written to a temporary location on the distributed le
system (Condie et al., 2010).
MapReduce worker nodes are congurable to concurrently execute up to a dened
number of Map and Reduce tasks, which are dened according to the number of Map and
Reduce slots. Each worker node of a MapReduce cluster is congured with a xed number
of Map slots, and another xed number of Reduce slots, which means the number of Map

12

2.1. BACKGROUND

or Reduce tasks that can be executed concurrently per node. During job executions, if
all possible slots are occupied, pending tasks must wait until some slots are freed up. If
the number of tasks in the job is bigger than the number of slots available, then Maps or
Reduces are rst scheduled to execute on all available slots, and these tasks compose the
rst wave of tasks, that is followed by subsequent waves. If an input is broken into 200
blocks and there are 20 Map slots in a cluster, the number of map tasks are 200 and the
map tasks are executed through 10 waves of executions (Lee et al., 2012). The number of
waves, and the sizes of waves, would aid the conguration of tasks for improved cluster
utilization (Kavulya et al., 2010).
The Shufe phase of the rst Reduce wave may be signicantly different from the
Shufe phase that belongs to the next Reduce waves. This happens because the Shufe
phase of the rst Reduce wave overlaps with the entire Map phase, and hence its depends
on the number of Map waves and their durations (Verma et al., 2012b).
Each Map task is independent of the others Map tasks, meaning that all Mappers can
be performed in parallel on multiple machines. The number of concurrent Map tasks in a
MapReduce system is limited by the number of slots and the number of blocks in which
the input data was divided. Reduce tasks can also be performed in parallel during the
Reduce phase, and the number of reduce tasks in a job is specied by the application and
by the number of Reduce slots per node.
MapReduce tries to achieve data locality for its job executions, which means the Map
task and the input data block it will process should be located as close to each other as
possible, in order for the Map task can read the input data block incurring as little network
trafc as possible.
Hadoop1 is an open source implementation of MapReduce, which relies on HDFS
for distributed data storage and replication. HDFS is an implementation of Google File
System (Ghemawat et al., 2003), which was designed to store large les, and was adopted
by MapReduce system as distributed le system to store its les and intermediate data.
The input data type and workload characteristics cause impact into the MapReduce
performance, due to each application has a different bottleneck resource, and requires
specic conguration to achieve optimal resource utilization (Kambatla et al., 2009).
Hadoop has a set of parameters for its conguration, the default values of these parameters
are based on typical conguration of machines in clusters and requirements of a typical
application, that usually processes text-like inputs, although the MapReduce optimal
resource utilization is dependent on the resource consumption prole of its application.
1 http://hadoop.apache.org/

13

2.2. RELATED WORK

Because the input data type and workload characteristics of MapReduce jobs impacts
into MapReduce performance, it is necessary to evaluate the MapReduce behavior and
performance for different purposes. Although much work has been done in order to
understand and analyse MapReduce for different input data types and workloads (Lu
et al., 2012; Groot, 2012), there is no evaluation to characterize the MapReduce behavior
and identify its optimal conguration for an application to packet level analysis and DPI.

2.2
2.2.1

Related Work
Distributed Debugging

Modern Internet services are often implemented as complex, large-scale distributed


systems. Information about the behavior of complex distributed systems is necessary
to evaluate and improve their performance, but for understanding distributed system
behavior it is required to observe related activities across many different components and
machines (Sigelman et al., 2010).
The evaluation of distributed applications is a challenge, due to the cost of monitoring
distributed systems and the lack of performance measurement of large scale distributed
applications at runtime. To reproduce the behavior of a complex distributed system, in
a test environment, it is necessary to reproduce each relevant conguration parameter
of the system (Gupta et al., 2011), which is a difcult effort, and is more evident and
complex in cases where faults only occurs when the system is over a high load (Loiseau
et al., 2009).
Gupta et al. (2011) presented a methodology and framework for large scale tests, able
to obtain resource congurations and scale near a large scale system, through the use of
emulated scalable network, multiplexed virtual machines and resource dilatation. Gupta
et al. (2011) shows its accuracy, scalability and the realism on network tests. However it
can not obtain the same accuracy of an evaluation of a real system at runtime, neither can
diagnose a problem occurred in production environment, in short time.
According to Sambasivan et al. (2011), debugging tools are needed to help the
identication and understanding of root causes of the diverse performance problems that
can arise in distributed systems. A request-ow can be seen as path and timing of a
request in a distributed system, representing the the ow of individual requests within
and across the components of a distributed system. There are many cases for which
request-ow traces comparison is useful; it can help to diagnose performance changes

14

2.2. RELATED WORK

resulting from modications made during software development or from upgrades of a


deployed system. It can also help to diagnose behaviour changes resulted from component
degradations, resource leakage, or workload changes.
Sigelman et al. (2010) reported Dapper, a large production distributed system tracing
framework of Google, that states three concrete design goals: low overhead, applicationlevel transparency and scalability. These goals were achieved by restricting Dappers
core tracing instrumentation to an ubiquitous threading, control ow, and RPC library
code of Google. Dapper provides valuable insights about the evaluation of distributed
systems through ows and procedure calls, but its implementation is dependent of the
instrumentation into the component responsible for message communication of the
distributed system, what can not be available in a black box system.
Some techniques has been developed for performance evaluation of distributed systems. Mi et al. (2012) proposed an approach, based on end-to-end request trace logs, to
identify primary causes of performance problems in cloud computing systems. Nagaraj
et al. (2012) compared logs of distributed systems to diagnose performance problems,
using machine learning techniques to analyse logs and to explore information of states
and event times. Sambasivan et al. (2011) used request ows to nd performance modications in distributed systems, comparing request ows across periods and ranking
them based on their impact in systems performance. Although these approaches evaluate
requests, ows and events of distributed systems, trafc analysis was not used as an
approach to provide de desired information.
Aguilera et al. (2003) proposed an approach to isolate performance bottlenecks in
distributed systems, based in message-level traces activity and algorithms for inferring
the dominant paths of a distributed system. Although network trafc was considered as
source to extract the desired information, a distributed approach was not adopt for data
processing.
Yu et al. (2011) presented SNAP, a scalable network-application proler to evaluate
the interactions between applications and the network. SNAP passively collects TCP
statistics and socket logs, and correlates them with network resources to indicate problem locations. However, SNAP did not adopted application trafc evaluation, neither
distributed computing to perform network trafc processing.

2.2.2

MapReduce for Network Trafc Analysis

Lee et al. (2010) proposed a network ow analysis method using MapReduce, where the
network trafc was captured, converted to text and used as input to Map tasks. As a result,

15

2.3. CHAPTER SUMMARY

it was shown improvements in fault tolerance and computation time, when compared with
ow-tools2 . The conversion time from binary network traces to text represents a relevant
additional time, that can be avoided adopting binary data as input data for MapReduce
jobs.
Lee et al. (2011) presented a Hadoop-based packet trace processing tool to process
large amounts of binary network trafc. A new input type to Hadoop was developed,
the PcapInputFormat, which encapsulate the complexity of processing a captured binary
PCAP traces and extracting the packets through the Libpcap (Jacobson et al., 1994)
library. Lee et al. (2011) compared their approach with CoralReef3 , which is a network
trafc analysis tool that also relies on Libpcap, the results of the evaluation showed
speed-up on completion time, for a case that process packet traces with more than 100GB.
This approach implemented a packet level evaluation, to extract indicators from IP, TCP
and UDP, evaluating the job completion time achieved with different input size and two
cluster congurations. It was implemented their own component to save network traces
into blocks, and the developed PcapInputFormat rely on a timestamp-based heuristic for
nding the rst packet from each block, using sliding-window. These implementations to
iterate over packets of a network trace, can present a limitation on accuracy, if compared
with the accuracy obtained by Tcpdump4 and LibPCAP for the same functionalities.
The approach proposed by Lee et al. (2011) is not able to evaluate more than one
packet per MapReduce iteration, because each block is divided into packets that are
evaluated individually by the user-dened Map function. Therefore, a new MapReduce
approach is necessary to perform DPI algorithms, which requires to reassemble more
than one packet to mount an application message, in order to evaluate message contents,
application states and application protocols.

2.3

Chapter Summary

In this chapter, we presented the background information of network trafc analysis,


JXTA and MapReduce, we also investigated previous studies that are related to the
measurement of distributed applications and related to the use of MapReduce for network
trafc analysis.
According to the background and related work evaluated, the detection of error causes,
diagnose and reproduction of errors of distributed systems are challenges that motivate
2 www.splintered.net/sw/ow-tools/

3 http://www.caida.org/tools/measurement/coralreef
4 http://www.tcpdump.org/

16

2.3. CHAPTER SUMMARY

efforts to develop less intrusive mechanisms for monitoring and debugging distributed
applications at runtime. Network trafc analysis is one option to distributed systems
measurement, although there are limitations on capacity to process large amounts of
network trafc in short time, and on scalability to process network trafc where there is
variation of resource demand.
Although MapReduce can be used for packet level analysis, it is necessary an approach
to use MapReduce for DPI, in order to evaluate distributed systems at a data center through
network trafc analysis, using commodity hardware and cloud computing services, in a
minimally intrusive way. Due to the lack of evaluation of MapReduce for trafc analysis
and the peculiarity of this kind of data, it is necessary to evaluate the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of
the most important conguration parameters of MapReduce.

17

Proling Distributed Applications


Through Deep Packet Inspection
Life is really simple, but we insist on making it complicated.
CONFUCIUS

In this chapter, we rst look at the problems in the distributed application monitoring,
processing capacity of network trafc, and in the restriction to use MapReduce for
proling application network trafc of distributed applications.
Network trafc analysis can be used to extract performance indicators from communication protocols, ows, throughput and load distribution of a distributed system. In
this context, network trafc analysis can enrich diagnoses and provide a mechanism for
measuring distributed systems in a passive way, with low overhead and low dependency
on developers.
However, there are limitations on the capacity to process large amounts of network
trafc in short time, and on processing capacity scalability to be able to process network
trafc over variations of throughput and resource demands. To address this problem, we
present an approach for proling application network trafc using MapReduce. Experiments show the effectiveness of our approach for proling a JXTA-based distributed
application through DPI, and its completion time scalability through node addition, in a
cloud computing environment.
In Section 3.1 we begin this chapter by motivating the need for an approach using
MapReduce for DPI, then we describe, in Section 3.2, the architecture proposed and the
DPI algorithm to extract indicators from network trafc of a JXTA-based distributed
application. Section 3.3 presents the adopted evaluation methodology and the experiment

18

3.1. MOTIVATION

setup used to evaluate our proposed approach. The obtained results are presented in
Section 3.4 and discussed in Section 3.5. Finally, Section 3.6 concludes and summarizes
this chapter.

3.1

Motivation

Modern Internet services and cloud computing infrastructure are commonly implemented
as distributed systems, to provide services with high performance, scalability and reliability. Cloud computing SLAs require a short time to identify, diagnose and solve problems
in its infrastructure, in order to avoid negative impacts and problems in the provided
quality of service.
Monitoring and performance analysis of distributed systems became more necessary
with the growth of cloud computing and the use of distributed systems to provide services
and infrastructure (Fox et al., 2009). In distributed systems development, maintenance and
administration, the detection of error causes, and the diagnosing and reproduction of errors
are challenges that motivates efforts to develop less intrusive mechanisms for debugging
and monitoring distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analyzers (Oliner et al., 2012) provide
relevant information of some aspects of a distributed system. However this information
can be complemented by correlating information from network trafc analysis, making
them more effective and increasing the information source to ubiquitously evaluate a
distributed system.
Low overhead, and transparency and scalability are commons requirements for an
efcient solution to the measurement of distributed systems. Many approaches have been
proposed in this direction, using instrumentation or logging, which cause overhead and
a dependency on developers. It is possible to diagnose and evaluate distributed applications performance with the evaluation of information from communication protocols,
ows, throughput and load distribution (Sambasivan et al., 2011; Mi et al., 2012). This
information can be collected through network trafc analysis, enriching a diagnosis, and
also providing an approach for the measurement of distributed systems in a passive way,
with low overhead and low dependency on developers.
Network trafc analysis is one option to evaluate distributed systems performance
(Yu et al., 2011), although there are limitations on the capacity to process large number
of network packets in a short time (Loiseau et al., 2009; Callado et al., 2009) and on
scalability to process network trafc over variations of throughput and resource demands.
To obtain information of the behaviour of distributed systems, from network trafc, it

19

3.1. MOTIVATION

is necessary to use DPI and evaluate information from application states, which requires
an additional effort in comparison with traditional approaches of DPI, which usually do
not evaluate application states.
Although much work has been done in order to improve the DPI performance (Fernandes et al., 2009; Antonello et al., 2012), the evaluation of application states still decreases
the processing capacity of DPI to evaluate large amounts of network trafc. With the
growth of links speed, Internet trafc exchange and the use of distributed systems to
provide Internet services (Sigelman et al., 2010), the development of new approaches
are needed to be able to deal with the analysis of the growing amount of network trafc,
and to permit the efcient evaluation of distributed systems through the network trafc
analysis.
MapReduce (Dean and Ghemawat, 2008) becomes an important programming model
and distribution platform to process large amount of data, with diverse use cases in
academia and industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce can be used
for packet level analysis: Lee et al. (2011) proposed an approach which evaluates each
packet individually to obtain information of network and transport layers. An approach
to process large amount of network trafc using MapReduce was proposed by Lee et al.
(2011), which splits network traces into packets to process each one individually and
extract indicators from IP, TCP, and UDP.
However, for proling distributed applications through network trafc analysis, it is
necessary to analyse the content of more than one packet, up to the application layer, to
evaluate application messages and its protocols. Due to TCP and message segmentation,
the desired application message may be split into several packets. Therefore, it is
necessary to evaluate more than one packet per MapReduce iteration to perform a deep
packet inspection, in order to be able to reassemble more than one packet and mount
application messages, to retrieve information from the application sessions, states and
from its protocols.
DPI refers for examining both packet header and complete payload to look for
predened patterns or rules, which can be a signature string or an application message.
According to the taxonomy presented by Risso et al. (2008), deep packet inspection
can be classied as message based per ow state (MBFS), which analyses application
messages and its ows, and also can be classied as message based per protocol state
(MBPS), which analyses application messages and its application protocol states, what
makes necessary to evaluate distributed applications through network trafc analysis, to
extract application indicators.

20

3.1. MOTIVATION

MapReduce is a restricted programming model to parallelize user functions automatically and to provide transparent fault-tolerance (Dean and Ghemawat, 2008), based
on functional combinators from functional languages. MapReduce does not efciently
express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012), because its approach adopts batch processing and functions executed independently, without
shared states.
Although restrictive, MapReduce provides a good t for many problems of processing
large datasets. Also, its expressiveness limitations may be reduced by problem decomposition into multiple MapReduce iterations, or by combining MapReduce with others
programming models for subproblems (Lmmel, 2007; Lin, 2012), but this approach can
be not optimal in some cases. DPI algorithms require the evaluation of one or more packets to retrieve information from application messages; this represents a data dependence
to mount an application message and is a restriction on the use of MapReduce for DPI.
Because the Lee et al. (2011) approach processes each packet individually, it can
not be efciently used to evaluate more than one packet and reassemble an application
message from a network trace, which makes it necessary a new approach for using
MapReduce to perform DPI and to evaluate application messages.
To be able to process large amounts of network trafc using commodity hardware,
in order to evaluate the behaviour of distributed systems at runtime, and also because
there is no evaluation of MapReduce effectiveness and processing capacity for DPI, an
approach was developed based on MapReduce, to deeply inspect distributed applications
trafc, in order to evaluate the behaviour of distributed systems, using Hadoop, an open
source implementation of MapReduce.
In this Chapter is evaluate the effectiveness of MapReduce to a DPI algorithm and
its completion time scalability through node addition, to measure a JXTA-based application, using virtual machines of Amazon EC21 , a cloud computing provider. The main
contributions of this chapter are:
1. To provide an approach to implement DPI algorithms using MapReduce;
2. To show the effectiveness of MapReduce for DPI;
3. To show the completion time scalability of MapReduce for DPI, using virtual
machines of cloud computing providers.
1 http://aws.amazon.com/ec2/

21

3.2. ARCHITECTURE

3.2

Architecture

In this section we present the architecture of the proposed approach to capture and process
network trafc of distributed applications.
To monitor distributed applications through network trafc analysis, specics points
of a data center must be monitored to capture the desired application network trafc.
Also, an approach is needed to process a large amount of network trafc in an acceptable
time. According to (Sigelman et al., 2010), fresh information enables a faster reaction to
production problems, thereby the information must be obtained as soon as possible, although a trace analysis system operating on hours-old data is still valuable for monitoring
distributed applications in a data center (Sigelman et al., 2010).
In this direction, we propose a pipelined process to capture network trafc, store
locally, transfer to a distributed le system, and evaluate the network trace to extract
application indicators. We use MapReduce, implemented by Apache Hadoop, to process application network trafc, extract application indicators, and provide an efcient
and scalable solution for DPI and proling application network trafc in a production
environment, using commodity hardware.
The architecture for network trafc capturing and processing is composed of four
main components: the SnifferServer (Shown in Figure 3.1), that captures, splits and stores
network packets into the HDFS for batch processing through Hadoop; the Manager, that
orchestrates the collected data, the job executions and stores the results generated; the
AppParser, that converts network packets into application messages; and the AppAnalyzer,
that implements Map and Reduce functions to extract the desired indicators.

Figure 3.1 Architecture of the the SnifferServer to capture and store network trafc

22

3.2. ARCHITECTURE

Figure 3.1 shows the architecture of the SnifferServer and its placement into monitoring points of a datacenter. SnifferServer captures network trafc from specic points
and stores it into the HDFS, for batch processing through Hadoop. Sniffer executes
user-dened monitoring plans guided by specication of places, time, trafc lters and
the amount of data to be captured. According to an user-dened monitoring plan, Sniffer
starts the capture of the desired network trafc through Tcpdump, which saves network
trafc in binary les, known as PCAP les. The collected trafc is split into les with
predened size, saved at the local SnifferServer le system, and transferred to HDFS
only when each le is totally saved into the local le system of the SnifferServer. The
SnifferServer must be connected to the network where the monitoring target nodes are
connected, and must be able to establish communication with the others nodes that
compose the HDFS cluster.
During the execution of a monitoring plan, initially the network trafc must be
captured, split into even-sized les and stored into HDFS. Through the Tcpdump, a
widely used LibPCAP network trafc capture tool, the packets are captured and split into
PCAP les with 64MB of size, which is the default block size of the HDFS, although this
block size may be congured to different values.
HDFS is optimized to store large les, but internally each le is split into blocks with
a predened size. Files that are greater than the HDFS block size must be split into blocks
with size equal to or smaller than the adopted block size, and must be spread among
machines in the cluster.
Because the LibPCAP, used by Tcpdump, stores the network packets in binary PCAP
les and due to the complexity of providing to HDFS an algorithm for splitting PCAP
les into packets, PCAP les splitting can be avoided through the adoption of les less
than the HDFS block size, but also can be provided to Hadoop an algorithm to split PCAP
les into packets, in order to be able to store PCAP les into the HDFS.
We adopted the approach that saves the network trace into PCAP les with the adopted
HDFS block size, using the split functionality provided by Tcpdump, because of the
PCAP le split into packets demands additional computing time and because of the trace
splitting into packets increases the complexity of the system. Thus, the network trafc
is captured by Tcpdump, split into even-sized PCAP les and stored into the local le
system of the SnifferServer, and periodically transferred to HDFS, which is responsible
for replicating the les into the cluster.
In the MapReduce framework, the input data is split into blocks, which are split
into small pieces, called records, to be used as input for each Map function. We adopt

23

3.2. ARCHITECTURE

the use of entire blocks, with size dened by the HDFS block size, as input for each
Map function, instead of using the block divided into records. With this approach, it is
possible to evaluate more than one packet per MapReduce task and to be able to mount an
application message from network trafc. Also it is possible to obtain more processing
time for the Map function than the approach where each Map function receives only one
packet as input.
Differently from the approach presented by Lee et al. (2011), which only permits
evaluation of a packet individually per Map function, with our approach it is possible to
evaluate many packets from a PCAP le per Map function and to reassemble application
messages from network trafc, which had the content of its messages divided into many
packets to be transferred over TCP.
Figure 3.2 shows the architecture to process distributed application trafc through
Map and Reduce functions, implemented by AppAnalyzer, which is deployed at Hadoop
nodes, and managed by Manager, and has the generated results stored into a distributed
database.

Figure 3.2 Architecture for network trafc analysis using MapReduce

The communication between components was characterized as blocking and nonblocking; blocking communication was adopted in cases that require high consistency,
and non blocking communication was adopted in cases where it is possible to use eventual
consistency to obtain better response time and scalability.

24

3.2. ARCHITECTURE

AppAnalyzer is composed of Mappers and Reducers for specic application protocols


and indicators. AppAnalyzer extends AppParser, which provides protocol parsers to
transform network trafc into programmable objects, providing a high level abstraction
to handle application messages from network trafc.
Manager provides functionalities for users to create monitoring plans with specication of places, time and amount of data to be captured. The amount of data to be processed
and the number of Hadoop nodes available for processing are important factors to obtain
an optimal completion time of MapReduce jobs and to generate fresh information for
faster reaction to production problems of the monitored distributed system. Thus, after
network trafc is captured and the PCAP les are stored into HDFS, Manager permits
the selection of the number of les to be processed, and then schedules a MapReduce job
for this processing. After each MapReduce job execution, Manager is also responsible
for storing the generated results into a distributed database.
We adopted a distributed database with eventual consistency and high availability,
based on Amazons Dynamo (DeCandia et al., 2007), and implemented by Apache
Cassandra2 , to store the indicator results generated by the AppAnalyzer. With the eventual
consistency, we expect gains with fast writes and reads operations, in order to reduce the
blocking time of these operations.
AppAnalyzer provides Map and Reduce functions to be used for evaluating specic
protocols and desired indicators. Each Map function receives as input a path of a PCAP
le stored into HDFS; this path is dened by the data locality control of the Hadoop,
which tries to delegate each task to nodes that have a local replica of the data or that are
near a replica. Then, the le is opened and each network packet is processed, to remount
messages and ows, and to extract the desired indicators.
During the data processing, the indicators are extracted from application messages
and saved in a SortedMapWritable object, which is ordered by its timestamp. SortedMapWritable is a sorted collection of values which will be used by Reduce functions
to summarize each evaluated indicator. In our approach, each evaluated indicator is
extracted and saved into an individual result le of Hadoop, which is stored into HDFS.
MapReduce usually splits blocks in records to be used as input for Map functions, but
we adopt whole les as input for Map tasks, to be able to perform DPI and reassemble
application messages that had their content divided into some TCP packets, due TCP
segmentation or due an implementation decision of the evaluated application. If an
application message is less than the maximum segment size (MSS), one TCP packet
2 http://cassandra.apache.org/

25

3.2. ARCHITECTURE

can transport one or more application message, but if an application message is greater
than the MSS, the message is split into some TCP packets, according with the TCP
segmentation. Thus, it is necessary to evaluate the full content of some TCP segments to
recognize application messages and their protocols.
If application messages have its packets spread into two or more blocks, it is possible
to generate intermediate data of this unevaluated messages by the Map function, grouping
each message by ow and its individual identication, and use the Reduce function to
reassembly the message and evaluate it.
To evaluate the effectiveness of our approach, we developed a pilot project to extract
application indicators from a JXTA-based distributed application trafc, this JXTA-based
distributed application implements a distributed backup system, based on JXTA Socket.
To analyse JXTA-based network trafc, the JNetPCAP-JXTA (Vieira, 2012b) was developed, which parses network trafc into Java JXTA messages, and the JXTAPerfMapper
and JXTAPerfReducer, which extract application indicators from JXTA Socket communication layer through Map and Reduce functions.
JNetPCAP-JXTA is writen in the Java language and provides methods to convert byte
arrays into Java JXTA messages, using an extension of the JXTA default library for Java,
known as JXSE3 . With JNetPCAP-JXTA, we are able to parse all kinds of messages
dened by the JXTA specication. JNetPCAP-JXTA relies on the JNetPCAP library to
support the instantiation and inspection of LibPCAP packets. JNetPCAP was adopted due
to its performance to iterate over packets, to the large quantity of functionalities provided
to handle packet traces and due to the recent update activities for this library.
The JXTAPerfMapper implements a Map function that receives as input a path of
a PCAP le stored into the HDFS; then the content of the specied le is processed to
extract the number of JXTA connection requests and number of JXTA message arrivals to
a server peer, and to evaluate the round-trip time of each piece of content transmitted over
a JXTA Socket. If a JXTA message is greater than the TCP PDU size, the message is split
into some TCP segments, due to the TCP segmentation. Additionally, in a JXTA network
trafc, one TCP packet can transport one or more JXTA message, due to the buffer
window size used by the Java JXTA Socket implementation to segment its messages.
Because of the possibility of transporting more than one JXTA message per packet
and the TCP segmentation, it is necessary to reassemble more than one packet and the
full content of each TCP segment to recognize all possible JXTA messages, instead of
evaluating only a message header or signature of individual packets, as is commonly done
3 http://jxse.kenai.com/

26

3.2. ARCHITECTURE

in DPI or by widely used trafc analysis tools, such as Wireshark4 , which is unable to
recognize all JXTA messages in a captured network trafc, due to its approach in which
it does not identify when two or more JXTA messages are transported into the same TCP
packet.
JXTAPerfMapper implements a DPI algorithm to recognize, sort and reassemble TCP
segments into JXTA messages, which is shown in Algorithm 1.
Algorithm 1 JXTAPerfMapper
for all tcpPacket do
if isJXTA isWaitingForPendings then
parsePacket(tcpPacket)
end if
end for
function PARSE PACKET(tcpPacket)
parseMessage
if isMessageParsed then
upddateSavedFlows
if hasRemain then
parsePacket(remainPacket)
end if
else
savePendingMessage
lookForMoreMessages
end if
end function
For each TCP packet of the PCAP le, it is veried if it is a JXTA message or if it
is part of a JXTA message that was not fully parsed and is waiting for its complement;
if one of these conditions is true, then a parse attempt is made, using JNetPCAP-JXTA
functionalities, up to the full verication of the packet content. As a TCP packet may
contain one or more JXTA messages, if a message is fully parsed, then another parse
attempt is done with the content not used by the previous parse. If the content is a JXTA
message and the parse attempt is not successful, then its TCP content is stored with its
TCP ow identication as a key, and all the next TCP packets that match with the ow
identication will be sorted and used to attempt to mount a new JXTA message, until the
parser is successful.
With these characteristics, inspection of JXTA messages and extract application
indicators requires more effort than other cases of DPI. For this kind of trafc analysis
4 http://www.wireshark.org/

27

3.2. ARCHITECTURE

memory requirements become even larger, because it needs to take into account not only
the state of the transport session, but also the state of each application layer session.
Also processing power is the highest because the protocol conformance analysis requires
processing the entire application data (Risso et al., 2008).
As previously shown in Figure 3.2, the AppAnalyzer is composed by Map and Reduce
functions, respectively JXTAPerfMapper and JXTAPerfReducer, to extract performance
indicators from JXTA Socket communication layer, which is a JXTA communication
mechanism that implements a reliable message exchange and obtains better throughput
between the communication layers provided by the Java JXTA implementation.
The JXTA Socket messages are transported by the TCP protocol, but it also implements its own control for data delivery, retransmission and acknowledgements. Each
message of a JXTA Socket is part of a Pipe that represents a connection established
between the sender and the receiver. In a JXTA Socket communication, two Pipes are
established, one from sender to receiver and the other from receiver to sender, in which are
transported content messages and acknowledgement messages, respectively. To evaluate
and extract performance indicators from a JXTA Socket, the messages must be sorted,
grouped and linked with their respective Pipes of content and acknowledgement.
The content transmitted into a JXTA Socket is split into byte array blocks and
stored in a reliability message that is sent to the destination and it expects to receive
an acknowledgement message of its arrival. The time between the message delivery
and when the acknowledgement is sent back is called round-trip time (RTT); this may
vary according to the system load and may indicate a possible overload of a peer. In a
Java JXTA implementation, each block received or to be sent is queued by the JXTA
implementation, until the system is ready to process a new block. This waiting time to
handle messages can impact the response time of the system, increasing the message
RTT.
The JXTAPerfMapper and JXTAPerfReducer evaluate the RTT of each content block
transmitted over a JXTA Socket, and also extract information about the number of
connection requests and message arrivals per time. Each Map function evaluates the
packet trace to mount JXTA messages, Pipes and Sockets. The parsed JXTA messages
are sorted by their sequence number and grouped by their Pipe identication, to compose
the Pipes of a JXTA Socket. As soon as the messages are sorted and grouped, the RTT is
obtained, its value is associated with its key and written as an output of the Map function.
The reduce function dened by JXTAPerfReducer receives as input a key and a collection of values, which are the evaluated indicator and its collected values, respectively,

28

3.3. EVALUATION

and then generates individual les with the results of each indicator evaluated.
The requirements to improve these Map and Reduce functions to address others
application indicators, such as throughput or number of retransmissions, are that each
indicator must be represented by an intermediate key, which is used by MapReduce for
grouping and sorting, and that collected values must be associated with its key.

3.3

Evaluation

In this section we perform an experiment to evaluate the effectiveness of MapReduce


to express DPI algorithms and its completion time scalability for proling distributed
applications through DPI: then our scope was limited to evaluate the AppAnalyzer,
AppParser and the Hadoop environment, from the architecture presented before.

3.3.1

Evaluation Methodology

For this experimental evaluation, we adopted a methodology based on aspects of GQM


(Goal-Question-Metric) template (Basili et al., 1994) and on the systematic approach to
performance evaluation dened by Jain (1991).
Two questions were dened to achieve our dened goal, and these questions are:
Q1 : Can MapReduce express DPI algorithms and extracts application indicators
from network trafc of distributed applications?
Q2 : Is the completion time of MapReduce for DPI proportionally scalable with the
addition of worker nodes?
To answer these questions, the metrics described in Table 3.1 were evaluated, which
shows the number of indicators extracted from distributed application trafc and the
behaviour followed by the completion time scalability obtained per variation of number
of worker nodes in a MapReduce cluster. The completion time scalability evaluates how
is the decreasing of completion time obtained with node addition into a MapReduce
cluster, for processing a dened input dataset.
This experimental evaluation adopts the factors and levels described in Table 3.2,
which represents the number of worker nodes of a MapReduce cluster and the input size
used in MapReduce jobs. These factors make possible to evaluate the scalability behavior
of MapReduce over variations in the selected factors.

29

3.3. EVALUATION
Table 3.1 Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI
of a JXTA-based network trafc

Metrics
M1 : Number of Indicators

M2 : Proportional Scalability

Description
Number of application indicators extracted
from a distributed application trafc.
Verify if the completion time decreases
proportionally to the number of worker
nodes.

Question
Q1
Q2

Table 3.2 Factors and levels to evaluate the dened metrics

Factors
Number of worker nodes
Input Size

Levels
3 up to 19
16GB and 34GB

Our testing hypotheses are dened in Table 3.3 and 3.4, that describe the null hypothesis and alternative hypothesis for each previously dened question. Table 3.3 describes
our hypotheses and Table 3.4 presents the notation used to evaluate our hypotheses.
Table 3.3 Hypotheses to evaluate the dened metrics

Alternative Hypothesis
H1num.indct : It is possible to use
MapReduce for extracting application indicators from network trafc.
H1scale.prop : The completion time of
MapReduce for DPI, does not scale
proportionally to node addition.

Null Hypothesis
Question
H0num.indct . It is not possible to use
Q1
MapReduce for extracting applications indicators from network trafc.
H0scale.prop . The completion time of
Q2
MapReduce for DPI, scales proportionally to node addition.

The hypotheses H1num.indct and H0num.indct were dened to evaluate if MapReduce


can be used to extract applications indicators from network trafc, for this evaluation
it was analysed the number of indicators extracted from a JXTA-base network trafc,
represented by num.indct .
It is common to see statements saying that MapReduce scalability is linear, but
achieving linear scalability in distributed systems is a difcult task. Linear scalability
happens when a parallel system does not loose performance while scaling (Gunther,
2006), then a node addition implies proportional performance gain in completion time or
processing capacity. We dened the hypotheses H1scale.prop and H0scale.prop to evaluate
the completion time scalability behavior of MapReduce, testing if it provides proportional
completion time scalability. In these hypotheses, t represents the completion time for
executing a Job j, s represents the cluster size and n represents the evaluated multiplication

30

3.3. EVALUATION

Hypothesis
H1num.indct
H0num.indct
H1scale.prop
H0scale.prop

Table 3.4 Hypothesis notation

Notation
num.indct > 0
num.indct <= 0
scale.prop = n N , sn = s n tn =
scale.prop = n N , sn = s n tn =

t
n
t
n

Question
Q1
Q1
Q2
Q2

factor, which is the increasing factor for the cluster size evaluated. H0scale.prop states that,
evaluating a specic MapReduce Job and input data, for all n being natural and bigger
than zero, a new cluster size dened by a previous cluster size multiplied by the factor n,
implies into the reduction of the previous job time t according to the division factor n,
resulting in the time tn obtained through the division of the previous time t by the factor n.

3.3.2

Experiment Setup

To evaluate the MapReduce effectiveness for application trafc analysis and its completion
time scalability, we performed two sets of experiments, grouped by the input size analysed,
with variation in the number of worker nodes.
Was used as input for MapReduce jobs, network trafc captured from a JXTA-based
distributed backup system, which uses the JXTA Socket communication layer for data
transfer between peers. The network trafc was captured from an environment composed
of six peers, where one peer server receives datafrom ve concurrent client peers, to be
stored and replicated to another peers. During the capturing trafc, one server peer creates
a JXTA Socket Server to accept JXTA Socket connections and receive data through an
established connection.
For each data backup, one client peer establishes a connection with a server peer and
sends messages with the content to be stored; if the content to be stored is bigger than the
JXTA message maximum size, its content will be transferred through two or more JXTA
messages. For our experiment, we adopted the backup of les with content size randomly
dened, with values between 64KB and 256KB.
The network trafc captured was saved into datasets of 16GB and 34GB, split in 35
and 79 les of 64MB, respectively, and stored into HDFS, to be processed as described
in Section 3.2, in order to extract, from the JXTA Socket communication layer, these
selected indicators: round-trip time, number of connection requests per time and number
of messages received by one peer server per time.
For each experiment set the algorithm 1 was executed, implemented by JXTAPerfMap-

31

3.4. RESULTS

per and JXTAPerfReducer, and was measured the completion time and processing capacity
for proling a JXTA-based distributed application through DPI, over different number of
worker nodes. Each experiment was executed 30 times to obtain reliable values (Chen
et al., 2011), within condence interval of 95% and a maximum error ratio of 5%. The
experiment was performed using virtual machines of the Amazon EC2, with nodes running Linux kernel 3.0.0-16, Hadoop version 0.20.203, block size of 64MB and with the
data replicated 3 times over the HDFS. All used virtual machines were composed of 2
virtual cores, 2.5 EC2 Compute Units and 1.7GB of RAM.

3.4

Results

From the JXTA trafc analysed, we extracted three indicators, the number of JXTA
connection requests per time, the number of JXTA messages received per time, and the
round-trip time of JXTA messages, that is dened by the time between content message
arrival from a client peer, and the JXTA acknowledgement sent back from a server peer.
The extracted indicators are shown in Figure 3.3.

Figure 3.3 JXTA Socket trace analysis

Figure 3.3 shows the extracted indicators, exhibiting the measured indicators from the

32

3.4. RESULTS

JXTA Socket communication layer and its behaviour for concurrent data transferring, of
a server peer receiving JXTA Socket connection request and messages from concurrent
client peers of a distributed backup system.
The three indicators extracted from the network trafc of a JXTA-based distributed
application, using MapReduce to perform DPI algorithm and extract desired indicators,
represents important indicators to evaluate a JXTA-based application (Halepovic and
Deters, 2005). With these extracted indicators it is possible to evaluate a distributed
system, providing a better understanding of the behaviour of a JXTA-based distributed
application. Through the extracted information it is possible to evaluate important
metrics, such as the load distribution, response time and the negative impact caused by
the increasing of number of messages received by a peer.
Using MapReduce to perform a DPI algorithm, was possible to extract the three
application indicators from network trafc, then was obtained num.indct = 3, what rejects
the null hypothesis H0num.indct , that states num.indct <= 0, and conrms the alternative
hypothesis H1num.indct , conrming that num.indct > 0.
Figures 3.4(a) and 3.4(b) illustrate how the addition of worker nodes into an Hadoop
cluster reduces the mean completion time and how the scalability of completion time is
for proling 16 GB and 34 GB of network trafc trace.
350

500

Completion Time

Completion Time

450
300
400

350
Time(s)

Time(s)

250

200

300

250

200
150
150

100

100
2

Nodes

(a): Scalability to process 16 GB

10

11

10

12
Nodes

14

16

18

20

(b): Scalability to process 34 GB

Figure 3.4 Completion time scalability of MapReduce for DPI

In both graphics, the behaviour of the completion time scalability is similar, not
following a linear function and with more signicant scalability gains, through node
addition, in smaller clusters, and less signicant gains with node addition into bigger
clusters.
This scalability behaviour highlights the importance of evaluating the relation between
costs and benets to nodes additions in a MapReduce cluster, due to the non proportional

33

3.4. RESULTS

gain with node addition in a MapReduce cluster.


The Tables 3.5 and 3.6 present respectively the results of the experiment to deeply
inspect 16 GB and 34 GB of network trafc trace, showing the number of Hadoop nodes
used for each experiment, the mean completion time in seconds, its margin of error, the
processing capacity achieved and the relative processing capacity per node in the cluster.
Table 3.5 Completion time to process 16 GB split into 35 les

Nodes
Time
Margin of Error
MB/s
(MB/s)/node

3
322.53
0.54
50.80
16.93

4
246.03
0.67
66.59
16.65

6
173.17
0.56
94.61
15.77

8
151.73
1.55
107.98
13.50

10
127.17
1.11
128.84
12.88

Table 3.6 Completion time to process 34 GB split into 79 les

Nodes
Time
Margin of Error
MB/s
(MB/s)/node

4
464.33
0.32
74.98
18.75

8
260.60
0.76
133.60
16.70

12
189.07
1.18
184.14
15.35

16
167.13
0.81
208.32
13.02

19
134.47
1.53
258.91
13.63

In our experiments, we achieved a maximum mean processing capacity of 258.91 MB


per second, in a cluster with 19 worker nodes, processing 34 GB. For a cluster with 4
nodes we achieved a mean processing capacity of 66.59 MB/s and 74.98 MB/s to process
respectively 16 GB and 34 GB of network trafc trace, which indicates that processing
capacity may vary in function of the amount of data processed and the number of les
used as input data, and indicates that the input size is an important factor to be analysed
by MapReduce performance evaluations.
The results show that for the evaluated scenario and application, the completion time
decreases with the increment of number of nodes in the cluster, but not proportionally
to node addition and not in a linear function, as can be observed in Figures 3.4(a) and
3.4(b). Observing Figures 3.4(a) and 3.4(b) is possible to see that the completion time
does not decreases linearly. Also, Tables 3.5 and 3.6 show values that conrms the non
proportional completion time scalability. For example the Table 3.5 shows a cluster with
4 nodes processing 16 GB that was scaled out to 8 nodes, then was obtained an increment
of 2 times in number of nodes, but we achieved a gain of 1.62 times in completion time.
To evaluate our stated hypotheses H1scale.prop and H0scale.prop based on this example,
we have the measured s2 = 8 and the calculated s n = sn dened by 4 2 = 8 = s2 ,
34

3.5. DISCUSSION

what conrms sn = s n. We also have the measured t2 = 151.73 and the calculated nt
dened by 246.03
= 123.01 = t2 , what rejects tn = nt and conrms tn = nt . Therefore,
2
with the measured results was rejected the null hypothesis H0scale.prop and conrmed the
alternative hypothesis H1scale.prop , which states that the completion time of MapReduce
for DPI, does not scale proportionally to node addition.

3.5

Discussion

In this section, we discuss the measured results and evaluate its meaning, restrictions and
opportunities. We also discuss possible threats to validity of our experimental results.

3.5.1

Results Discussion

Distributed systems analysis, detection of root causes and error reproduction are challenges that motivates efforts to develop less intrusive mechanisms for proling and
monitoring distributed applications at runtime. Network trafc analysis is one option to
evaluate distributed systems, although there are limitations on capacity to process a large
amount of network trafc in a short time, and on completion time scalability to process
network trafc where there is variation of resource demand.
According to the evaluated results, using MapReduce for proling a network trafc
from a JXTA-based distributed backup system, through DPI, it is important to analyse the
possible gains with node addition into a MapReduce cluster, because the node addition
provides different gains according to the cluster size and input size. For example, Table
3.6 shows that the addition of 4 nodes into a cluster with 12 nodes, produces a reduction
of 11% in completion time and an improvement of 13% in processing capacity, while the
addition of the same amount of nodes (4 nodes) into a cluster with 4 nodes produces a
reduction of 43% in completion time and an improvement of 78% in processing capacity.
The scalability behaviour of MapReduce for DPI highlights the importance of evaluating the relation between costs and benets to node additions into a MapReduce cluster,
because the gains obtained with node addition are related to the actual and future cluster
size and the input size to be processed.
The growing of the number of nodes in the cluster increases costs due to greater
cluster management, data replication, tasks allocation to available nodes and due costs
with the management of failures. Also, with the cluster growing, the cost is increased
with merging and sorting of the data processed by Map tasks (Jiang et al., 2010), that can
be spread into a bigger number of nodes.

35

3.5. DISCUSSION

In smaller clusters, the probability of a node having a replica of the input data, is
greater than in bigger clusters adopting the same replication factor (Zaharia et al., 2010).
In bigger clusters there are more options of nodes for delegate a task execution, but the
number of data replication limits the benets of data locality to the number of nodes that
store a replica of the data. This increases the cost to schedule tasks and to distribute tasks
in the cluster, and also increases costs with data transfer over the network.
The kind of workload submitted to be processed by MapReduce impacts in the
behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring
specic conguration to obtain an optimal performance. Although studies have been
done in order to understand, analyse and improve workload management decisions
in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to characterize
the MapReduce behaviour or to identify its optimal conguration to achieve the best
performance for packet level analysis and DPI. Thus, it is necessary deeply understand
the behaviour of MapReduce to process network traces and what optimizations can be
done to better explore the potential provided by MapReduce for packet level analysis and
DPI

3.5.2

Possible Threats to Validity

Due budget and time restrictions, our experiments were performed with small cluster size
and small input size, if compared with benchmarks that evaluate the MapReduce performance and its scalability (Dean and Ghemawat, 2008). However, relevant performance
evaluations and reports of real MapReduce production traces shows that the majority of
the MapReduce jobs are small and executed into a small number of nodes (Zaharia et al.,
2008; Wang et al., 2009; Lin et al., 2010; Zaharia et al., 2010; Kavulya et al., 2010; Chen
et al., 2011; Guo et al., 2012).
Although MapReduce has been designed to handle big data, the use of input data in
order of gigabytes has been reported by realistic production traces (Chen et al., 2011),
and this input size has been used in relevant MapReduce performance analysis (Zaharia
et al., 2008; Wang et al., 2009; Lin et al., 2010).
Improvements in MapReduce performance and proposed schedulers has focused
into problems related to small jobs, for example Facebooks fairness scheduler aims to
provide fast response time for small jobs (Zaharia et al., 2010; Guo et al., 2012). Fair
scheduler attempts to guarantee service levels for production jobs by maintaining job
pools composed by a small number of nodes than the total nodes of a data center, to
maintain a minimum share and dividing excess capacity among all jobs or pools (Zaharia

36

3.6. CHAPTER SUMMARY

et al., 2010).
According to Zaharia et al. (2010), 78% of Facebooks MapReduce jobs have up to
60 Map tasks. Our evaluated datasets were composed by 35 and 79 les, what implies
into the same and respective numbers of Map tasks, due to our approach evaluates an
entire block per Map task.

3.6

Chapter Summary

In this chapter, we presented an approach for proling application trafc using MapReduce, and evaluated its effectiveness for proling application through DPI and its completion time scalability in a cloud computing environment.
We proposed a solution based on MapReduce, for deep inspection of distributed
applications trafc, in order to evaluate the behaviour of distributed systems at runtime,
using commodity hardware, in a low intrusive way, through a scalable and fault tolerant
approach based on Hadoop, an open source implementation of MapReduce.
MapReduce was used to implement a DPI algorithm to extract application indicators
from a JXTA-based trafc of a distributed backup system. Was adopted an splitting
approach without the block division into records, was used a network trace split into les
with maximum size lesser than the HDFS block size, to avoid the cost and complexity of
providing to the HDFS a algorithm for splitting the network trace into blocks, and also to
use a whole block as input for Map functions, in order to be able to reassemble two or
more packets and reassemble JXTA messages from packets of network traces, per Map
function.
We evaluated the effectiveness of MapReduce for a DPI algorithm and its completion
time scalability, over different sizes of network trafc used as input, and different cluster
size. We showed that the MapReduce programming model can express algorithms for
DPI and extracts application indicators from application network trafc, using virtual
machines of a cloud computing provider, for DPI of large amounts of network trafc.
We also evaluated its completion time scalability, showing the scalability behaviour, the
processing capacity achieved, and the inuence of the number of nodes and the data input
size on the capacity processing for DPI.
It was shown that MapReduce completion time scalability for DPI does not follow a
linear function, with more signicant scalability gains, through the addition of nodes, in
small clusters, and less signicant gains in bigger clusters.
According to the results, input size and cluster size generate signicant impact in

37

3.6. CHAPTER SUMMARY

processing capacity and completion time of MapReduce jobs for DPI. This highlights
the importance of evaluating the best input size and cluster size to obtain an optimal
performance in MapReduce Jobs, but also indicates the need for more evaluations about
the inuence of others important factors on MapReduce performance, in order to provide
better conguration, selection of input size and machine allocation into a cluster, and to
provide valuable information for performance tuning and predictions.

38

Evaluating MapReduce for Network


Trafc Analysis

All difcult things have their origin in that which is easy, and great
things in that which is small.
LAO TZU

The use of MapReduce for distributed data processing has been growing and achieving benets with its application for different workloads. MapReduce can be used for
distributed trafc analysis, although network trafc traces present characteristics which
are not similar to the data type commonly processed through MapReduce, which in
general is divisible and text-like data, while network traces are binary and may present
restrictions about splitting, when processed through distributed approaches.
Due to the lack of evaluation of MapReduce for trafc analysis and the peculiarity of
this kind of data, this chapter deeply evaluates the performance of MapReduce in packet
level analysis and DPI of distributed application trafc, evaluating its scalability, speed-up
and the behaviour followed by MapReduce phases. The experiments provide evidences
for the predominant phases in this kind of MapReduce job, and show the impact of input
size, block size and number of nodes, on completion time and scalability.
This chapter is organized as follows. We rst describe the motivation for a MapReduce
performance evaluation for network trafc analysis in Section 4.1. Then we present the
evaluation plan and methodology adopted in Section 4.2, and the results are presented in

39

4.1. MOTIVATION

Section 4.3. Section 4.4 discusses the results and Section 4.5 summarizes the chapter.

4.1

Motivation

It is possible to measure, evaluate and diagnose distributed applications through the


evaluation of information from communication protocols, ows, throughput, and load
distribution (Mi et al., 2012; Nagaraj et al., 2012; Sambasivan et al., 2011; Aguilera
et al., 2003; Yu et al., 2011). This information can be collected through network trafc
analysis, but to retrieve application information from network traces, it is necessary to
recognize the application protocol and deeply inspect the trafc to retrieve details about
its behaviour, session and states.
MapReduce can be used for ofine evaluation of distributed applications, analysing
application trafc inside a data center, through packet level analysis (Lee et al., 2011),
evaluating each packet individually, and through DPI (Vieira et al., 2012b,a), adopting a
different approach for data splitting, where a whole block is processed without division
into individual packets, due to the necessity to reassemble two or more packages to
retrieve information of the application layer, in order to evaluate application messages
and protocols.
The kind of workload submitted for processing by MapReduce impacts on the behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring specic
conguration to obtain an optimal performance. Information about the occupation of
MapReduce phases, the processing characteristics (if the job is I/O or CPU bound),
and the mean duration of Map and Reduce tasks, can be used to optimize parameter
congurations, and to improve resource allocation and task scheduling.
The main evaluations of MapReduce are in text processing (Zaharia et al., 2008;
Chen et al., 2011; Jiang et al., 2010; Wang et al., 2009), where the input data are split
into blocks and into records, to be processed by parallel and independent Map functions.
For distributed processing of network trafc traces, which are usually binary, the data
splitting into packets is a concern and, in some cases, may require data without splitting,
especially when packet reassembly is required to extract application information from
the application layer.
Although work has been done to understand, analyse and improve workload management decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to
characterize the MapReduce behaviour or to identify its optimal conguration to achieve
the best performance for packet level analysis and DPI.
Due to the lack of evaluation of MapReduce for trafc analysis and the peculiarity of

40

4.2. EVALUATION

this kind of data, it is necessary to understand the behaviour of MapReduce to process


network traces and to understand what optimizations can be done to better explore the
potential provided by MapReduce for packet level analysis and DPI.
This chapter evaluates MapReduce performance for network packet level analysis
and DPI using Hadoop, characterizing the behaviour followed by MapReduce phases,
scalability and speed-up, over variation of input, block and cluster sizes. The main
contributions of this chapter are:
1. Characterization of MapReduce phases behaviour for packet level analysis and
DPI;
2. Description of scalability behaviour and its relation with important MapReduce
factors;
3. Identication of the performance provided by the block sizes adopted for different
cluster size;
4. Description of speed-up obtained for DPI.

4.2

Evaluation

The goal of this evaluation is to characterize the behaviour of MapReduce phases, its
scalability characteristics over node addition and the speed-up achieved with MapReduce
for packet level analysis and DPI. Thus, we performed a performance measurement and
evaluation of MapReduce jobs that execute packet level analysis and DPI algorithms.
To evaluate MapReduce for DPI, the Algorithm 1 implemented by JXTAPerfMapper
and JXTAPerfReducer was used, and applied to new factors and levels. To evaluate
MapReduce for packet level analysis, a port counter algorithm developed by Lee et al.
(2011) was used, which divides a block into packets and processes each packet individually to count the number of occurrence of TCP and UDP port numbers. The same
algorithm was evaluated using the splitting approach that processes a whole block per
Map function, without the division of a block into records or packets. A comparison was
done between these two approaches for packet level analysis.

4.2.1

Evaluation Methodology

For this evaluation, we adopted a methodology based on the systematic approach to


performance evaluation dened by Jain (1991), which consists of the denition of the

41

4.2. EVALUATION

goal, metrics, factors and levels for a performance study.


The goal of this evaluation is to characterize the behaviour of MapReduce phases,
its scalability through node addition and the speed-up achieved with MapReduce for
packet level analysis and DPI, to understand the impact of each factor on MapReduce
performance for this kind of input data, in order to be able to congure the MapReduce
and obtain an optimal performance over the evaluated factors.
Table 4.1 Metrics for evaluating MapReduce for DPI and packet level analysis

Metrics
Completion Time
Phases Time

Phases Occupation
Scalability
Speed-up

Description
Completion time of MapReduce jobs
Time consumed by each MapReduce phase in total completion
time of MapReduce jobs
Relative time consumed by each MapReduce phase in total completion time of MapReduce jobs
Processing capacity increasing obtained with node addition in a
MapReduce cluster
Improvement in completion time against the same algorithm implemented without distributed processing

Table 4.1 describes the metrics evaluated, that is the completion time of MapReduce
jobs, the relative and absolute time of each MapReduce phases from the total job time,
the processing capacity scalability, and the speed-up against non-distributed processing.
The experiments adopt the factors and level described in Table 4.2. The selected
factors were chosen due to its importance for MapReduce performance evaluations and
its adoption by relevant previous researches (Jiang et al., 2010; Chen et al., 2011; Shafer
et al., 2010; Wang et al., 2009).
Table 4.2 Factors and Levels

Factors
Number of Worker Nodes
Block Size
Input Size

Levels
2 up to 29
32MB, 64MB and 128MB
90Gb and 30Gb

Hadoop logs are a valuable information source about its environment and execution
jobs; important MapReduce indicators and information about jobs, tasks, attempts,
failures and topology are logged by Hadoop during its execution. The collected data to
perform this performance evaluation was extracted from Hadoop logs.
To extract information from Hadoop logs and to evaluate the selected metricis, we
developed the Hadoop-Analyzer (Vieira, 2013), an open source and publicly available

42

4.2. EVALUATION

tool to extract and evaluate MapReduce indicators, such as job completion time and
MapReduce phases distribution, from logs generated by Hadoop from its job executions.
With Hadoop-Analyzer is possible to generate graphs with the extracted indicators and
thereby evaluate the desired metrics.
Hadoop-Analyzer relies on Rumen (2012) to extract raw data from Hadoop logs
and generate structured information, which is processed and shown in graphs generated
through R (Eddelbuettel, 2012) and Gnuplot (Janert, 2010), such as the results presented
in Section 4.3

4.2.2

Experiment Setup

Network trafc traces of distributed applications were captured to be used as input for
MapReduce jobs of our experiments; these traces were divided into les with size dened
by the block size adopted in each experiment, and then the les were stored into HDFS,
following the process described in the previous chapter. The packets were captured using
Tcpdump and were split into les with sizes of 32MB, 64MB and 128MB.
For packet level analysis and DPI evaluation, two sizes of datasets were captured
from network trafc transferred between some nodes of distributed systems. One dataset
was 30Gb of network trafc, with data divided in 30 les of 128MB, 60 les of 64MB
and 120 les of 32MB. The other dataset was 90Gb of network trafc, split in 90 les of
128MB, 180 les of 64MB and 360 les of 32MB.
For the experiments of DPI through MapReduce, we used network trafc captured
from the same JXTA-based application described in Section 3.3.2, but with different
sizes of traces and les. To evaluate MapReduce for packet level analysis, we processed
network trafc captured from data transferred between 5 clients and one server of a data
storage service provided through the Internet, known as Dropbox1 .
To evaluate MapReduce for packet level analysis and DPI, one driver was developed
for each case of network trafc analysis, with one version using MapReduce and another
without it.
CountUpDriver implements packet level analysis for a port counter of network traces,
which records how many times a port appears in TCP or UDP packets; its implementation
is based on processing a whole block as input for Map functions, without splitting and with
block size dened by the block size of HDFS. Furthermore a port counter implemented
with P3 was evaluated; this implementation is a version of the tool presented by Lee et al.
1 http://www.dropbox.com/

43

4.2. EVALUATION

(2011), which adopts an approach that divides a block into packets and processes each
packet individually, without dependent information between packets.
JxtaSocketPerfDriver implements DPI to extract, from a JXTA (Duigou, 2003) network trafc, the round-trip time of JXTA messages, the number of connection requisitions
per time and the number of JXTA Socket messages from JXTA clients and a JXTA Socket
server. JxtaSocketPerfDriver uses whole les as input for each Map function, with size
dened by the HDFS block size, in order to reassemble JXTA messages with its content
divided into many TCP packets.
One TCP packet can transport one or more JXTA messages per time, which makes it
necessary to evaluate the full content of TCP segments to recognize all possible JXTA
messages, instead of to evaluate only a message header or signature. The round-trip time
of JXTA messages is calculated from the time between a client peer sending a JXTA
message and receiving the JXTA message arrival conrmation. To evaluate the round-trip
time it is necessary to keep the information of requests and which responses correspond to
each request, and thus, it is necessary to analyse several packages to retrieve and evaluate
information of the application behaviour and its states.
To analyse the speed-up provided by MapReduce against a single machine execution,
two drivers were developed that use the same dataset and implement the same algorithms
implemented by CountUpDriver and JxtaSocketPerfDriver, but without distributed processing. These drivers are respectively CountUpMono and JxtaSocketPerfMono.
The source code of all implemented drivers and other implementations to support the
use of MapReduce for network trafc analysis, are open source and publicly available at
Vieira (2012a).
The experiments were performed on a 30-node Hadoop-1.0.3 cluster composed of
nodes with four 3.2GHz cores, 8 GB RAM and 260GB of available hard disk space,
running Linux kernel 3.2.0-29. Hadoop was used as our MapReduce implementation,
and congured to permit a maximum of 4 Map and 1 Reduce task per node; also, we
dened the value -Xmx1500m as child option of JVM and 400 as io.Sort.mb value.
For drivers CountUpDriver and JxtaSocketPerfDriver, the number of Reducers
was dened as a function of the number of slots of Reducers per node, dened by
numReducers = (0.95)(numNodes)(maxReducersPerNode) (Kavulya et al., 2010). The
driver implemented with P3 (Lee et al., 2011) adopts a xed number of Reducers, dened
as 10 by the available version of P3. Each experiment was executed 20 times to obtain
reliable values (Chen et al., 2011), within condence interval of 95% and a maximum
error ratio of 5%.

44

4.3. RESULTS

4.3

Results

Two dataset sizes of network trafc were used during the experiments, with 30Gb and
90Gb. Each dataset was processed by MapReduce jobs that implement packet level
analysis and DPI, in Hadoop clusters with variation in number of worker nodes between
2 and 29, and block size of 32MB, 64MB and 128MB.
Each dataset was processed by algorithms implemented through MapReduce and
without distributed processing, to evaluate the speed-up achieved. The Table 4.3 shows
the execution times obtained by the non-distributed processing, implemented and executed through JxtaSocketPerfMono and CountUpMono, using a single machine, with the
resource conguration described in Subsection 4.2.2.
Table 4.3 Non-Distributed Execution Time in seconds

Block

JxtaSocketPerfMono

32MB
64MB
128MB

CountUpMono

90Gb

30Gb

90Gb

30Gb

1.745,35
1.755,40
1.765,50

584,92
587,02
606,50

872,40
571,33
745,25

86,71
91,76
94,82

Figure 4.1 shows the completion time and speed-up of the DPI Algorithm 1 to extract
indicators from a JXTA-based distributed application. The Completion Time represents
the job time of JxtaSocketPerfDriver and the Speed-up represents gains in execution time
of JxtaSocketPerfDriver against JxtaSocketPerfMono to process 90Gb of network trafc.
900

Completion Time 32MB


Completion Time 64MB
Completion Time 128MB
Speedup 32MB
Speedup 64MB
Speedup 128MB

800

20

700

15

500

400

10

Speedup

Time(s)

600

300

200
5
100

0
02

04

06

10

14
Nodes

18

21

25

29

Figure 4.1 DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTA-application
network trafc

45

4.3. RESULTS

According to Figure 4.1, JxtaSocketPerfDriver performs better than JxtaSocketPerfMono over all factors variation. Initially we observed the best speed-up of 3.70 times
with 2 nodes and block of 128MB, lastly we observed a maximum speed-up of 16.19
times with 29 nodes and block of 64MB. The speed-up achieved with block size of 32MB
was the worst case initially, but its speed-up increased with node addition and became
better than blocks with 128MB and near of the speed-up achieved with block of 64MB,
for a cluster with 29 nodes.
The completion time scalability behaviour of 32MB block size showed reduction
in completion time for all node additions, although cases with block size of 64MB and
128MB present no signicant reduction in completion time in clusters with more than 25
nodes. According to Figure 4.1, the completion time does not reduce linearly with node
addition, and the improvement in completion time was less signicant when the dataset
was processed by more than 14 nodes, specially for cases that adopted blocks of 64MB
and 128MB.
Figure 4.2 shows the processing capacity of MapReduce applied to DPI of 90Gb
of a JXTA-based application trafc, over variation of cluster size and block size. The
processing capacity was evaluated by the throughput of network trafc processed, and
by the relative throughput, dened by the processing capacity achieved per number of
allocated nodes.
100

Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB

1000

90

Throughput (Mbps)

70
600

60

50

400

Throughput/Nodes (Mbps)

80
800

40
200
30

10

12

14 16
Nodes

18

20

22

24

26

28

30

Figure 4.2 DPI Processing Capacity for 90Gb

The processing capacity achieved for DPI of 90Gb using block size of 64MB was
159.89 Mbps with 2 worker nodes, increasing up to 869.43 Mbps with 29 worker nodes.
For the same case, the relative processing capacity achieved was 79.94 Mbps/node with 2
nodes and 29.98 Mbps/node with 29 nodes, showing a decrease of relative processing

46

4.3. RESULTS

capacity with the growth of the MapReduce cluster size.


Although the processing capacity increased, the relative processing capacity, dened
by the processing capacity per allocated node, decreased with all node addition. This
behaviour indicates that MapReduce presents a reduction of efciency with the increase
of cluster size (Gunther, 2006), which highlights the importance of evaluation about the
cost of node allocation and its benets for completion time and processing capacity.
Figures 4.1 and 4.2 also show the difference of performance achieved with different
blocks sizes, and its relation to the cluster size. It was observed that blocks of 128MB
achieved a higher throughput in cluster sizes up to 14 nodes, and that blocks of 64MB
performed better in clusters bigger than 14 worker nodes.
Figures 4.3(a) and 4.3(b) show the behaviour of MapReduce phases to DPI of 90Gb.
900

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

800
700

Time(s)

600
500
400
300
200
100
0
29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

32MB

64MB

128MB

(a): Phases Time for DPI


Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

32MB

64MB

128MB

(b): Phases Distribution for DPI

Figure 4.3 MapReduce Phases Behaviour for DPI of 90Gb

47

4.3. RESULTS

MapReduce execution can be divided into Map, Shufe, Sort and Reduce phases,
although Shufe tasks can be executed before the conclusion of all Map tasks, thereby
Map and Shufe tasks can overlap. According to the Hadoop default conguration, the
overlapping between Map and Shufe tasks starts after 5% of Map tasks are concluded;
and then Shufe tasks are started and run until the Map phase ends.
In Figures 4.3(a) and 4.3(b) we showed the overlapping between Map and Shufe
tasks as a specic MapReduce phase, represented as "Map and Shufe" phase. The time
consumed by Setup and Cleanup tasks was considered too, for a better visualization of
the execution time division in Hadoop jobs.
Figure 4.3(a) shows the cumulative time of each MapReduce phase in total job time.
For DPI, Map time, which is the Map and the "Map and Shufe" phases, consumes the
major part of a job execution time and it is the phase that exhibits more variation with the
number of nodes variation, but no signicant time reduction is achieved with more than
21 nodes and block size of 64MB or 128MB.
The Shufe time, which happens after all Map tasks are completed, presented low
variation with node addition. Sort and Reduce phases required relatively low execution
times and did not appear in some bars of the graph. Setup and Cleanup tasks consumed
an almost constant time, independently of cluster size or block size variation.
Figure 4.3(b) shows the percentage of each MapReduce phase in total job completion
time. We also considered an additional phase, called others, which represents the time
consumed by cluster management tasks, like scheduling and tasks assignment. The
behaviour followed by phases occupation is similar over all block sizes evaluated, with
the exception of the case where Map time does not decrease with node addition, in
clusters using block size of 128MB and with more than 21 nodes.
With cluster size variation, a relative reduction in Map time was observed and a
relative increase in the time of Shufe, Setup and Cleanup phases. During the evaluation
of Figure 4.3(a), it was observed that Setup and Cleanup phases consume an almost
absolute constant time, independently of cluster size and block size, and thereby with
node addition and completion time decreasing, the time consumed by Setup and Cleanup
tasks became more signicant in relation to the total execution time, due to the taotal
job completion time decreasing and the time of Setup and Cleanup tasks remaining
almost the same; therefore, the Setup and Cleanup percentage time increased and became
more signicant over the total job completion time reduction, by node addition into the
MapReduce cluster.
According to Figures 4.3(a) and 4.3(b), Map phase is predominant in MapReduce

48

4.3. RESULTS

jobs for DPI, and the reduction of the total job completion time over node addition is
related to the decreasing of Map phase time. Thus, improvements in Map phase execution
for DPI workloads can produce more signicant gains, in order to reduce the total job
completion time for DPI.
Figure 4.4 shows the comparison between completion time of CountUpDriver and
P3 to packet level analysis of 90Gb of network trafc, over variation of cluster size and
block size.
CountUpDriver 32MB
P3 32MB
CountUpDriver 64MB
P3 64MB
CountUpDriver 128MB
P3 128MB

500

Time(s)

400

300

200

100

0
02

04

06

10
Nodes

14

21

28

Figure 4.4 Completion time comparison of MapReduce for packet level analysis, evaluating the
approach with and without splitting into packets

P3 achieves better completion time than CountUpDriver over all factors, showing
that a divisible les approach performs better for packet level analysis and that block size
is a signicant factor for both approaches, due to signicant impact on completion time
caused by adoption of blocks with different sizes.
With variation in the number of nodes, it was observed that using a block size of
128MB was achieved better completion time up to 10 nodes, but that no more improvement in completion time was achieved with node addition on clusters with more than 10
nodes. Blocks of 32MB and 64MB only present signicant completion time difference in
a cluster up to 14 nodes; for a cluster bigger than 14 nodes a similar completion time was
achieved for both block sizes adopted, but these are still better completion times than the
completion time achieved with blocks of 128MB.
Figures 4.5(a) and 4.5(b) show respectively the completion time and speed-up of P3
and CountUpDriver against CountUpMono, for packet level analysis, with variation on
the number of nodes and block size. For both cases, the use of a block size of 128MB

49

4.3. RESULTS

provides the best completion time in smaller clusters, up to 10 nodes, but it provides
a worse completion time in a cluster with more than 21 nodes. For both evaluations,
the speed-up adopting block of 128MB scales up to 10 nodes, but for a bigger cluster a
speed-up gain was not achieved with node addition.
350

Completion Time 32MB


Completion Time 64MB
Completion Time 128MB
Speedup 32MB
Speedup 64MB
Speedup 128MB

300

18

16

250

200
Time(s)

12

10

150

Speedup

14

8
100
6
50

0
02

04

06

10
Nodes

14

21

28

(a): P3 evaluation
Completion Time 32MB
Completion Time 64MB
Completion Time 128MB
Speedup 32MB
Speedup 64MB
Speedup 128MB

500

14

12

400

200

Speedup

Time(s)

10
300

100

2
0
02

04

06

10
Nodes

14

21

28

(b): CountUpDriver evaluation

Figure 4.5 CountUp completion time and speed-up of 90Gb

Using blocks of 32MB was achieved improvement in completion time for all node
addition, which causes improvement of speed-up for all cluster size, although the use of
this block size did not present better completion time than others block sizes in any case.
The adoption of 32MB blocks provided better speed-up than other block sizes in
a cluster with more than 14 nodes, due to the time consumed by CountUpMono for
processing of 90Gb divided into 32MB les which was bigger than the time consumed
by cases with other block sizes, as shown in Table 4.3.

50

4.3. RESULTS

Figures 4.6(a) and 4.6(b) show the processing capacity of P3 and CountUpDriver to
perform packet level analysis of 90Gb of network trafc, over variation of cluster size
and block size.
Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB

2000

250

150
1000

Throughput/Nodes (Mbps)

Throughput (Mbps)

200
1500

100

500
50
0

10

12

14 16
Nodes

18

20

22

24

26

28

30

(a): P3 processing capacity


1800

Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB

1600

120

Throughput (Mbps)

1200

100

1000
80

800
600

Throughput/Nodes (Mbps)

1400

60

400
200

40
0
0

10

12

14 16
Nodes

18

20

22

24

26

28

30

(b): CountUpDriver processing capacity

Figure 4.6 CountUp processing capacity for 90Gb

Using block size of 64MB, P3 achieved throughput of 413.16 Mbps with 2 nodes and
the maximum of 1606.13 Mbps with 28 nodes, while its relative throughput for the same
conguration was 206.58 Mbps and 55.38 Mbps. The processing capacity for packet
level analysis, evaluated for P3 and CountUpDriver, follows the same behaviour showed
in Figure 4.2. Additionally, it is possible to observe a convergent decreasing of relative
processing capacity for all block sizes evaluated, starting in a cluster size of 14 nodes,
where the relative throughput achieved by all block sizes is quite similar.
Figure 4.6(b) shows a relative processing capacity increasing with the addition of

51

4.3. RESULTS

2 nodes into a cluster with 4 nodes. For packet level analysis of 90Gb, MapReduce
achieved a better processing capacity efciency per node using 6 nodes, which provides
24 Mappers and 5 Reducers per Map and Reduce waves. With the adopted variation in
number of Reducers, according to cluster size, using 5 Reducers was achieved better
processing efciency and signicant reduction on Reduce time, as shown in Figure 4.7(b).
Figures 4.7(a) and 4.7(b) show the cumulative time per phase during a job execution.
350

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

300

Time(s)

250

200

150

100

50

0
21

28

21

28

14

10

64MB

06

04

02

28

21

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(a): MapReduce Phases Times of P3


600

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

500

Time(s)

400

300

200

100

0
14

10

06

04

02

28

21

64MB

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(b): MapReduce Phases Times for CountUpDriver

Figure 4.7 MapReduce Phases time of CountUp for 90Gb

The behaviour of MapReduce phases for packet level analysis is similar to the
behaviour observed for DPI; with Map time being predominant, Map and Shufe time
do not decreasing with node addition in a cluster bigger than a specic size, and Sort
and Reduce phases consuming low execution time. The exception is that Shufe phase
consumes more time in packet level analysis jobs than in DPI, specially in smaller clusters.

52

4.3. RESULTS

For packet level analysis, the amount of intermediate data generated by Map functions
is bigger than the amount generated through the use of MapReduce for DPI; packet
level analysis generates an intermediate data for each packet evaluated, but for DPI it is
necessary to evaluate more than one packet to generate intermediate data. Shufe phase
is responsible for sorting and transferring the Map outputs to the Reducers as inputs; then
the amount of intermediate data generated by Map tasks and the network transfer cost,
will impact on the Shufe phase time.
Figures 4.8(a) and 4.8(b) show the percentage of each phase on job completion time
of P3 and CountUpDriver, respectively.
Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
21

28

21

28

14

10

64MB

06

04

02

28

21

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(a): Phases Distribution for P3


Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
14

10

06

04

02

28

21

64MB

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(b): Phases Distribution for CountUpDriver

Figure 4.8 MapReduce Phases Distribution for CountUp of 90Gb

As the behaviour observed in Figure 4.3(b) and followed by these cases, Map and
Shufe phases consume more relative time than all others phases, over all factors. But, for

53

4.3. RESULTS

packet level analysis, Map phase occupation decreases signicantly, with node addition,
only when block sizes are 32MB or 64MB, and following the completion time behaviour
observed in Figures 4.5(a) and 4.5(b).
For the dataset of 30Gb of network trafc the same experiments were conducted, and
the results about MapReduce phases evaluation presented a behaviour quite similar to the
results already presented by 90Gb experiments, for DPI and packet level analysis.
Relevant differences were identied for speed-up, completion time and scalability
evaluation, as shown by Figures 4.9(a) and 4.9(b), which exhibit the completion time and
processing capacity scalability of MapReduce for DPI of 30Gb of network trafc, with
variation in cluster size and block size.
10

350

Completion Time 32MB


Completion Time 64MB
Completion Time 128MB
Speedup 32MB
Speedup 64MB
Speedup 128MB

300

8
250

Time(s)

6
150

Speedup

7
200

100

50

2
0
02

06

10

18
Nodes

21

25

29

(a): DPI Completion Time and Speed-up of MapReduce for 30Gb


of a JXTA-application network trafc
550

80

Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB

500

70

450

Throughput (Mbps)

50

350
300

40

250

Throughput/Nodes (Mbps)

60
400

30
200
20

150
100

10
0

10

12

14
16
Nodes

18

20

22

24

26

28

30

(b): DPI Processing Capacity of 30Gb

Figure 4.9 DPI Completion Time and Processing capacity for 90Gb

54

4.4. DISCUSSION

The completion time of DPI of 30Gb scales signicantly up to 10 nodes; then the
experiment presents no more gains with node addition using block size of 128MB, and
presents a low increase of completion time in cases using blocks of 32MB and 64MB.
This behaviour is the same observed for job completion time for 90Gb, as shown in
Figures 4.5(a) and 4.5(b), but with signicant scaling just up to 10 nodes for the dataset
of 30Gb, while it was achieved scaling up to 25 nodes for 90Gb.
Figure 4.9(a) shows that a completion time of 199.12 seconds was obtained with 2
nodes using block of 128MB, scaling up to 87.33 seconds with 10 nodes and the same
block size, while it was achieved respectively 474.44 and 147.12 seconds by the same
conguration for DPI of 90Gb, as shown in Figure 4.1.
Although the case for processing of 90Gb (Figure 4.1) had processed a dataset 3
times bigger than the case of 30Gb (Figure 4.9(a)), the completion time achieved in all
cases for processing of 90Gb was smaller than 3 times the completion time for processing
of 30Gb. For the cases with 2 and 10 nodes using block of 128MB, it was consumed
respectively 2.38 and 1.68 times more time to process a dataset 90Gb, which is a dataset
3 times bigger than the dataset of 30Gb.
Figure 4.9(b) shows the processing capacity for DPI of 30Gb. The maximum speed-up
achieved for DPI of 30Gb was 7.90 times, using block of 32MB and 29 worker nodes,
while it was achieved the maximum speed-up of 16.19 times for DPI of 90Gb with 28
nodes.
From these results, it is possible to conclude that the MapReduce efciency ts better
for bigger data and in some cases can be more efcient to accumulate input data to
process a bigger amount of data. Therefore it is important to analyse the dataset size to
be processed, and to quantify the ideal number of allocated nodes for each job, in order
to avoid wasting resources.

4.4

Discussion

In this section, we discuss the measured results and evaluate their meaning, restrictions
and opportunities. We also discuss possible threats to the validity of our experiment.

4.4.1

Results Discussion

According to the processing capacity presented in our experimental results, of the evaluation of MapReduce for packet level analysis and DPI, it is possible to see that the
MapReduce adoption for packet level analysis and DPI provided high processing capacity

55

4.4. DISCUSSION

and speed-up of completion time, when compared with a solution without distributed
processing, making it possible to evaluate a large amount of network trafc and extract
information from distributed applications of an evaluated data center.
The block size adopted and the number of nodes allocated to data processing are
important factors for obtaining an efcient job completion time and processing capacity
scalability. Some benchmarks show that MapReduce performance can be improved by
an optimal block size choice (Jiang et al., 2010), showing better performance with the
adoption of bigger block sizes. We evaluated the impact of the block size for packet level
analysis and DPI workloads; blocks with 128MB provided a better completion time for
smaller clusters, but blocks with 64MB performed better in bigger clusters. Thus, in order
to obtain an optimal completion time adopting bigger block size it is necessary also to
evaluate the node allocation for the MapReduce job, due to the variation in block size
and cluster size can cause signicant impact into completion time.
The different processing capacity achieved for processing datasets of 30Gb and 90Gb
highlights the efciency of MapReduce for dealing with bigger data processing, and
that can be more efcient to accumulate input data, to process a larger amount of data.
Therefore, it is important to analyse the dataset size to be processed, and to quantify the
ideal number of allocated nodes for each job, in order to avoid wasting resources.
The evaluation of the dataset size and the optimal number of nodes is important to
understand how to schedule MapReduce jobs and resource allocation through specic
Hadoop schedulers, such as Capacity Scheduler and Fair Scheduler (Zaharia et al., 2010),
in order to avoid resource wasting with allocation of nodes that will not produce signicant
gains (Verma et al., 2012a). Thus, the variation of processing capacity achieved in out
experiments highlights the importance of evaluation of the cost of node allocation and its
benets, and the need to evaluate the ideal size of pools in the Hadoop cluster, to obtain
efciency between the cluster size allocated to process an input size, and the resource
sharing of a Hadoop cluster.
The MapReduce processing capacity does not scale proportionally to node addition;
in some cases there is not signicant processing capacity which increases with node
addition, as shown in Figure 4.1, where jobs using block size of 64MB and 128MB in
clusters with more than 14 nodes for DPI of 90Gb, present no signicant completion time
gain with node addition.
The number of execution waves is a factor that must be evaluated (Kavulya et al.,
2010) when MapReduce scalability is analysed, due to the execution time decrease wis
related to the number of execution waves necessary to process all input data. The number

56

4.4. DISCUSSION

of execution waves is dened by the available slots for execution of Map and Reduce
tasks; for example, if a MapReduce job is divided into 10 tasks in a cluster with 5 available
slots, then it will be necessary to have 2 execution waves for all tasks be executed.
Figure 4.9(a) shows a case of DPI of 30Gb, using block size of 128MB, in which there
was not a reduction of completion time with cluster size bigger than 10 nodes, because
there was not a reduction in the number of execution waves. But in our experiments,
cases with a reduction of execution waves also presented no signicant reduction of
completion time, as cases using block size of 128MB in clusters with 21 nodes or more,
for DPI and packet level analysis, showed in Figure 4.1. Thus, node addition or tasks
distribution must be evaluated for resources usage optimization and to avoid additional or
unnecessary costs with machines and power consumption.
The comparison of completion time between CountUpDriver and P3 shows that P3,
which splits the data into packets, performs better than CountUpDriver, which processes
a whole block without splitting. Processing a whole block as input, the local node
parallelism is limited to the number of slots per node, while in the divisible approach
each split can be processed by an independent thread, increasing the possible parallelism.
Because some cases require data without splitting, such as DPI and video processing
cases (Pereira et al., 2010), improvements for this issue must be evaluated, considering
better schedulers, data location and task assignment.
The behavioural evaluation of MapReduce phases showed that the Map phase is
predominant in total execution time for packet level analysis and DPI, with Shufe being
the second most expressive phase. Shufe can overlap Map phase, and this condition
must be considered in MapReduce evaluations, specially in our case, due to the overlap
of Map and Shufe which represents more than 50% of total time execution.
The long time of "Map and Shufe" phase represents a long time of Shufe tasks
being executed in parallel with Map tasks, and a long time of slots allocated for Shufe
tasks that only will be concluded after all Map tasks are nished, although these Shufe
tasks can be longer than the time required to read and process the generated intermediate
data. If there are slots allocated for Shufe tasks that are only waiting for Map phase
conclusion, these slots could be used for other task executions, which could accelerate
the job completion time.
With the increase of cluster size and reduction of job completion time, it was observed
that the Map phase showed a proportional decrease, while Shufe phase increased with
the growth of the number of nodes. With more nodes, the intermediate data generated
by Map tasks is placed in more nodes, which are responsible for shufing the data and

57

4.5. CHAPTER SUMMARY

sending them to specic Reducers, increasing the amount of remote I/O from Mappers to
Reducers and the number of data sources for each Reducer. Shufe phase may represent
a bottleneck (Zhang et al., 2009) for scalability and could be optimized, due to I/O
restrictions (Lee et al., 2012; Akram et al., 2012) and data locality issues for Reduce
phase (Hammoud and Sakr, 2011).
Information extracted from the analysed results about the performance obtained
with specic cluster, block and input sizes, is important for conguring MapReduce
resource allocation and specialized schedulers, such as the Fair Scheduler (Zaharia et al.,
2008), which denes pool sizes and resource share for MapReduce jobs. Thus, with
the information of the performance achieved with specic resources, it is possible to
congure MapReduce parameters in order to obtain efciency between the resource
allocation, and the expected completion time or resource sharing (Zaharia et al., 2008,
2010).

4.4.2

Possible threats to validity

In this chapter we evaluated for packet level analysis, a port counter implemented with
P3. It was used a version of this implementation published in Lee et al. (2011) website2 ,
obtained on 2012 February, when a complete binary version was available, which was
used in our experiments, but this binary version is currently not available.
Part of the P3 source code was published later, but not all necessary code to compile
all binary libraries necessary to evaluate the P3 implementation of a port counter. Thereby
it is important to highlight that the obtained results, through our evaluation, was for the
P3 version obtained on 2012 February from Lee et al. (2011) website.
It is also important to highlight that the DPI can present restrictions to evaluate
encrypted messages, and that the obtained results are specics for the input datasets,
factors, levels and experiment setup used in our evaluation.

4.5

Chapter Summary

In this chapter, we evaluated the performance of MapReduce for packet level analysis and
DPI of applications trafc. We evaluated how data input, block and cluster sizes, impacts
on MapReduce phases, job completion time, processing capacity scalability and on the
speed-up achieved in comparison with the same algorithm executed by a non-distributed
2 https://sites.google.com/a/networks.cnu.ac.kr/yhlee/p3

58

4.5. CHAPTER SUMMARY

implementation.
The results show that MapReduce presents high processing capacity for dealing with
massive application trafc analysis. The behaviour of MapReduce phases over variation
of block size and cluster size was evaluated; we veried that packet level analysis and
DPI are Map-intensive jobs, and that Map phase consumes more than 70% of execution
time, with Shufe phase being the second predominant phase.
We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability
and efcient resource allocation, due to the variation in completion time provided by the
block size adopted and, in some cases, due to the processing capacity which does not
increase with node addition into the cluster.
We also showed that using a whole block as input for Map functions, achieved a
poorer performance than using divisible data, thereby more evaluation is necessary to
understand how it can be handled and improved.

59

Conclusion and Future Work


The softest things in the world overcome the hardest things in the world.
LAO TZU

Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructure. The detection of error causes, diagnose and reproduction of
errors of distributed systems are challenges that motivate efforts to develop less intrusive
mechanisms for monitoring and debugging distributed applications at runtime.
Network trafc analysis is one option for distributed systems measurement, although
there are limitations on capacity to process large amounts of network trafc in short time,
and on scalability to process network trafc where there is variation of resource demand.
In this dissertation we proposed an approach to perform deep inspection in distributed
applications network trafc, in order to evaluate distributed systems at a data center
through network trafc analysis, using commodity hardware and cloud computing services, in a minimally intrusive way. Thus we developed an approach based on MapReduce,
to evaluate the behavior of a JXTA-based distributed system through DPI.
We evaluated the effectiveness of MapReduce to implement a DPI algorithm and
its completion time scalability to measure a JXTA-based application, using virtual machines of a cloud computing provider. Also, was deeply evaluated the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of

60

5.1. CONCLUSION

input size, block size and cluster size.

5.1

Conclusion

With our proposed approach, it is possible to measure the network trafc behavior of
distributed applications with intensive network trafc generation, through the ofine
evaluation of information from the production environment of a distributed system,
making it possible to use the information from the evaluated indicators, to diagnose
problems and analyse performance of distributed systems.
We showed that MapReduce programming model can express algorithms for DPI, as
the Algorithm 1, implemented to extract application indicators from the network trafc
of a JXTA-based distributed application. We analysed the completion time scalability
achieved for different number of nodes in a Hadoop cluster composed of virtual machines,
with different size of network trafc used as input. We showed the processing capacity
and the completion time scalability achieved, and also was showed the inuence of the
number of nodes and the data input size in the processing capacity for DPI using virtual
machines of Amazon EC2, for a selected scenario.
We evaluated the performance of MapReduce for packet level analysis and DPI of
applications trafc, using commodity hardware, and showed how data input size, block
size and cluster size cause relevant impacts into MapReduce phases, job completion time,
processing capacity scalability and in the speedup achieved in comparison against the
same execution by a non distributed implementation.
The results showed that although MapReduce presents a good processing capacity
using cloud services or commodity computers for dealing with massive application
trafc analysis, but it is necessary to evaluate the behaviour of MapReduce to process
specics data type, in order to understand its relation with the available resources and
the conguration of MapReduce parameters, and to obtain an optimal performance for
specic environments.
We showed that MapReduce processing capacity scalability is not proportional to
number of allocated nodes, and the relative processing capacity decreases with node
addition. We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability,
due to the observed variation in completion time provided by different block size adopted.
Also, in some cases, the processing capacity does not scale with node addition into
the cluster, what highlights the importance of allocating resources according with the
workload and input data, in order to avoid wasting resources.

61

5.2. CONTRIBUTIONS

We veried that packet level analysis and DPI are Map-intensive jobs, due to Map
phase consumes more than 70% of the total job completion time, and shufe phase is
the second predominant phase. We also showed that using whole block as input for Map
functions, it was achieved a poor completion time than the approach which splits the
block into records.

5.2

Contributions

We attempt to analyse the processing capacity problem of measurement of distributed systems through network trafc analysis, the results of the work presented in this dissertation
provide the contributions below:
1. We proposed an approach to implements DPI algorithms through MapReduce,
using whole blocks as input for Map functions. It was shown the effectiveness of
MapReduce for a DPI algorithm to extract indicators from a distributed application trafc, also it was shown the completion time scalability of MapReduce
for DPI, using virtual machines of a cloud provider;
2. We developed JNetPCAP-JXTA (Vieira, 2012b), an open source parser to extract
JXTA messages from network trafc traces;
3. We developed Hadoop-Analyzer (Vieira, 2013), an open source tool to extract
indicators from Hadoop logs and generate graphs of specied metrics.
4. We characterized the behavior followed by MapReduce phases for packet
level analysis and DPI, showing that this kind of job is intense in Map phase
and highlighting points that can be improved;
5. We described the processing capacity scalability of MapReduce for packet
level analysis and DPI, evaluating the impact caused by variations in input
size, cluster size and block size;
6. We showed the speed-up obtained with MapReduce for DPI, with variations in
input size, cluster size and block size;
7. We published two papers reporting our results, as follows:
(a) Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. Evaluating
Performance of Distributed Systems with MapReduce and Network Trafc

62

5.2. CONTRIBUTIONS

Analysis. In ICSEA 2012, The Seventh International Conference on Software


Engineering Advances. Xpert Publishing Services.
(b) Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. Measuring
Distributed Applications Through MapReduce and Trafc Analysis. In Parallel
and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference
on, pages 704 - 705.

5.2.1

Lessons Learned

The contributions cited are of scientic and academic scope, with implementations and
evaluations little explored in the literature. However, with the development of this work,
some important lessons were learned.
During this research, different approaches for evaluating distributed systems of cloud
computing providers were studied. In this period, we could see the importance of the
performance evaluation in a cloud computing environment, and the recent efforts to
diagnose and evaluate system at production environment of a data center. Also, the
growth of the Internet and resource utilization make necessary solutions to be able to
evaluate large amounts of data in short time, with low performance degradation of the
evaluated system.
MapReduce has grown as a general purpose solution for big data processing, but it
is not a solution for all kind of problems, and its performance is dependent of several
parameters. Some researches has been done in order to improve MapReduce performance, through analytical modelling, simulation and measurement, but the most relevant
contributions in this direction was guided by realistic workload evaluations, from large
MapReduce clusters.
We learned that although the facilities provided by the MapReduce for distributed
processing, its performance is inuenced by the environment, network topology, workload,
data type and by several specic parameter congurations. Therefore, an evaluation of
the MapReduce behavior using data of a realistic environment will provide more accurate
and wide results, while in controlled experiments the results are more restricted and
limited to the evaluated metrics and factors.

63

5.3. FUTURE WORK

5.3

Future Work

Because of time constraints imposed on the master degree, this dissertation addresses
some problems, but some problems are still open and others are emerging from current
results. Thus, the following issues should be investigated as future work:
Evaluating of all components of the proposed approach. This dissertation evaluated the JNetPCAP-JXTA, the AppAnalyzer and its implementation to evaluate a
JXTA-based distributed application, it is necessary to evaluate the SnifferServer,
Manager and the whole system working together, analysing their impact into the
measured distributed system and the scalability achieved;
Development of a technique for the efcient evaluation of distributed systems through information extracted from network trafc. This dissertation
addressed the problem of processing capacity for measuring distributed systems
through network trafc analysis, but it is necessary an efcient approach to diagnose problems of distributed systems, using information of ows, connections,
throughput and response time obtained from network trafc analysis;
Development of a analytic model and simulations, using information of MapReduce behavior for network trafc analysis, measured by this dissertation, to reproduce its characteristics and enable the evaluation and prediction of some cases of
MapReduce for network trafc analysis;

64

Bibliography

Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. (2003).
Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst.
Rev., 37(5).
Akram, S., Marazakis, M., and Bilas, A. (2012). Understanding scalability and performance requirements of I/O-intensive applications on future multicore servers. In
Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Antonello, R., Fernandes, S., Kamienski, C., Sadok, D., Kelner, J., Godorc, I., Szaboc, G.,
and Westholm, T. (2012). Deep packet inspection tools and techniques in commodity
platforms: Challenges and trends. Journal of Network and Computer Applications.
Antoniu, G., Hatcher, P., Jan, M., and Noblet, D. (2005). Performance evaluation of jxta
communication layers. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE
International Symposium on, volume 1, pages 251 258 Vol. 1.
Antoniu, G., Cudennec, L., Jan, M., and Duigou, M. (2007). Performance scalability
of the jxta p2p framework. In Parallel and Distributed Processing Symposium, 2007.
IPDPS 2007. IEEE International, pages 1 10.
Armbrust, M., Fox, A., Grifth, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. (2010). A view of cloud
computing. Commun. ACM, 53, 5058.
Basili, V. R., Caldiera, G., and Rombach, H. D. (1994). The goal question metric
approach. In Encyclopedia of Software Engineering. Wiley.
Bhatotia, P., Wieder, A., Akkus, I. E., Rodrigues, R., and Acar, U. A. (2011). Largescale incremental data processing with change propagation. In Proceedings of the 3rd
USENIX conference on Hot topics in cloud computing, HotCloud11, pages 1818,
Berkeley, CA, USA. USENIX Association.
Callado, A., Kamienski, C., Szabo, G., Gero, B., Kelner, J., Fernandes, S., and Sadok, D.
(2009). A survey on internet trafc identication. Communications Surveys Tutorials,
IEEE, 11(3), 37 52.

65

BIBLIOGRAPHY

Chen, Y., Ganapathi, A., Grifth, R., and Katz, R. (2011). The case for evaluating
MapReduce performance using workload suites. In Modeling, Analysis Simulation of
Computer and Telecommunication Systems (MASCOTS), 2011 IEEE 19th International
Symposium on.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. (2010).
MapReduce online. In Proceedings of the 7th USENIX conference on Networked
systems design and implementation, pages 2121.
Cox, L. P., Murray, C. D., and Noble, B. D. (2002). Pastiche: making backup cheap and
easy. SIGOPS Oper. Syst. Rev., 36, 285298.
Dean, J. and Ghemawat, S. (2008). MapReduce: simplied data processing on large
clusters. Commun. ACM, 51, 107113.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazons highly
available key-value store. SIGOPS Oper. Syst. Rev., 41, 205220.
Duigou, M. (2003). Jxta v2.0 protocols specication. Technical report, IETF Internet
Draft.
Eddelbuettel, D. (2012). R in action. Journal of Statistical Software, Book Reviews, 46(2),
12.
Fernandes, S., Antonello, R., Lacerda, T., Santos, A., Sadok, D., and Westholm, T. (2009).
Slimming down deep packet inspection systems. In INFOCOM Workshops 2009,
IEEE.
Fonseca, A., Silva, M., Soares, P., Soares-Neto, F., Garcia, V., and Assad, R. (2012). Uma
proposta arquitetural para servios escalveis de dados em nuvens. In Proceedings of
the VIII Workshop de Redes Dinmicas e Sistemas P2P.
Fox, A., Grifth, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin,
A., and Stoica, I. (2009). Above the clouds: A berkeley view of cloud computing.
Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep.
UCB/EECS, 28.
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google le system. SIGOPS
Oper. Syst. Rev.

66

BIBLIOGRAPHY

Groot, S. (2012). Modeling I/O interference in data intensive map-reduce applications. In


Applications and the Internet (SAINT), 2012 IEEE/IPSJ 12th International Symposium
on.
Gunther, N. (2006). Guerrilla Capacity Planning: A Tactical Approach to Planning for
Highly Scalable Applications and Services. Springer.
Guo, Z., Fox, G., and Zhou, M. (2012). Investigation of data locality and fairness in
MapReduce. In Proceedings of third international workshop on MapReduce and its
Applications Date, MapReduce 12.
Gupta, D., Vishwanath, K. V., McNett, M., Vahdat, A., Yocum, K., Snoeren, A., and
Voelker, G. M. (2011). Diecast: Testing distributed systems with an accurate scale
model. ACM Trans. Comput. Syst., 29, 4:14:48.
Halepovic, E. (2004). Performance evaluation and benchmarking of the JXTA peer-topeer platform. Ph.D. thesis, University of Saskatchewan.
Halepovic, E. and Deters, R. (2003). The costs of using jxta. In Peer-to-Peer Computing,
2003. (P2P 2003). Proceedings. Third International Conference on, pages 160 167.
Halepovic, E. and Deters, R. (2005). The jxta performance model and evaluation. Future
Gener. Comput. Syst., 21, 377390.
Halepovic, E., Deters, R., and Traversat, B. (2005). Jxta messaging: Analysis of featureperformance tradeoffs and implications for system design. In R. Meersman and
Z. Tari, editors, On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and
ODBASE, volume 3761, pages 10971114. Springer Berlin / Heidelberg.
Hammoud, M. and Sakr, M. (2011). Locality-aware reduce task scheduling for MapReduce. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third
International Conference on, pages 570 576.
Jacobson, V., Leres, C., and McCanne, S. (1994). libpcap. http://www.tcpdump.org/.
Jain, R. (1991). The art of computer systems performance analysis - techniques for
experimental design, measurement, simulation, and modeling. Wiley professional
computing. Wiley.
Janert, P. K. (2010). Gnuplot in action: understanding data with graphs. Manning,
Greenwich, CT.

67

BIBLIOGRAPHY

Jiang, D., Ooi, B. C., Shi, L., and Wu, S. (2010). The performance of MapReduce: an
in-depth study. Proc. VLDB Endow.
Kambatla, K., Pathak, A., and Pucha, H. (2009). Towards optimizing hadoop provisioning
in the cloud. In Proc. of the First Workshop on Hot Topics in Cloud Computing.
Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. (2009). The nature
of data center trafc: measurements & analysis. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, IMC 09, pages 202208,
New York, NY, USA. ACM.
Kavulya, S., Tan, J., Gandhi, R., and Narasimhan, P. (2010). An analysis of traces from
a production MapReduce cluster. In Cluster, Cloud and Grid Computing (CCGrid),
2010 10th IEEE/ACM International Conference on.
Lmmel, R. (2007). Googles MapReduce programming model - revisited. Sci. Comput.
Program., 68(3), 208237.
Lee, G. (2012). Resource Allocation and Scheduling in Heterogeneous Cloud Environments. Ph.D. thesis, University of California, Berkeley.
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., and Moon, B. (2012). Parallel data
processing with MapReduce: a survey. SIGMOD Rec.
Lee, Y., Kang, W., and Son, H. (2010). An internet trafc analysis method with MapReduce. In Network Operations and Management Symposium Workshops (NOMS Wksps),
2010 IEEE/IFIP, pages 357 361.
Lee, Y., Kang, W., and Lee, Y. (2011). A hadoop-based packet trace processing tool. In
Proceedings of the Third international conference on Trafc monitoring and analysis,
TMA11.
Lin, H., Ma, X., Archuleta, J., Feng, W., Gardner, M., and Zhang, Z. (2010). Moon:
MapReduce on opportunistic environments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 95106.
ACM.
Lin, J. (2012). Mapreduce is good enough? if all you have is a hammer, throw away
everything thats not a nail! Big Data.

68

BIBLIOGRAPHY

Loiseau, P., Goncalves, P., Guillier, R., Imbert, M., Kodama, Y., and Primet, P.-B. (2009).
Metroux: A high performance system for analysing ow at very ne-grain. In
Testbeds and Research Infrastructures for the Development of Networks Communities
and Workshops, 2009. TridentCom 2009. 5th International Conference on, pages 1 9.
Lu, P., Lee, Y. C., Wang, C., Zhou, B. B., Chen, J., and Zomaya, A. Y. (2012). Workload
characteristic oriented scheduler for MapReduce. In Parallel and Distributed Systems
(ICPADS), 2012 IEEE 18th International Conference on, pages 156 163.
Massie, M. L., Chun, B. N., and Culler, D. E. (2004). The Ganglia distributed monitoring
system: design, implementation, and experience. Parallel Computing, 30(7), 817
840.
Mi, H., Wang, H., Yin, G., Cai, H., Zhou, Q., and Sun, T. (2012). Performance problems
diagnosis in cloud computing systems by mining request trace logs. In Network
Operations and Management Symposium (NOMS), 2012 IEEE.
Nagaraj, K., Killian, C., and Neville, J. (2012). Structured comparative analysis of
systems logs to diagnose performance problems. In Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, NSDI12.
Oliner, A., Ganapathi, A., and Xu, W. (2012). Advances and challenges in log analysis.
Commun. ACM, 55(2), 5561.
Paul, D. (2010). JXTA-Sim2: A Simulator for the core JXTA protocols. Masters thesis,
University of Dublin, Ireland.
Pereira, R., Azambuja, M., Breitman, K., and Endler, M. (2010). An architecture for
distributed high performance video processing in the cloud. In Cloud Computing
(CLOUD), 2010 IEEE 3rd International Conference on.
Piyachon, P. and Luo, Y. (2006). Efcient memory utilization on network processors
for deep packet inspection. In Proceedings of the 2006 ACM/IEEE Symposium on
Architecture for Networking and Communications Systems, pages 7180. ACM.
Risso, F., Baldi, M., Morandi, O., Baldini, A., and Monclus, P. (2008). Lightweight,
payload-based trafc classication: An experimental evaluation. In Communications,
2008. ICC08. IEEE International Conference on, pages 58695875. IEEE.

69

BIBLIOGRAPHY

Rumen (2012). Rumen, a tool to extract job characterization data from job tracker
logs. http://hadoop.apache.org/docs/MapReduce/r0.22.0/rumen.html. [Acessado em
dezembro de 2012].
Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M.,
Wang, W., Xu, L., and Ganger, G. R. (2011). Diagnosing performance changes by
comparing request ows. In Proceedings of the 8th USENIX conference on Networked
systems design and implementation, NSDI11.
Shafer, J., Rixner, S., and Cox, A. (2010). The hadoop distributed lesystem: Balancing
portability and performance. In Performance Analysis of Systems Software (ISPASS),
2010 IEEE International Symposium on.
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D.,
Jaspan, S., and Shanbhag, C. (2010). Dapper, a large-scale distributed systems tracing
infrastructure. Technical report, Google, Inc.
Tan, J., Meng, X., and Zhang, L. (2012). Coupling scheduler for MapReduce/hadoop. In
Proceedings of the 21st international symposium on High-Performance Parallel and
Distributed Computing, HPDC 12.
Verma, A., Cherkasova, L., Kumar, V., and Campbell, R. (2012a). Deadline-based
workload management for MapReduce environments: Pieces of the performance
puzzle. In Network Operations and Management Symposium (NOMS), 2012 IEEE.
Verma, A., Cherkasova, L., and Campbell, R. (2012b). Two sides of a coin: Optimizing
the schedule of MapReduce jobs to minimize their makespan and improve cluster
performance. In Modeling, Analysis Simulation of Computer and Telecommunication
Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Vieira, T. (2012a). hadoop-dpi. http://github.com/tpbvieira/hadoop-dpi.
Vieira, T. (2012b). jnetpcap-jxta. http://github.com/tpbvieira/jnetpcap-jxta.
Vieira, T. (2013). hadoop-analyzer. http://github.com/tpbvieira/hadoop-analyzer.
Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012a). Evaluating
performance of distributed systems with MapReduce and network trafc analysis. In
ICSEA 2012, The Seventh International Conference on Software Engineering Advances.
Xpert Publishing Services.

70

BIBLIOGRAPHY

Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012b). Measuring
distributed applications through MapReduce and trafc analysis. In Parallel and
Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pages
704 705.
Wang, G., Butt, A., Pandey, P., and Gupta, K. (2009). A simulation approach to evaluating
design decisions in MapReduce setups. In Modeling, Analysis Simulation of Computer
and Telecommunication Systems, 2009. MASCOTS 09. IEEE International Symposium
on.
Yu, M., Greenberg, A., Maltz, D., Rexford, J., Yuan, L., Kandula, S., and Kim, C. (2011).
Proling network performance for multi-tier data center applications. In Proceedings
of the 8th USENIX conference on Networked systems design and implementation,
NSDI11.
Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S. (2011). Improving software
diagnosability via log enhancement. In ACM SIGARCH Computer Architecture News,
volume 39, pages 314. ACM.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. (2008). Improving
MapReduce performance in heterogeneous environments. In Proceedings of the 8th
USENIX conference on Operating systems design and implementation, OSDI08.
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., and Stoica, I.
(2010). Delay scheduling: a simple technique for achieving locality and fairness
in cluster scheduling. In Proceedings of the 5th European conference on Computer
systems, EuroSys 10.
Zhang, S., Han, J., Liu, Z., Wang, K., and Feng, S. (2009). Accelerating MapReduce
with distributed memory cache. In Parallel and Distributed Systems (ICPADS), 2009
15th International Conference on.
Zheng, Z., Yu, L., Lan, Z., and Jones, T. (2012). 3-dimensional root cause diagnosis
via co-analysis. In Proceedings of the 9th international conference on Autonomic
computing, pages 181190. ACM.

71