Pós-Graduação em Ciência da Computação

“An Approach for Profiling Distributed

Applications Through Network Traffic Analysis”
Por
THIAGO PEREIRA DE BRITO VIEIRA
Dissertação de Mestrado

Universidade Federal de Pernambuco
posgraduacao@cin.ufpe.br
www.cin.ufpe.br/~posgraduacao

RECIFE, MARÇO/2013

UNIVERSIDADE FEDERAL DE PERNAMBUCO
CENTRO DE INFORMÁTICA
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO

THIAGO PEREIRA DE BRITO VIEIRA

“AN APPROACH FOR PROFILING DISTRIBUTED
APPLICATIONS THROUGH NETWORK TRAFFIC
ANALYSIS"

ESTE TRABALHO FOI APRESENTADO À PÓS-GRADUAÇÃO EM
CIÊNCIA DA COMPUTAÇÃO DO CENTRO DE INFORMÁTICA DA
UNIVERSIDADE FEDERAL DE PERNAMBUCO COMO REQUISITO
PARCIAL PARA OBTENÇÃO DO GRAU DE MESTRE EM CIÊNCIA
DA COMPUTAÇÃO.

ORIENTADOR: Vinicius Cardoso Garcia
CÓ-ORIENTADOR: Stenio Flavio de Lacerda Fernandes

RECIFE, MARÇO/2013

Catalogação na fonte
Bibliotecária Jane Souto Maior, CRB4-571

Vieira, Thiago Pereira de Brito
An approach for profiling distributed applications
through network traffic analysis. / Thiago Pereira de Brito
Vieira. - Recife: O Autor, 2013.
xv, 71 folhas: fig., tab.
Orientador: Vinicius Cardoso Garcia.
Dissertação (mestrado) - Universidade
Pernambuco. CIn, Ciência da Computação, 2013.

Federal

de

Inclui bibliografia.
1. Ciência da computação. 2. Sistemas distribuídos. I. Garcia,
Vinicius Cardoso (orientador). II. Título.
004

CDD (23. ed.)

MEI2013 – 054

Dissertação de Mestrado apresentada por Thiago Pereira de Brito Vieira à PósGraduação em Ciência da Computação do Centro de Informática da Universidade Federal
de Pernambuco, sob o título “An Approach for Profiling Distributed Applications
Through Network Traffic Analysis” orientada pelo Prof. Vinicius Cardoso Garcia e
aprovada pela Banca Examinadora formada pelos professores:

______________________________________________
Prof. José Augusto Suruagy Monteiro
Centro de Informática / UFPE
______________________________________________
Prof. Denio Mariz Timoteo de Souza
Instituto Federal da Paraíba
_______________________________________________
Prof. Vinicius Cardoso Garcia
Centro de Informática / UFPE

Visto e permitida a impressão.
Recife, 5 de março de 2013
___________________________________________________
Profa. Edna Natividade da Silva Barros
Coordenadora da Pós-Graduação em Ciência da Computação do
Centro de Informática da Universidade Federal de Pernambuco.

Eu dedico esta dissertação aos meus pais, por me
ensinarem a sempre estudar e trabalhar para evoluir como
pessoa e profissional.

Agradecimentos

Primeiramente eu gostaria de agradecer a Deus pela vida, saúde e todas oportunidades
criadas em minha vida.
Agradeço aos meus pais, João e Ana, por todo o amor, carinho e incentivos para que
Eu possa sempre buscar crescimento pessoal e profissional, além de sempre me apoiarem
nas minhas decisões e se mostrarem sempre preocupados e empenhados em me ajudar a
alcançar meus objetivos.
Agradeço à Alynne, minha futura esposa, por todo o amor e paciência durante todo
nosso relacionamento, principalmente nestes dois intensos anos de mestrado, em que
foram essenciais suas palavras de apoio nos momentos difíceis e sua descontração para
me dar mais engergia e vontade de seguir com cada vez mais dedicação.
Agradeço à Agência Nacional de Telecomunicações - Anatel por permitir e proporcionar mais um aprendizado na minha vida. Gostaria de agradecer especialmente
a Rodrigo Barbosa, Túlio Barbosa e Jane Teixeira por compreenderem e me apoiarem
nesde desafio de cursar um mestrado. Agradeço a Marcio Formiga, pelo apoio antes e
durante o mestrado, e pela compreensão do esforço necessário para vencer mais este
desafio. Agradeço a Wesley Paesano, Marcelo de Oliveira, Regis Novais e Danilo Balby
pelo apoio e suporte para que eu pudesse me dedicar ao mestrado durante estes dois anos.
Também agradeço aos amigos da Anatel, que de forma direta ou inditera me ajudaram
a enfrentar mais este desafio, dentre eles agradeço em especial a Ricardo de Holanda,
Rodrigo Curi, Esdras Hoche, Francisco Paulo, Cláudio Moonen, Otávio Barbosa, Hélio
Silva, Bruno Preto, Luide Liude e Alexandre Augusto.
Agradeço a todos aqueles que me orientaram e forneceram algum ensinamento
durante este mestrado, em especial a Vinicius Garcia pelo acolhimento, apoio, orientações,
cobranças e todos os importantes ensinamentos durante estes meses. Agradeço a Stenio
Fernandes por todos os ensinamentos e orientações em momentos importantes da minha
pesquisa. Agradeço a Rodrigo Assad pelo trabalho realizado em conjunto ao usto.re e
pelas orientações, que me nortearam no desenvolvimento da minha pesquisa. Agradeço
a Marcelo D’Amorim pelo acolhimento inicial e pelo trabalho que desempenhamos
juntos, que foi de grande valor para a minha inserção na pesquisa científica e para o meu
desenvolvimento como pesquisador.
Agradeço a José Augusto Suruagy e Denio Mariz por aceitarem fazer parte da banca
da minha defesa de dissertação e pelas valiosas críticas e contribuições para o meu
trabalho.
Agradeço a todos os amigos que fiz durante este período de mestrado, que con-

vi

tribuiram para que estes dias dedicados ao mestrado fossem bastante agradáveis. Gostaria
de agradecer a Paulo Fernando, Lenin Abadie, Marco Machado, Dhiego Abrantes,
Rodolfo Arruda, Francisco Soares, Sabrina Souto, Adriano Tito, Hélio Rodrigues, Jamilson Batista, Bruno Felipe e demais pessoas que tive o prazer de conhecer durante este
período do mestrado.
Também agradeço a todos os meus velhos amigos de João Pessoa, Geisel, UFPB e
CEFET-PB, que tanto me deram apoio e incentivos para desenvolver este trabalho.
Finalmente, a todos aqueles que colaboraram direta ou indiretamente na realização
deste trabalho.
Muito Obrigado!!!

vii

Wherever you go, go with all your heart.
—CONFUCIUS

Resumo

Sistemas distribuídos têm sido utilizados na construção de modernos serviços da Internet
e infraestrutura de computação em núvem, com o intuito de obter serviços com alto
desempenho, escalabilidade e confiabilidade. Os acordos de níves de serviço adotados
pela computação na núvem requerem um reduzido tempo para identificar, diagnosticar
e solucionar problemas em sua infraestrutura, de modo a evitar que problemas gerem
impactos negativos na qualidade dos serviços prestados aos seus clientes. Então, a
detecção de causas de erros, diagnóstico e reprodução de erros provenientes de sistemas
distribuídos são desafios que motivam esforços para o desenvolvimento de mecanismos
menos intrusivos e mais eficientes, para o monitoramento e depuração de aplicações
distribuídas em tempo de execução.
A análise de tráfego de rede é uma opção para a medição de sistemas distribuídos,
embora haja limitações na capacidade de processar grande quantidade de tráfego de
rede em curto tempo, e na escalabilidade para processar tráfego de rede sob variação de
demanda de recursos.
O objetivo desta dissertação é analisar o problema da capacidade de processamento
para mensurar sistemas distribuídos através da análise de tráfego de rede, com o intuito
de avaliar o desempenho de sistemas distribuídos de um data center, usando hardware
não especializado e serviços de computação em núvem, de uma forma minimamente
intrusiva.
Nós propusemos uma nova abordagem baseada em MapReduce para profundamente
inspecionar tráfego de rede de aplicações distribuídas, com o objetivo de avaliar o
desempenho de sistemas distribuídos em tempo de execução, usando hardware não
especializado. Nesta dissertação nós avaliamos a eficácia do MapReduce para um
algoritimo de avaliação profunda de pacotes, sua capacidade de processamento, o ganho
no tempo de conclusão de tarefas, a escalabilidade na capacidade de processamento, e o
comportamento seguido pelas fases do MapReduce, quando aplicado à inspeção profunda
de pacotes, para extrair indicadores de aplicações distribuídas.
Palavras-chave: Medição de Aplicações Distribuídas, Depuração, MapReduce, Análise
de Tráfego de Rede, Análise em Nível de Pacotes, Análise Profunda de Pacotes

ix

Abstract

Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructures, in order to obtain services with high performance, scalability,
and reliability. Cloud computing SLAs require low time to identify, diagnose and solve
problems in a cloud computing production infrastructure, in order to avoid negative
impacts into the quality of service provided for its clients. Thus, the detection of error
causes, diagnose and reproduction of errors are challenges that motivate efforts to the
development of less intrusive mechanisms for monitoring and debugging distributed
applications at runtime.
Network traffic analysis is one option to the distributed systems measurement, although there are limitations on capacity to process large amounts of network traffic
in short time, and on scalability to process network traffic where there is variation of
resource demand.
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network traffic analysis, in order to evaluate the
performance of distributed systems at a data center, using commodity hardware and cloud
computing services, in a minimally intrusive way.
We propose a new approach based on MapReduce, for deep inspection of distributed
application traffic, in order to evaluate the performance of distributed systems at runtime, using commodity hardware. In this dissertation we evaluated the effectiveness of
MapReduce for a deep packet inspection algorithm, its processing capacity, completion
time speedup, processing capacity scalability, and the behavior followed by MapReduce
phases, when applied to deep packet inspection for extracting indicators of distributed
applications.
Keywords: Distributed Application Measurement, Profiling, MapReduce, Network
Traffic Analysis, Packet Level Analysis, Deep Packet Inspection

x

Contents

List of Figures

xiii

List of Tables

xiv

List of Acronyms

xv

1

2

3

Introduction
1.1 Motivation . . . . . . . .
1.2 Problem Statement . . .
1.3 Contributions . . . . . .
1.4 Dissertation Organization

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

Background and Related Work
2.1 Background . . . . . . . . . . . . . . . . . . . .
2.1.1 Network Traffic Analysis . . . . . . . . .
2.1.2 JXTA . . . . . . . . . . . . . . . . . . .
2.1.3 MapReduce . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . .
2.2.1 Distributed Debugging . . . . . . . . . .
2.2.2 MapReduce for Network Traffic Analysis
2.3 Chapter Summary . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

Profiling Distributed Applications Through Deep Packet Inspection
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . .
3.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . .
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Results Discussion . . . . . . . . . . . . . . . . . . . . .
3.5.2 Possible Threats to Validity . . . . . . . . . . . . . . . .
3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

1
1
4
5
6

.
.
.
.
.
.
.
.

7
7
7
9
10
13
13
14
15

.
.
.
.
.
.
.
.
.
.

17
18
20
28
28
30
31
34
34
35
36

xi

4

5

Evaluating MapReduce for Network Traffic Analysis
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . .
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Evaluation Methodology . . . . . . . . . .
4.2.2 Experiment Setup . . . . . . . . . . . . . .
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Results Discussion . . . . . . . . . . . . .
4.4.2 Possible threats to validity . . . . . . . . .
4.5 Chapter Summary . . . . . . . . . . . . . . . . . .
Conclusion and Future Work
5.1 Conclusion . . . . . . .
5.2 Contributions . . . . . .
5.2.1 Lessons Learned
5.3 Future Work . . . . . . .

Bibliography

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

37
38
39
39
41
42
53
53
56
56

.
.
.
.

58
59
60
61
62
63

xii

List of Figures

2.1
2.2

Differences between packet level analysis and deep packet inspection .
MapReduce input dataset splitting into blocks and into records . . . . .

8
10

3.1
3.2
3.3
3.4

Architecture of the the SnifferServer to capture and store network traffic
Architecture for network traffic analysis using MapReduce . . . . . . .
JXTA Socket trace analysis . . . . . . . . . . . . . . . . . . . . . . . .
Completion time scalability of MapReduce for DPI . . . . . . . . . . .
(a) Scalability to process 16 GB . . . . . . . . . . . . . . . . . . . .
(b) Scalability to process 34 GB . . . . . . . . . . . . . . . . . . . .

21
23
31
32
32
32

4.1

DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTAapplication network traffic . . . . . . . . . . . . . . . . . . . . . . . .
DPI Processing Capacity for 90Gb . . . . . . . . . . . . . . . . . . . .
MapReduce Phases Behaviour for DPI of 90Gb . . . . . . . . . . . . .
(a) Phases Time for DPI . . . . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for DPI . . . . . . . . . . . . . . . . . . . .
Completion time comparison of MapReduce for packet level analysis,
evaluating the approach with and without splitting into packets . . . . .
CountUp completion time and speed-up of 90Gb . . . . . . . . . . . .
(a) P3 evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver evaluation . . . . . . . . . . . . . . . . . . . . .
CountUp processing capacity for 90Gb . . . . . . . . . . . . . . . . . .
(a) P3 processing capacity . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver processing capacity . . . . . . . . . . . . . . . .
MapReduce Phases time of CountUp for 90Gb . . . . . . . . . . . . .
(a) MapReduce Phases Times of P3 . . . . . . . . . . . . . . . . . .
(b) MapReduce Phases Times for CountUpDriver . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) Phases Distribution for P3 . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for CountUpDriver . . . . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) DPI Completion Time and Speed-up of MapReduce for 30Gb of a
JXTA-application network traffic . . . . . . . . . . . . . . . . .
(b) DPI Processing Capacity of 30Gb . . . . . . . . . . . . . . . . .

4.2
4.3

4.4
4.5

4.6

4.7

4.8

4.9

43
44
45
45
45
47
48
48
48
49
49
49
50
50
50
51
51
51
52
52
52

xiii

List of Tables

3.1
3.2
3.3
3.4
3.5
3.6

Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI of a JXTA-based network traffic . . . . . . . . . . . . . .
Factors and levels to evaluate the defined metrics . . . . . . . . . . . .
Hypotheses to evaluate the defined metrics . . . . . . . . . . . . . . . .
Hypothesis notation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Completion time to process 16 GB split into 35 files . . . . . . . . . . .
Completion time to process 34 GB split into 79 files . . . . . . . . . . .

28
29
29
29
33
33

4.1
4.2
4.3

Metrics for evaluating MapReduce for DPI and packet level analysis . .
Factors and Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-Distributed Execution Time in seconds . . . . . . . . . . . . . . .

40
40
43

xiv

List of Acronyms

DPI Deep Packet Inspection
EC2 Elastic Compute Cloud
GQM Goal Question Metric
HDFS Hadoop Distributed File System
IP Internet Protocol
I/O Input/Output
JVM Java Virtual Machine
MBFS Message Based Per Flow State
MBPS Message Based Per Protocol State
PBFS Packet Based Per Flow State
PBNS Packet Based No State
PCAP Packet Capture
PDU Protocol Data Unit
POSIX Portable Operating System Interface
RTT Roud-Trip Time
SLA Service Level Agreement
TCP Transmission Control Protocol
UDP User Datagram Protocol

xv

1

Introduction

Though nobody can go back and make a new beginning, anyone can
start over and make a new ending.
—CHICO XAVIER

1.1

Motivation

Distributed systems has been adopted for building high performance systems, due to the
possibility of obtaining high fault tolerance, scalability, availability and efficient use of
resources (Cox et al., 2002; Antoniu et al., 2007). Modern Internet services and cloud
computing infrastructures are commonly implemented as distributed systems, to provide
services with high performance and reliability (Mi et al., 2012). Cloud computing SLAs
require low time to identify, diagnose and solve problems in its production infrastructure,
in order to avoid negative impacts and problems into the quality of service provided for its
clients. Thus, monitoring and performance analysis of distributed systems at production
environment, became more necessary with the growth of cloud computing and the use of
distributed systems to provide services and infrastructure as a service (Fox et al., 2009;
Yu et al., 2011).
On distributed systems developing, maintaining and administration, the detection of
error causes, diagnosis and reproduction of errors are challenges that motivate efforts
to the development of less intrusive and more effective mechanisms for monitoring
and debugging distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analysers (Oliner et al., 2012) provide
relevant information regarding some aspects of a distributed system, but this information
can be complemented by correlated information from others sources (Zheng et al., 2012),

1

1.1. MOTIVATION

such as network traffic analysis, which can provide valuable information of a distributed
application and its environment, and also increase the number of information sources
to make them more effective for evaluating complex distributed systems. Simulators
(Paul, 2010), emulators or testbeds (Loiseau et al., 2009; Gupta et al., 2011) are also used
to evaluate distributed systems, but these approaches present lacks of to reproduce the
production behavior of a distributed system, and its relation within a complex environment,
such as the cloud computing environment (Loiseau et al., 2009; Gupta et al., 2011).
Monitoring and diagnosing production failures of distributed systems require low
intrusion, high accuracy and fast results. It is complex to achieve these requirements,
because distributed systems are usually composed of asynchronous communication,
unpredictability of network message issues, high number of resources to be monitored
in short time, and black box components (Yuan et al., 2011; Nagaraj et al., 2012). To
measure distributed systems with less intrusion and less dependency on developers,
approaches with low dependency on source code or instrumentation are necessary, such
as log analysis or network traffic analysis (Aguilera et al., 2003).
It is possible to measure, evaluate and diagnose distributed applications through the
evaluation of information from communication protocols, flows, throughput and load
distribution (Mi et al., 2012; Nagaraj et al., 2012; Sambasivan et al., 2011; Aguilera et al.,
2003; Yu et al., 2011). This information can be collected through network traffic analysis,
but to retrieve this kind of information from distributed application traffic it is necessary
to recognize application protocols and perform DPI to retrieve details of the application
behaviors, sessions, and states.
Network traffic analysis is one option to evaluate distributed systems’ performance
(Yu et al., 2011), although there are limitations on processing capacity to deal with large
amounts of network traffic in short time, on scalability to process network traffic over
variation of resource demands, and on complexity to obtain information of a distributed
application behavior from network traffic (Loiseau et al., 2009; Callado et al., 2009). To
evaluate application’s information from network traffic it is necessary to use DPI and
extract information from application protocols, which requires an additional effort in
comparison with traditional approaches of DPI, which usually do not evaluate content of
application protocols and application states.
In the production environment of a cloud computing provider, DPI can be used to
evaluate and diagnose distributed applications, through the analysis of application traffic
inside a data center. However, this kind of DPI presents differences and requires more
effort than common DPI approaches. DPI is usually used to inspect all network traffic that

2

1.1. MOTIVATION

arrives at a data center, but this approach would not provide reasonable performance for
inspecting application protocols and their states, due to the massive volumes of network
traffic to be online evaluated, and the computational cost to perform this kind of evaluation
in short time (Callado et al., 2009).
Packet level analysis can also be used to evaluate packet flows and load distribution of
network traffic inside a data center (Kandula et al., 2009), providing valuable information
about the behavior of a distributed system and about the dimension, capacity and usage
of network resources. However, with packet level analysis it is not possible to evaluate
application messages, protocols, and their states.
Although much work has been done to improve DPI performance (Fernandes et al.,
2009; Antonello et al., 2012), the evaluation of application states through traffic analysis
decreases the processing capacity of DPI to evaluate large amounts of network traffic.
With the growth of link speeds, Internet traffic exchange and use of distributed systems
to provide Internet services (Sigelman et al., 2010), the development of approaches are
needed to be able to deal with the analysis of the growing amount of network traffic, to
permit the efficient evaluation of distributed systems through network traffic analysis.
MapReduce (Dean and Ghemawat, 2008), which was proposed for distributed processing of large datasets, can be an option to deal with large amounts of network traffic.
MapReduce is a programming model and an associated implementation for processing
and generating large datasets. It becomes an important programming model and distribution platform to process large amounts of data, with diverse use cases in academia and
industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce is a restricted programming
model to easily and automatically parallelize the execution of user functions and to
provide transparent fault-tolerance (Dean and Ghemawat, 2008). Based on functional
combinators from functional languages, it provides a simple programming paradigm for
parallel processing that is increasingly being used for data-intensive applications in cloud
computing environments.
MapReduce can be used for network packet level analysis (Lee et al., 2011), which
evaluates each packet individually to obtain information of network and transport layers.
Lee et al. (2011) proposed an approach to perform network packet level analysis through
MapReduce, using network traces split into packets to process each one individually
and to extract indicators from IP, TCP, and UDP. However, for profiling an application
through network traffic analysis it is necessary to perform a deep packet inspection, in
order to evaluate the content of the application layer, and to evaluate application protocols
and reassemble application messages.

3

1.2. PROBLEM STATEMENT

Because the approach proposed by Lee et al. (2011) is not able to evaluate more than
one packet per MapReduce iteration and analyse application messages, it is necessary a
new MapReduce approach to perform DPI algorithms for profiling applications through
network traffic analysis.
The kind of workload submitted for processing by MapReduce impacts on the behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring specific
configuration to obtain an optimal performance. Information about the occupation of
MapReduce phases, about the processing characteristics (if the job is I/O or CPU bound),
and about the mean time duration of Map and Reduce tasks, can be used to optimize
parameter configurations of the MapReduce, in order to improve resource allocation and
task scheduling.
Although studies has been done to understand, analyse and improve workload management decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to
characterize the MapReduce behaviour or to identify its optimal configuration to achieve
the best performance for packet level analysis and DPI.

1.2

Problem Statement

MapReduce can express several kinds of problems, but not all. MapReduce does not efficiently express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012),
because its approach adopts batch processing and functions executed independently,
without shared state or data. Although MapReduce is restrictive, it provides a good fit
for many problems of processing large datasets. MapReduce expressiveness limitations
may be reduced by decomposition of problems into multiple MapReduce iterations, or
by combining MapReduce with others programming models for sub-problems (Lämmel,
2007; Lin, 2012), although the decomposition into interactions increases the completion
time of MapReduce jobs (Lämmel, 2007).
DPI algorithms require the evaluation of one or more packets to retrieve information from application layer messages; this represents a data dependency to mount an
application message from network packets, and it is a restriction to use MapReduce for
DPI. Because Lee et al. (2011)’s approach for MapReduce performs packet level analysis
processes each packet individually, it can not be used to evaluate more than one packet
per MapReduce Map function and efficiently reassemble an application message from
network traces. Thus it is necessary a new approach to use MapReduce to perform DPI,
evaluating the effectiveness of MapReduce to express DPI algorithms.

4

1.3. CONTRIBUTIONS

In elastic environments, like cloud computing providers, where users can request or
discard resources dynamically, it is important to know how to make provisioning and
resource allocation in an optimal way. To run MapReduce jobs efficiently, the allocated
resources need to be matched to the workload characteristics, and the allocated resources
should be sufficient to meet a requested processing capacity or deadline (Lee, 2012).
The main performance evaluations of MapReduce are about text processing (Zaharia
et al., 2008; Chen et al., 2011; Jiang et al., 2010; Wang et al., 2009), where the input
data are split into blocks and into records, to be processed by parallel and independent
Map functions. Although studies has been done in order to understand, analyse and
improve workload decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no
evaluation to characterize the MapReduce behavior or to identify its optimal configuration
to achieve the best performance for packet level analysis and DPI. Thus, it is necessary the
characterization of MapReduce jobs for packet level analysis and DPI, in order to permit
its optimal configuration to achieve the best performance, and to obtain information that
can be used to predict or simulate the completion time of a job with given resources, in
order to determine whether the job will be finished by the deadline with the allocated
resources (Lee, 2012).
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network traffic analysis, proposing a solution able to
perform deep inspection in distributed applications traffic, in order to evaluate distributed
systems at a data center, using commodity hardware and cloud computing services, in a
minimally intrusive way. Thus we developed an approach based on MapReduce to evaluate the behavior of distributed systems through DPI, and we evaluated the effectiveness of
MapReduce to a DPI algorithm and its completion time scalability through node addition
into the cluster, to measure a JXTA-based application, using virtual machines of a cloud
computing provider. Also we evaluated the MapReduce performance for packet level
analysis and DPI, characterizing the behavior followed by MapReduce phases, processing
capacity scalability and speed-up. In this evaluation we evaluated the impact caused by
the variation of input size, block size and cluster size.

1.3

Contributions

We analyse the processing capacity problem of distributed system measurements through
network traffic analysis. The results of the work presented in this dissertation provide the
following contributions:

5

1.4. DISSERTATION ORGANIZATION

1. We proposed an approach to implement DPI algorithms through MapReduce,
using whole blocks as input for Map functions. Was shown the effectiveness of
MapReduce for a DPI algorithm to extract indicators from a distributed application traffic, also it was shown the MapReduce completion time scalability,
through node addition into the cluster, for DPI on virtual machines of a cloud
computing provider;
2. We characterized the behavior followed by MapReduce phases for packet
level analysis and DPI, showing that this kind of job is intense in Map phase
and highlighting points for improvement;
3. We described the processing capacity scalability of MapReduce for packet
level analysis and DPI, evaluating the impact caused by variations in input,
cluster and block size;
4. We showed the speed-up obtained with MapReduce for DPI, with variations of
input, cluster and block size.

1.4

Dissertation Organization

The remainder of this dissertation is organized as follows.
In Chapter 2, we provide the background information on network traffic analysis and
MapReduce, we also investigate previous work that are related to the measurement of
distributed applications at runtime and with the use of MapReduce for network traffic
analysis.
In Chapter 3, we look at the problem of distributed application monitoring and
restriction to use MapReduce for profiling application traffic. There are limitations on
capacity to process large amounts of network packet in short time and on scalability to
be able to process network traffic where there are variations of throughput and resource
demand. To address this problem, we present an approach for profiling application traffic
using MapReduce. Experiments show the effectiveness of our approach for profiling
application through DPI and MapReduce, and shows the achieved completion time
scalability in a cloud computing provider.
In Chapter 4, we performed a performance evaluation of MapReduce for network
traffic analysis. Due to the lack of evaluation of MapReduce for traffic analysis and
the peculiarity of this kind of data, this chapter deeply evaluates the performance of
MapReduce for packet level analysis and DPI of distributed application traffic, evaluating

6

1.4. DISSERTATION ORGANIZATION

the MapReduce scalability, speed-up and behavior followed by MapReduce phases.
The experiments evidence the predominant phases in this kind of MapReduce job, and
show the impact caused by the input size, block size and number of nodes, into the job
completion time and scalability achieved through the use of MapReduce.
In Chapter 5 we conclude the work done, summarize our contributions and present
future work.

7

2

Background and Related Work
No one knows it all. No one is ignorant of everything. We all know
something. We are all ignorant of something.
—PAULO FREIRE

In this chapter, we provide background information on network traffic analysis, JXTA
and MapReduce, we also investigate previous studies that are related to the measurement
of distributed applications and to the use of MapReduce for network traffic analysis.

2.1
2.1.1

Background
Network Traffic Analysis

Network traffic measurement can be divided into active or passive measurement, and
a measurement can be performed at packet or flow levels. In packet level analysis,
the measurements are performed on each packet transmitted across the measurement
point. The common packet inspection only analyses the content up to the transport layer,
including the source address, destination address, source port, destination port and the
protocol type, but packet inspection can also analyse the packet payload, performing a
deep packet inspection.
Risso et al. (2008) presented a taxonomy of the methods that can be used for network
traffic analysis. According to Risso et al. (2008), the Packet Based No State (PBNS) operates by checking the value of some fields present in each packet, such as the TCP or UDP
ports, thus this method is very simple computationally. The Packet Based Per Flow State
(PBFS) requires a session table to manage session identification (source/destination address, transport-layer protocol, source/destination port) and the corresponding application

8

2.1. BACKGROUND

layer protocol, in order to be able to scan the payload looking for a specific rule, which
usually is an application-layer signature, which increases the processing complexity of
this method. The Message Based Per Flow State (MBFS) operates on messages instead
of packets. This method requires a TCP/IP reassembler to handle IP fragments and TCP
segments. In such case, memory requirements increase because of the additional state
information that must be kept for each session and because of buffers required by the
TCP/IP reassembler. The Message Based Per Protocol State (MBPS) interprets exactly
what each application sends and receives. A MBPS processor understands not only the
semantic of the message, but also the different phases of a messages exchange because it
has a full understanding of the protocol state machine. Memory requirements become
even larger, because this method needs to take into account not only the state of the
transport session, but also the state of each application layer session. Also processing
power is the highest because the protocol conformance analysis requires processing the
entire application data, while previous methods are limited to the first packets within each
session.
The Figure 2.1 illustrates the difference between packet level analysis and DPI from
PCAP files, and shows that packet level analysis evaluates each packet individually, while
DPI requires an evaluation of more than one packet to reassemble some packets and
obtain an application message.

Figure 2.1 Differences between packet level analysis and deep packet inspection

DPI refers for examining both packet header and complete payload to look for
predefined patterns or rules. A pattern or rule can be a particular TCP connection, defined
by source and destination IP addresses and port numbers, it can also be a signature string
of a virus, or a segment of malicious code (Piyachon and Luo, 2006). Antonello et al.

9

2.1. BACKGROUND

(2012) argues that many critical network services rely on the inspection of packet payload,
instead of only looking at the information of packet headers. Although DPI systems
are essentially more accurate to identify application protocols and application messages,
they are also resource-intensive and may not scale well with the growing link speeds.
MBFS, MBPS and DPI evaluate the content of the application layer, thus it is necessary
to recognize the content of the message evaluated, but encrypted messages can make
these kind of evaluation infeasible.

2.1.2

JXTA

JXTA is a language and specification for peer to peer networking, it attempts to formulate
peer to peer standard protocols, in order to provide an infrastructure for building peer
to peer applications, through basic functionalities for peer resource discovery, communication and organization. JXTA introduces an overlay on top of the existing physical
network, with its own addressing and routing (Duigou, 2003; Halepovic and Deters,
2003).
According to JXTA specification (Duigou, 2003), JXTA peers communicate through
messages transmitted by pipes, which are an abstraction of virtual channels composed of
input and output channels, for peer to peer communication. Pipes are not bound to the
physical location, it has its own unique ID. Each peer can carry its pipe with itself even
when its physical network location changes. Pipes are asynchronous, unidirectional and
unreliable, but bi-directional and reliable services are provided on top of them. JXTA
uses source-based routing, each message carries its routing information as a sequence
of peers, and peers along the path may update this information. The JXTA socket adds
reliability and bi-directionality to JXTA communications through one layer of abstraction
on top of the pipes (Antoniu et al., 2005), and it provides an interface similar to the
POSIX sockets specification. JXTA messages are XML-documents composed of well
defined and ordered message elements.
Halepovic and Deters (2005) proposed a performance model, describing important
metrics to evaluate the JXTA throughput, scalability, services and the JXTA behavior
over different versions. Halepovic et al. (2005) analysed the JXTA performance in
order to show the increasing cost or latency with higher workload and with concurrent
requests, and suggests more evaluations about JXTA scalability with large peer groups in
direct communication. Halepovic (2004) cites that network traffic analysis is a feasible
approach to performance evaluation of JXTA-based applications, but do not adopt it due
to the lack on JXTA traffic characterization. Although there are performance models and

10

2.1. BACKGROUND

evaluations of JXTA, there are no evaluations of it for the current versions and there are
not mechanisms to evaluate JXTA applications at runtime. Because JXTA is still used for
building peer to peer systems, such as the U-Store (Fonseca et al., 2012), which motivates
our research, is necessary a solution to measure JXTA-based applications at runtime and
provide information about their behavior and performance.

2.1.3

MapReduce

MapReduce (Dean and Ghemawat, 2008) is a programming model and a framework for
processing large datasets trough distributed computing, providing fault tolerance and high
scalability to big data processing. The MapReduce model was designed for unstructured
data processed by clusters of commodity hardware. Its functional style of Map and
Reduce functions automatically parallelizes and executes large jobs in a cluster. Also,
MapReduce handles failures, application deployment, task duplications, and aggregation
of results, thereby allowing programmers to focus on the core logic of applications.
An application executed through MapReduce is called job. The input data of a
job, which is stored into a distributed file system, it is split into even-sized blocks and
replicated for fault tolerance. Figure 2.2 shows the dataset input splitting adopted by
MapReduce.

Figure 2.2 MapReduce input dataset splitting into blocks and into records

11

2.1. BACKGROUND

Initially the input dataset is split into blocks and stored into the distributed file system
adopted. During the job execution of a dataset, each split is assigned to be processed by a
Mapper, thus the number of splits of the input determines the number of Map tasks of a
MapReduce job. Each Mapper reads its split from the distributed file system and divides
it into records, to be processed by the user-defined Map function. Each Map function
generates intermediate data from the evaluated block, which will be fetched, ordered by
keys and processed by the Reducers to generate the output of a MapReduce job.
A MapReduce job is divided into Map and Reduce tasks, which are composed of
user-defined functions of Map and Reduce. The execution of these tasks can be grouped
into phases, representing the Map and Reduce phases, but Reduce tasks still can be
divided into other phases, which are the Shuffle and Sort phases. A job is submitted by
an user to the master node, which selects worker nodes with idle slots and assigns Map or
Reduce tasks.
The execution of a Map task can be divided into two phases. In the first, the Map
phase reads the task’s split from the distributed file system, parses it into records, and
applies the user-defined Map function to each record. In the second, after the user-defined
Map function has been applied to each input record, the commit phase registers the final
output with the TaskTracker, which then informs the JobTracker that the task has finished
executing. The output of the Map phase is consumed by the Reduce phase.
The execution of a Reduce tasks can be divided into three phases. The first phase,
called Shuffle phase, fetches the Reduce task’s input data, where for each Reduce task is
assigned a partition of the key produced by the Map phase. The second phase, called Sort
phase, groups records with the same key. The third phase, called Reduce phase, applies
the user-defined Reduce function to each key and its values (Kavulya et al., 2010).
A Reduce task cannot fetch the output of a Map task until the Map has finished and
committed its output to disk. Only after receiving its partition from all Map outputs, the
Reduce task starts the Sort phase, while this does not happens, the Reduce task executes
the Shuffle phase. After the Sort phase, the Reduce task enters the Reduce phase, in
which it executes the user-defined Reduce function for each key and its values. Finally
the output of the Reduce function is written to a temporary location on the distributed file
system (Condie et al., 2010).
MapReduce worker nodes are configurable to concurrently execute up to a defined
number of Map and Reduce tasks, which are defined according to the number of Map and
Reduce slots. Each worker node of a MapReduce cluster is configured with a fixed number
of Map slots, and another fixed number of Reduce slots, which means the number of Map

12

2.1. BACKGROUND

or Reduce tasks that can be executed concurrently per node. During job executions, if
all possible slots are occupied, pending tasks must wait until some slots are freed up. If
the number of tasks in the job is bigger than the number of slots available, then Maps or
Reduces are first scheduled to execute on all available slots, and these tasks compose the
first wave of tasks, that is followed by subsequent waves. If an input is broken into 200
blocks and there are 20 Map slots in a cluster, the number of map tasks are 200 and the
map tasks are executed through 10 waves of executions (Lee et al., 2012). The number of
waves, and the sizes of waves, would aid the configuration of tasks for improved cluster
utilization (Kavulya et al., 2010).
The Shuffle phase of the first Reduce wave may be significantly different from the
Shuffle phase that belongs to the next Reduce waves. This happens because the Shuffle
phase of the first Reduce wave overlaps with the entire Map phase, and hence its depends
on the number of Map waves and their durations (Verma et al., 2012b).
Each Map task is independent of the others Map tasks, meaning that all Mappers can
be performed in parallel on multiple machines. The number of concurrent Map tasks in a
MapReduce system is limited by the number of slots and the number of blocks in which
the input data was divided. Reduce tasks can also be performed in parallel during the
Reduce phase, and the number of reduce tasks in a job is specified by the application and
by the number of Reduce slots per node.
MapReduce tries to achieve data locality for its job executions, which means the Map
task and the input data block it will process should be located as close to each other as
possible, in order for the Map task can read the input data block incurring as little network
traffic as possible.
Hadoop1 is an open source implementation of MapReduce, which relies on HDFS
for distributed data storage and replication. HDFS is an implementation of Google File
System (Ghemawat et al., 2003), which was designed to store large files, and was adopted
by MapReduce system as distributed file system to store its files and intermediate data.
The input data type and workload characteristics cause impact into the MapReduce
performance, due to each application has a different bottleneck resource, and requires
specific configuration to achieve optimal resource utilization (Kambatla et al., 2009).
Hadoop has a set of parameters for its configuration, the default values of these parameters
are based on typical configuration of machines in clusters and requirements of a typical
application, that usually processes text-like inputs, although the MapReduce optimal
resource utilization is dependent on the resource consumption profile of its application.
1 http://hadoop.apache.org/

13

2.2. RELATED WORK

Because the input data type and workload characteristics of MapReduce jobs impacts
into MapReduce performance, it is necessary to evaluate the MapReduce behavior and
performance for different purposes. Although much work has been done in order to
understand and analyse MapReduce for different input data types and workloads (Lu
et al., 2012; Groot, 2012), there is no evaluation to characterize the MapReduce behavior
and identify its optimal configuration for an application to packet level analysis and DPI.

2.2
2.2.1

Related Work
Distributed Debugging

Modern Internet services are often implemented as complex, large-scale distributed
systems. Information about the behavior of complex distributed systems is necessary
to evaluate and improve their performance, but for understanding distributed system
behavior it is required to observe related activities across many different components and
machines (Sigelman et al., 2010).
The evaluation of distributed applications is a challenge, due to the cost of monitoring
distributed systems and the lack of performance measurement of large scale distributed
applications at runtime. To reproduce the behavior of a complex distributed system, in
a test environment, it is necessary to reproduce each relevant configuration parameter
of the system (Gupta et al., 2011), which is a difficult effort, and is more evident and
complex in cases where faults only occurs when the system is over a high load (Loiseau
et al., 2009).
Gupta et al. (2011) presented a methodology and framework for large scale tests, able
to obtain resource configurations and scale near a large scale system, through the use of
emulated scalable network, multiplexed virtual machines and resource dilatation. Gupta
et al. (2011) shows its accuracy, scalability and the realism on network tests. However it
can not obtain the same accuracy of an evaluation of a real system at runtime, neither can
diagnose a problem occurred in production environment, in short time.
According to Sambasivan et al. (2011), debugging tools are needed to help the
identification and understanding of root causes of the diverse performance problems that
can arise in distributed systems. A request-flow can be seen as path and timing of a
request in a distributed system, representing the the flow of individual requests within
and across the components of a distributed system. There are many cases for which
request-flow traces comparison is useful; it can help to diagnose performance changes

14

2.2. RELATED WORK

resulting from modifications made during software development or from upgrades of a
deployed system. It can also help to diagnose behaviour changes resulted from component
degradations, resource leakage, or workload changes.
Sigelman et al. (2010) reported Dapper, a large production distributed system tracing
framework of Google, that states three concrete design goals: low overhead, applicationlevel transparency and scalability. These goals were achieved by restricting Dapper’s
core tracing instrumentation to an ubiquitous threading, control flow, and RPC library
code of Google. Dapper provides valuable insights about the evaluation of distributed
systems through flows and procedure calls, but its implementation is dependent of the
instrumentation into the component responsible for message communication of the
distributed system, what can not be available in a black box system.
Some techniques has been developed for performance evaluation of distributed systems. Mi et al. (2012) proposed an approach, based on end-to-end request trace logs, to
identify primary causes of performance problems in cloud computing systems. Nagaraj
et al. (2012) compared logs of distributed systems to diagnose performance problems,
using machine learning techniques to analyse logs and to explore information of states
and event times. Sambasivan et al. (2011) used request flows to find performance modifications in distributed systems, comparing request flows across periods and ranking
them based on their impact in system’s performance. Although these approaches evaluate
requests, flows and events of distributed systems, traffic analysis was not used as an
approach to provide de desired information.
Aguilera et al. (2003) proposed an approach to isolate performance bottlenecks in
distributed systems, based in message-level traces activity and algorithms for inferring
the dominant paths of a distributed system. Although network traffic was considered as
source to extract the desired information, a distributed approach was not adopt for data
processing.
Yu et al. (2011) presented SNAP, a scalable network-application profiler to evaluate
the interactions between applications and the network. SNAP passively collects TCP
statistics and socket logs, and correlates them with network resources to indicate problem locations. However, SNAP did not adopted application traffic evaluation, neither
distributed computing to perform network traffic processing.

2.2.2

MapReduce for Network Traffic Analysis

Lee et al. (2010) proposed a network flow analysis method using MapReduce, where the
network traffic was captured, converted to text and used as input to Map tasks. As a result,

15

2.3. CHAPTER SUMMARY

it was shown improvements in fault tolerance and computation time, when compared with
flow-tools2 . The conversion time from binary network traces to text represents a relevant
additional time, that can be avoided adopting binary data as input data for MapReduce
jobs.
Lee et al. (2011) presented a Hadoop-based packet trace processing tool to process
large amounts of binary network traffic. A new input type to Hadoop was developed,
the PcapInputFormat, which encapsulate the complexity of processing a captured binary
PCAP traces and extracting the packets through the Libpcap (Jacobson et al., 1994)
library. Lee et al. (2011) compared their approach with CoralReef3 , which is a network
traffic analysis tool that also relies on Libpcap, the results of the evaluation showed
speed-up on completion time, for a case that process packet traces with more than 100GB.
This approach implemented a packet level evaluation, to extract indicators from IP, TCP
and UDP, evaluating the job completion time achieved with different input size and two
cluster configurations. It was implemented their own component to save network traces
into blocks, and the developed PcapInputFormat rely on a timestamp-based heuristic for
finding the first packet from each block, using sliding-window. These implementations to
iterate over packets of a network trace, can present a limitation on accuracy, if compared
with the accuracy obtained by Tcpdump4 and LibPCAP for the same functionalities.
The approach proposed by Lee et al. (2011) is not able to evaluate more than one
packet per MapReduce iteration, because each block is divided into packets that are
evaluated individually by the user-defined Map function. Therefore, a new MapReduce
approach is necessary to perform DPI algorithms, which requires to reassemble more
than one packet to mount an application message, in order to evaluate message contents,
application states and application protocols.

2.3

Chapter Summary

In this chapter, we presented the background information of network traffic analysis,
JXTA and MapReduce, we also investigated previous studies that are related to the
measurement of distributed applications and related to the use of MapReduce for network
traffic analysis.
According to the background and related work evaluated, the detection of error causes,
diagnose and reproduction of errors of distributed systems are challenges that motivate
2 www.splintered.net/sw/flow-tools/

3 http://www.caida.org/tools/measurement/coralreef
4 http://www.tcpdump.org/

16

2.3. CHAPTER SUMMARY

efforts to develop less intrusive mechanisms for monitoring and debugging distributed
applications at runtime. Network traffic analysis is one option to distributed systems
measurement, although there are limitations on capacity to process large amounts of
network traffic in short time, and on scalability to process network traffic where there is
variation of resource demand.
Although MapReduce can be used for packet level analysis, it is necessary an approach
to use MapReduce for DPI, in order to evaluate distributed systems at a data center through
network traffic analysis, using commodity hardware and cloud computing services, in a
minimally intrusive way. Due to the lack of evaluation of MapReduce for traffic analysis
and the peculiarity of this kind of data, it is necessary to evaluate the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of
the most important configuration parameters of MapReduce.

17

3

Profiling Distributed Applications
Through Deep Packet Inspection
Life is really simple, but we insist on making it complicated.
—CONFUCIUS

In this chapter, we first look at the problems in the distributed application monitoring,
processing capacity of network traffic, and in the restriction to use MapReduce for
profiling application network traffic of distributed applications.
Network traffic analysis can be used to extract performance indicators from communication protocols, flows, throughput and load distribution of a distributed system. In
this context, network traffic analysis can enrich diagnoses and provide a mechanism for
measuring distributed systems in a passive way, with low overhead and low dependency
on developers.
However, there are limitations on the capacity to process large amounts of network
traffic in short time, and on processing capacity scalability to be able to process network
traffic over variations of throughput and resource demands. To address this problem, we
present an approach for profiling application network traffic using MapReduce. Experiments show the effectiveness of our approach for profiling a JXTA-based distributed
application through DPI, and its completion time scalability through node addition, in a
cloud computing environment.
In Section 3.1 we begin this chapter by motivating the need for an approach using
MapReduce for DPI, then we describe, in Section 3.2, the architecture proposed and the
DPI algorithm to extract indicators from network traffic of a JXTA-based distributed
application. Section 3.3 presents the adopted evaluation methodology and the experiment

18

3.1. MOTIVATION

setup used to evaluate our proposed approach. The obtained results are presented in
Section 3.4 and discussed in Section 3.5. Finally, Section 3.6 concludes and summarizes
this chapter.

3.1

Motivation

Modern Internet services and cloud computing infrastructure are commonly implemented
as distributed systems, to provide services with high performance, scalability and reliability. Cloud computing SLAs require a short time to identify, diagnose and solve problems
in its infrastructure, in order to avoid negative impacts and problems in the provided
quality of service.
Monitoring and performance analysis of distributed systems became more necessary
with the growth of cloud computing and the use of distributed systems to provide services
and infrastructure (Fox et al., 2009). In distributed systems development, maintenance and
administration, the detection of error causes, and the diagnosing and reproduction of errors
are challenges that motivates efforts to develop less intrusive mechanisms for debugging
and monitoring distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analyzers (Oliner et al., 2012) provide
relevant information of some aspects of a distributed system. However this information
can be complemented by correlating information from network traffic analysis, making
them more effective and increasing the information source to ubiquitously evaluate a
distributed system.
Low overhead, and transparency and scalability are commons requirements for an
efficient solution to the measurement of distributed systems. Many approaches have been
proposed in this direction, using instrumentation or logging, which cause overhead and
a dependency on developers. It is possible to diagnose and evaluate distributed applications’ performance with the evaluation of information from communication protocols,
flows, throughput and load distribution (Sambasivan et al., 2011; Mi et al., 2012). This
information can be collected through network traffic analysis, enriching a diagnosis, and
also providing an approach for the measurement of distributed systems in a passive way,
with low overhead and low dependency on developers.
Network traffic analysis is one option to evaluate distributed systems performance
(Yu et al., 2011), although there are limitations on the capacity to process large number
of network packets in a short time (Loiseau et al., 2009; Callado et al., 2009) and on
scalability to process network traffic over variations of throughput and resource demands.
To obtain information of the behaviour of distributed systems, from network traffic, it

19

3.1. MOTIVATION

is necessary to use DPI and evaluate information from application states, which requires
an additional effort in comparison with traditional approaches of DPI, which usually do
not evaluate application states.
Although much work has been done in order to improve the DPI performance (Fernandes et al., 2009; Antonello et al., 2012), the evaluation of application states still decreases
the processing capacity of DPI to evaluate large amounts of network traffic. With the
growth of links’ speed, Internet traffic exchange and the use of distributed systems to
provide Internet services (Sigelman et al., 2010), the development of new approaches
are needed to be able to deal with the analysis of the growing amount of network traffic,
and to permit the efficient evaluation of distributed systems through the network traffic
analysis.
MapReduce (Dean and Ghemawat, 2008) becomes an important programming model
and distribution platform to process large amount of data, with diverse use cases in
academia and industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce can be used
for packet level analysis: Lee et al. (2011) proposed an approach which evaluates each
packet individually to obtain information of network and transport layers. An approach
to process large amount of network traffic using MapReduce was proposed by Lee et al.
(2011), which splits network traces into packets to process each one individually and
extract indicators from IP, TCP, and UDP.
However, for profiling distributed applications through network traffic analysis, it is
necessary to analyse the content of more than one packet, up to the application layer, to
evaluate application messages and its protocols. Due to TCP and message segmentation,
the desired application message may be split into several packets. Therefore, it is
necessary to evaluate more than one packet per MapReduce iteration to perform a deep
packet inspection, in order to be able to reassemble more than one packet and mount
application messages, to retrieve information from the application sessions, states and
from its protocols.
DPI refers for examining both packet header and complete payload to look for
predefined patterns or rules, which can be a signature string or an application message.
According to the taxonomy presented by Risso et al. (2008), deep packet inspection
can be classified as message based per flow state (MBFS), which analyses application
messages and its flows, and also can be classified as message based per protocol state
(MBPS), which analyses application messages and its application protocol states, what
makes necessary to evaluate distributed applications through network traffic analysis, to
extract application indicators.

20

3.1. MOTIVATION

MapReduce is a restricted programming model to parallelize user functions automatically and to provide transparent fault-tolerance (Dean and Ghemawat, 2008), based
on functional combinators from functional languages. MapReduce does not efficiently
express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012), because its approach adopts batch processing and functions executed independently, without
shared states.
Although restrictive, MapReduce provides a good fit for many problems of processing
large datasets. Also, its expressiveness limitations may be reduced by problem decomposition into multiple MapReduce iterations, or by combining MapReduce with others
programming models for subproblems (Lämmel, 2007; Lin, 2012), but this approach can
be not optimal in some cases. DPI algorithms require the evaluation of one or more packets to retrieve information from application messages; this represents a data dependence
to mount an application message and is a restriction on the use of MapReduce for DPI.
Because the Lee et al. (2011) approach processes each packet individually, it can
not be efficiently used to evaluate more than one packet and reassemble an application
message from a network trace, which makes it necessary a new approach for using
MapReduce to perform DPI and to evaluate application messages.
To be able to process large amounts of network traffic using commodity hardware,
in order to evaluate the behaviour of distributed systems at runtime, and also because
there is no evaluation of MapReduce effectiveness and processing capacity for DPI, an
approach was developed based on MapReduce, to deeply inspect distributed applications
traffic, in order to evaluate the behaviour of distributed systems, using Hadoop, an open
source implementation of MapReduce.
In this Chapter is evaluate the effectiveness of MapReduce to a DPI algorithm and
its completion time scalability through node addition, to measure a JXTA-based application, using virtual machines of Amazon EC21 , a cloud computing provider. The main
contributions of this chapter are:
1. To provide an approach to implement DPI algorithms using MapReduce;
2. To show the effectiveness of MapReduce for DPI;
3. To show the completion time scalability of MapReduce for DPI, using virtual
machines of cloud computing providers.
1 http://aws.amazon.com/ec2/

21

3.2. ARCHITECTURE

3.2

Architecture

In this section we present the architecture of the proposed approach to capture and process
network traffic of distributed applications.
To monitor distributed applications through network traffic analysis, specifics points
of a data center must be monitored to capture the desired application network traffic.
Also, an approach is needed to process a large amount of network traffic in an acceptable
time. According to (Sigelman et al., 2010), fresh information enables a faster reaction to
production problems, thereby the information must be obtained as soon as possible, although a trace analysis system operating on hours-old data is still valuable for monitoring
distributed applications in a data center (Sigelman et al., 2010).
In this direction, we propose a pipelined process to capture network traffic, store
locally, transfer to a distributed file system, and evaluate the network trace to extract
application indicators. We use MapReduce, implemented by Apache Hadoop, to process application network traffic, extract application indicators, and provide an efficient
and scalable solution for DPI and profiling application network traffic in a production
environment, using commodity hardware.
The architecture for network traffic capturing and processing is composed of four
main components: the SnifferServer (Shown in Figure 3.1), that captures, splits and stores
network packets into the HDFS for batch processing through Hadoop; the Manager, that
orchestrates the collected data, the job executions and stores the results generated; the
AppParser, that converts network packets into application messages; and the AppAnalyzer,
that implements Map and Reduce functions to extract the desired indicators.

Figure 3.1 Architecture of the the SnifferServer to capture and store network traffic

22

3.2. ARCHITECTURE

Figure 3.1 shows the architecture of the SnifferServer and its placement into monitoring points of a datacenter. SnifferServer captures network traffic from specific points
and stores it into the HDFS, for batch processing through Hadoop. Sniffer executes
user-defined monitoring plans guided by specification of places, time, traffic filters and
the amount of data to be captured. According to an user-defined monitoring plan, Sniffer
starts the capture of the desired network traffic through Tcpdump, which saves network
traffic in binary files, known as PCAP files. The collected traffic is split into files with
predefined size, saved at the local SnifferServer file system, and transferred to HDFS
only when each file is totally saved into the local file system of the SnifferServer. The
SnifferServer must be connected to the network where the monitoring target nodes are
connected, and must be able to establish communication with the others nodes that
compose the HDFS cluster.
During the execution of a monitoring plan, initially the network traffic must be
captured, split into even-sized files and stored into HDFS. Through the Tcpdump, a
widely used LibPCAP network traffic capture tool, the packets are captured and split into
PCAP files with 64MB of size, which is the default block size of the HDFS, although this
block size may be configured to different values.
HDFS is optimized to store large files, but internally each file is split into blocks with
a predefined size. Files that are greater than the HDFS block size must be split into blocks
with size equal to or smaller than the adopted block size, and must be spread among
machines in the cluster.
Because the LibPCAP, used by Tcpdump, stores the network packets in binary PCAP
files and due to the complexity of providing to HDFS an algorithm for splitting PCAP
files into packets, PCAP files splitting can be avoided through the adoption of files less
than the HDFS block size, but also can be provided to Hadoop an algorithm to split PCAP
files into packets, in order to be able to store PCAP files into the HDFS.
We adopted the approach that saves the network trace into PCAP files with the adopted
HDFS block size, using the split functionality provided by Tcpdump, because of the
PCAP file split into packets demands additional computing time and because of the trace
splitting into packets increases the complexity of the system. Thus, the network traffic
is captured by Tcpdump, split into even-sized PCAP files and stored into the local file
system of the SnifferServer, and periodically transferred to HDFS, which is responsible
for replicating the files into the cluster.
In the MapReduce framework, the input data is split into blocks, which are split
into small pieces, called records, to be used as input for each Map function. We adopt

23

3.2. ARCHITECTURE

the use of entire blocks, with size defined by the HDFS block size, as input for each
Map function, instead of using the block divided into records. With this approach, it is
possible to evaluate more than one packet per MapReduce task and to be able to mount an
application message from network traffic. Also it is possible to obtain more processing
time for the Map function than the approach where each Map function receives only one
packet as input.
Differently from the approach presented by Lee et al. (2011), which only permits
evaluation of a packet individually per Map function, with our approach it is possible to
evaluate many packets from a PCAP file per Map function and to reassemble application
messages from network traffic, which had the content of its messages divided into many
packets to be transferred over TCP.
Figure 3.2 shows the architecture to process distributed application traffic through
Map and Reduce functions, implemented by AppAnalyzer, which is deployed at Hadoop
nodes, and managed by Manager, and has the generated results stored into a distributed
database.

Figure 3.2 Architecture for network traffic analysis using MapReduce

The communication between components was characterized as blocking and nonblocking; blocking communication was adopted in cases that require high consistency,
and non blocking communication was adopted in cases where it is possible to use eventual
consistency to obtain better response time and scalability.

24

3.2. ARCHITECTURE

AppAnalyzer is composed of Mappers and Reducers for specific application protocols
and indicators. AppAnalyzer extends AppParser, which provides protocol parsers to
transform network traffic into programmable objects, providing a high level abstraction
to handle application messages from network traffic.
Manager provides functionalities for users to create monitoring plans with specification of places, time and amount of data to be captured. The amount of data to be processed
and the number of Hadoop nodes available for processing are important factors to obtain
an optimal completion time of MapReduce jobs and to generate fresh information for
faster reaction to production problems of the monitored distributed system. Thus, after
network traffic is captured and the PCAP files are stored into HDFS, Manager permits
the selection of the number of files to be processed, and then schedules a MapReduce job
for this processing. After each MapReduce job execution, Manager is also responsible
for storing the generated results into a distributed database.
We adopted a distributed database with eventual consistency and high availability,
based on Amazon’s Dynamo (DeCandia et al., 2007), and implemented by Apache
Cassandra2 , to store the indicator results generated by the AppAnalyzer. With the eventual
consistency, we expect gains with fast writes and reads operations, in order to reduce the
blocking time of these operations.
AppAnalyzer provides Map and Reduce functions to be used for evaluating specific
protocols and desired indicators. Each Map function receives as input a path of a PCAP
file stored into HDFS; this path is defined by the data locality control of the Hadoop,
which tries to delegate each task to nodes that have a local replica of the data or that are
near a replica. Then, the file is opened and each network packet is processed, to remount
messages and flows, and to extract the desired indicators.
During the data processing, the indicators are extracted from application messages
and saved in a SortedMapWritable object, which is ordered by its timestamp. SortedMapWritable is a sorted collection of values which will be used by Reduce functions
to summarize each evaluated indicator. In our approach, each evaluated indicator is
extracted and saved into an individual result file of Hadoop, which is stored into HDFS.
MapReduce usually splits blocks in records to be used as input for Map functions, but
we adopt whole files as input for Map tasks, to be able to perform DPI and reassemble
application messages that had their content divided into some TCP packets, due TCP
segmentation or due an implementation decision of the evaluated application. If an
application message is less than the maximum segment size (MSS), one TCP packet
2 http://cassandra.apache.org/

25

3.2. ARCHITECTURE

can transport one or more application message, but if an application message is greater
than the MSS, the message is split into some TCP packets, according with the TCP
segmentation. Thus, it is necessary to evaluate the full content of some TCP segments to
recognize application messages and their protocols.
If application messages have its packets spread into two or more blocks, it is possible
to generate intermediate data of this unevaluated messages by the Map function, grouping
each message by flow and its individual identification, and use the Reduce function to
reassembly the message and evaluate it.
To evaluate the effectiveness of our approach, we developed a pilot project to extract
application indicators from a JXTA-based distributed application traffic, this JXTA-based
distributed application implements a distributed backup system, based on JXTA Socket.
To analyse JXTA-based network traffic, the JNetPCAP-JXTA (Vieira, 2012b) was developed, which parses network traffic into Java JXTA messages, and the JXTAPerfMapper
and JXTAPerfReducer, which extract application indicators from JXTA Socket communication layer through Map and Reduce functions.
JNetPCAP-JXTA is writen in the Java language and provides methods to convert byte
arrays into Java JXTA messages, using an extension of the JXTA default library for Java,
known as JXSE3 . With JNetPCAP-JXTA, we are able to parse all kinds of messages
defined by the JXTA specification. JNetPCAP-JXTA relies on the JNetPCAP library to
support the instantiation and inspection of LibPCAP packets. JNetPCAP was adopted due
to its performance to iterate over packets, to the large quantity of functionalities provided
to handle packet traces and due to the recent update activities for this library.
The JXTAPerfMapper implements a Map function that receives as input a path of
a PCAP file stored into the HDFS; then the content of the specified file is processed to
extract the number of JXTA connection requests and number of JXTA message arrivals to
a server peer, and to evaluate the round-trip time of each piece of content transmitted over
a JXTA Socket. If a JXTA message is greater than the TCP PDU size, the message is split
into some TCP segments, due to the TCP segmentation. Additionally, in a JXTA network
traffic, one TCP packet can transport one or more JXTA message, due to the buffer
window size used by the Java JXTA Socket implementation to segment its messages.
Because of the possibility of transporting more than one JXTA message per packet
and the TCP segmentation, it is necessary to reassemble more than one packet and the
full content of each TCP segment to recognize all possible JXTA messages, instead of
evaluating only a message header or signature of individual packets, as is commonly done
3 http://jxse.kenai.com/

26

3.2. ARCHITECTURE

in DPI or by widely used traffic analysis tools, such as Wireshark4 , which is unable to
recognize all JXTA messages in a captured network traffic, due to its approach in which
it does not identify when two or more JXTA messages are transported into the same TCP
packet.
JXTAPerfMapper implements a DPI algorithm to recognize, sort and reassemble TCP
segments into JXTA messages, which is shown in Algorithm 1.
Algorithm 1 JXTAPerfMapper
for all tcpPacket do
if isJXTA isWaitingForPendings then
parsePacket(tcpPacket)
end if
end for
function PARSE PACKET(tcpPacket)
parseMessage
if isMessageParsed then
upddateSavedFlows
if hasRemain then
parsePacket(remainPacket)
end if
else
savePendingMessage
lookForMoreMessages
end if
end function
For each TCP packet of the PCAP file, it is verified if it is a JXTA message or if it
is part of a JXTA message that was not fully parsed and is waiting for its complement;
if one of these conditions is true, then a parse attempt is made, using JNetPCAP-JXTA
functionalities, up to the full verification of the packet content. As a TCP packet may
contain one or more JXTA messages, if a message is fully parsed, then another parse
attempt is done with the content not used by the previous parse. If the content is a JXTA
message and the parse attempt is not successful, then its TCP content is stored with its
TCP flow identification as a key, and all the next TCP packets that match with the flow
identification will be sorted and used to attempt to mount a new JXTA message, until the
parser is successful.
With these characteristics, inspection of JXTA messages and extract application
indicators requires more effort than other cases of DPI. For this kind of traffic analysis
4 http://www.wireshark.org/

27

3.2. ARCHITECTURE

memory requirements become even larger, because it needs to take into account not only
the state of the transport session, but also the state of each application layer session.
Also processing power is the highest because the protocol conformance analysis requires
processing the entire application data (Risso et al., 2008).
As previously shown in Figure 3.2, the AppAnalyzer is composed by Map and Reduce
functions, respectively JXTAPerfMapper and JXTAPerfReducer, to extract performance
indicators from JXTA Socket communication layer, which is a JXTA communication
mechanism that implements a reliable message exchange and obtains better throughput
between the communication layers provided by the Java JXTA implementation.
The JXTA Socket messages are transported by the TCP protocol, but it also implements its own control for data delivery, retransmission and acknowledgements. Each
message of a JXTA Socket is part of a Pipe that represents a connection established
between the sender and the receiver. In a JXTA Socket communication, two Pipes are
established, one from sender to receiver and the other from receiver to sender, in which are
transported content messages and acknowledgement messages, respectively. To evaluate
and extract performance indicators from a JXTA Socket, the messages must be sorted,
grouped and linked with their respective Pipes of content and acknowledgement.
The content transmitted into a JXTA Socket is split into byte array blocks and
stored in a reliability message that is sent to the destination and it expects to receive
an acknowledgement message of its arrival. The time between the message delivery
and when the acknowledgement is sent back is called round-trip time (RTT); this may
vary according to the system load and may indicate a possible overload of a peer. In a
Java JXTA implementation, each block received or to be sent is queued by the JXTA
implementation, until the system is ready to process a new block. This waiting time to
handle messages can impact the response time of the system, increasing the message
RTT.
The JXTAPerfMapper and JXTAPerfReducer evaluate the RTT of each content block
transmitted over a JXTA Socket, and also extract information about the number of
connection requests and message arrivals per time. Each Map function evaluates the
packet trace to mount JXTA messages, Pipes and Sockets. The parsed JXTA messages
are sorted by their sequence number and grouped by their Pipe identification, to compose
the Pipes of a JXTA Socket. As soon as the messages are sorted and grouped, the RTT is
obtained, its value is associated with its key and written as an output of the Map function.
The reduce function defined by JXTAPerfReducer receives as input a key and a collection of values, which are the evaluated indicator and its collected values, respectively,

28

3.3. EVALUATION

and then generates individual files with the results of each indicator evaluated.
The requirements to improve these Map and Reduce functions to address others
application indicators, such as throughput or number of retransmissions, are that each
indicator must be represented by an intermediate key, which is used by MapReduce for
grouping and sorting, and that collected values must be associated with its key.

3.3

Evaluation

In this section we perform an experiment to evaluate the effectiveness of MapReduce
to express DPI algorithms and its completion time scalability for profiling distributed
applications through DPI: then our scope was limited to evaluate the AppAnalyzer,
AppParser and the Hadoop environment, from the architecture presented before.

3.3.1

Evaluation Methodology

For this experimental evaluation, we adopted a methodology based on aspects of GQM
(Goal-Question-Metric) template (Basili et al., 1994) and on the systematic approach to
performance evaluation defined by Jain (1991).
Two questions were defined to achieve our defined goal, and these questions are:
• Q1 : Can MapReduce express DPI algorithms and extracts application indicators
from network traffic of distributed applications?
• Q2 : Is the completion time of MapReduce for DPI proportionally scalable with the
addition of worker nodes?
To answer these questions, the metrics described in Table 3.1 were evaluated, which
shows the number of indicators extracted from distributed application traffic and the
behaviour followed by the completion time scalability obtained per variation of number
of worker nodes in a MapReduce cluster. The completion time scalability evaluates how
is the decreasing of completion time obtained with node addition into a MapReduce
cluster, for processing a defined input dataset.
This experimental evaluation adopts the factors and levels described in Table 3.2,
which represents the number of worker nodes of a MapReduce cluster and the input size
used in MapReduce jobs. These factors make possible to evaluate the scalability behavior
of MapReduce over variations in the selected factors.

29

3.3. EVALUATION
Table 3.1 Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI
of a JXTA-based network traffic

Metrics
M1 : Number of Indicators

M2 : Proportional Scalability

Description
Number of application indicators extracted
from a distributed application traffic.
Verify if the completion time decreases
proportionally to the number of worker
nodes.

Question
Q1
Q2

Table 3.2 Factors and levels to evaluate the defined metrics

Factors
Number of worker nodes
Input Size

Levels
3 up to 19
16GB and 34GB

Our testing hypotheses are defined in Table 3.3 and 3.4, that describe the null hypothesis and alternative hypothesis for each previously defined question. Table 3.3 describes
our hypotheses and Table 3.4 presents the notation used to evaluate our hypotheses.
Table 3.3 Hypotheses to evaluate the defined metrics

Alternative Hypothesis
H1num.indct : It is possible to use
MapReduce for extracting application indicators from network traffic.
H1scale.prop : The completion time of
MapReduce for DPI, does not scale
proportionally to node addition.

Null Hypothesis
Question
H0num.indct . It is not possible to use
Q1
MapReduce for extracting applications indicators from network traffic.
H0scale.prop . The completion time of
Q2
MapReduce for DPI, scales proportionally to node addition.

The hypotheses H1num.indct and H0num.indct were defined to evaluate if MapReduce
can be used to extract applications indicators from network traffic, for this evaluation
it was analysed the number of indicators extracted from a JXTA-base network traffic,
represented by µnum.indct .
It is common to see statements saying that MapReduce scalability is linear, but
achieving linear scalability in distributed systems is a difficult task. Linear scalability
happens when a parallel system does not loose performance while scaling (Gunther,
2006), then a node addition implies proportional performance gain in completion time or
processing capacity. We defined the hypotheses H1scale.prop and H0scale.prop to evaluate
the completion time scalability behavior of MapReduce, testing if it provides proportional
completion time scalability. In these hypotheses, t represents the completion time for
executing a Job j, s represents the cluster size and n represents the evaluated multiplication

30

3.3. EVALUATION

Hypothesis
H1num.indct
H0num.indct
H1scale.prop
H0scale.prop

Table 3.4 Hypothesis notation

Notation
µnum.indct > 0
µnum.indct <= 0
µscale.prop = ∀n ∈ N ∗ , sn = s · n ⇒ tn �=
µscale.prop = ∀n ∈ N ∗ , sn = s · n ⇒ tn =

t
n
t
n

Question
Q1
Q1
Q2
Q2

factor, which is the increasing factor for the cluster size evaluated. H0scale.prop states that,
evaluating a specific MapReduce Job and input data, for all n being natural and bigger
than zero, a new cluster size defined by a previous cluster size multiplied by the factor n,
implies into the reduction of the previous job time t according to the division factor n,
resulting in the time tn obtained through the division of the previous time t by the factor n.

3.3.2

Experiment Setup

To evaluate the MapReduce effectiveness for application traffic analysis and its completion
time scalability, we performed two sets of experiments, grouped by the input size analysed,
with variation in the number of worker nodes.
Was used as input for MapReduce jobs, network traffic captured from a JXTA-based
distributed backup system, which uses the JXTA Socket communication layer for data
transfer between peers. The network traffic was captured from an environment composed
of six peers, where one peer server receives datafrom five concurrent client peers, to be
stored and replicated to another peers. During the capturing traffic, one server peer creates
a JXTA Socket Server to accept JXTA Socket connections and receive data through an
established connection.
For each data backup, one client peer establishes a connection with a server peer and
sends messages with the content to be stored; if the content to be stored is bigger than the
JXTA message maximum size, its content will be transferred through two or more JXTA
messages. For our experiment, we adopted the backup of files with content size randomly
defined, with values between 64KB and 256KB.
The network traffic captured was saved into datasets of 16GB and 34GB, split in 35
and 79 files of 64MB, respectively, and stored into HDFS, to be processed as described
in Section 3.2, in order to extract, from the JXTA Socket communication layer, these
selected indicators: round-trip time, number of connection requests per time and number
of messages received by one peer server per time.
For each experiment set the algorithm 1 was executed, implemented by JXTAPerfMap-

31

3.4. RESULTS

per and JXTAPerfReducer, and was measured the completion time and processing capacity
for profiling a JXTA-based distributed application through DPI, over different number of
worker nodes. Each experiment was executed 30 times to obtain reliable values (Chen
et al., 2011), within confidence interval of 95% and a maximum error ratio of 5%. The
experiment was performed using virtual machines of the Amazon EC2, with nodes running Linux kernel 3.0.0-16, Hadoop version 0.20.203, block size of 64MB and with the
data replicated 3 times over the HDFS. All used virtual machines were composed of 2
virtual cores, 2.5 EC2 Compute Units and 1.7GB of RAM.

3.4

Results

From the JXTA traffic analysed, we extracted three indicators, the number of JXTA
connection requests per time, the number of JXTA messages received per time, and the
round-trip time of JXTA messages, that is defined by the time between content message
arrival from a client peer, and the JXTA acknowledgement sent back from a server peer.
The extracted indicators are shown in Figure 3.3.

Figure 3.3 JXTA Socket trace analysis

Figure 3.3 shows the extracted indicators, exhibiting the measured indicators from the

32

3.4. RESULTS

JXTA Socket communication layer and its behaviour for concurrent data transferring, of
a server peer receiving JXTA Socket connection request and messages from concurrent
client peers of a distributed backup system.
The three indicators extracted from the network traffic of a JXTA-based distributed
application, using MapReduce to perform DPI algorithm and extract desired indicators,
represents important indicators to evaluate a JXTA-based application (Halepovic and
Deters, 2005). With these extracted indicators it is possible to evaluate a distributed
system, providing a better understanding of the behaviour of a JXTA-based distributed
application. Through the extracted information it is possible to evaluate important
metrics, such as the load distribution, response time and the negative impact caused by
the increasing of number of messages received by a peer.
Using MapReduce to perform a DPI algorithm, was possible to extract the three
application indicators from network traffic, then was obtained µnum.indct = 3, what rejects
the null hypothesis H0num.indct , that states µnum.indct <= 0, and confirms the alternative
hypothesis H1num.indct , confirming that µnum.indct > 0.
Figures 3.4(a) and 3.4(b) illustrate how the addition of worker nodes into an Hadoop
cluster reduces the mean completion time and how the scalability of completion time is
for profiling 16 GB and 34 GB of network traffic trace.
350

500

Completion Time

Completion Time

450
300
400

350
Time(s)

Time(s)

250

200

300

250

200
150
150

100

100
2

3

4

5

6

7

8

9

Nodes

(a): Scalability to process 16 GB

10

11

4

6

8

10

12
Nodes

14

16

18

20

(b): Scalability to process 34 GB

Figure 3.4 Completion time scalability of MapReduce for DPI

In both graphics, the behaviour of the completion time scalability is similar, not
following a linear function and with more significant scalability gains, through node
addition, in smaller clusters, and less significant gains with node addition into bigger
clusters.
This scalability behaviour highlights the importance of evaluating the relation between
costs and benefits to nodes additions in a MapReduce cluster, due to the non proportional

33

3.4. RESULTS

gain with node addition in a MapReduce cluster.
The Tables 3.5 and 3.6 present respectively the results of the experiment to deeply
inspect 16 GB and 34 GB of network traffic trace, showing the number of Hadoop nodes
used for each experiment, the mean completion time in seconds, its margin of error, the
processing capacity achieved and the relative processing capacity per node in the cluster.
Table 3.5 Completion time to process 16 GB split into 35 files

Nodes
Time
Margin of Error
MB/s
(MB/s)/node

3
322.53
0.54
50.80
16.93

4
246.03
0.67
66.59
16.65

6
173.17
0.56
94.61
15.77

8
151.73
1.55
107.98
13.50

10
127.17
1.11
128.84
12.88

Table 3.6 Completion time to process 34 GB split into 79 files

Nodes
Time
Margin of Error
MB/s
(MB/s)/node

4
464.33
0.32
74.98
18.75

8
260.60
0.76
133.60
16.70

12
189.07
1.18
184.14
15.35

16
167.13
0.81
208.32
13.02

19
134.47
1.53
258.91
13.63

In our experiments, we achieved a maximum mean processing capacity of 258.91 MB
per second, in a cluster with 19 worker nodes, processing 34 GB. For a cluster with 4
nodes we achieved a mean processing capacity of 66.59 MB/s and 74.98 MB/s to process
respectively 16 GB and 34 GB of network traffic trace, which indicates that processing
capacity may vary in function of the amount of data processed and the number of files
used as input data, and indicates that the input size is an important factor to be analysed
by MapReduce performance evaluations.
The results show that for the evaluated scenario and application, the completion time
decreases with the increment of number of nodes in the cluster, but not proportionally
to node addition and not in a linear function, as can be observed in Figures 3.4(a) and
3.4(b). Observing Figures 3.4(a) and 3.4(b) is possible to see that the completion time
does not decreases linearly. Also, Tables 3.5 and 3.6 show values that confirms the non
proportional completion time scalability. For example the Table 3.5 shows a cluster with
4 nodes processing 16 GB that was scaled out to 8 nodes, then was obtained an increment
of 2 times in number of nodes, but we achieved a gain of 1.62 times in completion time.
To evaluate our stated hypotheses H1scale.prop and H0scale.prop based on this example,
we have the measured s2 = 8 and the calculated s · n = sn defined by 4 · 2 = 8 = s2 ,
34

3.5. DISCUSSION

what confirms sn = s · n. We also have the measured t2 = 151.73 and the calculated nt
defined by 246.03
= 123.01 = t2 , what rejects tn = nt and confirms tn �= nt . Therefore,
2
with the measured results was rejected the null hypothesis H0scale.prop and confirmed the
alternative hypothesis H1scale.prop , which states that the completion time of MapReduce
for DPI, does not scale proportionally to node addition.

3.5

Discussion

In this section, we discuss the measured results and evaluate its meaning, restrictions and
opportunities. We also discuss possible threats to validity of our experimental results.

3.5.1

Results Discussion

Distributed systems analysis, detection of root causes and error reproduction are challenges that motivates efforts to develop less intrusive mechanisms for profiling and
monitoring distributed applications at runtime. Network traffic analysis is one option to
evaluate distributed systems, although there are limitations on capacity to process a large
amount of network traffic in a short time, and on completion time scalability to process
network traffic where there is variation of resource demand.
According to the evaluated results, using MapReduce for profiling a network traffic
from a JXTA-based distributed backup system, through DPI, it is important to analyse the
possible gains with node addition into a MapReduce cluster, because the node addition
provides different gains according to the cluster size and input size. For example, Table
3.6 shows that the addition of 4 nodes into a cluster with 12 nodes, produces a reduction
of 11% in completion time and an improvement of 13% in processing capacity, while the
addition of the same amount of nodes (4 nodes) into a cluster with 4 nodes produces a
reduction of 43% in completion time and an improvement of 78% in processing capacity.
The scalability behaviour of MapReduce for DPI highlights the importance of evaluating the relation between costs and benefits to node additions into a MapReduce cluster,
because the gains obtained with node addition are related to the actual and future cluster
size and the input size to be processed.
The growing of the number of nodes in the cluster increases costs due to greater
cluster management, data replication, tasks allocation to available nodes and due costs
with the management of failures. Also, with the cluster growing, the cost is increased
with merging and sorting of the data processed by Map tasks (Jiang et al., 2010), that can
be spread into a bigger number of nodes.

35

3.5. DISCUSSION

In smaller clusters, the probability of a node having a replica of the input data, is
greater than in bigger clusters adopting the same replication factor (Zaharia et al., 2010).
In bigger clusters there are more options of nodes for delegate a task execution, but the
number of data replication limits the benefits of data locality to the number of nodes that
store a replica of the data. This increases the cost to schedule tasks and to distribute tasks
in the cluster, and also increases costs with data transfer over the network.
The kind of workload submitted to be processed by MapReduce impacts in the
behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring
specific configuration to obtain an optimal performance. Although studies have been
done in order to understand, analyse and improve workload management decisions
in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to characterize
the MapReduce behaviour or to identify its optimal configuration to achieve the best
performance for packet level analysis and DPI. Thus, it is necessary deeply understand
the behaviour of MapReduce to process network traces and what optimizations can be
done to better explore the potential provided by MapReduce for packet level analysis and
DPI

3.5.2

Possible Threats to Validity

Due budget and time restrictions, our experiments were performed with small cluster size
and small input size, if compared with benchmarks that evaluate the MapReduce performance and its scalability (Dean and Ghemawat, 2008). However, relevant performance
evaluations and reports of real MapReduce production traces shows that the majority of
the MapReduce jobs are small and executed into a small number of nodes (Zaharia et al.,
2008; Wang et al., 2009; Lin et al., 2010; Zaharia et al., 2010; Kavulya et al., 2010; Chen
et al., 2011; Guo et al., 2012).
Although MapReduce has been designed to handle big data, the use of input data in
order of gigabytes has been reported by realistic production traces (Chen et al., 2011),
and this input size has been used in relevant MapReduce performance analysis (Zaharia
et al., 2008; Wang et al., 2009; Lin et al., 2010).
Improvements in MapReduce performance and proposed schedulers has focused
into problems related to small jobs, for example Facebook’s fairness scheduler aims to
provide fast response time for small jobs (Zaharia et al., 2010; Guo et al., 2012). Fair
scheduler attempts to guarantee service levels for production jobs by maintaining job
pools composed by a small number of nodes than the total nodes of a data center, to
maintain a minimum share and dividing excess capacity among all jobs or pools (Zaharia

36

3.6. CHAPTER SUMMARY

et al., 2010).
According to Zaharia et al. (2010), 78% of Facebook’s MapReduce jobs have up to
60 Map tasks. Our evaluated datasets were composed by 35 and 79 files, what implies
into the same and respective numbers of Map tasks, due to our approach evaluates an
entire block per Map task.

3.6

Chapter Summary

In this chapter, we presented an approach for profiling application traffic using MapReduce, and evaluated its effectiveness for profiling application through DPI and its completion time scalability in a cloud computing environment.
We proposed a solution based on MapReduce, for deep inspection of distributed
applications traffic, in order to evaluate the behaviour of distributed systems at runtime,
using commodity hardware, in a low intrusive way, through a scalable and fault tolerant
approach based on Hadoop, an open source implementation of MapReduce.
MapReduce was used to implement a DPI algorithm to extract application indicators
from a JXTA-based traffic of a distributed backup system. Was adopted an splitting
approach without the block division into records, was used a network trace split into files
with maximum size lesser than the HDFS block size, to avoid the cost and complexity of
providing to the HDFS a algorithm for splitting the network trace into blocks, and also to
use a whole block as input for Map functions, in order to be able to reassemble two or
more packets and reassemble JXTA messages from packets of network traces, per Map
function.
We evaluated the effectiveness of MapReduce for a DPI algorithm and its completion
time scalability, over different sizes of network traffic used as input, and different cluster
size. We showed that the MapReduce programming model can express algorithms for
DPI and extracts application indicators from application network traffic, using virtual
machines of a cloud computing provider, for DPI of large amounts of network traffic.
We also evaluated its completion time scalability, showing the scalability behaviour, the
processing capacity achieved, and the influence of the number of nodes and the data input
size on the capacity processing for DPI.
It was shown that MapReduce completion time scalability for DPI does not follow a
linear function, with more significant scalability gains, through the addition of nodes, in
small clusters, and less significant gains in bigger clusters.
According to the results, input size and cluster size generate significant impact in

37

3.6. CHAPTER SUMMARY

processing capacity and completion time of MapReduce jobs for DPI. This highlights
the importance of evaluating the best input size and cluster size to obtain an optimal
performance in MapReduce Jobs, but also indicates the need for more evaluations about
the influence of others important factors on MapReduce performance, in order to provide
better configuration, selection of input size and machine allocation into a cluster, and to
provide valuable information for performance tuning and predictions.

38

4

Evaluating MapReduce for Network
Traffic Analysis

All difficult things have their origin in that which is easy, and great
things in that which is small.
—LAO TZU

The use of MapReduce for distributed data processing has been growing and achieving benefits with its application for different workloads. MapReduce can be used for
distributed traffic analysis, although network traffic traces present characteristics which
are not similar to the data type commonly processed through MapReduce, which in
general is divisible and text-like data, while network traces are binary and may present
restrictions about splitting, when processed through distributed approaches.
Due to the lack of evaluation of MapReduce for traffic analysis and the peculiarity of
this kind of data, this chapter deeply evaluates the performance of MapReduce in packet
level analysis and DPI of distributed application traffic, evaluating its scalability, speed-up
and the behaviour followed by MapReduce phases. The experiments provide evidences
for the predominant phases in this kind of MapReduce job, and show the impact of input
size, block size and number of nodes, on completion time and scalability.
This chapter is organized as follows. We first describe the motivation for a MapReduce
performance evaluation for network traffic analysis in Section 4.1. Then we present the
evaluation plan and methodology adopted in Section 4.2, and the results are presented in

39

4.1. MOTIVATION

Section 4.3. Section 4.4 discusses the results and Section 4.5 summarizes the chapter.

4.1

Motivation

It is possible to measure, evaluate and diagnose distributed applications through the
evaluation of information from communication protocols, flows, throughput, and load
distribution (Mi et al., 2012; Nagaraj et al., 2012; Sambasivan et al., 2011; Aguilera
et al., 2003; Yu et al., 2011). This information can be collected through network traffic
analysis, but to retrieve application information from network traces, it is necessary to
recognize the application protocol and deeply inspect the traffic to retrieve details about
its behaviour, session and states.
MapReduce can be used for offline evaluation of distributed applications, analysing
application traffic inside a data center, through packet level analysis (Lee et al., 2011),
evaluating each packet individually, and through DPI (Vieira et al., 2012b,a), adopting a
different approach for data splitting, where a whole block is processed without division
into individual packets, due to the necessity to reassemble two or more packages to
retrieve information of the application layer, in order to evaluate application messages
and protocols.
The kind of workload submitted for processing by MapReduce impacts on the behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring specific
configuration to obtain an optimal performance. Information about the occupation of
MapReduce phases, the processing characteristics (if the job is I/O or CPU bound),
and the mean duration of Map and Reduce tasks, can be used to optimize parameter
configurations, and to improve resource allocation and task scheduling.
The main evaluations of MapReduce are in text processing (Zaharia et al., 2008;
Chen et al., 2011; Jiang et al., 2010; Wang et al., 2009), where the input data are split
into blocks and into records, to be processed by parallel and independent Map functions.
For distributed processing of network traffic traces, which are usually binary, the data
splitting into packets is a concern and, in some cases, may require data without splitting,
especially when packet reassembly is required to extract application information from
the application layer.
Although work has been done to understand, analyse and improve workload management decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to
characterize the MapReduce behaviour or to identify its optimal configuration to achieve
the best performance for packet level analysis and DPI.
Due to the lack of evaluation of MapReduce for traffic analysis and the peculiarity of

40

4.2. EVALUATION

this kind of data, it is necessary to understand the behaviour of MapReduce to process
network traces and to understand what optimizations can be done to better explore the
potential provided by MapReduce for packet level analysis and DPI.
This chapter evaluates MapReduce performance for network packet level analysis
and DPI using Hadoop, characterizing the behaviour followed by MapReduce phases,
scalability and speed-up, over variation of input, block and cluster sizes. The main
contributions of this chapter are:
1. Characterization of MapReduce phases behaviour for packet level analysis and
DPI;
2. Description of scalability behaviour and its relation with important MapReduce
factors;
3. Identification of the performance provided by the block sizes adopted for different
cluster size;
4. Description of speed-up obtained for DPI.

4.2

Evaluation

The goal of this evaluation is to characterize the behaviour of MapReduce phases, its
scalability characteristics over node addition and the speed-up achieved with MapReduce
for packet level analysis and DPI. Thus, we performed a performance measurement and
evaluation of MapReduce jobs that execute packet level analysis and DPI algorithms.
To evaluate MapReduce for DPI, the Algorithm 1 implemented by JXTAPerfMapper
and JXTAPerfReducer was used, and applied to new factors and levels. To evaluate
MapReduce for packet level analysis, a port counter algorithm developed by Lee et al.
(2011) was used, which divides a block into packets and processes each packet individually to count the number of occurrence of TCP and UDP port numbers. The same
algorithm was evaluated using the splitting approach that processes a whole block per
Map function, without the division of a block into records or packets. A comparison was
done between these two approaches for packet level analysis.

4.2.1

Evaluation Methodology

For this evaluation, we adopted a methodology based on the systematic approach to
performance evaluation defined by Jain (1991), which consists of the definition of the

41

4.2. EVALUATION

goal, metrics, factors and levels for a performance study.
The goal of this evaluation is to characterize the behaviour of MapReduce phases,
its scalability through node addition and the speed-up achieved with MapReduce for
packet level analysis and DPI, to understand the impact of each factor on MapReduce
performance for this kind of input data, in order to be able to configure the MapReduce
and obtain an optimal performance over the evaluated factors.
Table 4.1 Metrics for evaluating MapReduce for DPI and packet level analysis

Metrics
Completion Time
Phases Time

Phases Occupation
Scalability
Speed-up

Description
Completion time of MapReduce jobs
Time consumed by each MapReduce phase in total completion
time of MapReduce jobs
Relative time consumed by each MapReduce phase in total completion time of MapReduce jobs
Processing capacity increasing obtained with node addition in a
MapReduce cluster
Improvement in completion time against the same algorithm implemented without distributed processing

Table 4.1 describes the metrics evaluated, that is the completion time of MapReduce
jobs, the relative and absolute time of each MapReduce phases from the total job time,
the processing capacity scalability, and the speed-up against non-distributed processing.
The experiments adopt the factors and level described in Table 4.2. The selected
factors were chosen due to its importance for MapReduce performance evaluations and
its adoption by relevant previous researches (Jiang et al., 2010; Chen et al., 2011; Shafer
et al., 2010; Wang et al., 2009).
Table 4.2 Factors and Levels

Factors
Number of Worker Nodes
Block Size
Input Size

Levels
2 up to 29
32MB, 64MB and 128MB
90Gb and 30Gb

Hadoop logs are a valuable information source about its environment and execution
jobs; important MapReduce indicators and information about jobs, tasks, attempts,
failures and topology are logged by Hadoop during its execution. The collected data to
perform this performance evaluation was extracted from Hadoop logs.
To extract information from Hadoop logs and to evaluate the selected metricis, we
developed the Hadoop-Analyzer (Vieira, 2013), an open source and publicly available

42

4.2. EVALUATION

tool to extract and evaluate MapReduce indicators, such as job completion time and
MapReduce phases distribution, from logs generated by Hadoop from its job executions.
With Hadoop-Analyzer is possible to generate graphs with the extracted indicators and
thereby evaluate the desired metrics.
Hadoop-Analyzer relies on Rumen (2012) to extract raw data from Hadoop logs
and generate structured information, which is processed and shown in graphs generated
through R (Eddelbuettel, 2012) and Gnuplot (Janert, 2010), such as the results presented
in Section 4.3

4.2.2

Experiment Setup

Network traffic traces of distributed applications were captured to be used as input for
MapReduce jobs of our experiments; these traces were divided into files with size defined
by the block size adopted in each experiment, and then the files were stored into HDFS,
following the process described in the previous chapter. The packets were captured using
Tcpdump and were split into files with sizes of 32MB, 64MB and 128MB.
For packet level analysis and DPI evaluation, two sizes of datasets were captured
from network traffic transferred between some nodes of distributed systems. One dataset
was 30Gb of network traffic, with data divided in 30 files of 128MB, 60 files of 64MB
and 120 files of 32MB. The other dataset was 90Gb of network traffic, split in 90 files of
128MB, 180 files of 64MB and 360 files of 32MB.
For the experiments of DPI through MapReduce, we used network traffic captured
from the same JXTA-based application described in Section 3.3.2, but with different
sizes of traces and files. To evaluate MapReduce for packet level analysis, we processed
network traffic captured from data transferred between 5 clients and one server of a data
storage service provided through the Internet, known as Dropbox1 .
To evaluate MapReduce for packet level analysis and DPI, one driver was developed
for each case of network traffic analysis, with one version using MapReduce and another
without it.
CountUpDriver implements packet level analysis for a port counter of network traces,
which records how many times a port appears in TCP or UDP packets; its implementation
is based on processing a whole block as input for Map functions, without splitting and with
block size defined by the block size of HDFS. Furthermore a port counter implemented
with P3 was evaluated; this implementation is a version of the tool presented by Lee et al.
1 http://www.dropbox.com/

43

4.2. EVALUATION

(2011), which adopts an approach that divides a block into packets and processes each
packet individually, without dependent information between packets.
JxtaSocketPerfDriver implements DPI to extract, from a JXTA (Duigou, 2003) network traffic, the round-trip time of JXTA messages, the number of connection requisitions
per time and the number of JXTA Socket messages from JXTA clients and a JXTA Socket
server. JxtaSocketPerfDriver uses whole files as input for each Map function, with size
defined by the HDFS block size, in order to reassemble JXTA messages with its content
divided into many TCP packets.
One TCP packet can transport one or more JXTA messages per time, which makes it
necessary to evaluate the full content of TCP segments to recognize all possible JXTA
messages, instead of to evaluate only a message header or signature. The round-trip time
of JXTA messages is calculated from the time between a client peer sending a JXTA
message and receiving the JXTA message arrival confirmation. To evaluate the round-trip
time it is necessary to keep the information of requests and which responses correspond to
each request, and thus, it is necessary to analyse several packages to retrieve and evaluate
information of the application behaviour and its states.
To analyse the speed-up provided by MapReduce against a single machine execution,
two drivers were developed that use the same dataset and implement the same algorithms
implemented by CountUpDriver and JxtaSocketPerfDriver, but without distributed processing. These drivers are respectively CountUpMono and JxtaSocketPerfMono.
The source code of all implemented drivers and other implementations to support the
use of MapReduce for network traffic analysis, are open source and publicly available at
Vieira (2012a).
The experiments were performed on a 30-node Hadoop-1.0.3 cluster composed of
nodes with four 3.2GHz cores, 8 GB RAM and 260GB of available hard disk space,
running Linux kernel 3.2.0-29. Hadoop was used as our MapReduce implementation,
and configured to permit a maximum of 4 Map and 1 Reduce task per node; also, we
defined the value -Xmx1500m as child option of JVM and 400 as io.Sort.mb value.
For drivers CountUpDriver and JxtaSocketPerfDriver, the number of Reducers
was defined as a function of the number of slots of Reducers per node, defined by
numReducers = (0.95)(numNodes)(maxReducersPerNode) (Kavulya et al., 2010). The
driver implemented with P3 (Lee et al., 2011) adopts a fixed number of Reducers, defined
as 10 by the available version of P3. Each experiment was executed 20 times to obtain
reliable values (Chen et al., 2011), within confidence interval of 95% and a maximum
error ratio of 5%.

44

4.3. RESULTS

4.3

Results

Two dataset sizes of network traffic were used during the experiments, with 30Gb and
90Gb. Each dataset was processed by MapReduce jobs that implement packet level
analysis and DPI, in Hadoop clusters with variation in number of worker nodes between
2 and 29, and block size of 32MB, 64MB and 128MB.
Each dataset was processed by algorithms implemented through MapReduce and
without distributed processing, to evaluate the speed-up achieved. The Table 4.3 shows
the execution times obtained by the non-distributed processing, implemented and executed through JxtaSocketPerfMono and CountUpMono, using a single machine, with the
resource configuration described in Subsection 4.2.2.
Table 4.3 Non-Distributed Execution Time in seconds

Block

JxtaSocketPerfMono

32MB
64MB
128MB

CountUpMono

90Gb

30Gb

90Gb

30Gb

1.745,35
1.755,40
1.765,50

584,92
587,02
606,50

872,40
571,33
745,25

86,71
91,76
94,82

Figure 4.1 shows the completion time and speed-up of the DPI Algorithm 1 to extract
indicators from a JXTA-based distributed application. The Completion Time represents
the job time of JxtaSocketPerfDriver and the Speed-up represents gains in execution time
of JxtaSocketPerfDriver against JxtaSocketPerfMono to process 90Gb of network traffic.
900

Completion Time − 32MB
Completion Time − 64MB
Completion Time − 128MB
Speed−up − 32MB
Speed−up − 64MB
Speed−up − 128MB

800

20

700

15

500

400

10

Speed−up

Time(s)

600

300

200
5
100

0
02

04

06

10

14
Nodes

18

21

25

29

Figure 4.1 DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTA-application
network traffic

45

4.3. RESULTS

According to Figure 4.1, JxtaSocketPerfDriver performs better than JxtaSocketPerfMono over all factors variation. Initially we observed the best speed-up of 3.70 times
with 2 nodes and block of 128MB, lastly we observed a maximum speed-up of 16.19
times with 29 nodes and block of 64MB. The speed-up achieved with block size of 32MB
was the worst case initially, but its speed-up increased with node addition and became
better than blocks with 128MB and near of the speed-up achieved with block of 64MB,
for a cluster with 29 nodes.
The completion time scalability behaviour of 32MB block size showed reduction
in completion time for all node additions, although cases with block size of 64MB and
128MB present no significant reduction in completion time in clusters with more than 25
nodes. According to Figure 4.1, the completion time does not reduce linearly with node
addition, and the improvement in completion time was less significant when the dataset
was processed by more than 14 nodes, specially for cases that adopted blocks of 64MB
and 128MB.
Figure 4.2 shows the processing capacity of MapReduce applied to DPI of 90Gb
of a JXTA-based application traffic, over variation of cluster size and block size. The
processing capacity was evaluated by the throughput of network traffic processed, and
by the relative throughput, defined by the processing capacity achieved per number of
allocated nodes.
100

Throughput − 32MB
Throughput − 64MB
Throughput − 128MB
Throughput/Nodes − 32MB
Throughput/Nodes − 64MB
Throughput/Nodes − 128MB

1000

90

Throughput (Mbps)

70
600

60

50

400

Throughput/Nodes (Mbps)

80
800

40
200
30

0

2

4

6

8

10

12

14 16
Nodes

18

20

22

24

26

28

30

Figure 4.2 DPI Processing Capacity for 90Gb

The processing capacity achieved for DPI of 90Gb using block size of 64MB was
159.89 Mbps with 2 worker nodes, increasing up to 869.43 Mbps with 29 worker nodes.
For the same case, the relative processing capacity achieved was 79.94 Mbps/node with 2
nodes and 29.98 Mbps/node with 29 nodes, showing a decrease of relative processing

46

4.3. RESULTS

capacity with the growth of the MapReduce cluster size.
Although the processing capacity increased, the relative processing capacity, defined
by the processing capacity per allocated node, decreased with all node addition. This
behaviour indicates that MapReduce presents a reduction of efficiency with the increase
of cluster size (Gunther, 2006), which highlights the importance of evaluation about the
cost of node allocation and its benefits for completion time and processing capacity.
Figures 4.1 and 4.2 also show the difference of performance achieved with different
blocks sizes, and its relation to the cluster size. It was observed that blocks of 128MB
achieved a higher throughput in cluster sizes up to 14 nodes, and that blocks of 64MB
performed better in clusters bigger than 14 worker nodes.
Figures 4.3(a) and 4.3(b) show the behaviour of MapReduce phases to DPI of 90Gb.
900

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

800
700

Time(s)

600
500
400
300
200
100
0
29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

32MB

64MB

128MB

(a): Phases Time for DPI
Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

29
25
21
18
14
10
06
04
02

32MB

64MB

128MB

(b): Phases Distribution for DPI

Figure 4.3 MapReduce Phases Behaviour for DPI of 90Gb

47

4.3. RESULTS

MapReduce execution can be divided into Map, Shuffle, Sort and Reduce phases,
although Shuffle tasks can be executed before the conclusion of all Map tasks, thereby
Map and Shuffle tasks can overlap. According to the Hadoop default configuration, the
overlapping between Map and Shuffle tasks starts after 5% of Map tasks are concluded;
and then Shuffle tasks are started and run until the Map phase ends.
In Figures 4.3(a) and 4.3(b) we showed the overlapping between Map and Shuffle
tasks as a specific MapReduce phase, represented as "Map and Shuffle" phase. The time
consumed by Setup and Cleanup tasks was considered too, for a better visualization of
the execution time division in Hadoop jobs.
Figure 4.3(a) shows the cumulative time of each MapReduce phase in total job time.
For DPI, Map time, which is the Map and the "Map and Shuffle" phases, consumes the
major part of a job execution time and it is the phase that exhibits more variation with the
number of nodes variation, but no significant time reduction is achieved with more than
21 nodes and block size of 64MB or 128MB.
The Shuffle time, which happens after all Map tasks are completed, presented low
variation with node addition. Sort and Reduce phases required relatively low execution
times and did not appear in some bars of the graph. Setup and Cleanup tasks consumed
an almost constant time, independently of cluster size or block size variation.
Figure 4.3(b) shows the percentage of each MapReduce phase in total job completion
time. We also considered an additional phase, called others, which represents the time
consumed by cluster management tasks, like scheduling and tasks assignment. The
behaviour followed by phases occupation is similar over all block sizes evaluated, with
the exception of the case where Map time does not decrease with node addition, in
clusters using block size of 128MB and with more than 21 nodes.
With cluster size variation, a relative reduction in Map time was observed and a
relative increase in the time of Shuffle, Setup and Cleanup phases. During the evaluation
of Figure 4.3(a), it was observed that Setup and Cleanup phases consume an almost
absolute constant time, independently of cluster size and block size, and thereby with
node addition and completion time decreasing, the time consumed by Setup and Cleanup
tasks became more significant in relation to the total execution time, due to the taotal
job completion time decreasing and the time of Setup and Cleanup tasks remaining
almost the same; therefore, the Setup and Cleanup percentage time increased and became
more significant over the total job completion time reduction, by node addition into the
MapReduce cluster.
According to Figures 4.3(a) and 4.3(b), Map phase is predominant in MapReduce

48

4.3. RESULTS

jobs for DPI, and the reduction of the total job completion time over node addition is
related to the decreasing of Map phase time. Thus, improvements in Map phase execution
for DPI workloads can produce more significant gains, in order to reduce the total job
completion time for DPI.
Figure 4.4 shows the comparison between completion time of CountUpDriver and
P3 to packet level analysis of 90Gb of network traffic, over variation of cluster size and
block size.
CountUpDriver − 32MB
P3 − 32MB
CountUpDriver − 64MB
P3 − 64MB
CountUpDriver − 128MB
P3 − 128MB

500

Time(s)

400

300

200

100

0
02

04

06

10
Nodes

14

21

28

Figure 4.4 Completion time comparison of MapReduce for packet level analysis, evaluating the
approach with and without splitting into packets

P3 achieves better completion time than CountUpDriver over all factors, showing
that a divisible files approach performs better for packet level analysis and that block size
is a significant factor for both approaches, due to significant impact on completion time
caused by adoption of blocks with different sizes.
With variation in the number of nodes, it was observed that using a block size of
128MB was achieved better completion time up to 10 nodes, but that no more improvement in completion time was achieved with node addition on clusters with more than 10
nodes. Blocks of 32MB and 64MB only present significant completion time difference in
a cluster up to 14 nodes; for a cluster bigger than 14 nodes a similar completion time was
achieved for both block sizes adopted, but these are still better completion times than the
completion time achieved with blocks of 128MB.
Figures 4.5(a) and 4.5(b) show respectively the completion time and speed-up of P3
and CountUpDriver against CountUpMono, for packet level analysis, with variation on
the number of nodes and block size. For both cases, the use of a block size of 128MB

49

4.3. RESULTS

provides the best completion time in smaller clusters, up to 10 nodes, but it provides
a worse completion time in a cluster with more than 21 nodes. For both evaluations,
the speed-up adopting block of 128MB scales up to 10 nodes, but for a bigger cluster a
speed-up gain was not achieved with node addition.
350

Completion Time − 32MB
Completion Time − 64MB
Completion Time − 128MB
Speed−up − 32MB
Speed−up − 64MB
Speed−up − 128MB

300

18

16

250

200
Time(s)

12

10

150

Speed−up

14

8
100
6
50

4

2

0
02

04

06

10
Nodes

14

21

28

(a): P3 evaluation
Completion Time − 32MB
Completion Time − 64MB
Completion Time − 128MB
Speed−up − 32MB
Speed−up − 64MB
Speed−up − 128MB

500

14

12

400

8

200

Speed−up

Time(s)

10
300

6

4

100

2
0
02

04

06

10
Nodes

14

21

28

(b): CountUpDriver evaluation

Figure 4.5 CountUp completion time and speed-up of 90Gb

Using blocks of 32MB was achieved improvement in completion time for all node
addition, which causes improvement of speed-up for all cluster size, although the use of
this block size did not present better completion time than others block sizes in any case.
The adoption of 32MB blocks provided better speed-up than other block sizes in
a cluster with more than 14 nodes, due to the time consumed by CountUpMono for
processing of 90Gb divided into 32MB files which was bigger than the time consumed
by cases with other block sizes, as shown in Table 4.3.

50

4.3. RESULTS

Figures 4.6(a) and 4.6(b) show the processing capacity of P3 and CountUpDriver to
perform packet level analysis of 90Gb of network traffic, over variation of cluster size
and block size.
Throughput − 32MB
Throughput − 64MB
Throughput − 128MB
Throughput/Nodes − 32MB
Throughput/Nodes − 64MB
Throughput/Nodes − 128MB

2000

250

150
1000

Throughput/Nodes (Mbps)

Throughput (Mbps)

200
1500

100

500
50
0

2

4

6

8

10

12

14 16
Nodes

18

20

22

24

26

28

30

(a): P3 processing capacity
1800

Throughput − 32MB
Throughput − 64MB
Throughput − 128MB
Throughput/Nodes − 32MB
Throughput/Nodes − 64MB
Throughput/Nodes − 128MB

1600

120

Throughput (Mbps)

1200

100

1000
80

800
600

Throughput/Nodes (Mbps)

1400

60

400
200

40
0
0

2

4

6

8

10

12

14 16
Nodes

18

20

22

24

26

28

30

(b): CountUpDriver processing capacity

Figure 4.6 CountUp processing capacity for 90Gb

Using block size of 64MB, P3 achieved throughput of 413.16 Mbps with 2 nodes and
the maximum of 1606.13 Mbps with 28 nodes, while its relative throughput for the same
configuration was 206.58 Mbps and 55.38 Mbps. The processing capacity for packet
level analysis, evaluated for P3 and CountUpDriver, follows the same behaviour showed
in Figure 4.2. Additionally, it is possible to observe a convergent decreasing of relative
processing capacity for all block sizes evaluated, starting in a cluster size of 14 nodes,
where the relative throughput achieved by all block sizes is quite similar.
Figure 4.6(b) shows a relative processing capacity increasing with the addition of

51

4.3. RESULTS

2 nodes into a cluster with 4 nodes. For packet level analysis of 90Gb, MapReduce
achieved a better processing capacity efficiency per node using 6 nodes, which provides
24 Mappers and 5 Reducers per Map and Reduce waves. With the adopted variation in
number of Reducers, according to cluster size, using 5 Reducers was achieved better
processing efficiency and significant reduction on Reduce time, as shown in Figure 4.7(b).
Figures 4.7(a) and 4.7(b) show the cumulative time per phase during a job execution.
350

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

300

Time(s)

250

200

150

100

50

0
21

28

21

28

14

10

64MB

06

04

02

28

21

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(a): MapReduce Phases Times of P3
600

Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup

500

Time(s)

400

300

200

100

0
14

10

06

04

02

28

21

64MB

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(b): MapReduce Phases Times for CountUpDriver

Figure 4.7 MapReduce Phases time of CountUp for 90Gb

The behaviour of MapReduce phases for packet level analysis is similar to the
behaviour observed for DPI; with Map time being predominant, Map and Shuffle time
do not decreasing with node addition in a cluster bigger than a specific size, and Sort
and Reduce phases consuming low execution time. The exception is that Shuffle phase
consumes more time in packet level analysis jobs than in DPI, specially in smaller clusters.

52

4.3. RESULTS

For packet level analysis, the amount of intermediate data generated by Map functions
is bigger than the amount generated through the use of MapReduce for DPI; packet
level analysis generates an intermediate data for each packet evaluated, but for DPI it is
necessary to evaluate more than one packet to generate intermediate data. Shuffle phase
is responsible for sorting and transferring the Map outputs to the Reducers as inputs; then
the amount of intermediate data generated by Map tasks and the network transfer cost,
will impact on the Shuffle phase time.
Figures 4.8(a) and 4.8(b) show the percentage of each phase on job completion time
of P3 and CountUpDriver, respectively.
Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
21

28

21

28

14

10

64MB

06

04

02

28

21

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(a): Phases Distribution for P3
Reduce
Setup
Cleanup
Others

Map
Map and Shuffle
Shuffle
Sort
100

% of Completion Time

80

60

40

20

0
14

10

06

04

02

28

21

64MB

14

10

06

04

02

28

21

14

10

06

04

02

32MB

128MB

(b): Phases Distribution for CountUpDriver

Figure 4.8 MapReduce Phases Distribution for CountUp of 90Gb

As the behaviour observed in Figure 4.3(b) and followed by these cases, Map and
Shuffle phases consume more relative time than all others phases, over all factors. But, for

53

4.3. RESULTS

packet level analysis, Map phase occupation decreases significantly, with node addition,
only when block sizes are 32MB or 64MB, and following the completion time behaviour
observed in Figures 4.5(a) and 4.5(b).
For the dataset of 30Gb of network traffic the same experiments were conducted, and
the results about MapReduce phases evaluation presented a behaviour quite similar to the
results already presented by 90Gb experiments, for DPI and packet level analysis.
Relevant differences were identified for speed-up, completion time and scalability
evaluation, as shown by Figures 4.9(a) and 4.9(b), which exhibit the completion time and
processing capacity scalability of MapReduce for DPI of 30Gb of network traffic, with
variation in cluster size and block size.
10

350

Completion Time − 32MB
Completion Time − 64MB
Completion Time − 128MB
Speed−up − 32MB
Speed−up − 64MB
Speed−up − 128MB

300

9

8
250

Time(s)

6
150

Speed−up

7
200

5

100

4

3

50

2
0
02

06

10

18
Nodes

21

25

29

(a): DPI Completion Time and Speed-up of MapReduce for 30Gb
of a JXTA-application network traffic
550

80

Throughput − 32MB
Throughput − 64MB
Throughput − 128MB
Throughput/Nodes − 32MB
Throughput/Nodes − 64MB
Throughput/Nodes − 128MB

500

70

450

Throughput (Mbps)

50

350
300

40

250

Throughput/Nodes (Mbps)

60
400

30
200
20

150
100

10
0

2

4

6

8

10

12

14
16
Nodes

18

20

22

24

26

28

30

(b): DPI Processing Capacity of 30Gb

Figure 4.9 DPI Completion Time and Processing capacity for 90Gb

54

4.4. DISCUSSION

The completion time of DPI of 30Gb scales significantly up to 10 nodes; then the
experiment presents no more gains with node addition using block size of 128MB, and
presents a low increase of completion time in cases using blocks of 32MB and 64MB.
This behaviour is the same observed for job completion time for 90Gb, as shown in
Figures 4.5(a) and 4.5(b), but with significant scaling just up to 10 nodes for the dataset
of 30Gb, while it was achieved scaling up to 25 nodes for 90Gb.
Figure 4.9(a) shows that a completion time of 199.12 seconds was obtained with 2
nodes using block of 128MB, scaling up to 87.33 seconds with 10 nodes and the same
block size, while it was achieved respectively 474.44 and 147.12 seconds by the same
configuration for DPI of 90Gb, as shown in Figure 4.1.
Although the case for processing of 90Gb (Figure 4.1) had processed a dataset 3
times bigger than the case of 30Gb (Figure 4.9(a)), the completion time achieved in all
cases for processing of 90Gb was smaller than 3 times the completion time for processing
of 30Gb. For the cases with 2 and 10 nodes using block of 128MB, it was consumed
respectively 2.38 and 1.68 times more time to process a dataset 90Gb, which is a dataset
3 times bigger than the dataset of 30Gb.
Figure 4.9(b) shows the processing capacity for DPI of 30Gb. The maximum speed-up
achieved for DPI of 30Gb was 7.90 times, using block of 32MB and 29 worker nodes,
while it was achieved the maximum speed-up of 16.19 times for DPI of 90Gb with 28
nodes.
From these results, it is possible to conclude that the MapReduce efficiency fits better
for bigger data and in some cases can be more efficient to accumulate input data to
process a bigger amount of data. Therefore it is important to analyse the dataset size to
be processed, and to quantify the ideal number of allocated nodes for each job, in order
to avoid wasting resources.

4.4

Discussion

In this section, we discuss the measured results and evaluate their meaning, restrictions
and opportunities. We also discuss possible threats to the validity of our experiment.

4.4.1

Results Discussion

According to the processing capacity presented in our experimental results, of the evaluation of MapReduce for packet level analysis and DPI, it is possible to see that the
MapReduce adoption for packet level analysis and DPI provided high processing capacity

55

4.4. DISCUSSION

and speed-up of completion time, when compared with a solution without distributed
processing, making it possible to evaluate a large amount of network traffic and extract
information from distributed applications of an evaluated data center.
The block size adopted and the number of nodes allocated to data processing are
important factors for obtaining an efficient job completion time and processing capacity
scalability. Some benchmarks show that MapReduce performance can be improved by
an optimal block size choice (Jiang et al., 2010), showing better performance with the
adoption of bigger block sizes. We evaluated the impact of the block size for packet level
analysis and DPI workloads; blocks with 128MB provided a better completion time for
smaller clusters, but blocks with 64MB performed better in bigger clusters. Thus, in order
to obtain an optimal completion time adopting bigger block size it is necessary also to
evaluate the node allocation for the MapReduce job, due to the variation in block size
and cluster size can cause significant impact into completion time.
The different processing capacity achieved for processing datasets of 30Gb and 90Gb
highlights the efficiency of MapReduce for dealing with bigger data processing, and
that can be more efficient to accumulate input data, to process a larger amount of data.
Therefore, it is important to analyse the dataset size to be processed, and to quantify the
ideal number of allocated nodes for each job, in order to avoid wasting resources.
The evaluation of the dataset size and the optimal number of nodes is important to
understand how to schedule MapReduce jobs and resource allocation through specific
Hadoop schedulers, such as Capacity Scheduler and Fair Scheduler (Zaharia et al., 2010),
in order to avoid resource wasting with allocation of nodes that will not produce significant
gains (Verma et al., 2012a). Thus, the variation of processing capacity achieved in out
experiments highlights the importance of evaluation of the cost of node allocation and its
benefits, and the need to evaluate the ideal size of pools in the Hadoop cluster, to obtain
efficiency between the cluster size allocated to process an input size, and the resource
sharing of a Hadoop cluster.
The MapReduce processing capacity does not scale proportionally to node addition;
in some cases there is not significant processing capacity which increases with node
addition, as shown in Figure 4.1, where jobs using block size of 64MB and 128MB in
clusters with more than 14 nodes for DPI of 90Gb, present no significant completion time
gain with node addition.
The number of execution waves is a factor that must be evaluated (Kavulya et al.,
2010) when MapReduce scalability is analysed, due to the execution time decrease wis
related to the number of execution waves necessary to process all input data. The number

56

4.4. DISCUSSION

of execution waves is defined by the available slots for execution of Map and Reduce
tasks; for example, if a MapReduce job is divided into 10 tasks in a cluster with 5 available
slots, then it will be necessary to have 2 execution waves for all tasks be executed.
Figure 4.9(a) shows a case of DPI of 30Gb, using block size of 128MB, in which there
was not a reduction of completion time with cluster size bigger than 10 nodes, because
there was not a reduction in the number of execution waves. But in our experiments,
cases with a reduction of execution waves also presented no significant reduction of
completion time, as cases using block size of 128MB in clusters with 21 nodes or more,
for DPI and packet level analysis, showed in Figure 4.1. Thus, node addition or tasks
distribution must be evaluated for resources usage optimization and to avoid additional or
unnecessary costs with machines and power consumption.
The comparison of completion time between CountUpDriver and P3 shows that P3,
which splits the data into packets, performs better than CountUpDriver, which processes
a whole block without splitting. Processing a whole block as input, the local node
parallelism is limited to the number of slots per node, while in the divisible approach
each split can be processed by an independent thread, increasing the possible parallelism.
Because some cases require data without splitting, such as DPI and video processing
cases (Pereira et al., 2010), improvements for this issue must be evaluated, considering
better schedulers, data location and task assignment.
The behavioural evaluation of MapReduce phases showed that the Map phase is
predominant in total execution time for packet level analysis and DPI, with Shuffle being
the second most expressive phase. Shuffle can overlap Map phase, and this condition
must be considered in MapReduce evaluations, specially in our case, due to the overlap
of Map and Shuffle which represents more than 50% of total time execution.
The long time of "Map and Shuffle" phase represents a long time of Shuffle tasks
being executed in parallel with Map tasks, and a long time of slots allocated for Shuffle
tasks that only will be concluded after all Map tasks are finished, although these Shuffle
tasks can be longer than the time required to read and process the generated intermediate
data. If there are slots allocated for Shuffle tasks that are only waiting for Map phase
conclusion, these slots could be used for other task executions, which could accelerate
the job completion time.
With the increase of cluster size and reduction of job completion time, it was observed
that the Map phase showed a proportional decrease, while Shuffle phase increased with
the growth of the number of nodes. With more nodes, the intermediate data generated
by Map tasks is placed in more nodes, which are responsible for shuffling the data and

57

4.5. CHAPTER SUMMARY

sending them to specific Reducers, increasing the amount of remote I/O from Mappers to
Reducers and the number of data sources for each Reducer. Shuffle phase may represent
a bottleneck (Zhang et al., 2009) for scalability and could be optimized, due to I/O
restrictions (Lee et al., 2012; Akram et al., 2012) and data locality issues for Reduce
phase (Hammoud and Sakr, 2011).
Information extracted from the analysed results about the performance obtained
with specific cluster, block and input sizes, is important for configuring MapReduce
resource allocation and specialized schedulers, such as the Fair Scheduler (Zaharia et al.,
2008), which defines pool sizes and resource share for MapReduce jobs. Thus, with
the information of the performance achieved with specific resources, it is possible to
configure MapReduce parameters in order to obtain efficiency between the resource
allocation, and the expected completion time or resource sharing (Zaharia et al., 2008,
2010).

4.4.2

Possible threats to validity

In this chapter we evaluated for packet level analysis, a port counter implemented with
P3. It was used a version of this implementation published in Lee et al. (2011) website2 ,
obtained on 2012 February, when a complete binary version was available, which was
used in our experiments, but this binary version is currently not available.
Part of the P3 source code was published later, but not all necessary code to compile
all binary libraries necessary to evaluate the P3 implementation of a port counter. Thereby
it is important to highlight that the obtained results, through our evaluation, was for the
P3 version obtained on 2012 February from Lee et al. (2011) website.
It is also important to highlight that the DPI can present restrictions to evaluate
encrypted messages, and that the obtained results are specifics for the input datasets,
factors, levels and experiment setup used in our evaluation.

4.5

Chapter Summary

In this chapter, we evaluated the performance of MapReduce for packet level analysis and
DPI of applications traffic. We evaluated how data input, block and cluster sizes, impacts
on MapReduce phases, job completion time, processing capacity scalability and on the
speed-up achieved in comparison with the same algorithm executed by a non-distributed
2 https://sites.google.com/a/networks.cnu.ac.kr/yhlee/p3

58

4.5. CHAPTER SUMMARY

implementation.
The results show that MapReduce presents high processing capacity for dealing with
massive application traffic analysis. The behaviour of MapReduce phases over variation
of block size and cluster size was evaluated; we verified that packet level analysis and
DPI are Map-intensive jobs, and that Map phase consumes more than 70% of execution
time, with Shuffle phase being the second predominant phase.
We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability
and efficient resource allocation, due to the variation in completion time provided by the
block size adopted and, in some cases, due to the processing capacity which does not
increase with node addition into the cluster.
We also showed that using a whole block as input for Map functions, achieved a
poorer performance than using divisible data, thereby more evaluation is necessary to
understand how it can be handled and improved.

59

5

Conclusion and Future Work
The softest things in the world overcome the hardest things in the world.
—LAO TZU

Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructure. The detection of error causes, diagnose and reproduction of
errors of distributed systems are challenges that motivate efforts to develop less intrusive
mechanisms for monitoring and debugging distributed applications at runtime.
Network traffic analysis is one option for distributed systems measurement, although
there are limitations on capacity to process large amounts of network traffic in short time,
and on scalability to process network traffic where there is variation of resource demand.
In this dissertation we proposed an approach to perform deep inspection in distributed
applications network traffic, in order to evaluate distributed systems at a data center
through network traffic analysis, using commodity hardware and cloud computing services, in a minimally intrusive way. Thus we developed an approach based on MapReduce,
to evaluate the behavior of a JXTA-based distributed system through DPI.
We evaluated the effectiveness of MapReduce to implement a DPI algorithm and
its completion time scalability to measure a JXTA-based application, using virtual machines of a cloud computing provider. Also, was deeply evaluated the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of

60

5.1. CONCLUSION

input size, block size and cluster size.

5.1

Conclusion

With our proposed approach, it is possible to measure the network traffic behavior of
distributed applications with intensive network traffic generation, through the offline
evaluation of information from the production environment of a distributed system,
making it possible to use the information from the evaluated indicators, to diagnose
problems and analyse performance of distributed systems.
We showed that MapReduce programming model can express algorithms for DPI, as
the Algorithm 1, implemented to extract application indicators from the network traffic
of a JXTA-based distributed application. We analysed the completion time scalability
achieved for different number of nodes in a Hadoop cluster composed of virtual machines,
with different size of network traffic used as input. We showed the processing capacity
and the completion time scalability achieved, and also was showed the influence of the
number of nodes and the data input size in the processing capacity for DPI using virtual
machines of Amazon EC2, for a selected scenario.
We evaluated the performance of MapReduce for packet level analysis and DPI of
applications traffic, using commodity hardware, and showed how data input size, block
size and cluster size cause relevant impacts into MapReduce phases, job completion time,
processing capacity scalability and in the speedup achieved in comparison against the
same execution by a non distributed implementation.
The results showed that although MapReduce presents a good processing capacity
using cloud services or commodity computers for dealing with massive application
traffic analysis, but it is necessary to evaluate the behaviour of MapReduce to process
specifics data type, in order to understand its relation with the available resources and
the configuration of MapReduce parameters, and to obtain an optimal performance for
specific environments.
We showed that MapReduce processing capacity scalability is not proportional to
number of allocated nodes, and the relative processing capacity decreases with node
addition. We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability,
due to the observed variation in completion time provided by different block size adopted.
Also, in some cases, the processing capacity does not scale with node addition into
the cluster, what highlights the importance of allocating resources according with the
workload and input data, in order to avoid wasting resources.

61

5.2. CONTRIBUTIONS

We verified that packet level analysis and DPI are Map-intensive jobs, due to Map
phase consumes more than 70% of the total job completion time, and shuffle phase is
the second predominant phase. We also showed that using whole block as input for Map
functions, it was achieved a poor completion time than the approach which splits the
block into records.

5.2

Contributions

We attempt to analyse the processing capacity problem of measurement of distributed systems through network traffic analysis, the results of the work presented in this dissertation
provide the contributions below:
1. We proposed an approach to implements DPI algorithms through MapReduce,
using whole blocks as input for Map functions. It was shown the effectiveness of
MapReduce for a DPI algorithm to extract indicators from a distributed application traffic, also it was shown the completion time scalability of MapReduce
for DPI, using virtual machines of a cloud provider;
2. We developed JNetPCAP-JXTA (Vieira, 2012b), an open source parser to extract
JXTA messages from network traffic traces;
3. We developed Hadoop-Analyzer (Vieira, 2013), an open source tool to extract
indicators from Hadoop logs and generate graphs of specified metrics.
4. We characterized the behavior followed by MapReduce phases for packet
level analysis and DPI, showing that this kind of job is intense in Map phase
and highlighting points that can be improved;
5. We described the processing capacity scalability of MapReduce for packet
level analysis and DPI, evaluating the impact caused by variations in input
size, cluster size and block size;
6. We showed the speed-up obtained with MapReduce for DPI, with variations in
input size, cluster size and block size;
7. We published two papers reporting our results, as follows:
(a) Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. Evaluating
Performance of Distributed Systems with MapReduce and Network Traffic

62

5.2. CONTRIBUTIONS

Analysis. In ICSEA 2012, The Seventh International Conference on Software
Engineering Advances. Xpert Publishing Services.
(b) Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. Measuring
Distributed Applications Through MapReduce and Traffic Analysis. In Parallel
and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference
on, pages 704 - 705.

5.2.1

Lessons Learned

The contributions cited are of scientific and academic scope, with implementations and
evaluations little explored in the literature. However, with the development of this work,
some important lessons were learned.
During this research, different approaches for evaluating distributed systems of cloud
computing providers were studied. In this period, we could see the importance of the
performance evaluation in a cloud computing environment, and the recent efforts to
diagnose and evaluate system at production environment of a data center. Also, the
growth of the Internet and resource utilization make necessary solutions to be able to
evaluate large amounts of data in short time, with low performance degradation of the
evaluated system.
MapReduce has grown as a general purpose solution for big data processing, but it
is not a solution for all kind of problems, and its performance is dependent of several
parameters. Some researches has been done in order to improve MapReduce performance, through analytical modelling, simulation and measurement, but the most relevant
contributions in this direction was guided by realistic workload evaluations, from large
MapReduce clusters.
We learned that although the facilities provided by the MapReduce for distributed
processing, its performance is influenced by the environment, network topology, workload,
data type and by several specific parameter configurations. Therefore, an evaluation of
the MapReduce behavior using data of a realistic environment will provide more accurate
and wide results, while in controlled experiments the results are more restricted and
limited to the evaluated metrics and factors.

63

5.3. FUTURE WORK

5.3

Future Work

Because of time constraints imposed on the master degree, this dissertation addresses
some problems, but some problems are still open and others are emerging from current
results. Thus, the following issues should be investigated as future work:
• Evaluating of all components of the proposed approach. This dissertation evaluated the JNetPCAP-JXTA, the AppAnalyzer and its implementation to evaluate a
JXTA-based distributed application, it is necessary to evaluate the SnifferServer,
Manager and the whole system working together, analysing their impact into the
measured distributed system and the scalability achieved;
• Development of a technique for the efficient evaluation of distributed systems through information extracted from network traffic. This dissertation
addressed the problem of processing capacity for measuring distributed systems
through network traffic analysis, but it is necessary an efficient approach to diagnose problems of distributed systems, using information of flows, connections,
throughput and response time obtained from network traffic analysis;
• Development of a analytic model and simulations, using information of MapReduce behavior for network traffic analysis, measured by this dissertation, to reproduce its characteristics and enable the evaluation and prediction of some cases of
MapReduce for network traffic analysis;

64

Bibliography

Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. (2003).
Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst.
Rev., 37(5).
Akram, S., Marazakis, M., and Bilas, A. (2012). Understanding scalability and performance requirements of I/O-intensive applications on future multicore servers. In
Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Antonello, R., Fernandes, S., Kamienski, C., Sadok, D., Kelner, J., Godorc, I., Szaboc, G.,
and Westholm, T. (2012). Deep packet inspection tools and techniques in commodity
platforms: Challenges and trends. Journal of Network and Computer Applications.
Antoniu, G., Hatcher, P., Jan, M., and Noblet, D. (2005). Performance evaluation of jxta
communication layers. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE
International Symposium on, volume 1, pages 251 – 258 Vol. 1.
Antoniu, G., Cudennec, L., Jan, M., and Duigou, M. (2007). Performance scalability
of the jxta p2p framework. In Parallel and Distributed Processing Symposium, 2007.
IPDPS 2007. IEEE International, pages 1 –10.
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. (2010). A view of cloud
computing. Commun. ACM, 53, 50–58.
Basili, V. R., Caldiera, G., and Rombach, H. D. (1994). The goal question metric
approach. In Encyclopedia of Software Engineering. Wiley.
Bhatotia, P., Wieder, A., Akku¸s, I. E., Rodrigues, R., and Acar, U. A. (2011). Largescale incremental data processing with change propagation. In Proceedings of the 3rd
USENIX conference on Hot topics in cloud computing, HotCloud’11, pages 18–18,
Berkeley, CA, USA. USENIX Association.
Callado, A., Kamienski, C., Szabo, G., Gero, B., Kelner, J., Fernandes, S., and Sadok, D.
(2009). A survey on internet traffic identification. Communications Surveys Tutorials,
IEEE, 11(3), 37 –52.

65

BIBLIOGRAPHY

Chen, Y., Ganapathi, A., Griffith, R., and Katz, R. (2011). The case for evaluating
MapReduce performance using workload suites. In Modeling, Analysis Simulation of
Computer and Telecommunication Systems (MASCOTS), 2011 IEEE 19th International
Symposium on.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. (2010).
MapReduce online. In Proceedings of the 7th USENIX conference on Networked
systems design and implementation, pages 21–21.
Cox, L. P., Murray, C. D., and Noble, B. D. (2002). Pastiche: making backup cheap and
easy. SIGOPS Oper. Syst. Rev., 36, 285–298.
Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large
clusters. Commun. ACM, 51, 107–113.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon’s highly
available key-value store. SIGOPS Oper. Syst. Rev., 41, 205–220.
Duigou, M. (2003). Jxta v2.0 protocols specification. Technical report, IETF Internet
Draft.
Eddelbuettel, D. (2012). R in action. Journal of Statistical Software, Book Reviews, 46(2),
1–2.
Fernandes, S., Antonello, R., Lacerda, T., Santos, A., Sadok, D., and Westholm, T. (2009).
Slimming down deep packet inspection systems. In INFOCOM Workshops 2009,
IEEE.
Fonseca, A., Silva, M., Soares, P., Soares-Neto, F., Garcia, V., and Assad, R. (2012). Uma
proposta arquitetural para serviços escaláveis de dados em nuvens. In Proceedings of
the VIII Workshop de Redes Dinâmicas e Sistemas P2P.
Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin,
A., and Stoica, I. (2009). Above the clouds: A berkeley view of cloud computing.
Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep.
UCB/EECS, 28.
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google file system. SIGOPS
Oper. Syst. Rev.

66

BIBLIOGRAPHY

Groot, S. (2012). Modeling I/O interference in data intensive map-reduce applications. In
Applications and the Internet (SAINT), 2012 IEEE/IPSJ 12th International Symposium
on.
Gunther, N. (2006). Guerrilla Capacity Planning: A Tactical Approach to Planning for
Highly Scalable Applications and Services. Springer.
Guo, Z., Fox, G., and Zhou, M. (2012). Investigation of data locality and fairness in
MapReduce. In Proceedings of third international workshop on MapReduce and its
Applications Date, MapReduce ’12.
Gupta, D., Vishwanath, K. V., McNett, M., Vahdat, A., Yocum, K., Snoeren, A., and
Voelker, G. M. (2011). Diecast: Testing distributed systems with an accurate scale
model. ACM Trans. Comput. Syst., 29, 4:1–4:48.
Halepovic, E. (2004). Performance evaluation and benchmarking of the JXTA peer-topeer platform. Ph.D. thesis, University of Saskatchewan.
Halepovic, E. and Deters, R. (2003). The costs of using jxta. In Peer-to-Peer Computing,
2003. (P2P 2003). Proceedings. Third International Conference on, pages 160 – 167.
Halepovic, E. and Deters, R. (2005). The jxta performance model and evaluation. Future
Gener. Comput. Syst., 21, 377–390.
Halepovic, E., Deters, R., and Traversat, B. (2005). Jxta messaging: Analysis of featureperformance tradeoffs and implications for system design. In R. Meersman and
Z. Tari, editors, On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and
ODBASE, volume 3761, pages 1097–1114. Springer Berlin / Heidelberg.
Hammoud, M. and Sakr, M. (2011). Locality-aware reduce task scheduling for MapReduce. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third
International Conference on, pages 570 –576.
Jacobson, V., Leres, C., and McCanne, S. (1994). libpcap. http://www.tcpdump.org/.
Jain, R. (1991). The art of computer systems performance analysis - techniques for
experimental design, measurement, simulation, and modeling. Wiley professional
computing. Wiley.
Janert, P. K. (2010). Gnuplot in action: understanding data with graphs. Manning,
Greenwich, CT.

67

BIBLIOGRAPHY

Jiang, D., Ooi, B. C., Shi, L., and Wu, S. (2010). The performance of MapReduce: an
in-depth study. Proc. VLDB Endow.
Kambatla, K., Pathak, A., and Pucha, H. (2009). Towards optimizing hadoop provisioning
in the cloud. In Proc. of the First Workshop on Hot Topics in Cloud Computing.
Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. (2009). The nature
of data center traffic: measurements & analysis. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, IMC ’09, pages 202–208,
New York, NY, USA. ACM.
Kavulya, S., Tan, J., Gandhi, R., and Narasimhan, P. (2010). An analysis of traces from
a production MapReduce cluster. In Cluster, Cloud and Grid Computing (CCGrid),
2010 10th IEEE/ACM International Conference on.
Lämmel, R. (2007). Google’s MapReduce programming model - revisited. Sci. Comput.
Program., 68(3), 208–237.
Lee, G. (2012). Resource Allocation and Scheduling in Heterogeneous Cloud Environments. Ph.D. thesis, University of California, Berkeley.
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., and Moon, B. (2012). Parallel data
processing with MapReduce: a survey. SIGMOD Rec.
Lee, Y., Kang, W., and Son, H. (2010). An internet traffic analysis method with MapReduce. In Network Operations and Management Symposium Workshops (NOMS Wksps),
2010 IEEE/IFIP, pages 357 –361.
Lee, Y., Kang, W., and Lee, Y. (2011). A hadoop-based packet trace processing tool. In
Proceedings of the Third international conference on Traffic monitoring and analysis,
TMA’11.
Lin, H., Ma, X., Archuleta, J., Feng, W., Gardner, M., and Zhang, Z. (2010). Moon:
MapReduce on opportunistic environments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 95–106.
ACM.
Lin, J. (2012). Mapreduce is good enough? if all you have is a hammer, throw away
everything that’s not a nail! Big Data.

68

BIBLIOGRAPHY

Loiseau, P., Goncalves, P., Guillier, R., Imbert, M., Kodama, Y., and Primet, P.-B. (2009).
Metroflux: A high performance system for analysing flow at very fine-grain. In
Testbeds and Research Infrastructures for the Development of Networks Communities
and Workshops, 2009. TridentCom 2009. 5th International Conference on, pages 1 –9.
Lu, P., Lee, Y. C., Wang, C., Zhou, B. B., Chen, J., and Zomaya, A. Y. (2012). Workload
characteristic oriented scheduler for MapReduce. In Parallel and Distributed Systems
(ICPADS), 2012 IEEE 18th International Conference on, pages 156 –163.
Massie, M. L., Chun, B. N., and Culler, D. E. (2004). The Ganglia distributed monitoring
system: design, implementation, and experience. Parallel Computing, 30(7), 817 –
840.
Mi, H., Wang, H., Yin, G., Cai, H., Zhou, Q., and Sun, T. (2012). Performance problems
diagnosis in cloud computing systems by mining request trace logs. In Network
Operations and Management Symposium (NOMS), 2012 IEEE.
Nagaraj, K., Killian, C., and Neville, J. (2012). Structured comparative analysis of
systems logs to diagnose performance problems. In Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, NSDI’12.
Oliner, A., Ganapathi, A., and Xu, W. (2012). Advances and challenges in log analysis.
Commun. ACM, 55(2), 55–61.
Paul, D. (2010). JXTA-Sim2: A Simulator for the core JXTA protocols. Master’s thesis,
University of Dublin, Ireland.
Pereira, R., Azambuja, M., Breitman, K., and Endler, M. (2010). An architecture for
distributed high performance video processing in the cloud. In Cloud Computing
(CLOUD), 2010 IEEE 3rd International Conference on.
Piyachon, P. and Luo, Y. (2006). Efficient memory utilization on network processors
for deep packet inspection. In Proceedings of the 2006 ACM/IEEE Symposium on
Architecture for Networking and Communications Systems, pages 71–80. ACM.
Risso, F., Baldi, M., Morandi, O., Baldini, A., and Monclus, P. (2008). Lightweight,
payload-based traffic classification: An experimental evaluation. In Communications,
2008. ICC’08. IEEE International Conference on, pages 5869–5875. IEEE.

69

BIBLIOGRAPHY

Rumen (2012). Rumen, a tool to extract job characterization data from job tracker
logs. http://hadoop.apache.org/docs/MapReduce/r0.22.0/rumen.html. [Acessado em
dezembro de 2012].
Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M.,
Wang, W., Xu, L., and Ganger, G. R. (2011). Diagnosing performance changes by
comparing request flows. In Proceedings of the 8th USENIX conference on Networked
systems design and implementation, NSDI’11.
Shafer, J., Rixner, S., and Cox, A. (2010). The hadoop distributed filesystem: Balancing
portability and performance. In Performance Analysis of Systems Software (ISPASS),
2010 IEEE International Symposium on.
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D.,
Jaspan, S., and Shanbhag, C. (2010). Dapper, a large-scale distributed systems tracing
infrastructure. Technical report, Google, Inc.
Tan, J., Meng, X., and Zhang, L. (2012). Coupling scheduler for MapReduce/hadoop. In
Proceedings of the 21st international symposium on High-Performance Parallel and
Distributed Computing, HPDC ’12.
Verma, A., Cherkasova, L., Kumar, V., and Campbell, R. (2012a). Deadline-based
workload management for MapReduce environments: Pieces of the performance
puzzle. In Network Operations and Management Symposium (NOMS), 2012 IEEE.
Verma, A., Cherkasova, L., and Campbell, R. (2012b). Two sides of a coin: Optimizing
the schedule of MapReduce jobs to minimize their makespan and improve cluster
performance. In Modeling, Analysis Simulation of Computer and Telecommunication
Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Vieira, T. (2012a). hadoop-dpi. http://github.com/tpbvieira/hadoop-dpi.
Vieira, T. (2012b). jnetpcap-jxta. http://github.com/tpbvieira/jnetpcap-jxta.
Vieira, T. (2013). hadoop-analyzer. http://github.com/tpbvieira/hadoop-analyzer.
Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012a). Evaluating
performance of distributed systems with MapReduce and network traffic analysis. In
ICSEA 2012, The Seventh International Conference on Software Engineering Advances.
Xpert Publishing Services.

70

BIBLIOGRAPHY

Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012b). Measuring
distributed applications through MapReduce and traffic analysis. In Parallel and
Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pages
704 –705.
Wang, G., Butt, A., Pandey, P., and Gupta, K. (2009). A simulation approach to evaluating
design decisions in MapReduce setups. In Modeling, Analysis Simulation of Computer
and Telecommunication Systems, 2009. MASCOTS ’09. IEEE International Symposium
on.
Yu, M., Greenberg, A., Maltz, D., Rexford, J., Yuan, L., Kandula, S., and Kim, C. (2011).
Profiling network performance for multi-tier data center applications. In Proceedings
of the 8th USENIX conference on Networked systems design and implementation,
NSDI’11.
Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S. (2011). Improving software
diagnosability via log enhancement. In ACM SIGARCH Computer Architecture News,
volume 39, pages 3–14. ACM.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. (2008). Improving
MapReduce performance in heterogeneous environments. In Proceedings of the 8th
USENIX conference on Operating systems design and implementation, OSDI’08.
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., and Stoica, I.
(2010). Delay scheduling: a simple technique for achieving locality and fairness
in cluster scheduling. In Proceedings of the 5th European conference on Computer
systems, EuroSys ’10.
Zhang, S., Han, J., Liu, Z., Wang, K., and Feng, S. (2009). Accelerating MapReduce
with distributed memory cache. In Parallel and Distributed Systems (ICPADS), 2009
15th International Conference on.
Zheng, Z., Yu, L., Lan, Z., and Jones, T. (2012). 3-dimensional root cause diagnosis
via co-analysis. In Proceedings of the 9th international conference on Autonomic
computing, pages 181–190. ACM.

71