Você está na página 1de 12

Big Data Engines

BINARY BATCH PROCESSING


Overview
• Signal processing has always been concerned with extracting
information from data
• And thus it is very near to, the heart of Big Data.
• Techniques such as sampling, filtering, and the DFT apply to data
collection.
• This data sometimes exceeds the computing power or storage of a
single computer.
• The importance of signal processing will grow as we will dive deeper
into the information age.
Type of Data aimed to process using Big Data tools
• Tabular
• Binary data
• Batch (screenshots of measurements from sensors)
• Large static datasets
Aim of project
• Faster processing
• Explore the Bigdata scope in processing measurements
• Distributed processing
• Comparison among different Bigdata engines
• The ease in using
• The speed of processing
• Tools available
Batch processing
• Batch processing is one method of computing over a large dataset.
• The process involves breaking work up into smaller pieces, scheduling
each piece on an individual machine.
• Then reshuffling the data based on the intermediate results, and then
calculating and assembling the final result.
• These steps are often referred as a distributed map reduce algorithm.
• This is the strategy used by Apache Hadoop's MapReduce.
• Batch processing is most useful when dealing with very large datasets
that require quite a bit of computation.
Bigdata
Frameworks

Batch-only Stream-only Hybrid


frameworks frameworks frameworks
Hadoop
• Hadoop is an Apache project that was the early open-source success in
big data.
• It consists of a distributed filesystem called HDFS, with a cluster
management and resource scheduler on top called YARN (Yet Another
Resource Negotiator).
• Batch processing capabilities are provided by the MapReduce
computation engine.
• Other computational and analysis systems can be run alongside
MapReduce in modern Hadoop deployments.
MapReduce
• This model is a framework for processing and generating large-scale
datasets with parallel and distributed algorithms.
• It is composed of two phases: Map and Reduce.
• The framework splits the input data and distributes it across the cluster,
then the same operation is performed on each split in parallel.
• Finally, the results are aggregated and returned to the master node.
• The framework manages all the task scheduling, monitoring and re-
executing in case of failed tasks.
Hadoop Distributed File System (HDFS)
• This paradigm is applied when the amount of data is too much for a
single machine.
• contains two types of nodes, master and slave(multiple)
• It stores files in blocks, the default block size of which is 64MB.
• All HDFS files are replicated in multiples to facilitate the parallel
processing of large amounts of data.
Batch Processing Model
The processing functionality of Hadoop comes from the MapReduce engine.
MapReduce's processing technique follows the map, shuffle, reduce algorithm
using key-value pairs. The basic procedure involves:
• Reading the dataset from the HDFS filesystem
• Dividing the dataset into chunks and distributed among the available nodes
• Applying the computation on each node to the subset of data (the intermediate
results are written back to HDFS)
• Redistributing the intermediate results to group by key
• "Reducing" the value of each key by summarizing and combining the results
calculated by the individual nodes
• Write the calculated final results back to HDFS
Apache spark
•Apache Spark is a fast and general engine for large-scale data processing based on the
MapReduce model.
•This framework aimed at performing fast distributed computing on Big Data by using in-
memory primitives.
•This platform allows user programs to load data into memory and query it repeatedly,
making it a well suited tool for online and iterative processing (especially for ML algorithms).
•Spark is based on distributed data structures called Resilient Distributed Dataset (RDDs)

Spark SQL
•It introduces DataFrames, which is a new data structure for structured (and semi-structured)data.
•DataFrames offers us the possibility of introducing SQL queries in the Spark programs.
•It provides SQL language support, with command-line interfaces and ODBC/JDBC controllers.
Apache Flink
•An open-source framework for distributed stream and batch data
processing.
•It is focused on working with lots of data with very low data latency and
high fault tolerance on distributed systems.
•Its fault tolerance makes it perfect for streaming data processing.
•Trick is, it generates consistent snapshots. And in case of failure, the
system falls back on these snapshots

Você também pode gostar