A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive

A Review paper on BIG Data Database’s
Thakur Shani Prakashchand1, Sahil Mehta2

Department of computer science and engineering, Bahra university, Waknaghat, HP, India
Abstract
Big data is an excessive amount of imprecise data in variety of formats generated from variety
of sources with rapid speed. It is most buzz term among researcher, industry and academia. Big
data is not only limited to data perspective but it has been emerged as a stream that
includes associated technologies, tools and real world applications . As organizations
increasingly look to big data to deliver valuable business insights, it has become clear that the
traditional relational database management systems (RDBMS) that have been the standard for
the past 30 years are not up to the task of handling these new data requirements. As a result, a
variety of big data database options have emerged. While the technologies differ, they all
designed to overcome the limitations of RDBMS to enable organizations to extract value from
their data. The objective of this paper is to discuss variety of big data databases like MongoDB,
Apache Cassandra, Hive and Hbase.
Key Words: Big Data, database, MangoDB, Cassandra, HBase, Hive.
CHAPTER I
INTRODUCTION
Big data is high volume, high velocity, and high variety information assets that require new
forms of processing to enable enhanced decision making, insight discovery and process
optimization. It is no doubt that today’s systems are processing huge amount of data every day.
For example, Facebook’s Hive data warehouse holds 300 PB data with an incoming daily rate of
high speed data generation and capture, we won’t quickly accumulate a large amount of data to
process. According to IBM, 90% of the data in the world today has been created over the last two
years alone. High variety (i.e. unstructured data) is another important aspect of big data. It refers
to information that either does not have a pre-defined data model or format. Traditional data
processing systems (e.g. relational data warehouse) may handle large volume of rigid relational
data but they are not flexible to process semi-structure or unstructured data. New technologies
have to be developed to handle data from various sources, e.g. texts, social networks, image data,
etc.
The problem with relational model is that it has some scalability issues that is performance
degrades rapidly as data volumes increases. This led to the development of a new data model i.e.
NOSQL. Because of the high scalability provided by NOSQL, it was seen as a major competitor
to the relational database model. Unlike RDBMS, NOSQL databases are designed to easily scale
out as and when they grow. Most NOSQL systems have removed the multi-platform support and
some extra unnecessary features of RDBMS, making them much more lightweight and efficient
than their RDMS counterparts. The NOSQL data model does not guarantee ACID properties
(Atomicity, Consistency, Isolation and Durability) but instead it guarantees BASE properties
(Basically Available, Soft state, Eventual consistency). It is in compliance with the CAP
(Consistency, Availability, Partition tolerance) theorem.
CHAPTER II
CASSANDRA DB
2.1 Introduction:
Apache Cassandra is a massively scalable open source non-relational database that offers
continuous availability, linear scale performance, operational simplicity and easy data
distribution across multiple data centers and cloud availability zones. Cassandra was originally
developed at Facebook, was open sourced in 2008, and became a top-level Apache project in
2010.It is a free and open source distributed NOSQL database management system that is
designed to handle large amounts of data. It is a java based system that can be managed and
handled by a JMX (java management extensions). It doesn’t fully support relational data model.
2.2 Cassandra Features:
Cassandra provides a number of key features and benefits for those looking to use it as the
underlying database for modern online applications:
Massively scalable architecture – a master less design where all nodes are the same, which
provides operational simplicity and easy scale-out.
Linear scale performance – the ability to add nodes without going down produces predictable
increases in performance.
Continuous availability – offers redundancy of both data and node function, which eliminate
single points of failure and provide constant uptime.
Transparent fault detection and recovery – nodes that fail can easily be restored or replaced.
Flexible and dynamic data model – supports modern data types with fast writes and reads.
Strong data protection – a commit log design ensures no data loss and built in security with
backup/restore keeps data protected and safe.
Multi-data center replication – cross data center (in multiple geographies) and multi-cloud
availability zone support for writes/reads.
Data compression – data compressed up to 80% without performance overhead.
CQL (Cassandra Query Language) – an SQL-like language that makes moving from a relational
database very easy.
CHAPTER III
MONGO DB
3.1 Introduction
It is a free and open source cross platform document database program. It is classified as a NOSQL
database program. It uses JSON like documents with schemas. It is developed by Mongo DB Inc.
and publish under the GNU Affero general public license and the Apache license. It is a non-
relational database system. It generally works with the semi structured data. So, MONGO DB is a
perfect example of non-relational database. It mainly used in the field of large databases. Its main
advantage is its scalability. It follows BASE transactions, and also handles the failures. So, it
mainly suited for the applications which having large amount of data where data needs high
security. When the number of records is increased, it shows significant reduction in the time taken
for execution. when the number of records is higher amount then it takes less time. So that why it
can be preferred for better performance. So, MONGO DB (NO SQL DB) executes the high
transaction loads but at the cost of data integrity.
3.2 Mangodb Features:
AD HOC QUERIES: It supports field, range queries, regular expressions searches. It can return
java script functions.
INDEXING: fields can be indexed with primary and secondary indices.
REPLICATION: It contains number of replica sets. It contains two or more copies of the data.
It plays the role of the primary and secondary replica. And all writes and reads are done on the
primary replica by default.
FILE STORAGE: It can be used as a file system with load balancing and data replication. This
function called Grid file system, is included with mongodb drivers.
AGGREGATION: the aggregation framework enables users to obtain the kind of results for
which the SQL GROUP BY clause is used. It includes the $lookup operator which can join
documents.
CHAPTER IV
APACHE HIVE
4.1 Introduction
It is a data warehouse software project built on top of Apache Hadoop for providing data
summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in
various databases and file systems that integrate with Hadoop. Traditional SQL queries must be
implemented in the MapReduce Java API to execute SQL applications and queries over
distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries
(HiveQL) into the underlying Java without the need to implement queries in the low-level Java
API. Since most data warehousing applications work with SQL-based querying languages, Hive
aids portability of SQL-based applications to Hadoop. While initially developed by Facebook,
Apache Hive is used and developed by other companies such as Netflix and the Financial
Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive
included in Amazon Elastic MapReduce on Amazon Web Services.
4.2 Apache Hive Features:
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems
such as Amazon S3 filesystem.
It provides an SQL-like query language called HiveQL with schema on read and transparently converts
queries to MapReduce, Apache Tez and Spark jobs.
To accelerate queries, it provides indexes, including bitmap indexes. Other features of Hive include:
Indexing to provide acceleration, index type including compaction and bitmap index as of 0.10, more
index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in a relational database management system, significantly reducing the time to
perform semantic checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive
supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.
CHAPTER V
APACHE HBASE
5.1 Introduction
It is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data
store. This means that HBase can leverage the distributed processing paradigm of the
Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce
programming model. It is meant to host large tables with billions of rows with potentially
millions of columns and run across a cluster of commodity hardware. But beyond its Hadoop
roots, HBase is a powerful database in its own right that blends real-time query capabilities
with the speed of a key/value store and offline or batch processing via MapReduce. In short,
HBase allows you to query for individual records as well as derive aggregate analytic reports
across a massive amount of data.
As a little bit of history, Google was faced with a challenging problem: How could it provide
timely search results across the entire Internet? The answer was that it essentially needed to
cache the Internet and define a new way to search that enormous cache quickly. It defined
the following technologies for this purpose:
Google File System: A scalable distributed file system for large distributed data-intensive
applications
BigTable: A distributed storage system for managing structured data that is designed to scale
to a large size: petabytes of data across thousands of commodity servers
MapReduce: A programming model and an associated implementation for processing and
generating large data sets
It was not too long after Google published these documents that we started seeing open
source implementations of them, and in 2007, Mike Cafarella released code for an open
source BigTable implementation that he called HBase. Since then HBase has become a top-
level Apache project that runs in Facebook, Twitter, and Adobe, just to name a few.HBase is
not a relational database and requires a different approach to modeling your data.
5.2 Apache HBase Features:
1) HBase is not an “eventually consistent” DataStore. This makes it very suitable for tasks
such as high-speed counter aggregation (Strongly consistent reads/writes).
2) HBase tables are distributed on the cluster via regions, and regions are automatically split
and re-distributed as your data grows(Automatic sharding)
3)Automatic Region Server failover.
4)HBase supports HDFS out of the box as its distributed file system(Hadoop/HDFS
Integration)
5)HBase supports massively parallelized processing via MapReduce for using HBase as
both source and sink. (MapReduce)
6)HBase supports an easy to use Java API for programmatic access (Java Client API)
7)HBase also supports Thrift and REST for non-Java front-ends (Thrift/REST API)
8)HBase supports a Block Cache and Bloom Filters for high volume query optimization
(Block Cache and Bloom Filters)
CHAPTER VI
CONCLUSION
All the databases have their own functionalities & work in a different manner & also useful in
different cases. The Apache Cassandra database is the right choice when you need scalability and
high availability without compromising performance. MongoDB is an open-source document
database that provides high performance, high availability, and automatic scaling. HBase is a
distributed column-oriented database built on top of the Hadoop file system. It provides quick
random access to huge amounts of structured data. It is suitable for Online Analytical Processing.
The Apache Hive data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage and queried using SQL syntax.
REFERENCES
[1] https://academy.datastax.com/resources/brief-introduction-apache-cassandra
[2] https://hive.apache.org/
[3] https://en.wikipedia.org/wiki/Apache_Hive
[4] http://cassandra.apache.org/
[5] https://www.javatpoint.com/what-is-hbase
[6] https://docs.mongodb.com/manual/introduction/

A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive

Enviado por

Direitos autorais:

Formatos disponíveis

A Review paper on BIG Data Database’s

Thakur Shani Prakashchand1, Sahil Mehta2

Key Words: Big Data, database, MangoDB, Cassandra, HBase, Hive.

3)Automatic Region Server failover.

Você também pode gostar