Você está na página 1de 60

Big data question

12.Explain gfs and hdfs.


Google File System (GFS) is a scalable distributed file system (DFS) created
by Google Inc. and developed to accommodate Google’s expanding data
processing requirements. GFS provides fault tolerance, reliability, scalability,
availability and performance to large networks and connected nodes. GFS is
made up of several storage systems built from low-cost commodity hardware
components. It is optimized to accomodate Google's different data use and
storage needs, such as its search engine, which generates huge amounts of data
that must be stored.
The Google File System capitalized on the strength of off-the-shelf servers
while minimizing hardware weaknesses.
GFS is also known as GoogleFS.
The features of Google file system are as follows:

1. GFS was designed for high fault tolerance.


2. Master and chunk servers can be restarted in a few seconds and with such a
fast recovery capability, the window of time in which data is unavailable can
be greatly reduced.
3. Each chunk is replicated at least three places and can tolerate at least two
data crashes for a single chunk of data.
4. The shadow master handles the failure of the GFS master.
5. For data integrity, GFS makes checksums on every 64KB block in each
chunk.
6. GFS can achieve the goals of high availability, high performance and
implementation.
7. It demonstrates how to support large scale processing workloads on
commodity hardware designed to tolerate frequent component failures
optimized for huge files that are mostly appended and read.

HDFS:-
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications. It employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
HDFS is a key part of the many Hadoop ecosystem technologies, as it
provides a reliable means for managing pools of big data and supporting
related big data analytics applications.

Features of Hadoop HDFS


3.1. Fault Tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavorable
conditions and how that system can handle such situations. HDFS is highly fault-
tolerant, in HDFS data is divided into blocks and multiple copies of blocks are
created on different machines in the cluster (this replica creation is configurable).
So whenever if any machine in the cluster goes down, then a client can easily
access their data from the other machine which contains the same copy of data
blocks. HDFS also maintains the replication factor by creating a replica of blocks
of data on another rack. Hence if suddenly a machine fails, then a user can access
data from other slaves present in another rack. To learn more about Fault
Tolerance follow this Guide.
3.2. High Availability
HDFS is a highly available file system, data gets replicated among the nodes in the
HDFS cluster by creating a replica of the blocks on the other slaves present in
HDFS cluster. Hence whenever a user wants to access this data, they can access
their data from the slaves which contains its blocks and which is available on the
nearest node in the cluster. And during unfavorable situations like a failure of a
node, a user can easily access their data from the other nodes. Because duplicate
copies of blocks which contain user data are created on the other nodes present in
the HDFS cluster. To learn more about high availability follow this Guide.
3.3. Data Reliability
HDFS is a distributed file system which provides reliable data storage. HDFS can
store data in the range of 100s of petabytes. It also stores data reliably on a cluster
of nodes. HDFS divides the data into blocks and these blocks are stored on nodes
present in HDFS cluster. It stores data reliably by creating a replica of each and
every block present on the nodes present in the cluster and hence provides fault
tolerance facility. If node containing data goes down, then a user can easily access
that data from the other nodes which contain a copy of same data in the HDFS
cluster. HDFS by default creates 3 copies of blocks containing data present in the
nodes in HDFS cluster. Hence data is quickly available to the users and hence user
does not face the problem of data loss. Hence HDFS is highly reliable.

3.4. Replication
Data Replication is one of the most important and unique features of Hadoop
HDFS. In HDFS replication of data is done to solve the problem of data loss in
unfavorable conditions like crashing of a node, hardware failure, and so on. Since
data is replicated across a number of machines in the cluster by creating blocks.
The process of replication is maintained at regular intervals of time by HDFS and
HDFS keeps creating replicas of user data on different machines present in the
cluster. Hence whenever any machine in the cluster gets crashed, the user can
access their data from other machines which contain the blocks of that data. Hence
there is no possibility of losing of user data. Follow this guide to learn more about
the data read operation.
3.5. Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase
we can scale the cluster. There is two scalability mechanism available: Vertical
scalability – add more resources (CPU, Memory, Disk) on the existing nodes of
the cluster. Another way is horizontal scalability – Add more machines in the
cluster. The horizontal way is preferred since we can scale the cluster from 10s of
nodes to 100s of nodes on the fly without any downtime.

3.6. Distributed Storage


In HDFS all the features are achieved via distributed storage and replication.
HDFS data is stored in distributed manner across the nodes in HDFS cluster. In
HDFS data is divided into blocks and is stored on the nodes present in HDFS
cluster. And then replicas of each and every block are created and stored on other
nodes present in the cluster. So if a single machine in the cluster gets crashed we
can easily access our data from the other nodes which contain its replica.
13.specify the role of namnode and datranode in hdfs?
Namenode
NameNode is the centerpiece of HDFS.
NameNode is also known as the Master
NameNode only stores the metadata of HDFS – the directory tree of all files in the
file system, and tracks the files across the cluster.
NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes.
NameNode knows the list of the blocks and its location for any given file in
HDFS. With this information NameNode knows how to construct the file from
blocks.
NameNode is so critical to HDFS and when the NameNode is down,
HDFS/Hadoop cluster is inaccessible and considered down.
NameNode is a single point of failure in Hadoop cluster.
NameNode is usually configured with a lot of memory (RAM). Because the block
locations are help in main memory.
DataNode
DataNode is responsible for storing the actual data in HDFS.
DataNode is also known as the Slave
NameNode and DataNode are in constant communication.
When a DataNode starts up it announce itself to the NameNode along with the list
of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of data or the cluster.
NameNode will arrange for replication for the blocks managed by the DataNode
that is not available.
DataNode is usually configured with a lot of hard disk space. Because the actual
data is stored in the DataNode.
14.What is hbase? Give detail node on future
HBase is a distributed column-oriented database built on top of the Hadoop file system.
It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access
to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.

Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.
15.What is metadata, what information it provides and explain the role of
namenode in a hdfs clusters

Metadata means "data about data". Although the "meta" prefix means "after" or
"beyond", it is used to mean "about" in epistemology. Metadata is defined as the data
providing information about one or more aspects of the data; it is used to summarize
basic information about data which can make tracking and working with specific data
easier.

Some examples include:

· Means of creation of the data

· Purpose of the data

· Time and date of creation

· Creator or author of the data

· Location on a computer network where the data was created

· Standards used

· File size

· Data quality

· Source of the data

· Process used to create the data

How does the file get stored ?

Suppose a client wants to write a file into HDFS. So, the following steps will be
performed internally during the whole HDFS write process:-

 The client will divide the files into blocks and will send a write request to the
NameNode.

 For each block, the NameNode will provide the client a list containing the IP address
of DataNodes (depending on replication factor, 3 by default) where the data block has
to be copied eventually.

 The client will copy the first block into the first DataNode and then the other copies
of the block will be replicated by the DataNodes themselves in a sequential manner.

NameNode works as Master in Hadoop cluster. Below listed are the main
function performed by NameNode:
1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks,
Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files,
renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge
5amount of memory is required for its operation. This should be hosted on
reliable hardware.

1. What is NOSQL? Advantages of it.


 NoSQL is a non-relational DMS, that does not require a fixed schema,
avoids joins, and is easy to scale. NoSQL database is used for
distributed data stores with humongous data storage needs. NoSQL is
used for Big data and real-time web apps. For example, companies like
Twitter, Facebook, Google that collect terabytes of user data every
single day.
 NoSQL database stands for "Not Only SQL" or "Not SQL." Though a
better term would NoREL NoSQL caught on. Carl Strozz introduced the
NoSQL concept in 1998.
 Traditional RDBMS uses SQL syntax to store and retrieve data for
further insights. Instead, a NoSQL database system encompasses a
wide range of database technologies that can store structured, semi-
structured, unstructured and polymorphic data.

Advantages of NoSQL
 Can be used as Primary or Analytic Data Source
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 No Need for Separate Caching Layer
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data with
equal effect
 Object-oriented programming which is easy to use and flexible
 NoSQL databases don't need a dedicated high-performance server
 Support Key Developer Languages and Platforms
 Simple to implement than using RDBMS
 It can serve as the primary data source for online applications.
 Handles big data which manages data velocity, variety, volume, and
complexity
 Excels at distributed database and multi-data center operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered without
downtime or service disruption

Types of NoSQL Databases

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.

Key-value pair storage databases store data as a hash table where each key
is unique, and the value can be a JSON, BLOB(Binary Large Objects), string,
etc.

For example, a key-value pair may contain a key like "Website" associated
with a value like "Guru99".

It is one of the most basic types of NoSQL databases. This kind of NoSQL
database is used as a collection, dictionaries, associative arrays, etc. Key
value stores help the developer to store schema-less data. They work best for
shopping cart contents.

Redis, Dynamo, Riak are some examples of key-value store DataBases. They
are all based on Amazon's Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable
paper by Google. Every column is treated separately. Values of single column
databases are stored contiguously.

Column based NoSQL database

They deliver high performance on aggregation queries like SUM, COUNT,


AVG, MIN etc. as the data is readily available in a column.

Column-based NoSQL databases are widely used to manage data


warehouses, business intelligence, CRM, Library card catalogs,

HBase, Cassandra, HBase, Hypertable are examples of column based


database.

Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair
but the value part is stored as a document. The document is stored in JSON
or XML formats. The value is understood by the DB and can be queried.

Relational Vs. Document


In this diagram on your left you can see we have rows and columns, and in
the right, we have a document database which has a similar structure to
JSON. Now for the relational database, you have to know what columns you
have and so on. However, for a document database, you have data store like
JSON object. You do not require to define which make it flexible.

The document type is mostly used for CMS systems, blogging platforms, real-
time analytics & e-commerce applications. It should not use for complex
transactions which require multiple operations or queries against varying
aggregate structures.

Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are


popular Document originated DBMS systems.

Graph-Based
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An
edge gives a relationship between nodes. Every node and edge has a unique
identifier.

Compared to a relational database where tables are loosely connected, a


Graph database is a multi-relational in nature. Traversing relationship is fast
as they are already captured into the DB, and there is no need to calculate
them.

Graph base database mostly used for social networks, logistics, spatial data.

Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based


databases.
2. difference between sql and nosql.

Key Differences between SQL and NoSQL

3.explain in details interacting with hadoop ecosystem?

Hadoop Ecosystem Components


The objective of this Apache Hadoop ecosystem components tutorial is to have
an overview of what are the different components of Hadoop ecosystem that
make Hadoop so powerful and due to which several Hadoop job roles are
available now. We will also learn about Hadoop ecosystem components
like HDFS and HDFS components, MapReduce, YARN, Hive, Apache
Pig, Apache HBase and HBase
components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache
Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data
Hadoop and to acquire master level knowledge of the Hadoop Ecosystem.
Hadoop Ecosystem and Their Components

Introduction to Hadoop Ecosystem


As we can see the different Hadoop ecosystem explained in the above figure of
Hadoop Ecosystem. Now We are going to discuss the list of Hadoop
Components in this section one by one in detail.

2.1. Hadoop Distributed File System


It is the most important component of Hadoop Ecosystem. HDFS is the
primary storage system of Hadoop. Hadoop distributed file system (HDFS) is
a java based file system that provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data. HDFS is a distributed filesystem that
runs on commodity hardware. HDFS is already configured with default
configuration for many installations. Most of the time for large clusters
configuration is needed. Hadoop interact directly with HDFS by shell-like
commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and
DataNode. Let’s now discuss these Hadoop HDFS Components-

i. NameNode
It is also known as Master node. NameNode does not store actual data or
dataset. NameNode stores Metadata i.e. number of blocks, their location, on
which Rack, which Datanode the data is stored and other details. It consists of
files and directories.
Tasks of HDFS NameNode
 Manage file system namespace.
 Regulates client’s access to files.
 Executes file system execution such as naming, closing, opening files and
directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data
in HDFS. Datanode performs read and write operation as per the request of
the clients. Replica block of Datanode consists of 2 files on the file system. The
first file is for data and second file is for recording the block’s metadata. HDFS
Metadata includes checksums for data. At startup, each Datanode connects to
its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking.
At the time of mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
 DataNode performs operations like block replica creation, deletion, and
replication according to the instruction of NameNode.
 DataNode manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.

4.list and explain the hdfs commands

 fsck
HDFS Command to check the health of the Hadoop file system.

Command: hdfs fsck /

 ls
HDFS Command to display the list of Files and Directories in HDFS.

Command: hdfs dfs –ls /

 mkdir
HDFS Command to create the directory in HDFS.

Usage: hdfs dfs –mkdir /directory_name

 touchz
HDFS Command to create a file in HDFS with file size 0 bytes.

Usage: hdfs dfs –touchz /directory/filename

 du
HDFS Command to check the file size.

Usage: hdfs dfs –du –s /directory/filename


 cat
HDFS Command that reads a file on HDFS and prints the content of that file to the
standard output.

Usage: hdfs dfs –cat /path/to/file_in_hdfs

 text
HDFS Command that takes a source file and outputs the file in text format.

Usage: hdfs dfs –text /directory/filename

 copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.

Usage: hdfs dfs -copyFromLocal <localsrc> <hdfs destination>

 copyToLocal
HDFS Command to copy the file from HDFS to Local File System.

Usage: hdfs dfs -copyToLocal <hdfs source> <localdst>

 put
HDFS Command to copy single source or multiple sources from local file system to the
destination file system.

Usage: hdfs dfs -put <localsrc> <destination>

 get
HDFS Command to copy files from hdfs to the local file system.

Usage: hdfs dfs -get <src> <localdst>

 count
HDFS Command to count the number of directories, files, and bytes under the paths that
match the specified file pattern.

Usage: hdfs dfs -count <path>

 rm
HDFS Command to remove the file from HDFS.
Usage: hdfs dfs –rm <path>

 rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.

Usage: hdfs dfs -rm -r <path>

 cp
HDFS Command to copy files from source to destination. This command allows multiple
sources as well, in which case the destination must be a directory.

Usage: hdfs dfs -cp <src> <dest>

 mv
HDFS Command to move files from source to destination. This command allows multiple
sources as well, in which case the destination needs to be a directory.

Usage: hdfs dfs -mv <src> <dest>

 expunge
HDFS Command that makes the trash empty.

Command: hdfs dfs -expunge

 rmdir
HDFS Command to remove the directory.

Usage: hdfs dfs -rmdir <path>

Command: hdfs dfs –rmdir /user/hadoop

 usage
HDFS Command that returns the help for an individual command.

Usage: hdfs dfs -usage <command>

 help
HDFS Command that displays help for given command or all commands if none is
specified.
Command: hdfs dfs -help

5.limitation of hadoop 1.0

Hadoop 1.x has the following Limitations/Drawbacks:

 It is only suitable for Batch Processing of Huge amount of Data, which is


already in Hadoop System.
 It is not suitable for Real-time Data Processing.
 It is not suitable for Data Streaming.
 It supports upto 4000 Nodes per Cluster.
 It has a single component : JobTracker to perform many activities like
Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs
etc.
 JobTracker is the single point of failure.
 It does not support Multi-tenancy Support.
 It supports only one Name Node and One Namespace per Cluster.
 It does not support Horizontal Scalability.
 It runs only Map/Reduce jobs.
 It follows Slots concept in HDFS to allocate Resources (Memory, RAM, CPU).
It has static Map and Reduce Slots. That means once it assigns resources to
Map/Reduce jobs, it cannot re-use them even though some slots are idle.

For Example:- Suppose, 10 Map and 10 Reduce Jobs are running with 10 +
10 Slots to perform a computation. All Map Jobs are doing their tasks but all
Reduce jobs are idle. We cannot use these Idle jobs for other purpose.
6.expLAIN ABOUT 5V’S.

five V's: Velocity, Volume, Value, Variety, and Veracity.

Velocity

First let’s talk about velocity. Obviously, velocity refers to the speed at which
vast amounts of data are being generated, collected and analyzed. Every day
the number of emails, twitter messages, photos, video clips, etc. increases at
lighting speeds around the world. Every second of every day data is
increasing. Not only must it be analyzed, but the speed of transmission, and
access to the data must also remain instantaneous to allow for real-time
access to website, credit card verification and instant messaging. Big data
technology allows us now to analyze the data while it is being generated,
without ever putting it into databases.

Volume

Volume refers to the incredible amounts of data generated each second from
social media, cell phones, cars, credit cards, M2M sensors, photographs,
video, etc. The vast amounts of data have become so large in fact that we
can no longer store and analyze data using traditional database
technology. We now use distributed systems, where parts of the data is
stored in different locations and brought together by software. With just
Facebook alone there are 10 billion messages, 4.5 billion times that the “like”
button is pressed, and over 350 million new pictures are uploaded every
day. Collecting and analyzing this data is clearly an engineering challenge of
immensely vast proportions.

Value

When we talk about value, we’re referring to the worth of the data being
extracted. Having endless amounts of data is one thing, but unless it can be
turned into value it is useless. While there is a clear link between data and
insights, this does not always mean there is value in Big Data. The most
important part of embarking on a big data initiative is to understand the costs
and benefits of collecting and analyzing the data to ensure that ultimately the
data that is reaped can be monetized.

Variety

Variety is defined as the different types of data we can now use. Data today
looks very different than data from the past. We no longer just have
structured data (name, phone number, address, financials, etc) that fits nice
and neatly into a data table. Today’s data is unstructured. In fact, 80% of all
the world’s data fits into this category, including photos, video sequences,
social media updates, etc. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used
simultaneously.

Veracity

Last, but certainly not least there is veracity. Veracity is the quality or
trustworthiness of the data. Just how accurate is all this data? For example,
think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content. Gleaning loads and loads of
data is of no use if the quality or trustworthiness is not accurate. Another
good example of this relates to the use of GPS data. Often the GPS will “drift”
off course as you peruse through an urban area. Satellite signals are lost as
they bounce off tall buildings or other structures. When this happens, location
data has to be fused with another data source like road data, or data from an
accelerometer to provide accurate data.
7.write about challenges with big data

1. Dealing with data growth


The most obvious challenge associated with big data is simply storing and analyzing all
that information. In its Digital Universe report, IDC estimates that the amount of
information stored in the world's IT systems is doubling about every two years. By 2020,
the total amount will be enough to fill a stack of tablets that reaches from the earth to
the moon 6.6 times. And enterprises have responsibility or liability for about 85 percent
of that information.

Much of that data is unstructured, meaning that it doesn't reside in a database.


Documents, photos, audio, videos and other unstructured data can be difficult to search
and analyze.

It's no surprise, then, that the IDG report found, "Managing unstructured data is growing
as a challenge – rising from 31 percent in 2015 to 45 percent in 2016."

In order to deal with data growth, organizations are turning to a number of


different technologies. When it comes to storage, converged and hyperconverged
infrastructure and software-defined storage can make it easier for companies to scale
their hardware. And technologies like compression, deduplication and tiering can
reduce the amount of space and the costs associated with big data storage.

On the management and analysis side, enterprises are using tools like NoSQL
databases, Hadoop, Spark, big data analytics software, business intelligence
applications, artificial intelligence and machine learning to help them comb through their
big data stores to find the insights their companies need.

2. Generating insights in a timely manner


Of course, organizations don't just want to store their big data — they want to use that
big data to achieve business goals. According to the NewVantage Partners survey, the
most common goals associated with big data projects included the following:

1. Decreasing expenses through operational cost efficiencies


2. Establishing a data-driven culture
3. Creating new avenues for innovation and disruption
4. Accelerating the speed with which new capabilities and services are deployed
5. Launching new product and service offerings

All of those goals can help organizations become more competitive — but only if they
can extract insights from their big data and then act on those insights quickly. PwC's
Global Data and Analytics Survey 2016 found, "Everyone wants decision-making to be
faster, especially in banking, insurance, and healthcare."

To achieve that speed, some organizations are looking to a new generation of ETL
and analytics tools that dramatically reduce the time it takes to generate reports. They
are investing in software with real-time analytics capabilities that allows them to
respond to developments in the marketplace immediately.

3. Recruiting and retaining big data talent


But in order to develop, manage and run those applications that generate insights,
organizations need professionals with big data skills. That has driven up demand for big
data experts — and big data salaries have increased dramatically as a result.

The 2017 Robert Half Technology Salary Guide reported that big data engineers were
earning between $135,000 and $196,000 on average, while data scientist salaries
ranged from $116,000 to $163, 500. Even business intelligence analysts were very well
paid, making $118,000 to $138,750 per year.

In order to deal with talent shortages, organizations have a couple of options. First,
many are increasing their budgets and their recruitment and retention efforts. Second,
they are offering more training opportunities to their current staff members in an attempt
to develop the talent they need from within. Third, many organizations are looking to
technology. They are buying analytics solutions with self-service and/or machine
learning capabilities. Designed to be used by professionals without a data science
degree, these tools may help organizations achieve their big data goals even if they do
not have a lot of big data experts on staff.

4. Integrating disparate data sources


The variety associated with big data leads to challenges in data integration. Big data
comes from a lot of different places — enterprise applications, social media streams,
email systems, employee-created documents, etc. Combining all that data and
reconciling it so that it can be used to create reports can be incredibly difficult. Vendors
offer a variety of ETL and data integration tools designed to make the process easier,
but many enterprises say that they have not solved the data integration problem yet.
In response, many enterprises are turning to new technology solutions. In the IDG
report, 89 percent of those surveyed said that their companies planned to invest in new
big data tools in the next 12 to 18 months. When asked which kind of tools they were
planning to purchase, integration technology was second on the list, behind data
analytics software.

5. Validating data
Closely related to the idea of data integration is the idea of data validation. Often
organizations are getting similar pieces of data from different systems, and the data in
those different systems doesn't always agree. For example, the ecommerce system
may show daily sales at a certain level while the enterprise resource planning (ERP)
system has a slightly different number. Or a hospital's electronic health record (EHR)
system may have one address for a patient, while a partner pharmacy has a different
address on record.

The process of getting those records to agree, as well as making sure the records are
accurate, usable and secure, is called data governance. And in the AtScale 2016 Big
Data Maturity Survey, the fastest-growing area of concern cited by respondents was
data governance.

Solving data governance challenges is very complex and is usually requires a


combination of policy changes and technology. Organizations often set up a group of
people to oversee data governance and write a set of policies and procedures. They
may also invest in data management solutions designed to simplify data governance
and help ensure the accuracy of big data stores — and the insights derived from them.

6. Securing big data


Security is also a big concern for organizations with big data stores. After all, some big
data stores can be attractive targets for hackers or advanced persistent threats (APTs).

However, most organizations seem to believe that their existing data security
methods are sufficient for their big data needs as well. In the IDG survey, less than half
of those surveyed (39 percent) said that they were using additional security measure for
their big data repositories or analyses. Among those who do use additional measures,
the most popular include identity and access control (59 percent), data encryption (52
percent) and data segregation (42 percent).

7. Organizational resistance
It is not only the technological aspects of big data that can be challenging — people can
be an issue too.

In the NewVantage Partners survey, 85.5 percent of those surveyed said that their firms
were committed to creating a data-driven culture, but only 37.1 percent said they had
been successful with those efforts. When asked about the impediments to that culture
shift, respondents pointed to three big obstacles within their organizations:
 Insufficient organizational alignment (4.6 percent)
 Lack of middle management adoption and understanding (41.0 percent)
 Business resistance or lack of understanding (41.0 percent)

In order for organizations to capitalize on the opportunities offered by big data, they are
going to have to do some things differently. And that sort of change can be
tremendously difficult for large organizations.

The PwC report recommended, "To improve decision-making capabilities at your


company, you should continue to invest in strong leaders who understand data’s
possibilities and who will challenge the business."

One way to establish that sort of leadership is to appoint a chief data officer, a step that
NewVantage Partners said 55.9 percent of Fortune 1000 companies have taken. But
with or without a chief data officer, enterprises need executives, directors and
managers who are going to commit to overcoming their big data challenges, if they
want to remain competitive in the increasing data-driven economy.

8. explain the various operational modes of hadoop cluster configuration


Different Hadoop Modes

1. Local Mode or Standalone Mode


Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly
used for debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.
You also don’t need to do any custom configuration in the files- mapred-site.xml, core-
site.xml, hdfs-site.xml.
Standalone mode is usually the fastest Hadoop modes as it uses the local file system for
all the input and output. Here is the summarized view of the standalone mode-
• Used for debugging purpose
• HDFS is not being used
• Uses local file system for input and output
• No need to change any configuration files
• Default Hadoop Modes

2. Pseudo-distributed Mode
The pseudo-distribute mode is also known as a single-node cluster where both
NameNode and DataNode will reside on the same machine.
In pseudo-distributed mode, all the Hadoop daemons will be running on a single node.
Such configuration is mainly used while testing when we don’t need to think about the
resources and other users sharing the resource.
In this architecture, a separate JVM is spawned for every Hadoop components as they
could communicate across network sockets, effectively producing a fully functioning
and optimized mini-cluster on a single host.
Here is the summarized view of pseudo distributed Mode-
• Single Node Hadoop deployment running on Hadoop is considered as pseudo
distributed mode
• All the master & slave daemons will be running on the same node
• Mainly used for testing purpose
• Replication Factor will be ONE for blocks
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml

3. Fully-Distributed Mode (Multi-Node Cluster)


This is the production mode of Hadoop where multiple nodes will be running. Here
data will be distributed across several nodes and processing will be done on each node.
Master and Slave services will be running on the separate nodes in fully-distributed
Hadoop Mode.
• Production phase of Hadoop
• Separate nodes for master and slave daemons
• Data are used and distributed across multiple nodes
In the Hadoop development, each Hadoop Modes have its own benefits and drawbacks.
Definitely fully distributed mode is the one for which Hadoop is mainly known for but
again there is no point in engaging the resource while in testing or debugging phase. So
standalone and pseudo-distributed Hadoop modes are also having their own
significance.
9. Explain the syntax of pig programming with a suitable example

10. Every pig program consists of three parts. Loading, Transforming,


Dumping of the data.

11. Below given is a sample data load command. We provide the file
location which can be a directory or a specific file. We select the load
function through which data is parsed from the file. PigStorage function
parses each line in the file and splits the data based on the argument
provided with the function to generate the fields. We provide the schema
(field names with data type) in the load function after the keyword ‘as‘.
12. modified_data = GROUP data BY ;
13. counts = FOREACH modified_data GENERATE group,
14. COUNT(data);

15. We can either dump the processed data or store it in a file based
upon the requirements. Using dump method, the processed data is
displayed on the standard output.

16. store counts into '’ ;


17. DUMP counts;

18. User Defined Function


19. This is an important feature in PigLatin language. It allows to
extend the present evaluation function and mathematical function by
using custom used written functions. It allows users to write the custom
functions using Java, Python,Jython, Ruby, Groovy and Javascript. Java
functions are written by extending the evaluation function class. Then the
scripts have to be added to pig library using the ‘register’ command.

20. In case of Java, the required data types are imported from the
resective classes and the custom function is written by extending the
resecting class. In case of Jython the script is registered using jython
which imports the required scripts to interpret the jython script. The output
schema for every function is specified so that pig can parse the data. The
same goes with Javascript. In case of Ruby ‘pigudf’ library is extended
and jruby is used to register the script. In case of python udf, python
command line is used in which the data is streamed in and out to execute
the script.

21. Executing Pig programs


22. A pig program can be executed in three methods. We can write a
pig script file containing all the commands and execute it from the
command line. We can use the interactive shell, Grunt to execute
commands line by line. It can also run scripts using run or exec
commands. We can execute the required commands by extending the
PigRunner class. It provides the access to run the commands from any
program.

23. To run the program using the script, run the following command.
The script can be stored in hdfs which can be distributed to other
machines in case the program is run in cluster mode.

24. $ pig

10.types of data types in hive

Types of Data Types in Hive


Mainly Hive Data Types are classified into 5 major categories, let’s discuss
them one by one:

a. Primitive Data Types in Hive


Primitive Data Types also divide into 4 types which are as follows:

 Numeric Data Type


 Date/Time Data Type
 String Data Type
 Miscellaneous Data Type

i. Numeric Data Type


The Hive Numeric Data types also classified into two types-
 Integral Data Types
 Floating Data Types
* Integral Data Types
Integral Hive data types are as follows-
 TINYINT (1-byte (8 bit) signed integer, from -128 to 127)
 SMALLINT (2-byte (16 bit) signed integer, from -32, 768 to 32, 767)
 INT (4-byte (32-bit) signed integer, from –2,147,483,648to
2,147,483,647)
 BIGINT (8-byte (64-bit) signed integer, from –
9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
Floating Data Types
Floating Hive data types are as follows-
 FLOAT (4-byte (32-bit) single-precision floating-point number)
 DOUBLE (8-byte (64-bit) double-precision floating-point number)
 DECIMAL (Arbitrary-precision signed decimal number)

ii. Date/Time Data Type


The second category of Apache Hive primitive data type is Date/Time data
types. The following Hive data types comes into this category-

 TIMESTAMP (Timestamp with nanosecond precision)


 DATE (date)
 INTERVAL
iii. String Data Type
String data types are the third category under Hive data types. Below are the
data types that come into this-

 STRING (Unbounded variable-length character string)


 VARCHAR (Variable-length character string)
 CHAR (Fixed-length character string)
iv. Miscellaneous Data Type
Miscellaneous data types has two types of Hive data types-
 BOOLEAN (True/false value)
 BINARY (Byte array)

b. Complex Data Types in Hive


In this category of Hive data types following data types are come-

 Array
 MAP
 STRUCT
 UNION
i. ARRAY
An ordered collection of fields. The fields must all be of the same type.
Syntax: ARRAY<data_type>
E.g. array (1, 2)
ii. MAP
An unordered collection of key-value pairs. Keys must be primitives; values
may be any type. For a particular map, the keys must be the same type, and
the values must be the same type.
Syntax: MAP<primitive_type, data_type>
E.g. map(‘a’, 1, ‘b’, 2).
iii. STRUCT
A collection of named fields. The fields may be of different types.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment],…..>
E.g. struct(‘a’, 1 1.0),[b] named_struct(‘col1’, ‘a’, ‘col2’, 1, ‘col3’, 1.0)
iv. UNION
A value that may be one of a number of defined data. The value is tagged with
an integer (zero-indexed) representing its data type in the union.
Syntax: UNIONTYPE<data_type, data_type, …>
E.g. create_union(1, ‘a’, 63)
c. Column Data Types in Hive
Column Hive data types are furthermore divide into 6 categories:

 Integral Type
 Strings
 Timestamp
 Dates
 Decimals
 Union Types
Let us discuss these Hive Column data types one by one-
Hive Column Data Types
i. Integral type
In this category of Hive data types following 4 data types are come-

 TINYINT
 SMALLINT
 INT/INTEGER
 BIGINT
By default, Integral literals are assumed to be INT. When the data range
exceeds the range of INT, we need to use BIGINT. If the data range is smaller
than the INT, we uses SMALLINT. And TINYINT is smaller than SMALLINT.
Table Postfix Example

TINYINT Y 100Y

SMALLINT S 100S

BIGINT L 100L

ii. Strings
The string data types in Hive, can be specified with either single quotes (‘) or
double quotes (“). Apache Hive use C-style escaping within the strings.

Data Type Length

VARCHAR 1 to 65355

CHAR 255

* VARCHAR
Varchar- Hive data types are created with a length specifier (between 1 and
65355). It defines the maximum number of characters allowed in the
character string.
* Char
Char – Hive data types are similar to VARCHAR. But they are fixed-length
meaning that values shorter than the specified length value are padded with
spaces but trailing spaces are not important during comparisons. 255 is the
maximum fixed length.
iii. Timestamp
Hive supports traditional UNIX timestamp with operational nanosecond
precision. Timestamps in text files use format ”YYYY-MM-DD
HH:MM:SS.fffffffff” and “yyyy-mm-dd hh:mm:ss.ffffffffff”.

iv. Dates
DATE values are described in particular year/month/day (YYYY-MM-DD)
format.E.g. DATE ‘2017-01-01’. These types don’t have a time of day
component. This type supports range of values for 0000-01-01 to 9999-12-31.

v. Decimals
In Hive DECIMAL type is similar to Big Decimal format of java. This
represents immutable arbitrary precision. The syntax and example are below:
Apache Hive 0.11 and 0.12 has the precision of the DECIMAL type fixed. And
it’s limited to 38 digits.
Apache Hive 0.13 users can specify scale and precision when creating tables
with the DECIMAL data type using DECIMAL (precision, scale) syntax. If the
scale is not specified, then it defaults to 0 (no fractional digits). If no precision
is specified, then it defaults to 10.

1. CREATE TABLE foo (


2. a DECIMAL, -- Defaults to decimal(10,0)b DECIMAL(9, 7)
3. b DECIMAL(9, 7)
4. )

vi. Union Types


Union Hive data types are the collection of heterogeneous data types. By
using create union we can create an instance. The syntax and example are
below:
1. CREATE TABLE union_test(foo UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>);
2. SELECT foo FROM union_test;
3. {0:1}
4. {1:2.0}
5. {2:["three","four"]}
6. {3:{"a":5,"b":"five"}}
7. {2:["six","seven"]}
8. {3:{"a":8,"b":"eight"}}
9. {0:9}
10. {1:10.0}
d. Literals Data Types in Hive
In Hive data types following literals are used:

Apache Hive Literals Data Types


 Floating Point Types
 Decimal Type
i. Floating Point Types
These are nothing but numbers with decimal points. This type of data is
composed of the DOUBLE data types in Hive.

ii. Decimal Type


This type is nothing but floating point value with the higher range than the
DOUBLE data type. The decimal type range is approximate -10-308 to 10308.
e. Null Value Data Types in Hive
In this Hive Data Types, missing values are represented by the special
value NULL.
So, this was all in Hive Data Types. Hope you like our explanation.

3. Conclusion – Hive Data Types


Hence, in this Apache Hive data type tutorial, we have discussed all the data
types in Hive in detail. Hope this blog will help you to understand all the data
types in hive easily. In the next section, we will discuss the Hive Operators in
detail.

11. Role of driver code ,mapper code and reducer code with a map.

Mapper code:-
 We have created a class Map that extends the class Mapper which is already

defined in the MapReduce Framework.


 We define the data types of input and output key/value pair after the class
declaration using angle brackets.
 Both the input and output of the Mapper is a key/value pair.
 Input:
o The key is nothing but the offset of each line in the text file: LongWritable
o The value is each individual line (as shown in the figure at the right): Text
 Output:
o The key is the tokenized words: Text
o We have the hardcoded value in our case which is 1: IntWritable
o Example – Dear 1, Bear 1, etc.
 We have written a java code where we have tokenized each word and assigned
them a hardcoded value equal to 1.

Reducer code:-

 We have created a class Reduce which extends class Reducer like that of Mapper.
 We define the data types of input and output key/value pair after the class
declaration using angle brackets as done for Mapper.
 Both the input and the output of the Reducer is a key-value pair.
 Input:
o The key nothing but those unique words which have been generated after
the sorting and shuffling phase: Text
o The value is a list of integers corresponding to each key: IntWritable
o Example – Bear, [1, 1], etc.
 Output:
o The key is all the unique words present in the input text file: Text
o The value is the number of occurrences of each of the unique
words: IntWritable
o Example – Bear, 2; Car, 3, etc.
 We have aggregated the values present in each of the list corresponding to each
key and produced the final answer.
 In general, a single reducer is created for each of the unique words, but, you can
specify the number of reducer in mapred-site.xml.

Driver code:-

 In the driver class, we set the configuration of our MapReduce job to run in
Hadoop.
 We specify the name of the job , the data type of input/output of the mapper and
reducer.
 We also specify the names of the mapper and reducer classes.
 The path of the input and output folder is also specified.
 The method setInputFormatClass () is used for specifying that how a Mapper will
read the input data or what will be the unit of work. Here, we have chosen
TextInputFormat so that single line is read by the mapper at a time from the input
text file.
 The main () method is the entry point for the driver. In this method, we instantiate
a new Configuration object for the job.

MST UNIT 3 & 4


1. What is machine learning? Techniques.

Machine Learning is a subset of artificial intelligence which focuses mainly on machine


learning from their experience and making predictions based on its experience. It enables
the computers or the machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task. These programs or algorithms are designed
in a way that they learn and improve over time when are exposed to new data.
Machine Learning Technique #1: Regression

If you’re looking for a great conversation starter at the next party you go to, you could
always start with “You know, machine learning is not so new; why, the concept
of regression was first described by Francis Galton, Charles Darwin’s half cousin, all the
way back in 1875”. Of course, it will probably be the last party you get an invite to for a
while.
But the concept is simple enough. Francis Galton was looking at the sizes of sweet peas
over many generations. We know that if you selectively breed peas for size, you can get
larger ones.

But if you let nature take its course, you see a variety of sizes. Eventually, even bigger
peas will produce smaller offspring and “regress to the mean”. Basically, there’s a typical
size for a pea and although things vary, they don’t “stay varied” (as long as you don’t
selectively breed).

The same principle applies to monkeys picking stocks. On more than one occasion there
have been stock-picking competitions (WSJ has done them, for example) where a
monkey will beat the pros. Great headline. But what happens next year or the year after
that? Chances are that monkey, which is just an entertaining way of illustrating
“random,” will not do so well. Put another way, its performance will eventually regress to
the mean.

What this means is that in this simple situation, you can predict what the next result will
be (with some kind of error). The next generation of pea will be the average size, with
some variability or uncertainty (accommodating smaller and larger peas). Of course, in
the real world things are a little more complicated than that.
In the image above, we don’t have a single mean value like pea size. We’ve got a
straight line with a slope and two values to work with, not just one. Instead of variability
around a single value, here we’ve got variability in a two-dimensional plane based on an
underlying line.

You can see all the various data points in blue, and that red line is the line that best fits
all that data. And based on that red line, you could make a prediction about what
would happen if, say, the next data point was a 70 on the X axis. (That prediction would
not be a single definitive value, but rather a projected value with some degree of
uncertainty, just like for the pea sizes we looked at earlier).

Regression algorithms are used to make predictions about numbers. For example, with
more data, we can:

1. Make predictions about customer lifetime value, perhaps spotting potentially


valuable customers before they have declared themselves by the volume of
their purchases
2. Predict the optimal pricing for a product to maximize revenue or profit
3. Predict house prices, for companies that want to send out those property
newsletters
The straight line in the graph is an example of linear regression, but looking at those
three examples above, I’d be surprised if any of them fit well to a straight line. And in
fact, the underlying line behind your data doesn’t have to be straight. It could be an
exponential, a sine wave or some arbitrary curve. And there are algorithms and
techniques to find the best fit to the underlying data no matter what shape the
underlying line is.

Furthermore, I’ve given you a two-dimensional diagram there. If you were trying to
predict house prices, for example, you’d include many more factors than just two: size,
number of rooms, school scores, recent sales, size of garden, age of house and more.

Finally, perhaps my favorite example of regression is this approach to measuring the


quality of Bordeaux wine.
Machine Learning Technique #2: Classification
Let’s move on to classification. And now I want you to pretend you're back in preschool
and I'll play the role of teacher trying hard to teach a room of children about fruit
(presumably fruit-hating children if they've got to this age without knowing what a
banana is).

While you kids don't know about fruit, the good news for you is that I do. You don’t
have to guess (at least initially). I’m going to show you lots of pieces of fruit and tell you
what each one is. And so, like children in a preschool, you will learn how to classify fruit.
You’ll look at things like size, color, taste, firmness, smell, shape and whatever else
strikes your fancy as you attempt to figure out what it is that makes an apple, an apple,
as opposed to a banana.

Once I've gone through 70 percent to 80 percent of the basket, we can move onto the
next stage. I’ll show you a fruit (that I have already identified) and ask you “What is it?”
Based on the learning you’ve done, you should be able to classify that new fruit
correctly.

In fact, by testing you on fruit that I’ve already classified correctly, I can see how well
you’ve learned. If you do a good job, then you’re ready for some real work which in a
non-kindergarten situation, would mean deploying that trained model into production.
If of course the results of the test weren’t good enough that would mean the model
wasn’t ready. Perhaps we need to start again with more data, better data, or a different
algorithm.

We call this approach “supervised learning” because we’re testing your ability to get the
right answers, and we have got lots of correct examples to work with since we have a
whole basket that has been correctly classified.

That idea of using part of the basket for training and the rest for testing is also
important. That's how techniques like this make sure that the training worked or,
alternatively, that the training didn't work and a new approach is needed.

Note that the basket of fruit we worked with had only four kinds of fruit: apples,
bananas, strawberries (you can't see them in the picture, but I assure you they are there)
and oranges. So, if you were presented with a cherry to classify it would be somewhat
unpredictable. It would depend what the algorithm found to be important in
differentiating the others. The point here of course, is that if you want to recognize
cherries then the model should be trained on them.

Here's an example of a chart showing a data set that has been grouped into two
different classes. We've got a few outliers in this diagram, a few colored points that are
on the wrong side of the line. I show this to emphasize the point that these algorithms
aren't magic and may not get everything right. It could also be the case that with
different approaches or algorithms, we could do a better job classifying these data
points and identifying them correctly.

Summarizing the previous entry, classification enables you to find membership in


a known class. Examples of known classes? Let’s go back to customer segmentation. I
know who my high-value customers are today. What did they look like some time ago?
By using them as a training class, I could train a model to spot a valuable customer
earlier.
Another example is customer churn. We know who’s left us. Let’s train a model on that
class and then see if we do a better job of spotting other churners before they churn.
This kind of approach is what triggers those unexpected offers from companies who
think you are about to leave them.

Insurance companies pay out on claims and they've got a historical set of claims that
they have already classified into "good claims" and ones that need "further
investigation". Train a classification algorithm on all those old claims, and perhaps you
can do a better job of spotting dubious claims when they come in.

One additional point.

In all these cases, it’s important to have lots of data available to train on. The more data
you have, the better the training (more accurate, wider range of situations etc.). One of
the reasons (of course there are others) for building a data lake is to have easy access to
more data for machine learning algorithms.
Machine Learning Technique #3: Clustering
Alert readers should have noticed that this is the same bowl of fruit used in the
classification example. Yes, this was done on purpose. Same fruit, but a different
approach.
This time we’re going to do clustering, which is an example of unsupervised learning.
You're back in preschool and the same teacher is standing in front of you with the same
basket of fruit.

But this time, as I hand the stuff out, I'm not going to tell you "This is a banana." Instead
I'm effectively going to say, “Do these things have any kind of natural grouping?”
(Which is a complex concept for a pre-schooler, but work with me for a moment).

You’ll look at them and their various characteristics, and you might end up with several
piles of fruit that look like “squidgy red things”, “curved yellow things”, “small green
things” and “larger red or green things”.

To clarify, what you did (in your role as preschoolers/machine learning algorithm) is
group the fruits in that way. What the teacher (or the human supervising the machine
learning process) did was to come up with meaningful names for those different piles.
This is likely the same process used to do the customer segmentation mentioned in the
previous blog. Having found logical groupings of customers, somebody came up with a
shorthand way to name or describe each grouping.

Here’s a real-world cluster diagram. With these data points you can see five separate
clusters. Those little arrows represent part of the process of calculating the clusters and
their boundaries: basically pick arbitrary centers, calculate which points belong in which
cluster and then move your arbitrary point to the actual center of the cluster and repeat
until you’ve got close enough (movements of the centers are sufficiently small).

This approach is very common for customer segmentation. You could evaluate credit
risk, or even things like the similarity between written documents. Basically, if you look
at a mass of data and don’t know how to logically group it, then clustering is a good
place to start.

Machine Learning Technique #4: Anomaly Detection

Sometimes you’re not trying to group like things together. Maybe you don’t much care
about all the things that blend in with the flock. What you’re looking for is something
unusual, something different, something that stands out in some way.

This approach is called anomaly detection. You can use this to find things that are
different, even if you can’t say up front how they are different. It’s fairly easy to spot the
outliers here, but in the real world, those outliers might be harder to find.

One health provider used anomaly detection to look at claims for medical services and
found a dentist billing at the extraordinarily high rate of 85 fillings per hour. That's 42
seconds per patient to get the numbing shot, drill the bad stuff out and put the filling
in.

Clearly that's suspicious and needs further investigation. Just by looking at masses of
data (and there were millions of records) it would not have been obvious that you were
looking for something like that.

Of course, it might also throw up that fact that one doctor only ever billed on
Thursdays. Anomalous, yes. Relevant, probably not. Anomaly detection can throw up
the outliers for you to evaluate to see if they need further investigation.

Finding a dentist billing for too much work is a relatively simple anomaly. If you knew to
look at billing rates (which will not always be the case), you could find this kind of issue
using other techniques. But anomaly detection could also apply to more complex
scenarios. Perhaps you are responsible for some mechanical equipment where things
like pressure, flow rate and temperature are normally in sync with each other: one goes
up, they all go up; one goes down, they all go down. Anomaly detection could identify
the situation where two of those variables go up and the other one goes down. That
would be really hard to spot with any other technique.

All right, I think that’s enough to think about and process for this week. But be sure
to subscribe to the Oracle blog, because the fun hasn’t come to an end yet. Next, we’re
going to post about three more machine learning techniques that people are especially
excited about.
If you're ready to get started with machine learning, try Oracle Cloud for free and build
your own data lake to test out some of these techniques.

2. Explain indetail about ml tools.

 Azure Machine Learning SDK for Python


 See the full reference for the Azure Machine Learning SDK for Python.

What is it? Azure Machine Learning is a cloud service that you can use to develop and deploy
machine-learning models. You can track your models as you build, train, scale, and
manage them by using the Python SDK. Deploy models as containers and run them
in the cloud, on-premises, or on Azure IoT Edge.

Supported Windows (conda environment: AzureML), Linux (conda environment: py36)


editions

Typical uses General machine-learning platform


How is it Installed with GPU support
configured or
installed?

How to use or As a Python SDK and in the Azure CLI. Activate to the conda
run it environment AzureML on Windows edition or to py36 on Linux edition.

Link to Sample Jupyter notebooks are included in the AzureML directory under notebooks.
samples

Related tools Visual Studio Code, Jupyter

 H2O

What is it? An open-source AI platform that supports in-memory, distributed, fast, and
scalable machine learning.

Supported Linux
versions

Typical uses General-purpose distributed, scalable machine learning

How is it H2O is installed in /dsvm/tools/h2o.


configured or
installed?

How to use or run Connect to the VM by using X2Go. Start a new terminal, and run java -jar
it /dsvm/tools/h2o/current/h2o.jar. Then start a web browser and connect
to http://localhost:54321.

Link to samples Samples are available on the VM in Jupyter under the h2o directory.

Related tools Apache Spark, MXNet, XGBoost, Sparkling Water, Deep Water

 There are several other machine-learning libraries on DSVMs, such as the


popular scikit-learn package that's part of the Anaconda Python distribution for
DSVMs. To check out the list of packages available in Python, R, and Julia, run the
respective package managers.
 LightGBM
What is it? A fast, distributed, high-performance gradient-boosting (GBDT, GBRT, GBM, or
MART) framework based on decision tree algorithms. It's used for ranking,
classification, and many other machine-learning tasks.

Supported Windows, Linux


versions

Typical uses General-purpose gradient-boosting framework

How is it On Windows, LightGBM is installed as a Python package. On Linux, the


configured or command-line executable is in /opt/LightGBM/lightgbm, the R package is installed,
installed? and Python packages are installed.

Link to samples LightGBM guide

Related tools MXNet, XgBoost

 Rattle

What is it? A graphical user interface for data mining by using R.

Supported Windows, Linux


editions

Typical uses General UI data-mining tool for R

How to use or As a UI tool. On Windows, start a command prompt, run R, and then inside R,
run it run rattle(). On Linux, connect with X2Go, start a terminal, run R, and then inside
R, run rattle().

Link to Rattle
samples

Related tools LightGBM, Weka, XGBoost

 Vowpal Wabbit

What is it? A fast, open-source, out-of-core learning system library


Supported editions Windows, Linux

Typical uses General machine-learning library

How is it configured or Windows: msi installer


installed? Linux: apt-get

How to use or run it As an on-path command-line tool (C:\Program Files\VowpalWabbit\vw.exe on


Windows, /usr/bin/vw on Linux)

Link to samples VowPal Wabbit samples

Related tools LightGBM, MXNet, XGBoost

 Weka

What is it? A collection of machine-learning algorithms for data-mining tasks. The algorithms can
be either applied directly to a data set or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association
rules, and visualization.

Supported Windows, Linux


editions

Typical General machine-learning tool


uses

How to use On Windows, search for Weka on the Start menu. On Linux, sign in with X2Go, and
or run it then go to Applications > Development > Weka.

Link to Weka samples


samples

Related LightGBM, Rattle, XGBoost


tools

 XGBoost
What is it? A fast, portable, and distributed gradient-boosting (GBDT, GBRT, or GBM)
library for Python, R, Java, Scala, C++, and more. It runs on a single machine, and
on Apache Hadoop and Spark.

Supported Windows, Linux


editions

Typical uses General machine-learning library

How is it Installed with GPU support


configured or
installed?

How to use or As a Python library (2.7 and 3.5), R package, and on-path command-line tool
run it (C:\dsvm\tools\xgboost\bin\xgboost.exe for Windows
and /dsvm/tools/xgboost/xgboost for Linux)

Links to samples Samples are included on the VM, in /dsvm/tools/xgboost/demo on Linux,


and C:\dsvm\tools\xgboost\demo on Windows.

Related tools LightGBM, MXNet

4. Market basket analysis.

Market Basket Analysis is a modelling technique based upon the theory that if
you buy a certain group of items, you are more (or less) likely to buy another
group of items. For example, if you are in an English pub and you buy a pint
of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips)
at the same time than somebody who didn't buy beer.

The set of items a customer buys is referred to as an itemset, and market


basket analysis seeks to find relationships between purchases.

Typically the relationship will be in the form of a rule:

IF {beer, no bar meal} THEN {crisps}.


The probability that a customer will buy beer without a bar meal (i.e. that the
antecedent is true) is referred to as the support for the rule. The conditional
probability that a customer will purchase crisps is referred to as
the confidence.

The algorithms for performing market basket analysis are fairly


straightforward (Berry and Linhoff is a reasonable introductory resource for
this). The complexities mainly arise in exploiting taxonomies, avoiding
combinatorial explosions (a supermarket may stock 10,000 or more line
items), and dealing with the large amounts of transaction data that may be
available.

A major difficulty is that a large number of the rules found may be trivial for
anyone familiar with the business. Although the volume of data has been
reduced, we are still asking the user to find a needle in a haystack. Requiring
rules to have a high minimum support level and a high confidence level risks
missing any exploitable result we might have found. One partial solution to
this problem is differential market basket analysis, as described below.

How is it used?

In retailing, most purchases are bought on impulse. Market basket analysis


gives clues as to what a customer might have bought if the idea had occurred
to them . (For some real insights into consumer behavior, see Why We Buy:
The Science of Shopping by Paco Underhill.)

As a first step, therefore, market basket analysis can be used in deciding the
location and promotion of goods inside a store. If, as has been observed,
purchasers of Barbie dolls have are more likely to buy candy, then high-
margin candy can be placed near to the Barbie doll display. Customers who
would have bought candy with their Barbie dolls had they thought of it will now
be suitably tempted.

But this is only the first level of analysis. Differential market basket
analysis can find interesting results and can also eliminate the problem of a
potentially high volume of trivial results.

In differential analysis, we compare results between different stores, between


customers in different demographic groups, between different days of the
week, different seasons of the year, etc.

If we observe that a rule holds in one store, but not in any other (or does not
hold in one store, but holds in all others), then we know that there is
something interesting about that store. Perhaps its clientele are different, or
perhaps it has organized its displays in a novel and more lucrative way.
Investigating such differences may yield useful insights which will improve
company sales.

Other Application Areas

Although Market Basket Analysis conjures up pictures of shopping carts and


supermarket shoppers, it is important to realize that there are many other
areas in which it can be applied. These include:

 Analysis of credit card purchases.


 Analysis of telephone calling patterns.
 Identification of fraudulent medical insurance claims.
(Consider cases where common rules are broken).
 Analysis of telecom service purchases.

Note that despite the terminology, there is no requirement for all the items to
be purchased at the same time. The algorithms can be adapted to look at a
sequence of purchases (or events) spread out over time. A predictive market
basket analysis can be used to identify sets of item purchases (or events) that
generally occur in sequence — something of interest to direct marketers,
criminologists and many others.

5. Rdbms to hbase

tart MySQL service:


Start the Mysql service by using below command in the terminal

sudo service mysqld start

Once the Mysql service is started, enter Mysql shell using the below command in the
terminal.
Login to MySQL shell:

mysql -u root -p
Password: cloudera

In the above command -u represents the user name and -p represents the password. Here
username is root and password to Mysql shell is cloudera.
Show databases:

show databases;

As we have mentioned earlier we will be using emp database in our example which is
already available in Mysql DB.
Use database emp:
Follow the below code to use database emp;

Use emp;
Show tables:
Let us use show tables command to list the tables which are present in the database emp.

Show tables;

We can observe from the above image there is our example table employee in the
database emp.
Describe table:
We can use below command to describe employee table schema.

Describe employee;

The DESCRIBE TABLE command lists the following information about each column:
 Column name
 Type schema
 Type name
 Length
 Scale
 Nulls (yes/no)
Display the contents of the table employee:
We can use below command to display all the columns present in the table employee.
select * from employee;

Grant all permission:


We can use below command to grant superuser permission to root.

grant all on *.* to ‘root’@’localhost’ with grant option;

MySQL privileges are critical to the utility of the system as they allow each of the users to
access and utilize only the areas needed to perform their work functions. This is meant to
prevent, a user from accidentally accessing an area where he or she should not have access.
Additionally, this adds to the security of the MySQL server. When you connect to a MySQL
server, the host from which we connect and the user name we specify determines our
identity. With this information, the server then grants privileges based upon this identity.
The above step finishes the Mysql part.
Now, we need to create a new table in Hbase to import table contents from Mysql
database. So, follow the below steps to import the contents from Mysql to Hbase.
Enter Hbase shell:
Use below command to enter HBase shell.

hBase shell
Create table:
We can use create command to create a table in Hbase.

create ‘Academp’,’emp_details’

We can observe from the above image we have create a new table in Hbase with the
name Academp and column family as emp_details
Scan table:
We can use scan command to see a table contents in Hbase.

scan ‘Academp’

We can observe from the above image no contents are available in the table Academp
Sqoop import command:
Now use below command to import Mysql employee table to HBase Academp table.

sqoop import --connect jdbc:mysql://localhost/emp --username root --password


cloudera --table employee

--hbase-table Academp --column-family emp_id --hbase-row-key id -m 1


Scan HBase table:
Now again use Scan ‘Academp’ command to see the table contents which is successfully
imported from the Mysql employee table.

scan 'Academp'

We can observe from the above image we have successfully imported contents from a
Mysql table to HBase table using Sqoop. For more updates on Big Data Hadoop and other
technologies visit Acadgild blog section.
6. Rdbms to hive?

What is Sqoop Import?


Sqoop is a tool from Apache using which bulk data can be imported or exported from a
database like MySQL or Oracle into HDFS.

Now, we will discuss how we can efficiently import data from MySQL to Hive using Sqoop.
But before we move ahead, we recommend you to take a look at some of the blogs that we
put out previously on Sqoop and its functioning.
Beginners Guide for Sqoop
Sqoop Tutorial for Incremental Imports
Export Data from Hive to MongoDB
Importing Data from MySQL to HBase

How do we Use Sqoop?


In this example, we will be using the table Company1 which is already present in the
MySQL database.
We can use the describe command to see the schema of the Company1 table.
Describing theTable Schema
describe Company1;

The DESCRIBE TABLE command lists the following information about each column:
 Column name
 Type schema
 Type name
 Length
 Scale
 Nulls (Yes/No)
Displaying the Table Contents
We can use the following commands to display all the columns present in the
table Company1.
select * from Company1;

Granting All Permissions to Root and Flush the Privileges


We can use the following command to grant a superuser the permission to root.
grant all on *.* to ‘root’@’localhost’ with grant option;
flush privileges;

MySQL privileges are critical to the utility of the system as they allow each of their users to
access and utilize only those areas that are needed to perform their work functions. This is
meant to prevent a user from accidentally accessing an area which they should not have
access to.
Additionally, this adds to the security of the MySQL server.
Whenever someone connects to a MySQL server, their identities are determined by the host
used to connect them and the user name specified. With this information, the server grants
privileges based upon the identity determined.
The above step finishes the MySQL part.
Now, let us open a new terminal and enter Sqoop commands to import data from
MySQL to Hive table.
I. A Sqoop command is used to transfer selected columns from MySQL to Hive.
Now, use the following command to import selected columns from the
MySQL Company1 table to the Hive Company1Hive table.
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –split-by
EmpId –columns EmpId,EmpName,City –table company1 –target-dir /myhive –hive-
import –create-hive-table –hive-table default.Company1Hive -m 1
The above Sqoop command will create a new table with the
name Company1Hive in the Hive default database and transfer the 3 mentioned
column (EmpId, EmpName and City) values from the MySQL table Company1 to the
Hive table Company1Hive.
Displaying the Contents of the Table Company1Hive
Now, let us see the transferred contents in the table Company1Hive.
select * from Company1Hive;

II. Sqoop command for transferring a complete table data from MySQL to Hive.
In the previous example, we transferred only the 3 selected columns from the MySQL
table Company1 to the Hive default database table Company1Hive.
Now, let us go ahead and transfer the complete table from the table Company1 to a
new Hive table by following the command given here:
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –table
Company1 –target-dir /myhive –hive-import –create-hive-table –hive-table
default.Company2Hive -m 1

The above given Sqoop command will create a new table with the
name Company2Hive in the Hive default database and will transfer all this data
from the MySQL table Company1 to the Hive table Company2Hive.
In Hive.
Now, let us see the transferred contents in the table Company2Hive.
select * from Company2Hive;
We can observe from the above screenshot that we have successfully transferred
these table contents from the MySQL to a Hive table using Sqoop.
Next, we will do a vice versa job, i.e, we will export table contents from the Hive table
to the MySQL table.
III. Export command for transferring the selected columns from Hive to MySQL.
In this example we will transfer the selected columns from Hive to MySQL. For this,
we need to create a table before transferring the data from Hive to the MySQL
database. We should follow the command given below to create a new table.
create table Company2(EmpId int, EmpName varchar(20), City varchar(15));

The above command creates a new table named Company2 in the MySQL database
with three columns: EmpId, EmpName, and City.
Let us use the select statement to see the contents of the table Company2.
Select * from Company2;

We can observe that in the screenshot shown above, the table contents are empty.
Let us use the Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 -username root –P –columns
EmpId,EmpName,City –table Company2 –export-dir
/user/hive/warehouse/company2hive –input-fields-terminated-by ‘\001’ -m 1

The Sqoop command given above will transfer the 3 mentioned column (EmpId,
EmpName, and City) values from the Hive table Company2Hive to the MySQL
table Company2.
Displaying the Contents of the Table Company2
Now, let us see the transferred contents in the table Company2.
select * from Company2;
We can observe from the above image that we have now successfully transferred
data from Hive to MySQL.
IV. Export command for transferring the complete table data from Hive to
MySQL.
Now, let us transfer this complete table from the Hive table Company2Hive to a
MySQL table by following the command given below:
create table Company2Mysql(EmpId int, EmpName varchar(20), Designation
varchar(15), DOJ varchar(15), City varchar(15), Country varchar(15));

Let us use the select statement to see the contents of the table Company2Msyql.
select * from Company2Mysql;

We observe in the screenshot given above that the table contents are empty. Let us
use a Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 –username root –P –table
Company2Mysql –export-dir /user/hive/warehouse/company2hive –input-fields-
terminated-by ‘\001’ -m 1

The above given Sqoop command will transfer the complete data from the Hive
table Company2Hive to the MySQL table Company2Mysql.
Displaying the Contents of the Table Company2Mysql
Now, let us see the transferred contents in the table Company2Mysql.
select * from Company2Mysql;
We can see here in the screenshot how we have successfully exported table contents
from Hive to MySQL using Sqoop. We can follow the above steps to transfer this
data between Apache Hive and the structured databases.
Keep visiting our site www.acadgild.com for more updates on Big Data and other
technologies.
Enroll for our Big Data and Hadoop Training and kickstart a successful career as a big
data developer.

Ques.6. market basket ananlysis.

Market basket analysis is identifying items in the supermarket which customers are more likely to buy
together.
e.g., Customers who bought pampers also bought beer

This is important for super markets to arrange their items in a consumer convenient manner as well as to
come up with promotions taking item affinity in to consideration.

Frequent Item set Mining and Association Rule Learning

Frequent item set mining is a sub area in data mining that focuses on identifying frequently co-occuring
items. Once, the frequent item set is ready, we can come up with rules to derive association
between items.
e.g., Frequent item set = {pampers, beer, milk}, association rule = {pampers, milk ---> beer}

There are two possible popular approaches for frequent item set mining and association rule
learning as given below:

Apriori algorithm
FP-Growth algorithm

To explain above algorithms, let us consider example with 4 customers making 4 transactions in
supermarket that contain 7 items in total as given below:
Transaction 1: Jana’s purchase: egg, beer, pampers, milk
Transaction 2: Abi’s purchase: carrot, milk, pampers, beer
Transaction 3: Mahesha’s purchase: perfume, tissues, carrot
Transaction 4: Jayani’s purchase: perfume, pampers, beer

Item index
1: egg, 2: beer, 3: pampers, 4: carrot, 5: milk, 6: perfume, 7: tissues

Using Apriori algorithm

Apriori algorithm identifies frequent item sets by starting individual items and extending item set by one
at a time. This is known as candidate generation step.
This algorithm makes the assumption that any sub set of item within a frequent item set is also frequent.

Transaction: Items
1: 1, 2, 3, 5
2: 4, 5, 3, 2
3: 6, 7, 4
4: 6, 3, 2

Minimum Support

Minimum support is used to prune the associations that are less frequent.

Minimum support = number of times item occur in transactions/ number of transactions

For example, lets say we define minimum support as 0.5.


Calculating support for egg is 1/4 = 0.25 (0.25 < 0.5), so that is eliminated. Support for beer is 3/4
= 0.75 (0.75 > 0.5) is it is considered for further processing.

Calculation of support for all items

size of the candidate itemset = 1

itemset: support
1: 0.25: eliminated
2: 0.75
3: 0.75
4: 0.5
5: 0.5
6: 0.5
7: 0.25: eliminated

remaining items: 2, 3, 4, 5, 6

extend candidate itemset by 1


size of the items = 2

itemset: support
2, 3: 0.75
2, 4: 0.25: eliminated
2, 5: 0.5
2, 6: 0.25: eliminated
3, 4: 0.25: eliminated
3, 5: 0.5
3, 6: 0.25: eliminated
4, 5: 0.25: eliminated
4, 6: 0.25: eliminated
5, 6: 0.25: eliminated

remaining items: {2,3},{ 2, 5}, {3, 5}

extend candidate itemset by 1


size of the items = 3

2, 3, 5: 0.5

Você também pode gostar