Course Logistics Chapter Topics ---------------------------------- Refer the course objective and content PDF. Course Objective --------------------------------- About your instructor About you Experience with Hadoop? Experience as a developer? Expectations from the course? Introductions Module 1 Introduction to BIG Data and its Need Lesson 1: Introduction to BIG Data Lesson 2: Big Data Analytics and why its a need now Lesson 3: Real Time Case Studies Lesson 4: Traditional Vs. Big Data Approach Lesson 5: Technologies within Big Data Eco System Module Objectives At the end of this module, you will learn to: Introduction to BIG Data Few Examples on BIG Data Big Data Real time case studies Why Big Data is a BUZZ and why its a need now Big Data Analytics Comparison between Traditional and Big Data approach Technologies within Big Data Eco System Introduction to BIG Data At the end of this lesson, you will learn to: What is Big Data? The 3 Vs of BIG Data Few Example of Big Data Why Big Data is a BUZZ! Lesson 1 Introduction What is BIGData??? When you hear the term BIG Data what is the first instant thought? Volume!!!! right??? Massive, huge, enormous quantities of digital stuff. But its not just the volume that makes BIG Data difficult to manage and analyze its also the Variety and Velocity!!! Big Data : Insight Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data is equipped enough to handle day - to- day data explosion. Big data is difficult to work with using most relational database management systems Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. What do we Mean by Big Data? BIG Data has three defining attributes 3 Vs. They are: Data Volume, Data Variety & Data Velocity Together 3 Vs constitute a comprehensive definition of BIG Data. Volume: Using millions of transactions & events to analyze trends and perform forecast!! Turning 12 terabytes of Tweets created each day into improved product sentiment analysis!! Converting 350 billion annual meter readings to better predict power consumption!! Velocity: Using fast paced real time transactions for predictive analysis!! Scrutinizing 5 million trade events created each day to identify potential fraud!! Analyzing 500 million daily call detail records in real-time to predict customer churn faster!! Variety: Collectively analyzing all forms of data (text, sensor data, audio, video, click streams, log files ) gives new insights!! Monitoring 100s of live video feeds from surveillance cameras to target points of interest!! Exploiting 80% data growth in images, video and documents to improve customer satisfaction!! 3Vs Of BIG Data BIG Data Volume Velocity Variety Terabytes Records Transactions Batch Load Near Time Data Real Time Data Structured Unstructured Semi Structured Understanding BIG Data - Summary Data thats an order of magnitude greater than you are accustomed to - Gartner Analyst Doug Laney BIG Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand Database Management Tools - Wikipedia 3 Vs - Volume - Velocity & - Variety Few Examples of BIG Data Facebook handles 40 billion photos from its user base and has more than 901 million active users generating social interaction data. RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems. 10,000 payment card transactions are made every second around the world. 340 million tweets are sent per day. That's nearly 4,000 tweets per second. More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones Boeing jet engines produces terabytes of operational information every 30 minutes they turn. A four engine jumbo jet can create 640 terabytes of data on just one Atlantic crossing, multiply that by the more than 25,000 flights flown each day Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data Why BIG Data is a BUZZ! BIG Data Platform can be used to analyze semi structured & unstructured data along with raw structured data BIG data solutions are ideal for iterative & exploratory analysis when business measures can not be pre-determined using structured data set Big Data can be used to supports Predictive Analytics and Provide Predictive Enterprise Solutions using all forms of real transactions in contrary to the Traditional DWBI. Few of case study for BIG Data would be Performing IT log Analytics Identifying Fraud detection pattern Sentiment Analytics using SocialMedia Feed Executing usage analytics in Energy Sector Analyzing competitor market penetration So what BIG data means to a business? Profile customers & Gain Customer Trust Determine pricing strategies Identify competitive advantages Better target advertising Strengthen customer service Summary In this chapter you have learned What is Big Data? The 3 Vs of BIG Data Few Example of Big Data Why Big Data is a BUZZ! BIG Data Analytics & Why its a Need Now? At the end of this lesson, you will learn to: What is Big Data Analytics? Its advantages and challenges. Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation. Lesson 2 What is BIG Data Analytics Big data analytics is the process of examining large amounts of data of a variety of types (Structured, Semi-Structured or Unstructured), to uncover hidden patterns, unknown correlations and other useful information. The primary goal of big data analytics is to help companies make better business decisions by analyzing data sources that may be left untapped by conventional business intelligence (BI) programs Underlying data may include Web server logs, Internet clickstream data, social media activity reports, mobile-phone call detail records, information captured by sensors, IT Logs etc. Challenges? Lack of skill set High Initial Cost Involvement Challenges in integrating BIG Data Little Awareness of Technologies Unavailability of matured BIG Data toolset Advantages? Making sense out of unstructured Data Optimized usage of Organizational Data Value add to existing BI Solutions More accurate BI Results Best bet to make Better Business Decisions Why BIG Data Analytics is a need now? Information is at center of a New Wave Opportunity and Organization needs Deeper Insights 1 in 3 business leaders frequently makes Business Decisions based on Information they do not trust or do not have!!! 1 in 2 business leaders say they do not have access to relevant Information they require to do their job!!! 83 % of CIO cited BI as part of their visionary plans to be competitive and enhance their competitiveness 60 % of CEOs need to do a better job capturing & understanding Information rapidly in order to make swift business decisions 44x As much data & content In coming decade 80% Of Worlds available data Is Unstructured or Semi Structured 35 Zeta bytes of data by 2020 BIG Data Platform helps you combine varied data forms for making decisions Why BIG Data Analytics is a need now? BIG Data Platform provides you multi-channel Customer Sentiment Analytics Who are the BIGGEST influencers and what are they saying Social Network Web Call Centre What people think about your company or product??? Why BIG Data Analytics is a need now? New Information Sources Traditional Sources Twitter produces 7TB data everyday 2 Billion Internet users as of now Facebook produces 10 TB data everyday 4.6 Billion mobile phones worldwide Steady growth of traditional data Enormous Satellite data growth New media channels emerging everyday Digitization makes exponential data growth Future continues to bring new data sources with high data volume BIG Data Platform ensures consolidation of ever growing varied data sets Why BIG Data Analytics is a need now? Imagine if we could predict infections in pre-mature newborns 24 hours earlier? apply social relationships of customers to prevent churn? adjust credit lines as transactions are occurring to account for risk fluctuations determine whom to offer discounts at time of sale instead of offering to all Physician Call Centre REP Loan Officer Sales Associate BIG Data Platform can be used across industry for making Analytic Decisions BIG Data: The Solution The Solution Bring together any data source @ any velocity to generate insight Analyze variety of data @ enormous volume Insight on streaming data Large volume structured data analysis BIG Data Platform Velocity Variety Volume Multi channel customer sentiment Analytics Predict weather patterns to optimize capital expenditure Make risk decisions based on real time transactional data Identify criminal & threats from disparate Audio/VDO Find life threatening conditions in time to intervene Implementing BIG Data Analytics - Different Approaches Interactive Exploration Operational Reporting Indirect Batch Analysis For Data Analysts & Data Scientists who wants to discover real time pattern as they emerge from their BIG Data Content Foe executives & Operational Managers who wants summarized, pre- built, periodic reports on BIG Data Content For Data Analysts & Op Managers who want to analyze data trends based on predefined questions in their BIG Data Content Low Medium High Hbase, No-SQL, Analytic DBMS Hive, No-SQL, Analytic DBMS Hadoop, No-SQL, Analytic DBMS Native Native, SQL ETL Approach Use Cases Latency BIG Data Platform Connectivity Architecture BIG Data BI Platform In-Memory Engine BIG Data BI Platform BIG Data BI Platform OLAP Engine Data Mart Reports & Dashboards Multidimensional Analysis Multidimensional Analysis Native Native SQL ETL Summary In this chapter you have learned What is Big Data Analytics? Its advantages and challenges. Why it has become a need now? Big data as a complete Solution. Big Data Analytics implementation. Traditional Analytics Vs. Big Data Analytics At the end of this lesson, you will learn to: The Traditional Approach The BIG Data Approach Traditional Vs. Big Data Approach BIG Data Complements Traditional Enterprise Data Warehouse Traditional Analytics Vs. Big Data Analytics Lesson 3 The Traditional Approach: Business Requirement Drives Solution Design New Requirements require redesign & rebuild Business executes queries to answer questions over and over IT designs a solution with a set structure & functionality Business defines requirements what questions should we ask Well suited to High Value, Structured Data Repeated operations & processes Relatively stable sources Well understood requirements Stretched by Highly valuable data and content Exploratory analysis Volatile sources Changing requirements The Traditional Approach : Business Requirements drive solution design The BIG Data Approach: Information Sources drive Creative Discovery Can be implemented for Structured or Unstructured Data Exploratory operations & processes Relatively Unstable sources Unknown Business Requirements The BIG Data Approach : Information sources drive Creative discovery New insights drive integration to traditional technology Business determines what questions to ask by exploring data & relationships IT delivers platform that enables creative exploration of all available data & content Business & IT identify available Information Sources Traditional and BIG Data Approaches Traditional approach vs BIG Data approach Traditional Approach Structured & Repeatable Analysis BIG Data Approach Iterative & Exploratory Analysis Business Users Determines what questions to ask IT Structures data to answer questions IT Delivers platform to enable creative discovery Business Explores what questions could be asked Monthly Sales Report - Profitability Analysis - Customer Surveys Brand sentiment - Product Strategy - Maximizing Utilization BIG Data Complements Traditional Enterprise Data Warehouse Enterprise Integration Traditional Sources New Sources BIG Data Platform Data Warehouse BIG Data shouldnt be a silo, Must be an integrated part of your Enterprise Information Architecture Traditional Analytics Platform v/s BIG Data Analytics Platform Traditional DW Analytics Platform BIG Data Analytics Platform Gigabytes to Terabytes Petabytes to Exabyte Centralized Data Structure Distributed Data Structure Structured Semi Structured & Non Structured Relational Data-model Flat Schemas Batch oriented data load process Aimed at near real time analysis of the data Analytics based on historical trends BIG Data Analytics is based on real time data Data generated using conventional method (Data Entry) Data generated using unconventional methods like, RFID, Sensor networks etc. Summary In this chapter you have learned The Traditional Approach The BIG Data Approach Traditional Vs. Big Data Approach How Big Data Complements Traditional Enterprise Data Warehouse Traditional Analytics Vs. Big Data Analytics Real Time Case Studies At the end of this lesson, you will learn to: Big Data Analytics: Use Cases Big Data to Predict Your Customers' Behaviors When to consider for Big Data Solution Big Data Real Time Case Studies Lesson 4 BIG Data Analytics - Use Cases Web/E-Commerce/Internet Integrated Website Analytics Retail Competitive Pricing Customer Segmentation Predictive buying behavior Market Campaign Management Government Defense Intelligence Analysis Threat Analytics Telecommunications Customer Experience Analytics Healthcare & Pharmaceutical Drug Discovery Insurance Customer Segmentation Service Response Optimization Financial Services Fraud detection Analytics Risk Modeling & Analysis Manufacturing Inventory Optimization Energy & Utilities Customer Experience Analytics Service Quality Optimization Media & Content Customer Satisfaction Analytics Dispatch Optimization Big Data to Predict Your Customers' Behaviors Retailers like Wal-Mart and Kohl's are making use of sales, pricing, and economic data, combined with demographic and weather data, to fine-tune merchandising store by store and anticipate appropriate timing of store sales. Online data services like eHarmony and Match.com are constantly observing activity on their sites to optimize their matching algorithms to predict who will hit it off with whom. Google search queries on flu symptoms and treatments reveal weeks in advance what flu-related volumes hospital emergency departments can expect. BIG Data provides capacity to predict the future before your rivals can whether they're companies or criminals. Currently NYPD is using Big Data platform to fight crime in Manhattan. When to consider for Big Data Solution Big Data solutions are ideal for analyzing not only raw structured data, but semi structured and unstructured data from wide variety of sources. Big Data solutions are ideal when all, or most, of the data needs to be analyzed versus a sample of the data; or a sampling of data isn't nearly as effective as a larger set of data from which to derive analysis. Big Data solutions are ideal for iterative and exploratory analysis when business measures on data are not predetermined. Big Data Real Time Case Study TXU Energy Smart Electric Meters: Because of smart meters, electricity providers can read the meter once every 15 minutes rather than once a month. This not only eliminates the need to send some one for meter reading, but as the meter is read once every fifteen minutes, electricity can be priced differently for peak and off-peak hours. Pricing can be used to shape the demand curve during peak hours, eliminating the need for creating additional generating capacity just to meet peak demand, saving electricity providers millions of dollars worth of investment in generating capacity and plant maintenance costs. Big Data Real Time Case Study .(Contd) T-Mobile USA: T-Mobile USA has integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections. By leveraging social media data (Big Data) along with transaction data from CRM and Billing systems, T-Mobile USA has been able to cut customer defections in half in a single quarter. Big Data Real Time Case Study .(Contd) US Xpress : US Xpress, provider of a wide variety of transportation solutions collects about a thousand data elements ranging from fuel usage to tire condition to truck engine operations to GPS information, and uses this data for optimal fleet management and to drive productivity saving millions of dollars in operating costs. Big Data Real Time Case Study .(Contd) McLarens Formula One racing team : McLarens Formula One racing teamuses real-time car sensor data during car races, identifies issues with its racing cars using predictive analytics and takes corrective actions pro- actively before its too late! Summary In this chapter you have learned Big Data Analytics: Use Cases Big Data to Predict Your Customers' Behaviors When to consider for Big Data Solution Big Data Real Time Case Studies Like TXU smart meters, T- Mobile, US Xpress, McLarens Formula One racing team Technologies within Big Data Eco System At the end of this lesson, you will learn to: BIG Data Landscape BIG Data Key Components Components of Analytical Big-data Processing Hadoop at a glance Conclusion Lesson 5 BIG Data Landscape Hardware BIG Data Distributions Data Management Components Analytics Layer Application Layer Services Storage Servers Networking Vendors include DELL, HP, Arista, IBM, Cisco, EMC, NetApp Open source Hadoop Distributions Enterprise Hadoop Distributions Non-Hadoop BIG Data Frameworks Vendors/Providers include Apache, Cloudera, Hortonworks, IBM,EMC, MapR, LexisNexis Distributed File Sources No-SQL Databases Hadoop Optimized Datawarehouse Data Integration Data Quality & Governance Vendors/Providers include Apache, DataStax, Pervasive, Couchbase, IBM, Oracle, Informatica, Syncsort, Talend. Analytic Application development Platforms Advanced Analytic Applications Vendors/Providers include Apache, Karmasphere, Hadapt, Attivio, 101Data, EMC, SAS Institute, Digital Reasoning, Revolution Analytics. Data Visualization Tools BI Applications Vendore include Datameter, ClickFox, Platfora, Tableau Software, Tresata, IBM, SAP, Microstrategy, Pentaho, QlikTech, Japersoft Consulting Training Technical Support Software /Hardware Maintenance Hosting / BIG Data as a Service Cloud Vendor include Trisata, Tidemark, Think Big Analytics, Amazon Web Services, Accenture, Cloudera, Hortonworks. Vendors BIG Data Key Components Map Reduce Engine Pig Hive (DW) Cascading Kerberos ETL (Extract Transform & Load) & Modeling Tools (CRX) Click Fox, Merced etc.. Hadoop Structured Source Eg. Greenplum Netizza File System, eg. HDFS No SQL Database, eg. HBASE, Cassandra Processing & Original Data Location aware File System Job & Task Trackers Higher Level Languages Management & Security ETL & Modeling Fast Loading Analytic Database Analytic Applications Abstract Layers Components of Analytical Big-data Processing Raw massive data: Kept within cheap commodity machines/ servers. They are further categorized as Nodes and clusters. File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL Eventuallyconsistent keyvalue store) can also be used. Computation Engine: Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of computations. Data warehouse Layer: Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers for BIG Data Platform. Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and managed as a unit. It is widely used to develop special tools. Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set up the database that will run the analytics. Analytic Database : Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data for the analytic models . ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business issues. Hadoop at a Glance It is not advisable to dig out the hole for a pool using only an ice cream scooper; you need a big tool Hadoop is one among them!!! Apache Hadoop is an open-source project which was inspired by BIG Data research of Google. Hadoop is best available tool for processing and storing herculean amounts of big Data. Hadoop throws thousands of computers at big data problem, rather than using single computer. In Hadoop parlance, group of coordinated computers is called cluster & individual computers in the cluster are called nodes. Hadoop makes data mining, analytics, and processing of big data cheap and fast when compared with other toolsets. Hadoop is cheap, fast, flexible & scales to large amounts of big data storage & computation. Hadoop: A big tool for BIG Data Conclusion Looking at the Data explosion, the real issue is not to acquire large amount of Data or storing those data, it is what you do with your BIG Data!!! With BIG Data and BIG Data Analytics, its possible to: Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory. Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk. Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign etc. Quickly identify customers who matter the most. Generate retail coupons at the point of sale based on the customer's current and past purchases Send tailored recommendations at just the right time, while customers are in the right location Analyze data from social media to detect new market trends and changes in demand. Use clickstream analysis and data mining to detect fraudulent behavior. Determine root causes of failures, issues & defects by investigating user sessions, network logs & sensors. The working principle behind all big data platform is to move the query to the data to be processed, not the data to the query processor. Its time to move on and try to avoid just looking at the rear view mirror and drive the car (Traditional BI) but also to look a step forward and get into the predictive analytics by using the power of BIG DATA and hence help the organization to take right decision at right point of time. Summary In this chapter you have learned BIG Data Landscape BIG Data Key Components Components of Analytical Big-data Processing Hadoop at a glance Conclusion Module 2 Introduction to Apache Hadoop and its Ecosystem Lesson 1: The Motivation for Hadoop Lesson 2: Hadoop: Concepts and Architecture Lesson 3: Hadoop and the Data Warehouse: When and Where to use which Lesson 4: Introducing Hadoop Eco system components Module Objectives At the end of this module, you will learn to: Introduction to Apache Hadoop The motivation for Hadoop The Basic concept of Hadoop Hadoop Architecture Hadoop Distributes File System (HDFS) and MapReduce Right usage and scenarios for Hadoop Introduction to key Hadoop Eco System Projects The Motivation for Hadoop At the end of this lesson, you will learn to: What problems exist with traditional large scale computing systems What requirements an alternative approach should have How Hadoop addresses those requirements Lesson 1 Traditional Large Scale Computation Traditionally, computation has been processor bound Relatively small amounts of data Significant amount of complex processing performed on that data For decades, the primary push was to increase the computing power of a single machine Faster processor, more RAM Distributed systems evolved to allow developers to use multiple machines for a single job MPI PVM Condor MPI: Message Passing Interface PVM: Parallel Virtual Machine Distributed Systems: Problems Programming for traditional distributed systems is complex Data exchange requires synchronization Finite bandwidth is available Temporal dependencies are complicated It is difficult to deal with partial failures of the system Ken Arnold, CORBA designer: Failure is the defining difference between distributed and local Programming, so you have to design distributed systems with the expectation of failure Developers spend more time designing for failure than they do actually working on the problem itself CORBA: Common Object Request Broker Architecture Distributed Systems: Data Storage Typically, data for a distributed system is stored on a SAN At compute time, data is copied to the compute nodes Fine for relatively limited amounts of data The Data Driven World Modern systems have to deal with far more data than was the case in the past Organizations are generating huge amounts of data That data has inherent value, and cannot be discarded Examples: Facebook over 70PB of data eBay over 5PB of data Many organizations are generating data at a rate of terabytes per day Data Becomes the Bottleneck Moores Law has held firm for over 40 years Processing power doubles every two years Processing speed is no longer the problem Getting the data to the processors becomes the bottleneck Quick calculation Typical disk data transfer rate: 75MB/sec Time taken to transfer 100GB of data to the processor: approx. 22 minutes! Assuming sustained reads Actual time will be worse, since most servers have less than 100GB of RAM available A new approach is needed Partial Failure Support The system must support partial failure Failure of a component should result in a graceful degradation of application performance Not complete failure of the entire system Data Recoverability If a component of the system fails, its workload should be assumed by still functioning units in the system Failure should not result in the loss of any data Component Recovery If a component of the system fails and then recovers, it should be able to rejoin the system Without requiring a full restart of the entire system Consistency Component failures during execution of a job should not affect the outcome of the job Scalability Adding load to the system should result in a graceful decline in performance of individual jobs Not failure of the system Increasing resources should support a proportional increase in load capacity Hadoops History Hadoop is based on work done by Google in the late 1990s/early 2000s Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004 This work takes a radical new approach to the problem of distributed computing Meets all the requirements we have for reliability and scalability Core concept: distribute the data as it is initially stored in the system Individual nodes can work on data local to those nodes No data transfer over the network is required for initial processing Core Hadoop Concepts Applications are written in high level code Developers need not worry about network programming, temporal dependencies or low/level infrastructure Nodes talk to each other as little as possible Developers should not write code which communicates between nodes Shared nothing architecture Data is spread among machines in advance Computation happens where the data is stored, wherever possible Data is replicated multiple times on the system for increased availability and reliability Hadoop: Very High/Level Overview When data is loaded into the system, it is split into blocks Typically 64MB or 128MB Map tasks (the first part of the MapReduce system) work on relatively small portions of data Typically a single block A master program allocates work to nodes such that a Map task will work on a block of data stored locally on that node whenever possible Many nodes work in parallel, each on their own part of the overall dataset Fault Tolerance If a node fails, the master will detect that failure and re-assign the work to a different node on the system Restarting a task does not require communication with nodes working on other portions of the data If a failed node restarts, it is automatically added back to the system and assigned new tasks If a node appears to be running slowly, the master can redundantly execute another instance of the same task Results from the first to finish will be used Known as speculative execution Summary In this chapter you have learned What problems exist with traditional large-scale computing systems What requirements an alternative approach should have How Hadoop addresses those requirements Hadoop: Concepts and Architecture At the end of this lesson, you will learn to: What Hadoop is all about Hadoop Components What features the Hadoop Distributed File System (HDFS) provides HDFS Architecture The concepts behind MapReduce Lesson 2 The Hadoop Project Hadoop is an open-source project overseen by the Apache Software Foundation Originally based on papers published by Google in 2003 and 2004 Hadoop committers work at several different organizations Including Yahoo!, Facebook, LinkedIn Hadoop Components Hadoop consists of two core components The Hadoop Distributed File System (HDFS) MapReduce There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop etc Many are discussed later in the course A set of machines running HDFS and MapReduce is known as a Hadoop Cluster Individual machines are known as nodes A cluster can have as few as one node, as many as several thousands More nodes = better performance Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Data is split into blocks and distributed across multiple nodes in the cluster Each block is typically 64MB or 128MB in size Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability HDFS The Data File is broken up into 64MB or 128 MB blocks The Data Blocks are replicated 3 times and scattered amongst the workers Hadoop Components: MapReduce MapReduce is the system used to process data in the Hadoop cluster Consists of two phases: Map, and then Reduce Between the two is a stage known as the shuffle and sort Each Map task operates on a discrete portion of the overall dataset Typically one HDFS block of data After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase Much more on this later! HDFS Basic Concepts HDFS is a filesystemwritten in Java Based on Googles GFS Sits on top of a native filesystem Such as ext3, ext4 or xfs Provides redundant storage for massive amounts of data Using commodity (relatively low/cost) computers HDFS Basic Concepts (Contd) HDFS performs best with a modest number of large files Millions, rather than billions, of files Each file typically 100MB or more Files in HDFS are write once No random writes to files are allowed HDFS is optimized for large, streaming reads of files Rather than random reads How Files Are Stored Files are split into blocks Each block is usually 64MB or 128MB Data is distributed across many machines at load time Different blocks from the same file will be stored on different machines This provides for efficient MapReduce processing (see later) Blocks are replicated across multiple machines, known as DataNodes Default replication is three/fold Meaning that each block exists on three different machines A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located Known as the metadata How Files Are Stored. Example NameNode holds metadata for the two files (Foo.txt and Bar.txt) DataNodes hold the actual blocks Each block will be 64MB or 128MB in size Each block is replicated three times on the cluster More On The HDFS NameNode The NameNode daemon must be running at all times If the NameNode stops, the cluster becomes inaccessible Your system administrator will take care to ensure that the NameNode hardware is reliable! The NameNode holds all of its metadata in RAM for fast access It keeps a record of changes on disk for crash recovery A separate daemon known as the Secondary NameNode takes care of some housekeeping tasks for the NameNode Be careful: The Secondary NameNode is not a backup NameNode! CDH4 introduces NameNode High Availability NameNode is not a single point of failure Features an Active and a Standby NameNode HDFS: Points To Note Although files are split into 64MB or 128MB blocks, if a file is smaller than this the full 64MB/128MB will not be used Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoops configuration files This will be set by the system administrator Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster When a client application wants to read a file: It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on It then communicates directly with the DataNodes to read the data The NameNode will not be a bottleneck Accessing HDFS Applications can read and write HDFS files directly via the Java API Covered later in the course Typically, files are created on a local filesystem and must be moved into HDFS Likewise, files stored in HDFS may need to be moved to a machines local filesystem Access to HDFS from the command line is achieved with the hadoop fs command Hadoop fs Examples hadoop fs Examples (contd) hadoop fs Examples (contd) Hands-On Exercise: Using HDFS Aside: The Training Virtual Machine During this course, you will perform numerous Hands-On Exercises using the Training Virtual Machine (VM) The VM has Hadoop installed in pseudo-distributed mode This essentially means that it is a cluster comprised of a single node Using a pseudo/distributed cluster is the typical way to test your code before you run it on your full cluster It operates almost exactly like a real cluster A key difference is that the data replication factor is set to 1, not 3 Hands-On Exercise: Using HDFS In this Hands-On Exercise you will gain familiarity with manipulating files in HDFS Please refer to the Hands-On Exercise Manual What is MapReduce MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Where possible Consists of two phases: Map Reduce Features of MapReduce Automatic parallelization and distribution Fault-tolerance Status and monitoring tools A clean abstraction for programmers MapReduce programs are usually written in Java Can be written in any language using Hadoop Streaming All of Hadoop is written in Java MapReduce abstracts all the housekeeping away from the developer Developer can concentrate simply on writing the Map and Reduce functions In 2010, Facebook sat on top of a mountain of data; just one year later it had grown from 21 to 30 petabytes. If you were to store all of this data on 1TB hard disks and stack them on top of one another, you would have a tower twice as high as the Empire State building in New York. Enterprises like Google and Facebook use the mapreduce approach to process petabyte-range volumes of data. For some analyses, it is an attractive alternative to SQL databases, and Apache Hadoop exists as an open source implementation. Giant Data: MapReduce and Hadoop MapReduce: Automatically Distributed Processing and analyzing such data need to take place in a distributed process on multiple machines. However, this kind of processing has always been very complex, and much time is spent solving recurring problems, like processing in parallel, distributing data to the compute nodes, and, in particular, handling errors during processing. To free developers from these repetitive tasks, Google introduced the MapReduce framework. MapReduce Framework The MapReduce framework breaks down data processing into map, shuffle, and reduce phases. Processing is mainly in parallel on multiple compute nodes. MapReduce: Map Phase The Map Phase The Shuffle Phase The Reduce Phase MapReduce Programming Example: Search Engine A web search engine is a good example for the use of MapReduce. Set of MapReduce programming is used to implement page Rank algorithm, that Google uses to evaluate the relevance of a page on the web. Map Method: Reduce Method: Schematic process of a mapreduce computation The use of a combiner The use of a combiner makes sense for arithmetic operations in particular. MapReduce: The Big Picture The Five Hadoop Daemons Hadoop is comprised of five separate daemons NameNode Holds the metadata for HDFS Secondary NameNode Performs housekeeping functions for the NameNode Is not a backup or hot standby for the NameNode DataNode Stores actual HDFS data blocks JobTracker Manages MapReduce jobs, distributes individual tasks to machines running the TaskTracker Instantiates and monitors individual Map and Reduce tasks The Five Hadoop Daemons (contd) Each daemon runs in its own Java Virtual Machine (JVM) No node on a real cluster will run all five daemons Although this is technically possible We can consider nodes to be in two different categories: Master Nodes Run the NameNode, Secondary NameNode, JobTracker daemons Only one of each of these daemons runs on the cluster Slave Nodes Run the DataNode and TaskTracker daemons A slave node will run both of these daemons Basic Cluster Configuration Basic Cluster Configuration (Contd) On very small clusters, the NameNode, JobTracker and Secondary NameNode can all reside on a single machine It is typical to put them on separate machines as the cluster grows beyond 20/30 nodes Each dotted box on the previous diagram represents a separate Java Virtual Machine (JVM) Submitting A Job When a client submits a job, its configuration information is packaged into an XML file This file, along with the .jar file containing the actual program code, is handed to the JobTracker The JobTracker then parcels out individual tasks to TaskTracker nodes When a TaskTracker receives a request to run a task, it instantiates a separate JVM for that task TaskTracker nodes can be configured to run multiple tasks at the same time If the node has enough processing power and memory MapReduce: The JobTracker MapReduce jobs are controlled by a software daemon known as the JobTracker The JobTracker resides on a master node Clients submit MapReduce jobs to the JobTracker The JobTracker assigns Map and Reduce tasks to other nodes on the cluster These nodes each run a software daemon known as the TaskTracker The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker MapReduce: Terminology A job is a full program A complete execution of Mappers and Reducers over a dataset A task is the execution of a single Mapper or Reducer over a slice of data A task attempt is a particular instance of an attempt to execute a task There will be at least as many task attempts as there are tasks If a task attempt fails, another will be started by the JobTracker Speculative execution (see later) can also result in more task attempts than completed tasks MapReduce: The Mapper MapReduce: The Mapper (contd) The Mapper may use or completely ignore the input key For example, a standard pattern is to read a line of a file at a time The key is the byte offset into the file at which the line starts The value is the contents of the line itself Typically the key is considered irrelevant If the Mapper writes anything out, the output must be in the form of key/value pairs Example Mapper: Upper Case Mapper Example Mapper: Explode Mapper Example Mapper: Filter Mapper Example Mapper: Changing Keyspaces MapReduce: The Reducer After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list This list is given to a Reducer There may be a single Reducer, or multiple Reducers This is specified as part of the job configuration (see later) All values associated with a particular intermediate key are guaranteed to go to the same Reducer The intermediate keys, and their value lists, are passed to the Reducer in sorted key order This step is known as the shuffle and sort The Reducer outputs zero or more final key/value pairs These are written to HDFS In practice, the Reducer usually emits a single key/value pair for each input key Example Reducer: Sum Reducer Example Reducer: Identity Reducer MapReduce Example: Word Count MapReduce Example: Word Count (Contd) MapReduce Example: Word Count (Contd) MapReduce: Data Locality Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data stored locally on that node via HDFS If this is not possible, the Map task will have to transfer the data across the network as it processes that data Once the Map tasks have finished, data is then transferred across the network to the Reducers Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers All Mappers will, in general, have to communicate with all Reducers MapReduce: Is Shuffle and Sort a Bottleneck? It appears that the shuffle and sort phase is a bottleneck The reduce method in the Reducers cannot start until all Mappers have finished In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers finish work This mitigates against a huge amount of data transfer starting as soon as the last Mapper finishes Note that this behavior is configurable The developer can specify the percentage of Mappers which should finish before Reducers start retrieving data The developers reduce method still does not start until all intermediate data has been transferred and sorted MapReduce: Is a Slow Mapper a Bottleneck? It is possible for one Map task to run more slowly than the others Perhaps due to faulty hardware, or just a very slow machine It would appear that this would create a bottleneck The reduce method in the Reducer cannot start until every Mapper has finished Hadoop uses speculative execution to mitigate against this If a Mapper appears to be running significantly more slowly than the others, a new instance of the Mapper will be started on another machine, operating on the same data The results of the first Mapper to finish will be used Hadoop will kill off the Mapper which is still running Hands/On Exercise: Running A MapReduce Job In this Hands-On Exercise, you will run a MapReduce job on your pseudo- distributed Hadoop cluster Please refer to the Hands-On Exercise Manual Summary In this chapter you have learned What Hadoop is all about? What are the components in Hadoop Concept and detailed architecture of HDFS. What features the Hadoop Distributed File System (HDFS) provides The concepts behind MapReduce Few illustrations on MapReduce and how it works in real time. Hadoop and the Data Warehouse: When and Where to use which At the end of this lesson, you will learn to: Find out the answer, when should I use Hadoop, and when should I put the data into a data warehouse? Hadoop Differentiators Data Warehouse Differentiators Where and where to use which? Lesson 3 Hadoop and the Data Warehouse Figure 1. Before: Data flow of meter reading done manually Figure 2. After: Meter reading every 5 or 60 minutes via smart meters Hadoop Differentiators Hadoop is the repository and refinery for raw data. Hadoop is a powerful, economical and active archive. Data Warehouse Differentiators Data warehouse performance Integrated data that provides business value Interactive BI tools for end users While there are certain use cases that are distinct to Hadoop or the data warehouse, there is also overlap where either technology could be effective. The following table is a good starting place for helping to decide which platform to use based on your requirements. When and Where to Use Which Summary In this chapter you have learned Tried finding out the answer, when should I use Hadoop, and when should I put the data into a data warehouse? Learn the Hadoop and Data Warehouse Differentiators Also learn and discussed, where to use which? Introducing Hadoop Eco system components At the end of this lesson, you will learn to: A quick over view of few key Hadoop Eco system projects. Like Hive, Pig, Flume, Sqoop, Oozie, Hbase. The details of each one with demo and Hands-on will be covered in separate module. Lesson 4 Other Ecosystem Projects: Introduction The term Hadoop core refers to HDFS and MapReduce Many other projects exist which use Hadoop core Either both HDFS and MapReduce, or just HDFS Most are Apache projects or Apache Incubator projects Some others are not hosted by the Apache Software Foundation These are often hosted on GitHub or a similar repository We will investigate many of these projects later in the course Following is an introduction to some of the most significant projects Hive Hive is an abstraction on top of MapReduce Allows users to query data in the Hadoop cluster without knowing Java or MapReduce Uses the HiveQL language Very similar to SQL The Hive Interpreter runs on a client machine Turns HiveQL queries into MapReduce jobs Submits those jobs to the cluster Note: this does not turn the cluster into a relational database server It is still simply running MapReduce jobs Those jobs are created by the Hive Interpreter Hive (contd) Pig Pig is an alternative abstraction on top of MapReduce Uses a dataflow scripting language Called PigLatin The Pig interpreter runs on the client machine Takes the PigLatin script and turns it into a series of MapReduce jobs Submits those jobs to the cluster As with Hive, nothing magical happens on the cluster It is still simply running MapReduce jobs Pig (Contd) Flume Flume provides a method to import data into HDFS as it is generated Rather than batch/processing the data later For example, log files from a Web server A high level diagram Sqoop Sqoop provides a method to import data from tables in a relational database into HDFS Does this very efficiently via a Map/only MapReduce job Can also go the other way Populate database tables from files in HDFS We will investigate Sqoop later in the course. Oozie Oozie allows developers to create a workflow of MapReduce jobs Including dependencies between jobs The Oozie server submits the jobs to the server in the correct sequence We will investigate Oozie later in the course HBase HBase is the Hadoop database A NoSQL datastore Can store massive amounts of data Gigabytes, terabytes, and even petabytes of data in a table Scales to provide very high write throughput Hundreds of thousands of inserts per second Copes well with sparse data Tables can have many thousands of columns Even if most columns are empty for any given row Has a very constrained access model Insert a row, retrieve a row, do a full or partial table scan Only one column (the row key) is indexed Hbase vs Traditional RDBMSs Summary In this chapter you have learned Different Hadoop Eco System Projects namely Hive Pig Sqoop Flume Oozie HBase (a Hadoop datastore)