Ebook551 pages3 hours

Learning Cascading

Name: Learning Cascading
Author: Michael Covert
ISBN: 9781785285233

By Michael Covert and Victoria Loewengart

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Understand how Cascading fits into the big data landscape and hides the complexity of MapReduce to enable the development of streamlined, maintainable, and concise applications
Develop a real-life Cascading application that can be easily customized for your specific needs
Learn basic and advanced features of Cascading through a practical, hands-on approach with step-by-step instructions and code samples

Who This Book Is For

This book is intended for software developers, system architects and analysts, big data project managers, and data scientists who wish to deploy big data solutions using the Cascading framework. You must have a basic understanding of the big data paradigm and should be familiar with Java development techniques.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateMay 29, 2015

ISBN9781785285233

Author

Michael Covert

Related authors

Skip carousel

Related to Learning Cascading

Related ebooks

Skip carousel

Programming MapReduce with Scalding
Ebook
Programming MapReduce with Scalding
byAntonios Chalkiopoulos
Rating: 0 out of 5 stars
0 ratings
Cassandra 3.x High Availability - Second Edition
Ebook
Cassandra 3.x High Availability - Second Edition
byRobbie Strickland
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
OpenStack Sahara Essentials
Ebook
OpenStack Sahara Essentials
byOmar Khedher
Rating: 0 out of 5 stars
0 ratings
OpenStack Object Storage (Swift) Essentials
Ebook
OpenStack Object Storage (Swift) Essentials
byAmar Kapadia
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform
Ebook
SQL Server Big Data Clusters: Data Virtualization, Data Lake, and AI Platform
byBenjamin Weissman
Rating: 0 out of 5 stars
0 ratings
Learning Apache Cassandra - Second Edition
Ebook
Learning Apache Cassandra - Second Edition
bySandeep Yarabarla
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Parallel Python with Dask
Ebook
Parallel Python with Dask
byTim Peters
Rating: 0 out of 5 stars
0 ratings
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Ebook
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
byTim Peters
Rating: 0 out of 5 stars
0 ratings
JavaScript Concurrency
Ebook
JavaScript Concurrency
byBoduch Adam
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Numerical Applications with SAS
Ebook
Deep Learning for Numerical Applications with SAS
byHenry Bequet
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Optimizing Hadoop for MapReduce
Ebook
Optimizing Hadoop for MapReduce
byKhaled Tannir
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Hadoop Essentials
Ebook
Hadoop Essentials
byShiva Achari
Rating: 5 out of 5 stars
5/5
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
Ebook
The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform
byMatt How
Rating: 0 out of 5 stars
0 ratings
Monitoring Hadoop
Ebook
Monitoring Hadoop
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
MariaDB High Performance
Ebook
MariaDB High Performance
byPierre MAVRO
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Hyper-V 2016 Best Practices
Ebook
Hyper-V 2016 Best Practices
byBenedict Berger
Rating: 0 out of 5 stars
0 ratings
React Components
Ebook
React Components
byChristopher Pitt
Rating: 0 out of 5 stars
0 ratings
Learning Couchbase
Ebook
Learning Couchbase
byPotsangbam Henry
Rating: 0 out of 5 stars
0 ratings
Mastering Redis
Ebook
Mastering Redis
byJeremy Nelson
Rating: 0 out of 5 stars
0 ratings
Developing with Docker
Ebook
Developing with Docker
byJarosław Krochmalski
Rating: 5 out of 5 stars
5/5
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Predictive Analytics Using Rattle and Qlik Sense
Ebook
Predictive Analytics Using Rattle and Qlik Sense
byFerran Garcia Pagans
Rating: 0 out of 5 stars
0 ratings
Real-time Analytics with Storm and Cassandra
Ebook
Real-time Analytics with Storm and Cassandra
byShilpi Saxena
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Dawn of the New Everything: Encounters with Reality and Virtual Reality
Ebook
Dawn of the New Everything: Encounters with Reality and Virtual Reality
byJaron Lanier
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
Ebook
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Podcast episode
Troubleshooting Kafka In Production: Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
byData Engineering Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
Podcast episode
Building An Internal Database As A Service Platform At Cloudflare: Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.
byData Engineering Podcast
0 ratings
0% found this document useful
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
Podcast episode
Eliminate The Overhead In Your Data Integration With The Open Source dlt Library: Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.
byData Engineering Podcast
0 ratings
0% found this document useful
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
$Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.$
Podcast episode
Addressing The Challenges Of Component Integration In Data Platform Architectures: Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
Podcast episode
Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams: With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.
byData Engineering Podcast
0 ratings
0% found this document useful
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
Podcast episode
Adding An Easy Mode For The Modern Data Stack With 5X: The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.
byData Engineering Podcast
0 ratings
0% found this document useful
Evolution of Public Cloud Integrators
Podcast episode
Evolution of Public Cloud Integrators
byThe Cloudcast
0 ratings
0% found this document useful
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
Podcast episode
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
byData Engineering Podcast
0 ratings
0% found this document useful
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack: If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
Podcast episode
Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack: If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.
byData Engineering Podcast
0 ratings
0% found this document useful
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Podcast episode
Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine: Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
byData Engineering Podcast
0 ratings
0% found this document useful
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
Podcast episode
Designing Data Transfer Systems That Scale: The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Scalable, Serverless Database Platforms
Podcast episode
Scalable, Serverless Database Platforms
byThe Cloudcast
0 ratings
0% found this document useful
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Podcast episode
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
byData Engineering Podcast
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
New Trends in Serverless
Podcast episode
New Trends in Serverless
byThe Cloudcast
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
Podcast episode
Defining A Strategy For Your Data Products: The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.
byData Engineering Podcast
0 ratings
0% found this document useful
How Redpanda Extracts Business Value from Data Events with Alex Gallego
Podcast episode
How Redpanda Extracts Business Value from Data Events with Alex Gallego
byScreaming in the Cloud
0 ratings
0% found this document useful
Developing Multi-Cloud Skills
Podcast episode
Developing Multi-Cloud Skills
byThe Cloudcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
Podcast episode
Powering Vector Search With Real Time And Incremental Vector Indexes: The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.
byData Engineering Podcast
0 ratings
0% found this document useful
Scalable Databases on Kubernetes
Podcast episode
Scalable Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
Podcast episode
Data Sharing Across Business And Platform Boundaries: Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Darq
PC Pro Magazine
Article
Darq
Jul 9, 2022
3 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
PC Pro Magazine
Article
“We Should Pay Attention To The Way That A New Language Can Redefine The Limits Of Computing”
Feb 11, 2021
7 min read
Is Quantum Computing Ready For Prime Time?
APC
Article
Is Quantum Computing Ready For Prime Time?
Oct 9, 2023
4 min read
Is Quantum Computing Ready For Prime Time?
Maximum PC
Article
Is Quantum Computing Ready For Prime Time?
Nov 7, 2023
4 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Is Quantum Computing Ready For Prime Time?
PC Pro Magazine
Article
Is Quantum Computing Ready For Prime Time?
Sep 7, 2023
7 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
The Big Cloud Question: How Can You Protect Your Assets On Someone Else’s Servers?
PC Pro Magazine
Article
The Big Cloud Question: How Can You Protect Your Assets On Someone Else’s Servers?
Aug 10, 2023
I write this as an old fool who remembers sitting in endless meetings and presentations back when the whole concept of the cloud was starting up. So I can tell you that, right from the beginning, while vendors were pitching the all-encompassing busin
6 min read
Open Success
Linux Format
Article
Open Success
Nov 17, 2020
“ClickHouse was developed for Yandex Metrics (the Russian equivalent of Google Analytics) as a data store and was Apache 2 licenced in 2016. In 2020. Altinity picked up $4m in funding to help it finish off a ClickHouse cloud service that’s in private
1 min read
Benchmark your SSD
APC
Article
Benchmark your SSD
Nov 2, 2020
4 min read
Manage Your Apps!
Linux Format
Article
Manage Your Apps!
Nov 14, 2023
17 min read
“It’s Very Important To Takea Long, Hard, Cynical Look At Your Servere State”
PC Pro Magazine
Article
“It’s Very Important To Takea Long, Hard, Cynical Look At Your Servere State”
Feb 10, 2022
8 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read
Scan Cloud RTX Virtual Workstation
PC Pro Magazine
Article
Scan Cloud RTX Virtual Workstation
Aug 7, 2022
2 min read
Art Beyond The Canvas
Linux Format
Article
Art Beyond The Canvas
May 2, 2023
9 min read
REDUCING IT COSTS FOR SMEs
PC Pro Magazine
Article
REDUCING IT COSTS FOR SMEs
Jan 5, 2023
Where is your company’s data stored? It’s all the rage to push data up into the cloud and to make it someone else’s problem. However, this is rarely the real outcome. While I would accept that a well-run data centre is likely to be more robust than a
4 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
It’s Great When You’re K8s
Linux Format
Article
It’s Great When You’re K8s
Oct 18, 2022
8 min read
Join the Pod, Man!
Linux Format
Article
Join the Pod, Man!
May 30, 2023
8 min read
How to Move From CrashPlan for Home to Another Backup Solution
MacWorld
Article
How to Move From CrashPlan for Home to Another Backup Solution
Sep 14, 2017
8 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
Mocap In The Cloud: The Future Of Motion Capture
3D World
Article
Mocap In The Cloud: The Future Of Motion Capture
Nov 30, 2021
4 min read
Rolling The Database As A Service
Linux Format
Article
Rolling The Database As A Service
Aug 27, 2019
A couple of times during our conversation, Robin alluded to the fact that DataStax has now set its eyes on helping users eradicate some of the day-to-day operational complexity from their workflow. The DataStax Apache Cassandra as a Service is one of
2 min read
Poisoning The Well
Linux Format
Article
Poisoning The Well
Jan 11, 2022
4 min read
AWS vs Azure
Linux Format
Article
AWS vs Azure
Aug 22, 2023
9 min read
Salesforce Adding Einstein Analytics Al To Tableau Platform
Techfastly
Article
Salesforce Adding Einstein Analytics Al To Tableau Platform
Feb 4, 2021
3 min read

Related categories

Skip carousel

Reviews for Learning Cascading

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Learning Cascading - Michael Covert

Learning Cascading

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. The Big Data Core Technology Stack

Reviewing Hadoop

Hadoop architecture

HDFS – the Hadoop Distributed File System

The NameNode

The secondary NameNode

DataNodes

MapReduce execution framework

The JobTracker

The TaskTracker

Hadoop jobs

Distributed cache

Counters

YARN – MapReduce version 2

A simple MapReduce job

Beyond MapReduce

The Cascading framework

The execution graph and flow planner

How Cascading produces MapReduce jobs

Summary

2. Cascading Basics in Detail

Understanding common Cascading themes

Data flows as processes

Understanding how Cascading represents records

Using tuples and defining fields

Using a Fields object, named field groups, and selectors

Data typing and coercion

Defining schemes

Schemes in detail

TupleEntry

Understanding how Cascading controls data flow

Using pipes

Creating and chaining

Pipe operations

Each

Splitting

GroupBy and sorting

Every

Merging and joining

The Merge pipe

The join pipes – CoGroup and HashJoin

CoGroup

HashJoin

Default output selectors

Using taps

Flow

FlowConnector

Cascades

Local and Hadoop modes

Common errors

Putting it all together

Summary

3. Understanding Custom Operations

Understanding operations

Operations and fields

The Operation class and interface hierarchy

The basic operation lifecycle

Contexts

FlowProcess

OperationCall

An operation processing sequence and its methods

Operation types

Each operations

Filters

Filter calling sequence

Built-in filters

Function

Function calling sequence

Built-in functions

Every operations

Aggregator

Aggregator calling sequence

Built-in aggregators

Buffers

Buffer calling sequence

Built-in buffers

Assertions

ValueAssertion calling sequence

GroupAssertion calling sequence

AssertionLevel

Using assertions

Built-in assertions

A note about implementing BaseOperation methods

Summary

4. Creating Custom Operations

Writing custom operations

Writing a filter

Writing a function

Writing an aggregator

Writing a custom assertion

Writing a buffer

Identifying common use cases for custom operations

Putting it all together

Summary

5. Code Reuse and Integration

Creating and using subassemblies

Built-in subassemblies

Creating a new custom subassembly

Using custom subassemblies

Using cascades

Building a complex workflow using cascades

Skipping a flow in a cascade

Intermediate file management

Dynamically controlling flows

Instrumentation and counters

Using counters to control flow

Using existing MapReduce jobs

Using fluent programming techniques

The FlowDef fluent interface

Integrating external components

Flow and cascade events

Using external JAR files

Using Cascading as insulation from big data migrations and upgrades

Summary

6. Testing a Cascading Application

Debugging a Cascading application

Getting your environment ready for debugging

Using Cascading local mode debugging

Setting up Eclipse

Remote debugging

Using assertions

The Debug() filter

Managing exceptions with traps

Checkpoints

Managing bad data

Viewing flow sequencing using DOT files

Testing strategies

Unit testing and JUnit

Mocking

Integration testing

Load and performance testing

Summary

7. Optimizing the Performance of a Cascading Application

Optimizing performance

Optimizing Cascading

Optimizing Hadoop

A note about the effective use of checkpoints

Summary

8. Creating a Real-world Application in Cascading

Project description – Business Intelligence case study on monitoring the competition

Project scope – understanding requirements

Understanding the project domain – text analytics and natural language processing (NLP)

Conducting a simple named entity extraction

Defining the project – the Cascading development methodology

Project roles and responsibilities

Conducting data analysis

Performing functional decomposition

Designing the process and components

Creating and integrating the operations

Creating and using subassemblies

Building the workflow

Building flows

Managing the context

Building the cascade

Designing the test plan

Performing a unit test

Performing an integration test

Performing a cluster test

Performing a full load test

Refining and adjusting

Software packaging and delivery to the cluster

Next steps

Summary

9. Planning for Future Growth

Finding online resources

Using other Cascading tools

Lingual

Pattern

Driven

Fluid

Load

Multitool

Support for other languages

Hortonworks

Custom taps

Cascading serializers

Java open source mock frameworks

Summary

A. Downloadable Software

Contents

Installing and using

Index

Learning Cascading

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: May 2015

Production reference: 1250515

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-891-3

www.packtpub.com

Credits

Authors

Michael Covert

Victoria Loewengart

Reviewers

Allen Driskill

Bernie French

Supreet Oberoi

Commissioning Editor

Veena Pagare

Acquisition Editor

Vivek Anantharaman

Content Development Editor

Prachi Bisht

Technical Editors

Ruchi Desai

Ankita Thakur

Copy Editors

Sonia Michelle Cheema

Ameesha Green

Project Coordinator

Shipra Chawhan

Proofreaders

Stephen Copestake

Safis Editing

Indexer

Monica Ajmera Mehta

Graphics

Disha Haria

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

Foreword

The Cascading project was started in 2007 to complete the promise that Apache Hadoop was indirectly making to people like me—that we can dramatically simplify data-oriented application development and deployment. This can be done not only from a tools perspective, but more importantly, from an organizational perspective. Take a thousand machines and make them look like one: one storage layer and a computing layer. This promise means I would never have to ask our IT group for another storage array, more disk space, or another overpriced box to manage. My team and I could just throw our data and applications at the cluster and move on. The problem with this story is that we would have to develop our code against the MapReduce model, forcing us to think in MapReduce. I've only ever written one MapReduce application and it was a terrible experience. I did everything wrong.

Cascading was originally designed to help you, the developer, use your existing skills and do the right thing initially, or more importantly, understand whether assumptions about your data were right and wrong so that you can quickly compensate.

In the past couple of years, we have seen the emergence of new models improving on what MapReduce started. These models are driven by new expectations around latency and scale. Thinking in MapReduce is very difficult, but at least you can reason in it. Some of the newer models do not provide enough mental scaffolding to even help you reason out your problem.

Cascading has been evolving to help insulate you from the need to intimately know these new models while allowing businesses to leverage them as they become stable and available. This work started with Cascading 2.0 a few years ago. Cascading 3.0, still under development at the time of writing this, has evolved to a much deeper level, making a new promise to developers that they can write data-oriented applications once, and with little effort can adapt to new models and infrastructure.

What shouldn't come as a surprise is that Cascading has also grown to become its own ecosystem. It supports nearly every type of data source and has been ported to multiple JVM-based programming languages, including Python (Jython), Ruby (JRuby), Scala (via Scalding), and Clojure (via Cascalog). Cascading has a vibrant community of developers and users alike. Having been adopted by companies, such as Twitter, Etsy, eBay, and others, Cascading is the foundation for many advanced forms of analytics from machine learning, genomics, to being the foundation of new data-oriented languages, DSLs, and APIs. It is also the foundation of many commercial products that advertise Hadoop compatibility.

I am very enthusiastic about this book. Here, at Concurrent, we see the need for this book because it consolidates so much information in one place. This book provides details that are not always well documented or can be difficult to find. It contains many concrete coding examples, how-to tips, systems integration strategies, performance and tuning steps, debugging techniques, and advice on future directions to take. Also included is a real-life, end-to-end application that can serve as a reference architecture for development using Cascading. This book contains much that is valuable to developers, managers, and system administrators.

Chris K. Wensel

CTO, Concurrent, Inc.

About the Authors

Michael Covert, CEO, Analytics Inside LLC, has significant experience in a variety of business and technical roles. Michael is a mathematician and computer scientist and is involved in machine learning, deep learning, predictive analytics, graph theory, and big data. He earned a bachelor's of science degree in mathematics with honors and distinction from The Ohio State University. He also attended it as a PhD student, specializing in machine learning and high-performance computing. Michael is a Cloudera Hadoop Certified Developer.

Michael served as the vice president of performance management in Whittman-Hart, Inc., based in Chicago, and as the chief operating officer of Infinis, Inc., a business intelligence consulting company based in Columbus, Ohio. Infinis merged with Whittman-Hart in 2005. Prior to working at Infinis, Michael was the vice president of product development and chief technology officer at Alta Analytics, and the producer of data mining and visualization software. In addition to this, he has served in technology management roles for Claremont Technology Group, Inc., where he was the director of advanced technology.

I would like to thank the editors of Packt Publishing who guided us through this writing process and made significant contributions to what is contained here. I would also like to thank Bernie, Allen, and Supreet who expended tremendous effort through many hours of technical review, and in doing so, helped to ensure that this book was of the highest quality.

More than anyone else, I would like to thank my wife, Kay, for her support while writing this book, and more so, throughout my entire adult existence. Without her, most of the good things that have occurred in my life would not have been possible. I also would like thank my dogs, Ellie and Josie, who served as constant, faithful companions through all of this. Lastly, I would like to thank my coauthor, business partner, and most of all, my friend, Victoria Loewengart. She has not only helped make all this possible, but working with her also made it fun and rewarding.

Victoria Loewengart, COO, Analytics Inside LLC, is an innovative software systems architect with a proven record of bringing emerging technologies to clients through discovery, design, and integration. Additionally, Victoria spent a large part of her career developing software technologies that extract information from unstructured text. Victoria has published numerous articles on topics ranging from text analytics to intelligence analysis and cyber security. Her book An Introduction to Hacking & Crimeware: A Pocket Guide was published by IT Governance, UK, in January 2012. Victoria earned a bachelor's degree in computer science from Purdue University and a master's degree in intelligence studies from the American Military University.

I am indebted to the editors of Packt Publishing who did a fantastic job in making sure that this book sees the light of day. Also, I would like to thank our reviewers Bernie, Allen, and Supreet for putting a lot of time and effort into making sure that this book addresses all the necessary issues and does not have technical errors.

On a personal note, I would like to thank my wonderful husband, Steven Loewengart, for his never-ending love, patience, and support through the process of writing this book. I would also like to thank my parents, Boris and Fanya Rudkevich, for their encouragement and faith in me. I am grateful to my daughters, Gina and Sophie, for keeping me young and engaged in life. Finally, I would like to thank my coauthor, friend, and business partner, Mike Covert, for his wisdom and mentorship, and without whom, this book would have not happened.

About the Reviewers

Allen Driskill holds a BS in computer science from Murray State University and an MS in information science from The Ohio State University, where he has also been an instructor.

Since 1974, Allen has been a software developer in various fields and guises. He has developed business systems on a variety of small and large platforms for NCR, Burroughs, IBM, online information delivery systems, search and retrieval, hierarchical storage and management, telecommunications, imbedded systems for various handheld devices, and NLP systems for chemical information analysis.

He is currently employed by Chemical Abstracts Service in Columbus, Ohio, as a senior research scientist. In this capacity, he is currently using a large Hadoop cluster to process chemical information from a variety of massive databases.

Bernie French has over 20 years of experience in bio and cheminformatics. His interests are in the field of graph-based knowledge and analytic systems. He has developed analytic solutions within the Hadoop ecosystem using MapReduce, Cascading, and Giraph. He has also created analytic solutions using Spark and GraphX technologies. Bernie has a PhD in molecular biology from The Ohio State University and was an NIH Fellow at the The Wistar Institute, where he specialized in molecular and cellular biology. He was also an NIH hematology/oncology Fellow at The Ohio State University's Comprehensive Cancer Center.

Supreet Oberoi is the vice president of field engineering at Concurrent, Inc. Prior to this, he was the director of big data application infrastructure for American Express, where he led the development of use cases for fraud, operational risk, marketing, and privacy on big data platforms. He holds multiple patents in data engineering and has held executive and leadership positions at Real-Time Innovations, Oracle, and Microsoft.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

Big data is the new must have of this century. Suddenly, everyone wants to manage huge amounts of data and find patterns in it, which they did not see before. The problem, however, is that big data is not one, but a whole slew of technologies, which work together to produce the desired outcome. New technologies are emerging, and existing ones are evolving very quickly. The design and development of big data systems is very complex, with an incredibly steep learning curve and not a whole lot of prior experience and best practices to rely on. This is where Cascading comes in. Cascading sits on top of the core big data frameworks and makes design and development of big data applications intuitive and fun! Cascading significantly simplifies and streamlines application development, job creation, and job scheduling. Cascading is an open source software, and many large organizations prefer to use it to manage their big data systems over more complicated solutions, such as MapReduce.

We discovered Cascading after our applications started to get to the level of complexity when pure MapReduce turned into a blood-pressure-raising nightmare. Needless to say, we fell in love with Cascading. Now, we train other developers on it, and evangelize it every chance we get. So, this is how our book came about. Our vision was to provide a book to the Cascading user community that will help them accelerate the development of complex, workflow-based Cascading applications, while still keeping their sanity intact so that they can enjoy life.

This book will teach you how to quickly develop practical Cascading applications, starting with the basics and gradually progressing into more complex topics. We start with a look under the hood, how Cascading relates to core big data technologies, such as Hadoop MapReduce, and future emerging technologies, such as Tez, Spark, Storm, and others. Having gained an understanding of underlying technologies, we follow with a comprehensive introduction to the Cascading paradigm and components using well-tested code examples that go beyond the ones in the open domain that exist today. Throughout this book, you will receive expert advice on how to use the portions of a product that are undocumented or have limited documentation. To deepen your knowledge and experience with Cascading, you will work with a real-life case study using Natural Language Processing to perform text analysis and search large volumes of unstructured text. We conclude with a look to the future, and how Cascading will soon run on additional big data fabrics, such as Spark and Tez.

Cascading has rapidly gained popularity, and obtaining development skills in this product is a very marketable feature for a big data professional. With in-depth instructions and hands-on practical approaches, Learning Cascading will ensure your mastery of Cascading.

What this book covers

Chapter 1, The Big Data Core Technology Stack, introduces you to the concepts of big data and the core technologies that comprise it. This knowledge is essential as the foundation of learning Cascading.

Chapter 2, Cascading Basics in Detail, we will begin our exploration of Cascading. We will look at its intended purpose, how it is structured, how to develop programs using it base classes, and how it then manages the execution of these programs.

Chapter 3, Understanding Custom Operations, is based on the newly acquired Cascading knowledge. We will learn more advanced Cascading features that will truly unleash the power of the Cascading API. We will concentrate on an in-depth discussion about operations and their five primary types.

Chapter 4, Creating Custom Operations, will show you how to actually write a custom operation for each type of operation. This is done after we've completely explored and understood the anatomy of operations in the previous chapter.

Chapter 5, Code Reuse and Integration, introduces you to learning how to assemble reusable code, integrating Cascading with external systems, and using the Cascading instrumentation effectively to track and control execution. Specifically, we will discuss how Cascading fits into the overall processing infrastructure.

Chapter 6, Testing a Cascading Application, provides in-depth information on how to efficiently test a Cascading application using many techniques.

Chapter 7, Optimizing the Performance of a Cascading Application, provides in-depth information on how to efficiently optimize a Cascading application and configure the underlying framework (Hadoop) for maximum performance.

Chapter 8, Creating a Real-world Application in Cascading, takes everything we learned and creates a real-world application in Cascading. We will use a practical case study, which could be easily adapted to a functional component of a larger system within your organization.

Chapter 9, Planning for Future Growth, provides information that will allow you to continue to advance your skills.

Appendix, Downloadable Software, provides information that will allow you to download all the code from the book.

What you need for this book

Software:

Eclipse 4.x (Kepler, Luna, or beyond) or other Java IDE

Java 1.6 or above (1.7 preferred)

Cascading 2,2 or above (2.6.2 preferred)

Hadoop Virtual Machine is optional, but it'll be nice to have (Cloudera or Hortonworks are preferred)

OS:

If you have a Hadoop Virtual Machine: Windows 7 or above

If no Hadoop Virtual Machine: Linux (CentOS 6 and above preferred)

Hadoop version 2 (YARN) running MapReduce version 1 is preferred. Hadoop MapReduce (0.23 or beyond) will also work

Who this book is for

This book is intended for software developers, system architects, system analysts, big data project managers, and data scientists.

The reader of this book should have a basic understanding of the big data paradigm, Java development skills, and be interested in deploying big data solutions in their organization, specifically using the Cascading open source framework.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: A key script, such as hadoop-env.sh, is needed in instances where environment variables are set.

A block of code is set as follows:

public class

Enjoying the preview?

Page 1 of 1

Learning Cascading

About this ebook

Michael Covert

Related authors

Related to Learning Cascading

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Learning Cascading

What did you think?

Book preview

Learning Cascading - Michael Covert

Table of Contents

Learning Cascading

Learning Cascading

Credits

Foreword

About the Authors

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions