Assignment 1 Spec

REVA UNIVERSITY
School of Computing and Information Technology
B. Tech VII Sem (For all Sections)

Big Data and Hadoop Assignment Specifications
NOTE:
1. Check the following links appended in the last section of this document for reference. The three
sample project specifications are also listed
2. Mini-Project Exhibition shall be held on Saturday, 10th of November 2019, 9:00 am to 1:00 pm
Assignment-1:
Submission Deadline: Thursday, 6th of September 2019, 4:30

pm. (Assignments not submitted in time is considered as NOT
SUBMITTED and given ZERO)
Question/Specification:
The student is expected to perform the following tasks and submit the report in electronic form to
the MOODLE within the deadline:
1. Choose an appropriate sub-area of Big Data and Hadoop.
For example: Big Data Clustering, Big Data Classification, Big Data Transformation, Big Data
Regression, Big Data Recommenders, and so, on…
2. Choose the appropriate Problem Domain.
For Example: Healthcare data (related to some disease), Text Document Data (related to some
topic), Image Data (related to any field such as agriculture…), JSON Data, Key-Value paired Data,
Graph Data, etc.
3. Corner the Problem Statement by clearly mentioning Aims and Objectives.
4. Perform the comprehensive Literature Review in the Problem Domain.
The following are the questions that are to be answered in the literature Review.
a. What is the scope of the Problem intended to be solved?
b. What are the other related Problems existing in that area?
c. What are the serious challenges/issues in the Problem Defined?
d. What are the existing solutions that are present for the Problem?
e. What are the limitations of these existing solutions?
f. Conclusion with the formulation of a Problem by coining the Research Gap.
5. Bibliography
Page 1 of 5
Assignment-2:
Submission Deadline: Friday, 2nd of November 2019, 4:30 pm.

(Assignments not submitted in time is considered as NOT
SUBMITTED and given ZERO)
Question/Specification:
The student is expected to perform the following tasks and submit the report in electronic form to
the MOODLE within the deadline:
1. Basic System Design of the Project.(UML Diagrams/ER Diagrams/Data Flow Diagrams… You
can search for the latest tools that are used in the industry for designing the Big Data Applications)
2. Performing the System Analysis followed the System Design. (Check through the fundamentals
of System Design and Analysis)
3. Skeletal representation of the algorithms used.
4. Functional and Software Requirements.
5. Discussion of Results and Experimental Results.
6. Writing the conclusion of the work done.
7. Bibliography
Resources (References or Links):

Free links available for Big Data Hadoop use cases.
Link 1: Spark Streaming and Kafka Integration
You can perform sentiment analysis on demonetization using Apache Pig using below link:
Link 2: Sentiment Analysis on Demonetization – Pig Use Case
Also perform analysis on Aviation data using Apache Pig, Hive and Tableau.
Link 3: Aviation data analysis using Apache Pig
Link 4: Aviation Data Analysis Using Apache Hive
Learn to implement the scheduling of Hadoop Jobs using using Jenkins and Rundeck.
Link 5: Scheduling Hadoop Jobs using Jenkins.
Link 6: Scheduling Hadoop Jobs using Rundeck.
Since Machine Learning is heavily implemented along with Big Data technologies like Hadoop
and spark,I am giving some use cases on machine learning with Spark.
Link 7: Machine Learning with Spark - Part 1
Link 9: Machine Learning with Spark on Bank Use Case - Part 3
Link 12: Distributed SQL engine for Big Data
Find the below links for small use cases on MapReduce in hadoop.
The below two links will help you to get hold on Map Reduce concepts:
Link 13: Map Reduce Use Case - Uber Data Analysis
Link 14: MapReduce Use Case-Youtube Data Analysis
Link 15: Map reduce Use case – Titanic Data Analysis
Page 2 of 5
Below links are related to sentiment analysis using Hadoop's various components like Pig
and Hive.
Link 16: Pig Use Case - Weblog Analysis
Link 17: Pig Use Case – The Daily Show Data Analysis Part – I
Link 18: Pig Use Case – The Daily Show Data Analysis Part – II
Link 19:Determining Popular Hashtags in Twitter Using Pig
Link 20: Sentiment Analysis on Twitter – TimeZone wise analysis
Link 21: Hive Use case – Counting Hashtags Using Hive
Link 22: Sentiment Analysis on Tweets with Apache Pig Using AFINN Dictionary
Link 23: Sentiment Analysis on Tweets with Apache Hive Using AFINN Dictionary
Link 24: Pokemon Data Analysis using Apache Hive
For beginner’s level use cases in Spark , refer the below links:
Link 25: HealthCare Use Case With Apache Spark
Link 26: Introduction to Spark RDD and Basic Operations in RDD
Link 27: Analyzing New York Crime Data Using SparkSQL
Link 28: Spark Use Case – Travel Data Analysis
Link 29: Spark Use Case – Uber Data Analysis
Link 30: Spark Use Case – Analyzing MovieLens Dataset
Link 31: Spark Use Case – Social Media Analysis
Link 32: Spark SQL Use Case – 911 -Emergency Helpline Number Data Analysis
Link 33: Building Spam filter Engine using Spark.
Apart from these free links there are few paid programs which will help to get real exposure in Big
Data Projects.
Link: Mastering Hadoop With Industrial Projects.
Sample Project Specifications:

Project One: eCommerce Data Analysis
A leading e-commerce company, MyCart, is planning to analyze its existing products and
customer behavior. It has been continuously receiving huge amounts of data about its different
products, registered users and their behavior in terms of placing orders and subsequent actions
performed on those orders. Different products belong to different categories and have a certain
amount of discount and profit percentage associated with them. Users are spread across different
locations, and depending on their behavior, MyCart wants to capture their purchasing patterns and
detect possible fraudulent activities. It receives files on a daily basis and processes them using Big
Data tools in order to:
Page 3 of 5
 Gain competitive benefit
 Design a marketing campaign
 Detect possible fraud
Implementation Details
Technical Details
First, data will be collected and stored in the HDFS. Next, for the process of Data Enrichment, we
will load this data into a NoSQL database like, HBase. After enriching this data, we need to
validate it, post which, you can use any of the Hadoop technologies like, MapReduce, Hive, or Pig
for data analysis. Results will be finally exported to both, the RDBMS and NoSQL database. This
job will be automated by using schedulers that will allow us to extract the outcomes on a daily
basis.
Feasibility Study
This project is helpful in defining the purchasing patterns of people over the company's website.
The project also helps in identifying fraud users and in finding out the net worth of the cancelled
products across the city among other things. With the outcomes of this analysis, the company can
derive solutions for improving their products in warehouses and how to eradicate frauds, and much
more.
Infrastructure Required
The project work will be carried out in a virtual environment, wherein, the Hadoop cluster, HBase,
and other required tools will be installed on a single machine using the Oracle virtual box/VMware
RAM: Min 4 GB
OS: Windows/Linux/Mac
Processor: Dual core processor or above
Software Required
Apache Hadoop, Apache Hive, Apache HBase, Apache Sqoop, MySQL.

All the above mentioned tools are open source and requires no prior permission to download and
install.
Project Two: Banking Data Analysis
The Banking Sector is one of the many areas that gets flooded with data every minute. A single
bank usually has multiple branches with its multiple customer details. Details are important as they
can help in making informed decisions. Many banks make use of the traditional RDBMS for data
storage, but this traditional system is not able to handle the huge amount of data pouring in. Using
Hadoop, we will be storing and analyzing banking data to come up with multiple insights.
Implementation Details
Data Ingestion: Process of bringing raw data into the Hadoop storage unit.
Data Encryption: Encryption of highly sensitive data. Data migration from RDBMS to Hadoop
will also be password protected.
Page 4 of 5
Feasibility Study
As real banks do not share details of their customers, we have created a dummy data set that has
proper columns and other essential details. This will provide us with a way to analyze banking
data using Hadoop and come up with multiple insights.
Infrastructure Required
The project work will be completed in the virtual environment, wherein, the Hadoop cluster,
HBase and, other required tools will be installed on a single machine using the Oracle Virtual
Box/VMware.
RAM: Min 4 GB
OS: Windows/Linux/Mac
Processor: Dual core processor or above
Software Required
Apache Hadoop, Apache Hive, Apache HBase, Apache Sqoop, and MySQL
All the above mentioned tools are open source and require no prior permission to download and
install them
Project Three: Music Data Analysis
A leading music-catering company, MyRadio, is planning to analyze large amounts of data that it
receives from its mobile app and the website. MyRadio wants to track the behavior of its users,
classify them, calculate the royalties associated with songs and make appropriate business
strategies. As the data is very huge, we will be using an open-source framework of Apache
Hadoop, a NoSQL database called Hbase, and a few other tools for analysis.
In order to achieve the objectives, we can sub-divide the projects into the following phases:
 Data ingestion
 Understanding the data
 Data validation
 Data enrichment
 Post-data enrichment steps
 Data analysis
 Optimizations
 Post analysis
Page 5 of 5

Assignment 1 Spec

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Assignment 1 Spec

Enviado por

Direitos autorais:

Formatos disponíveis

REVA UNIVERSITY

School of Computing and Information Technology

B. Tech VII Sem (For all Sections)

Submission Deadline: Thursday, 6th of September 2019, 4:30

Submission Deadline: Friday, 2nd of November 2019, 4:30 pm.

Resources (References or Links):

Sample Project Specifications:

Apache Hadoop, Apache Hive, Apache HBase, Apache Sqoop, MySQL.

Você também pode gostar