Escolar Documentos
Profissional Documentos
Cultura Documentos
Big Data Fundamentals: An Introductory Course for Big Data and Related Technologies
Table of Contents
Abstract ................................................................................................................................. 2
Introduc0on .......................................................................................................................... 3
Methodology ........................................................................................................................ 11
Appendix .............................................................................................................................. 23
Module Evalua0on Form ................................................................................................................ 23
BIG DATA FUNDAMENTALS – An Introductory 2
Abstract
This project is based on the growing industry need of big data skills expected of the
individuals graduating with a Bachelor of Science in Computer science (BSCS) degree. Big data
is a growing field and any BSCS program that is available needs to provide this course as an
available for the graduating students. Currently, the courses available do not meet the needs of
BSCS students because of multiple reasons: not design for onlinelearning modality, based on
satndard semester or quarter system, do not take into account the unoque needs of eight week
schedule of working students. As of the cohort, the team did not see any course available to the
students that meets this critical need. This introductory course, “Big Data Fundamentals” that
will meet the needs of the future BSCS online students. It will provide a thorough understanding
of the concepts in the big data field, the technologies used and the skills needed to be successful
Introduction
Big data is a growing field and it is not a hype or buzzword that is being used to attract
attention by individuals and corporations. Big data has revolutionized our lives in ways
unimaginable and it has also become one of the fastest growing industries across the globe. It
has become one of the most sought-after skill in the high-tech industry. Corporations and
academia are in pursuit of ways to address the need in terms of developing courses, publishing
papers on the latest and the most cutting-edge technology related to the field.
However, before these corporations and academia world focuses on the deep technical
aspects, it is important to educate individuals who are new to the field about the basic concepts
and technologies related to Big Data. Equally important is to educate individuals about the
resources available to learn and practice these skills without getting completely overwhelmed
and unable to venture to take up careers in the field. Given the ubiquitous need of “Big Data”
skills, it is critical that every student enrolled in the undergraduate program related to technical
field whether it be IT or computer science has the course available through the program.
However, before we delve further into what is available and what is not and the quality of
resources available, we need to define “Big Data”. According to Gartner, Big Data is defined as
innovative forms of information processing that enable enhanced insight, decision making, and
BIG DATA FUNDAMENTALS – An Introductory 4
process automation.” Forbes magazine defines it as “a collection of data from traditional and
digital sources that represent a source for ongoing discovery and analysis”.
Big data is playing a key role in almost every field including healthcare, sports, retail,
entertainment, policy making, business management, social media, etc. Organizations are using
analytics to harness data and using it to identify new opportunities, effective operations, cost
reduction, improve customer satisfaction, etc. Alistair Croll, author of Lean Analytics: Use Data
to Build a Better Startup Faster, refers to Big Data as the new superpower according to his
Clearly, Big Data has become that magical power that is not only at the forefront of all the
industries mentioned but it is also at the forefront of the top ten tech skills in demand.
Bentley University commissioned a study to find which skills are growing in demand
by looking at millions of job listings posted on more than 40,000 online job sites, jobs analytics
firm Burning Glass determined which skills saw the biggest increases in demand when
comparing 2011 to 2015. Based on the information from the study and recent job market
reports, Forbes published a report, Technical Skills with the Biggest Increases in Demand.
According to the report, there will be a 3,977% increase in demand for individuals with
Big Data skills and this was the biggest increase in compared to other technical skills including
programming languages, data analytics and visualization, Apache Hadoop, etc. To be successful
individuals are expected to have a solid grounding in computing science, be comfortable with
numbers, have an in-depth understanding of particular analytic models and algorithms, and
have an overall idea of different big data visualization and analytics tools.
BIG DATA FUNDAMENTALS – An Introductory 5
Given all this information, it is critical that every BSCS program provides a course that
covers all these concepts and is a part of the learning path so that students gain Big Data
Our research objective will explore this idea of learning about Big Data technologies
further. We will do further research and examine what options are available to students related
to Big Data related learning and job skills preparation. If so, are these courses and resources
provide the necessary information that will prepare students better for the Big Data and related
jobs. Furthermore, we will explore if these courses are suited and better aligned to the online
learning environment.
After analyzing that information, we will design a “Big Data Fundamentals” course that
will better prepare BSCS students for the Big Data jobs. The final course layout will be
designed keeping in mind the specific needs of the industry and provide BSCS online graduates
with the necessary knowledge, skills, and resources to have successful careers in the Big Data
field. We will also explore ways to ensure that this course allows students to always stay in
intended to be an introductory course in the area of Big Data which will cover many concepts.
This course is intended for individuals who are completely new to the field of Big Data. It will
help the learners understand the basics of Big Data field, increase knowledge of the Big Data
BIG DATA FUNDAMENTALS – An Introductory 6
and help learners understand why Big Data has become one of the fastest and most sought-after
This course will help learners become comfortable with the terminology and the core
concepts behind big data problems, applications, and systems. It will help students understand
how they can apply and use these skills in their current or future career choices.
• Describe the Big Data field with examples of real world big data problems
• Learn about programming models used for scalable big data analysis
looked at information, made connections, and arrived at conclusions to meet their needs. These
service based on the information. The data was collected via feedback, surveys, observation
notes, etc. Businesses have relied on databases, used data collection, and used data analytics
technologies to spot patterns and predict market trends for a long time. Government uses data
BIG DATA FUNDAMENTALS – An Introductory 7
collection and analysis for tax purposes, national census analysis, etc. Since, there are
challenges related to the costs and the process of analyzing and storing such datasets, these data
have been produced in tightly controlled ways using sampling techniques that limit their scope
and size.
However, data industry has transformed drastically in the past decade with the advent of
the big data. The “3 Vs” are the key characteristics that differentiate big data from the data we
knew a decade ago and these are volume, velocity (speed), and variety. Volume refers to the
voluminous data that is generated from various sources. A shopping transaction includes all the
items that an individual bought including details like size, color, etc. This is massive volume
compared to a final sales receipt data that was generated as result of a single transaction.
Another area that has witnessed burgeoning growth is social data and mobile phone data
usage, especially among the teenagers. Velocity refers to the speed at which the data is
generated and analyzed in real time. People are all too familiar with sports statistics that are
generated during each play of a basketball game and displayed in real time. Finally, variability
refers to the variety of data from various sources like data from financial systems, social
In the current era of big data and internet of things, for companies like Google and
Facebook “data is the new currency and with the amounts these companies have they could buy
anyone”. Big data is not limited to shopping, entertainment, and sports statistics. It has and can
lead to unimaginable medical breakthroughs that can treat illnesses and help human beings
have a longer and healthier life. With the tools and products available, efforts are underway to
treat human illnesses at the genetic level. The unprecedented growth of companies like Google,
Facebook, and eBay is a testimony that the big data field is poised for continued growth. There
BIG DATA FUNDAMENTALS – An Introductory 8
is a huge need for professionals with big data knowledge who can understand and apply their
Given all the above background information, it is obvious that all these companies are
looking for developing solutions to address the need and also to educate their employees and
give back to the community in form of open source projects. A number of companies have
understand a few technical terms (Hadoop, types of data) and historical background related to
the developments in the big data field. Data types refer to three different types of data:
structured, unstructured, and semi-structured data. In very simple terms, structured data is
organized data or information stored according to specific rules and follows a data model like
date stored in relational databases, excel spreadsheets, etc. Semi-structured data is organized
data that may not necessarily follow a specific data model or standards like information written
or stored in a Word document. A collection of facts, random dates are examples of unstructured
data that is basically information that is not stored based on predefined data model or rules.
application development framework that provides ability to store and process massive amounts
of data. Currently the most effective programming model to process big data is MapReduce, “a
Internet companies like Google and Facebook developed Hadoop because they realized
that they had petabytes of structured, semi-structured, and unstructured data about their users
and the usage of their products. In order to extract value out of this large amount of data, they
decided to create open source projects that would encourage people to find solutions to this
BIG DATA FUNDAMENTALS – An Introductory 9
programming model and an associated implementation for processing and generating large data
sets. Yahoo created an open source product, Hadoop, based on information published in
Google’s research papers. Facebook created a product, Hive, which provides a SQL interface
over Hadoop. Hive interprets unstructured data as RDBMS tables, allows developers to write
SQL queries over the data, and translates the queries into MapReduce jobs to produce the
desired results. Many corporations are leading efforts to build big data products and services.
Teradata, IBM Big Data Analytics, Mongo DB, Datastax Big Data, Cloudera, Amazon Web
Services and Splunk are few of the key players in this industry. Each company is focused on
some aspect of big data related technology but none have developed a complete platform for
big data solution. Most of them have been working on one specific aspect of big data and
developing it into a robust product or service. For example, Splunk has spent the last few years
doing research and development in the area of machine data. Their product focuses on
producing real time analytics to identify patterns and trends in the machine data. It works with
Hadoop and NoSQL data for analysis and visualization of data that is unstructured.
Based on our initial scan of available courses, we saw many options available to address
the need of teaching Big Data but each with some shortcomings. Either each course focuses too
deep on one topic or misses out on some key concepts. The courses we reviewed include
courses available on MOOCs like Coursera, Udacity, etc. We also reviewed courses available
on corporate learning websites like Hortonworks, Cloudera, and Simplilearn. Simplilearn has a
large array of technical courses like Hadoop, SAS, Apache Spark, and R. We did not find an
introductory course on Big data. The second option is Cloudera which provides very technical
options like CCP Spark and Hadoop Developer certification but each course is very expensive.
BIG DATA FUNDAMENTALS – An Introductory 10
Hortonworks is another popular name in Big Data but these are expensive and do not have any
accreditation.
Most of the courses available in the academia world are based on quarter system or
semester system. After reviewing multiple courses the key concerns identified were: courses
lacked relevant content, concepts were too technical for beginners, did not provide information
about the latest technology trends, or not meant for the target audience with a varying degree of
The key stakeholders for who stand to benefit from this project are:
The first target audience for this project are BSCS online students at CSUMB. Most of
the students enrolled in this program are very motivated and looking to learn and acquire skills
in Computer Science that can be very helpful in the real world. This perspective comes from
the fact that majority of them are working full time and have decided to enroll in this program
• Extremely motivated
BIG DATA FUNDAMENTALS – An Introductory 11
• Eager to learn and use the learning to be more effective in current job
The other key stakeholders for this project are the faculty who are teaching various courses in
the BSCS program. They are always looking for ways to improve the program and also include
courses that help enhance the learning experience and improve job prospects for their
graduates. At a personal level, it will also help the faculty understand what the students need
and how students learn best if they have a course designed by students who have been a target
audience themselves and understand the challenges of online learning environment. Besides
focus on the subject matter, this course will provide the layout for what online students feel is
Methodology
We will start by collecting information from students and their interests. We will also
gather information from students to develop a baseline of their understanding about the field
and their awareness. The focus will be to understand how well these students understand the
field and if they do, are they aware of resources available, if any, to help them develop the
Based on our personal experience as undergraduate students in the online program and
information available. Often, they need a simple and easy start before delving deeper. Also,
sometimes the information presented in courses is too technical for an introductory course. We
BIG DATA FUNDAMENTALS – An Introductory 12
will collect input from students and also examine data available from various resources to
Finally, we plan to connect with industry experts to ensure that we have ongoing input
during the process. This will allow us to stay on track and ensure that all the information we are
collecting and planning to use in designing the proposed course is in sync with the current
trends of the industry. We have also applied for memberships and attending meetups,
conferences related to the field. In some cases, we were unable to get membership, but the
publicly available information provides valuable information to help us guide through the
process. Mostly, we are relying on open source big-data tools to learn and collect information
for our research project like Hadoop, High Performance Computing Cluster (HPCC),
to help learners understand the concepts, none of them is designed specifically for students and
the target audience which has a unique set of requirements and background. Since BSCS course
at CSUMB is on a eight week schedule, it makes it different from quarter or semester system.
The course that we plan to develop will be aligned to the eight-week schedule without
compromising the quality of learning and the intent of achieving the objectives.
At this point it will be more theory and concepts based. However, we will try to provide
some hands-on experience that will be applicable to the course. The goal is to provide enough
exposure to the learners such that they can learn about the field and be able to extend this
BIG DATA FUNDAMENTALS – An Introductory 13
learning on their own. I plan to develop a course that provides the information, tools, and
resources that will help the learners understand about Big Data. It will introduce them to
resources available that they can use to become well versed in the field of Big Data.
Design Details
Since the goal is to ensure that students are able to fully understand the concepts covered in the
designated time of eight weeks, our goal is to pay specific attention to the material covered in
each module and the length of the trimester, which is eight weeks for the online BSCS students.
Also, the goal is to include relevant resources, videos, and resources for hands on activities to
• Dimensions of Scalability
• Understanding MapReduce
• Data management
deal with any ethical or legal issues. Similarly, we do not anticipate any similar issues when the
project is implemented.
However, we do plan to use some resources and technology from open source technology like:
• Hadoop
• Spark
• Videos: YouTube videos, videos available on academic websites like Khan Academy
• Informational Documents
• Images
BIG DATA FUNDAMENTALS – An Introductory 15
Open source resources have their specific licenses and usage restrictions. To ensure that we
are complying with their usage guidelines, we plan to review those licenses and usage
restrictions to ensure that we meet all the requirements. Also, some of the videos, online content,
images have copyright and permissions guidelines. If and when we use any such resource, we
intend to provide details of the product or service so that when the project is implemented there
Project Scope
Timeline
• Develop Outline/Modules
Week 2
• Each team member takes ownership of module
Project Budget
There is no expected cost since the team will be relying on using open resources and personal
Resources Needed
• Internet connection
• Software
Milestones
• Include any other online resources (academic papers) for each module
As a team, we do anticipate certain risks and these fall into few categories:
• Collaboration
BIG DATA FUNDAMENTALS – An Introductory 18
member
• Technical issues/risks
One of the challenges has been individual team members’ work schedule and personal
situation. Technical risks include availability of all software, developing activities where we
need open software. Often unexpected issues arise when open source software is involved
However, as a team we have discussed backup plans where we will modify the modules
to include relevant links and details on how to use the software if we are unable to design the
hands-on experience due to technical or time limitations. Finally, there is always risk of SMEs
being unavailable or lack of availability due to their personal and professional commitments.
Final Deliverables
Final deliverable will be a course in the format that is used by most CSUMB BSCS courses
hosted on iLearn. For each week, there will be a topic and related videos, readings, etc. The
team also intends to include quiz or some form of assessment for each week. As discussed
above, due to time restrictions, we may not have assessment for each module. In that case we
intend to develop a quiz or assessment with key elements that need to be evaluated for that
week. This will allow for the blueprint to develop an assessment. We plan to host it on
Googlesites as of now. The reason being ease of collaboration, cross team review, easy updates,
BIG DATA FUNDAMENTALS – An Introductory 19
availability of history of each document. Overall outline of the course will be four modules
• Dimensions of Scalability
• Understanding MapReduce
• Data management
SMEs, BSCS students at CSUMB and other institutions who may be interested in reviewing the
material.
• Reach out to at least 3 BSCS students (past and/or current) to review the course
• Ongoing feedback from Industry Experts, SMEs (as per their availability)
Team Members
The entire team is involved in creating all documents and proposals. Besides that, all
members will be working to design the outline and develop overview of each module. Once
that is complete, each team member will take ownership of the module based on their
individual interest and expertise. Finally, all modules will be put together and all team members
will review the entire course and provide feedback on individual modules, course flow, topic
cohesiveness, etc. The division of labor and roles were evaluated and assigned based on interest
The team members with their roles and responsibilities are as follows:
• Brian Brooks
Advanced topics modules. He will also provide support as an SME since he has
extensive experience working with Industrial data and related technologies. So.
• Devorah Akhamzadeh
o Devorah has experience in programming and interest in learning new topics. She
is exploring and doing further research on specific topics that would be useful to
add in one of the modules that she plans to develop. She will be available a s
• Seema Khan
module and working with all team members to ensure that course development
References
Chaudhuri, S., Dayal, U., & Narasayya, V. (2011). An overview of business intelligence
technology. Communications of the ACM Commun. ACM, 54(8), 88.
Dean, J., & Ghemawat, S. (2008). MapReduce. Communications of the ACM Commun. ACM,
51(1), 107.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of
the ACM Commun. ACM, 55(10), 78.
Lourenço, J. R., Cabral, B., Carreiro, P., Vieira, M., & Bernardino, J. (2015). Choosing the right
NoSQL database for the job: A quality attribute evaluation. Journal of Big Data, 2(1).
Pybus, J., Cote, M., & Blanke, T. (2015). Hacking the social life of Big Data. Big Data &
Society, 2(2).
Smale, S. (n.d.). The Mathematics of Learning: Dealing With Data. 2005 International
Conference on Neural Networks and Brain.
Teng, P., Li, H., & Zhang, X. (2015). Survey on Visualization Layout for Big Data. Intelligence
Science and Big Data Engineering. Big Data and Machine Learning
Techniques Lecture Notes in Computer Science, 384-394.
Tsai, C., Lai, C., Chao, H., &
Vasilakos, A. V. (2015). Big data analytics: A survey. Journal of Big Data, 2(1).
Wu, X., Zhu, X., Wu, G., & Ding, W. (2014). Data mining with big data. IEEE Trans. Knowl.
Data Eng. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97-107.
BIG DATA FUNDAMENTALS – An Introductory 23
Appendix
Module/Module Title:
Date:
Below are a series of statements. Please respond by circling the number you feel most reflects
your opinion.
Agree or
Disagree
Neither
Disagree
Disagree
Strongly
Strongly
Agree
Agree
The module fulfilled the objectives 5 4 3 2 1
readily be understood
experience
learning