Escolar Documentos
Profissional Documentos
Cultura Documentos
net/publication/301513250
Big Data and Data Mining A study of (Characteristics , Factory work, Security
Threats and Solution for Big Data ,Data mining Architecture, challenges &
Solutions with big data )
CITATIONS READS
0 2,029
1 author:
Ameer Sameer
University of Babylon
29 PUBLICATIONS 3 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ameer Sameer on 20 April 2016.
﴾ ِيم ّ ﴿ ِب ْسم
ِ الرح
َّ الر ْح َم ِن
َّ ِّللا ِ
Highly exalted be Allah, the true King! Do not hasten with the Koran before its revelation
has been completed to you, but say: 'Lord, increase me in knowledge.
I
Overview
II
Information assets characterized by such a High Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its
transformation into Value".
Volume: big data doesn't sample; it just observes and tracks what happens
Variety: big data draws from text, images, audio, video; plus it completes
missing pieces through data fusion
Machine Learning: big data often doesn't ask why and simply detects
patterns Digital footprint: big data is often a cost-free byproduct of digital
interaction Machine learning is a type of artificial intelligence (AI) that
provides computers with the ability to learn without being explicitly
programmed. Machine learning focuses on the development of computer
programs that can teach themselves to grow and change when exposed to
new data. The process of machine learning is similar to that of data
mining. Both systems search through data to look for patterns. However,
instead of extracting data for human comprehension -- as is the case in
data mining applications -- machine learning uses that data to detect
patterns in data and adjust program actions accordingly. Machine
learning algorithms are often categorized as being supervised or
unsupervised . Supervised algorithms can apply what has been learned in
the past to new data. Unsupervised algorithms can draw inferences from
datasets.
The term Big data is used to describe a massive volume of both structured
and unstructured data that is so large that it's difficult to process using
traditional database and software techniques.
III
Figure 1 : data mining concept
For the most part, structured data refers to information with a high degree
of organization, such that inclusion in a relational database is seamless
and readily searchable by simple, straightforward search engine
algorithms or other search operations; whereas unstructured data is
essentially the opposite. The lack of structure makes compilation a time
and energy-consuming task. It would be beneficial to a company across
all business strata to find a mechanism of data analysis to reduce the costs
unstructured data adds to the organization.
IV
Spreadsheets, on the other hand, would be considered structured data,
which can be quickly scanned for information because it is properly
arranged in a relational database system. The problem that unstructured
data presents is one of volume; most business interactions are of this
kind, requiring a huge investment of resources to sift through and extract
the necessary elements, as in a web-based search engine. Because the
pool of information is so large, current data mining techniques often miss
a substantial amount of the information that‟s out there, much of which
could be game-changing data if efficiently analyzed.
Structured data
are numbers and words that can be easily categorized and analyzed.
These data are generated by things like network sensors embedded in
electronic devices, smart phones, and global positioning system (GPS)
devices. Structured data also include things like sales figures, account
balances, and transaction data.
V
Unstructured data include more multifarious information, such as
customer reviews from feasible websites, photos and other multimedia,
and comments on social networking sites. These data can not be separated
into categorized or analyzed numerically.
- Volume
amount of data
- Velocity
- Variety
- Variability
- Veracity
The quality of captured data can vary greatly, affecting accurate analysis.
Volume The quantity of generated and stored data. The size of the data
determines the value and potential insight- and whether it can actually be
considered big data or not. Variety The type and nature of the data. This
helps people who analyze it to effectively use the resulting insight.
Velocity In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Variability Inconsistency of the data set can
hamper processes to handle and manage it. Veracity The quality of
captured data can vary greatly, affecting accurate analysis.
Examples:
Government
VI
On 4 October 2012, the first presidential debate between President
Barack Obama and Governor Mitt Romney triggered more than 10
million tweets within 2 hours
Private Sector
Flickr, a public picture sharing site, which received 1.8 million photos per
day, on average, from February to March 2012 [5]. Assuming the size of
each photo is 2 megabytes (MB), this requires 3.6 terabytes (TB) storage
every single day.
(ii) Variety- The next aspect of Big Data is its variety.This means that the
category to which Big Data belongs to is also a very essential fact that
needs to be known by the data analysts.This helps the people, who are
closely analyzing the data and are associated with it, to effectively use the
data to their advantage and thus upholding the importance of the Big
Data.
(iii) Velocity- The term „velocity‟ in this context refers to the speed of
generation of data or how fast the data is generated and processed to meet
the demands and the challenges which lie ahead in the path of growth and
development.
(iv) Variability- This is a factor which can be a problem for those who are
analyse the data. This refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being able to handle and
manage the data effectively.
VII
Factory work may have a 6C system
Public sector agencies can catch fraud and other threats in real-time.
– CC TV camera footage
Recommender system
- An unauthorized user may access files and could execute arbitrary code
or carry out further attacks.
VIII
- An unauthorized user may eavesdrop/sniff to data packets being sent to
client.
- etc..
Security Solution
IX
Authentication is the process verifying user or system identity before
accessing the system. Authentication methods such as Kerberos can be
employed for this.
- etc…
XI
(to remove noise and inconsistent data, to handle the missing data elds,
etc.) and data integration (to combine data from multiple sources).
KDD process
XII
annual and quarterly comparisons and trends to detailed daily sales
analysis.
::::
XIII
Traditionally, data mining and knowledge discovery was performed
manually. As time passed, the amount of data in many systems grew to
larger than terabyte size, and could no longer be maintained manually.
Moreover, for the successful existence of any business, discovering
underlying patterns in data is considered essential. As a result, several
software tools were developed to discover hidden data and make
assumptions, which formed a part of artificial intelligence.
The KDD process has reached its peak in the last 10 years. It now houses
many different approaches to discovery, which includes inductive
learning, Bayesian statistics, semantic query optimization, knowledge
acquisition for expert systems and information theory. The ultimate goal
is to extract high-level knowledge from low-level data.
Web mining
Web mining allows you to look for patterns in data through content
mining, structure mining, and usage mining. Content mining is used to
examine data collected by search engines and Web spiders. Structure
mining is used to examine data related to the structure of a particular Web
site and usage mining is used to examine data related to a particular user's
browser as well as data gathered by forms the user may have submitted
during Web transactions.
XIV
The information gathered through Web mining is evaluated (sometimes
with the aid of software graphing applications) by using traditional data
mining parameters such as clustering and classification, association, and
examination of sequential patterns.
CRM software
- data source
Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large volumes
of historical data for data mining to be successful. Organizations usually
store data in databases or data warehouses. Data warehouses may contain
one or more databases, text files, spreadsheets or other kinds of
XV
information repositories. Sometimes, data may reside even in plain text
files or spreadsheets. World Wide Web or the Internet is another big
source of data.
Different Processes
The database or data warehouse server contains the actual data that is
ready to be processed. Hence, the server is responsible for retrieving the
relevant data based on the data mining request of the user.
The data mining engine is the core component of any data mining system.
It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.
The graphical user interface module communicates between the user and
the data mining system. This module helps the user use the system easily
and efficiently without knowing the real complexity behind the process.
When the user specifies a query or a task, this module interacts with the
XVI
data mining system and displays the result in an easily understandable
manner.
- knowledge base.
The knowledge base is helpful in the whole data mining process. It might
be useful for guiding the search or evaluating the interestingness of the
result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the knowledge
base to make the result more accurate and reliable. The pattern evaluation
module interacts with the knowledge base on a regular basis to get inputs
and also to update it.
Each and every component of data mining system has its own role and
importance in completing data mining efficiently. These different
modules need to interact correctly with each other in order to complete
the complex process of data mining successfully.
XVII
Sequence or path analysis - looking for patterns where one event leads
to another later event
XVIII
Figure 4. a conceptual view of the Big Data processing framework
query from a database with billions of records, is divided into many small
tasks each of which is running on one or multiple cluster.
XIX
In Big Data, Semantic & Application knowledge refer to several aspect
related to the rules, policies, user information & application information.
The most important aspect in this tier contain 1) Information sharing and
its confidentiality; and 2) domain and application knowledge.
XXI
engines or recommendation systems. Complex relationship networks in
data. In the context of Big Data, there exist relationships between
individuals. On the Internet, individuals are webpages and the pages
linking to each other via hyperlinks form a complex network. There also
exist social relationships between individuals forming complex social
networks, such as big relationship data from Facebook, Twitter,
LinkedIn, and other social media , including call detail records (CDR),
devices and sensors information, GPS and geocoded map data, massive
image files transferred by the Manage File Transfer protocol, web text
and click-stream data, scientific information, e-mail, and so on. To deal
with complex relationship networks, emerging research efforts have
begun to address the issues of structure-and-evolution, crowds-and-
interaction, and information-and-communication.
Challenges
- Volume of the Big Data- size of the Big Data grows continuously.
Solutions
- Hadoop
- MapReduce
XXII
- Apache S4
- Strom
- Apache Mahout
- etc …
Hadoop
– Model level
XXIII
– Knowledge level.
Each and every local sites use local data to calculate the data statistics
and it share this information in order to achieve global data distribution in
their data level.
In model level it will produce local pattern. This pattern will be produced
after mined local data. By sharing these local patterns with other local
sites, we can produce a single global pattern. At the knowledge level,
model correlation analysis investigates the relevance between models
generated from various data sources to determine how related the data
sources are correlated to each other, and how to form accurate decisions
based on models built from autonomous sources
Conclusions
Big Data consists of huge modules, difficult, growing data sets with
numerous and , independent sources. With the fast development of
networking, storage of data, and the data gathering capacity, Big Data are
now quickly increasing in all science and engineering domains, as well as
animal, genetic and biomedical sciences.
used in web browsers, messages like MMS and SMS. Image data can be
used in art work and pictures with text still images taken by a digital
camera. Audio data contains sound, MP3 songs, speech and music. Video
XXIV
data include time aligned sequence of frames, MPEG videos from
desktops, cell phones, video cameras.
Reference
1- MULTIMEDIA MINING RESEARCH – AN OVERVIEW :Dr. S.Vijayarani1 and Ms.
A.Sakila21Assistant Professor, Department of Computer Science, Bharathiar
University,Coimbatore.2M.Phil Research Scholar, Department of Computer Science,
Bharathiar University,Coimbatore. January 2015
LEI XU, CHUNXIAO JIANG, (Member, IEEE), JIAN WANG, (Member, IEEE),
5- Big Data and Big Data Mining: Study of Approaches, Issues and Future scope
International Journal of Engineering Trends and Technology Dec 2014
Data Mining with Big Data IEEE Transactions on Knowledge and Data Engineering
26(1):97-107 · January 2014
https://www.researchgate.net/profile/Ameer_Sameer
https://www.linkedin.com/in/ameer-sameer-452693107
XXV
http://www.slideshare.net/AmeerSameer
http://facebook.com/ameer.Mee/
XXVI