Você está na página 1de 27

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/301513250

Big Data and Data Mining A study of (Characteristics , Factory work, Security
Threats and Solution for Big Data ,Data mining Architecture, challenges &
Solutions with big data )

Conference Paper · April 2016


DOI: 10.13140/RG.2.1.3238.9525

CITATIONS READS

0 2,029

1 author:

Ameer Sameer
University of Babylon
29 PUBLICATIONS   3 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Enhancement The Security of Cognitive Radio Networks View project

Emotional Arabic News Recommender System View project

All content following this page was uploaded by Ameer Sameer on 20 April 2016.

The user has requested enhancement of the downloaded file.


Big Data and Data Mining

Prepared by: Ameer Sameer Hamood

University of Babylon - Iraq

Information Technology - Information Networks

﴾ ‫ِيم‬ ّ ‫﴿ ِب ْسم‬
ِ ‫الرح‬
َّ ‫الر ْح َم ِن‬
َّ ِ‫ّللا‬ ِ

َ ‫آن مِن َق ْب ِل أَن ُي ْق‬


‫ضى إِلَ ْي َك َو ْح ُي ُه َوقُل َّر ِّب ِزدْ نِ ي ِِ ْل اما‬ َّ ‫َف َت َعالَى‬
ِ ‫ّللا ُ ا ْل َملِ ُك ا ْل َحقُّ َو ََل َت ْع َجلْ بِا ْلقُ ْر‬

( In the Name of Allah, the Merciful, the Most Merciful )

Highly exalted be Allah, the true King! Do not hasten with the Koran before its revelation
has been completed to you, but say: 'Lord, increase me in knowledge.

I
Overview

• Introduction to Big data

- Structured vs. Unstructured


- The Problem with Unstructured Data

• Big Data Characteristics

• Factory work for Big data

• Applications of Big Data

• Security Threats & Solution for Big Data

• Introduction to Data Mining

- Step of a KDD process


- Web mining
- Customer relationship management (CRM)

• Data Mining Architecture

• Data mining challenges & Solutions with big data

• BIG Data Mining Tools

• Big Data Mining Algorithm

Introduction to Big data

In a 2001 research report and related lectures, META Group (now


Gartner) analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume (amount
of data), velocity (speed of data in and out), and variety (range of data
types and sources). Gartner, and now much of the industry, continue to
use this "3Vs" model for describing big data. In 2012, Gartner updated its
definition as follows: "Big data is high volume, high velocity, and/or high
variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization."
Gartner's definition of the 3Vs is still widely used, and in agreement with
a consensual definition that states that "Big Data represents the

II
Information assets characterized by such a High Volume, Velocity and
Variety to require specific Technology and Analytical Methods for its
transformation into Value".

Volume: big data doesn't sample; it just observes and tracks what happens

Velocity: big data is often available in real-time

Variety: big data draws from text, images, audio, video; plus it completes
missing pieces through data fusion

Machine Learning: big data often doesn't ask why and simply detects
patterns Digital footprint: big data is often a cost-free byproduct of digital
interaction Machine learning is a type of artificial intelligence (AI) that
provides computers with the ability to learn without being explicitly
programmed. Machine learning focuses on the development of computer
programs that can teach themselves to grow and change when exposed to
new data. The process of machine learning is similar to that of data
mining. Both systems search through data to look for patterns. However,
instead of extracting data for human comprehension -- as is the case in
data mining applications -- machine learning uses that data to detect
patterns in data and adjust program actions accordingly. Machine
learning algorithms are often categorized as being supervised or
unsupervised . Supervised algorithms can apply what has been learned in
the past to new data. Unsupervised algorithms can draw inferences from
datasets.

Facebook's News Feed uses machine learning to personalize each


member's feed. If a member frequently stops scrolling in order to read or
"like" a particular friend's posts, the News Feed will start to show more of
that friend's activity earlier in the feed. Behind the scenes, the software is
simply using statistical analysis and predictive analytics to identify
patterns in the user's data and use to patterns to populate the News Feed.
Should the member no longer stop to read, like or comment on the
friend's posts, that new data will be included in the data set and the News
Feed will adjust accordingly.

The term Big data is used to describe a massive volume of both structured
and unstructured data that is so large that it's difficult to process using
traditional database and software techniques.

III
Figure 1 : data mining concept

Structured vs. Unstructured

For the most part, structured data refers to information with a high degree
of organization, such that inclusion in a relational database is seamless
and readily searchable by simple, straightforward search engine
algorithms or other search operations; whereas unstructured data is
essentially the opposite. The lack of structure makes compilation a time
and energy-consuming task. It would be beneficial to a company across
all business strata to find a mechanism of data analysis to reduce the costs
unstructured data adds to the organization.

The Problem with Unstructured Data

Of course; if it was possible or feasible to instantly transform


unstructured data to structured data, then creating intelligence from
unstructured data would be easy. However, structured data is akin to
machine-language, in that it makes information much easier to deal with
using computers; whereas unstructured data is (loosely speaking) usually
for humans, who don‟t easily interact with information in strict, database
format.

Email is an example of unstructured data; because while the busy inbox


of a corporate human resources manager might be arranged by date, time
or size; if it were truly fully structured, it would also be arranged by exact
subject and content, with no deviation or spread – which is impractical,
because people don‟t generally speak about precisely one subject even in
focused emails.

IV
Spreadsheets, on the other hand, would be considered structured data,
which can be quickly scanned for information because it is properly
arranged in a relational database system. The problem that unstructured
data presents is one of volume; most business interactions are of this
kind, requiring a huge investment of resources to sift through and extract
the necessary elements, as in a web-based search engine. Because the
pool of information is so large, current data mining techniques often miss
a substantial amount of the information that‟s out there, much of which
could be game-changing data if efficiently analyzed.

BrightPlanet’s Solution for Unstructured Data BrightPlanet‟s Deep Web


harvesting platform provides a robust solution for collecting both
structured and unstructured data from the Internet. BrightPlanet takes a
unique approach to “connecting” those unconnected strands of
information through the use of metadata. BrightPlanet‟s Deep Web
harvesting technology employs multiple threads to mass-harvest scalable
quantities of unstructured data. Harvests are based on multiple user-
developed queries with results (web pages, PDF”s, XLS, PPT, XML, etc.)
qualified through customizable filters. BrightPlanet developed four
scoring algorithms that index the information based on relevancy to
further qualify the documents returned, ensuring the user is seeing only
super-relavant content. The final user interface displays the qualified
results in a searchable database based on customizable facets (URL,
filetype, source category, people mentioned, places mentioned,
companies mentioned, custom keywords, etc.)

Finding a way to analyze and create intelligence from the wealth of


unstructured data available on the Web can be expected to endow an
organization with the direct benefit of drastic increases in overall
effectiveness and speed of decision-making and implementation.

Structured data

are numbers and words that can be easily categorized and analyzed.
These data are generated by things like network sensors embedded in
electronic devices, smart phones, and global positioning system (GPS)
devices. Structured data also include things like sales figures, account
balances, and transaction data.

V
Unstructured data include more multifarious information, such as
customer reviews from feasible websites, photos and other multimedia,
and comments on social networking sites. These data can not be separated
into categorized or analyzed numerically.

Big Data Characteristics

- Volume

amount of data

- Velocity

Speed rate in collecting or acquiring or generating or processing of data

- Variety

different data type such as audio, video, image data (mostly


unstructured data)

- Variability

semantics, or the variability of meaning in language.

- Veracity

The quality of captured data can vary greatly, affecting accurate analysis.

Volume The quantity of generated and stored data. The size of the data
determines the value and potential insight- and whether it can actually be
considered big data or not. Variety The type and nature of the data. This
helps people who analyze it to effectively use the resulting insight.
Velocity In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Variability Inconsistency of the data set can
hamper processes to handle and manage it. Veracity The quality of
captured data can vary greatly, affecting accurate analysis.

Examples:

Government

VI
On 4 October 2012, the first presidential debate between President
Barack Obama and Governor Mitt Romney triggered more than 10
million tweets within 2 hours

Private Sector

Walmart handles more than 1 million customer transactions every hour,


which is imported into databases estimated to contain more than 2.5
petabytes of data

Facebook handles 40 billion photos from its user base.

Flickr, a public picture sharing site, which received 1.8 million photos per
day, on average, from February to March 2012 [5]. Assuming the size of
each photo is 2 megabytes (MB), this requires 3.6 terabytes (TB) storage
every single day.

term which is related to size and hence the characteristic.

(ii) Variety- The next aspect of Big Data is its variety.This means that the
category to which Big Data belongs to is also a very essential fact that
needs to be known by the data analysts.This helps the people, who are
closely analyzing the data and are associated with it, to effectively use the
data to their advantage and thus upholding the importance of the Big
Data.

(iii) Velocity- The term „velocity‟ in this context refers to the speed of
generation of data or how fast the data is generated and processed to meet
the demands and the challenges which lie ahead in the path of growth and
development.

(iv) Variability- This is a factor which can be a problem for those who are
analyse the data. This refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being able to handle and
manage the data effectively.

(v) Complexity- Data management can become a very complex process


,especially when large volumes of data come from multiple sources
.These data need to be linked , connected and correlated in order to be
able to grasp the information that is supposed to be conveyed by these
data .This situation , is therefore, termed as the „complexity‟ of Big Data.

VII
Factory work may have a 6C system

Cyber-physical systems (CPS) first evolved from mechatronic devices


augmented with communication capabilities and connectivity. A CPS is
therefore a system composed of physical entities such mechanisms
controlled or monitored by computer-based algorithms. Today, a
precursor generation of cyber-physical systems can be found in areas as
diverse as aerospace, automotive, chemical processes, civil infrastructure,
energy, healthcare, manufacturing, transportation, entertainment, and
consumer appliances. CPS involves transdisciplinary approaches,
merging theory of cybernetics, mechatronic design, and design and
process science .This generation is often referred to as embedded
systems. In embedded systems the emphasis tends to be more on the
computational elements, and less on an intense link between the
computational and physical elements. CPS is also similar to the Internet
of Things (IoT) sharing the same basic architecture, nevertheless, CPS
presents a higher combination and coordination between physical and
computational elements

Applications of Big Data

Healthcare organizations can achieve better insight into disease trends


and patient treatments.

Public sector agencies can catch fraud and other threats in real-time.

Applications of Multimedia data

– To find travelling pattern of travelers

– CC TV camera footage

– Photos and Videos from social network

Recommender system

- Weather app , Netflix ,etc..

Security Threats for Big Data

- An unauthorized user may access files and could execute arbitrary code
or carry out further attacks.

VIII
- An unauthorized user may eavesdrop/sniff to data packets being sent to
client.

- An unauthorized client may read/write a data block of a file.

- An unauthorized client may gain access privileges.

- etc..

Security Solution

Security of big data can be enhanced by using the techniques of


authentication, authorization, encryption and audit trails. There is always
a possibility of occurrence of security violations by unintended,
unauthorized access or inappropriate access by privileged users. The
following are some of the methods used for protecting big data:

- Use secure communication

Implement secure communication between nodes and between nodes and


applications. This requires an SSL/TLS implementation that actually
protects all network communications rather than just a subset. Thus the
privacy of data is a huge concern in the context of Big Data. There is
great public fear regarding the inappropriate use of personal data,
particularly through linking of data from multiple sources. So,
unauthorized use of private data needs to be protected. To protect
privacy, two common approaches used are the following. One is to
restrict access to the data by adding certification or access control to the
data entries so sensitive information is accessible to a limited group of
users only. The other approach is to anonymize data fields such that
sensitive information cannot be pinpointed to an individual record. For
the first approach, common challenges are to design secured certification
or access control mechanisms, such that no sensitive information can be
misconduct by unauthorized individuals. For data anonymization, the
main objective is to inject randomness into the data to ensure a number of
privacy goals

- Using authentication methods

IX
Authentication is the process verifying user or system identity before
accessing the system. Authentication methods such as Kerberos can be
employed for this.

Kerberos is a computer network authentication protocol which works on


the basis of 'tickets' to allow nodes communicating over a non-secure
network to prove their identity to one another in a secure manner.

- Use file encryption

Encryption ensures confidentiality and privacy of user information, and it


secures the sensitive data. Encryption protects data if malicious users or
administrators gain access to data and directly inspect files, and renders
stolen files or copied disk images unreadable. File layer encryption
provides consistent protection across different platforms regardless of
OS/platform type. Encryption meets our requirements for big data
security. Open source products are available for most Linux systems,
commercial products additionally offer external key management, and
full support. This is a cost effective way to deal with several data security
threats.

- Implementing access controls

Authorization is a process of specifying access control privileges for user


or system to enhance security. To detect attacks, diagnose failures, or
investigate unusual behavior, we need a record of activity. Unlike less
scalable data management platforms, big data is a natural fit for collecting
and managing event data. Many web companies start with big data
particularly to manage log files. It gives us a place to look when
something fails, or if someone thinks you might have been hacked. So to
meet the security requirements, we need to audit the entire system on a
periodic basis.

- Use key management

File layer encryption is not effective if an attacker can access encryption


keys. Many big data cluster administrators store keys on local disk drives
because it‟s quick and easy, but it‟s also insecure as keys can be collected
by the platform administrator or an attacker. Use key management service
to distribute keys and certificates and manage different keys for each
group, application, and user.
X
- Logging

To detect attacks, diagnose failures, or investigate unusual behavior, we


need a record of activity. Unlike less scalable data management
platforms, big data is a natural fit for collecting and managing event data.
Many web companies start with big data particularly to manage log files.
It gives us a place to look when something fails, or if someone thinks you
might have been hacked. So to meet the security requirements, we need to
audit the entire system on a periodic basis.

- etc…

Introduction to Data Mining

In the 1960s, statisticians used terms like "Data Fishing" or "Data


Dredging" to refer to what they considered the bad practice of analyzing
data without an a-priori hypothesis. The term "Data Mining" appeared
around 1990 in the database community. For a short time in 1980s, a
phrase "database mining”

In the Academic community, the major forums for research started in


1995 when the First International Conference on Data Mining and
Knowledge Discovery from data(KDD-95) was started The term ``data
mining'' is often treated as a synonym for another term ``knowledge
discovery from data'' (KDD) which highlights the goal of the mining
process. To obtain useful knowledge from data, the following steps are
performed in an iterative way (see Fig2.) Data mining has attracted more
and more attention in recent years, probably because of the popularity of
the ``big data'' concept. Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data . As a
highly application-driven discipline, data mining has been successfully
applied to many domains, such as business intelligence, Web search,
scientic discovery, digital libraries, etc.

Step of a KDD process

Step 1: Data preprocessing. Basic operations include data selection (to


retrieve data relevant to the KDD task from the database), data cleaning

XI
(to remove noise and inconsistent data, to handle the missing data elds,
etc.) and data integration (to combine data from multiple sources).

Step 2: Data transformation. The goal is to transform data into forms


appropriate for the mining task, that is, tond useful features to represent
the data. Feature selection and feature transformation are basic
operations.

Step 3: Data mining. This is an essential process where intelligent


methods are employed to extract data patterns (e.g. association rules,
clusters, classication rules, etc).

Step 4: Pattern evaluation and presentation. Basic operations include


identifying the truly interesting patterns which represent knowledge, and
presenting the mined knowledge in an easy-to-understand fashion.

KDD process

Figure 2 An overview of the KDD process.

In computing, a data warehouse (DW or DWH), also known as an


enterprise data warehouse (EDW), is a system used for reporting and
data analysis. DWs are central repositories of integrated data from one
or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers
throughout the enterprise. Examples of reports could range from

XII
annual and quarterly comparisons and trends to detailed daily sales
analysis.

The data stored in the warehouse is uploaded from the operational


systems (such as marketing, sales, etc., shown in the figure to the
right). The data may pass through an operational data store for
additional operations before it is used in the DW for reporting.

::::

Remember that data warehousing is a process that must occur before


any data mining can take place. In other words, data warehousing is
the process of compiling and organizing data into one common
database, and data mining is the process of extracting meaningful data
from that database. The data mining process relies on the data
compiled in the data warehousing phase in order to detect meaningful
patterns.

Data Mining is Process of semi-automatically analyzing large


databases to find patterns that are:

– valid: hold on new data with some certainty

– useful: should be possible to act on the item

– understandable: humans should be able to interpret the


pattern

Also known as Knowledge Discovery in Databases (KDD)

Data Mining computational process of discovering patterns in large


data sets

What does Knowledge Discovery in Databases (KDD) mean?

Knowledge discovery in databases (KDD) is the process of discovering


useful knowledge from a collection of data. This widely used data mining
technique is a process that includes data preparation and selection, data
cleansing, incorporating prior knowledge on data sets and interpreting
accurate solutions from the observed results. Major KDD application
areas include marketing, fraud detection, telecommunication and
manufacturing.

XIII
Traditionally, data mining and knowledge discovery was performed
manually. As time passed, the amount of data in many systems grew to
larger than terabyte size, and could no longer be maintained manually.
Moreover, for the successful existence of any business, discovering
underlying patterns in data is considered essential. As a result, several
software tools were developed to discover hidden data and make
assumptions, which formed a part of artificial intelligence.

The KDD process has reached its peak in the last 10 years. It now houses
many different approaches to discovery, which includes inductive
learning, Bayesian statistics, semantic query optimization, knowledge
acquisition for expert systems and information theory. The ultimate goal
is to extract high-level knowledge from low-level data.

Data mining techniques are used in a many research areas, including


mathematics, cybernetics, genetics and marketing. Web mining, a type of
data mining used in customer relationship management (CRM), takes
advantage of the huge amount of information gathered by a Web site to
look for patterns in user behavior.

Web mining

In customer relationship management (CRM), Web mining is the


integration of information gathered by traditional data mining
methodologies and techniques with information gathered over the World
Wide Web. (Mining means extracting something useful or valuable from
a baser substance, such as mining gold from the earth.) Web mining is
used to understand customer behavior, evaluate the effectiveness of a
particular Web site, and help quantify the success of a marketing
campaign.

Web mining allows you to look for patterns in data through content
mining, structure mining, and usage mining. Content mining is used to
examine data collected by search engines and Web spiders. Structure
mining is used to examine data related to the structure of a particular Web
site and usage mining is used to examine data related to a particular user's
browser as well as data gathered by forms the user may have submitted
during Web transactions.

XIV
The information gathered through Web mining is evaluated (sometimes
with the aid of software graphing applications) by using traditional data
mining parameters such as clustering and classification, association, and
examination of sequential patterns.

Customer relationship management (CRM)

is a term that refers to practices, strategies and technologies that


companies use to manage and analyze customer interactions and data
throughout the customer lifecycle, with the goal of improving business
relationships with customers, assisting in customer retention and driving
sales growth. CRM systems are designed to compile information on
customers across different channels -- or points of contact between the
customer and the company -- which could include the company's website,
telephone, live chat, direct mail, marketing materials and social media.
CRM systems can also give customer-facing staff detailed information on
customers' personal information, purchase history, buying preferences
and concerns.

CRM software

CRM software consolidates customer information and documents into a


single CRM database so business users can more easily access and
manage it. The other main functions of this software include recording
various customer interactions (over email, phone calls, social media or
other channels, depending on system capabilities), automating various
workflow processes such as tasks, calendars and alerts, and giving
managers the ability to track performance and productivity based on
information logged within the system.

Data Mining Architecture

The major components of any data mining system are:

- data source

Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large volumes
of historical data for data mining to be successful. Organizations usually
store data in databases or data warehouses. Data warehouses may contain
one or more databases, text files, spreadsheets or other kinds of

XV
information repositories. Sometimes, data may reside even in plain text
files or spreadsheets. World Wide Web or the Internet is another big
source of data.

Different Processes

The data needs to be cleaned, integrated and selected before passing it to


the database or data warehouse server. As the data is from different
sources and in different formats, it cannot be used directly for the data
mining process because the data might not be complete and reliable. So,
first data needs to be cleaned and integrated. Again, more data than
required will be collected from different data sources and only the data of
interest needs to be selected and passed to the server. These processes are
not as simple as we think. A number of techniques may be performed on
the data as part of cleaning, integration and selection

- data warehouse server

The database or data warehouse server contains the actual data that is
ready to be processed. Hence, the server is responsible for retrieving the
relevant data based on the data mining request of the user.

- data mining engine

The data mining engine is the core component of any data mining system.
It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.

- pattern evaluation module

The pattern evaluation module is mainly responsible for the measure of


interestingness of the pattern by using a threshold value. It interacts with
the data mining engine to focus the search towards interesting patterns.

- graphical user interface

The graphical user interface module communicates between the user and
the data mining system. This module helps the user use the system easily
and efficiently without knowing the real complexity behind the process.
When the user specifies a query or a task, this module interacts with the

XVI
data mining system and displays the result in an easily understandable
manner.

- knowledge base.

The knowledge base is helpful in the whole data mining process. It might
be useful for guiding the search or evaluating the interestingness of the
result patterns. The knowledge base might even contain user beliefs and
data from user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the knowledge
base to make the result more accurate and reliable. The pattern evaluation
module interacts with the knowledge base on a regular basis to get inputs
and also to update it.

Each and every component of data mining system has its own role and
importance in completing data mining efficiently. These different
modules need to interact correctly with each other in order to complete
the complex process of data mining successfully.

Figure 3 : Data Mining Architecture

Data mining parameters)Task(

Association - looking for patterns where one event is connected to


another event

XVII
Sequence or path analysis - looking for patterns where one event leads
to another later event

Classification - looking for new patterns (May result in a change in the


way the data is organized but that's ok)

Clustering - finding and visually documenting groups of facts not


previously known

Forecasting - discovering patterns in data that can lead to reasonable


predictions about the future (This area of data mining is known as
predictive analytics.)

Predictive analytics is the branch of data mining concerned with the


prediction of future probabilities and trends. The central element of
predictive analytics is the predictor, a variable that can be measured for
an individual or other entity to predict future behavior. For example, an
insurance company is likely to take into account potential driving safety
predictors such as age, gender, and driving record when issuing car
insurance policies.

Multiple predictors are combined into a predictive model, which, when


subjected to analysis, can be used to forecast future probabilities with an
acceptable level of reliability. In predictive modeling, data is collected, a
statistical model is formulated, predictions are made and the model is
validated (or revised) as additional data becomes available. Predictive
analytics are applied to many research areas, including meteorology,
security, genetics, economics, and marketing.

Data mining challenges & Solutions with big data

Main challenge for an intelligent database is handling Big data. The


important thing is scaling the large amount of data and provide solution
for these problem by HACE(heterogeneous, Autonomous, Complex and
Evolving) theorem

XVIII
Figure 4. a conceptual view of the Big Data processing framework

For an intelligent knowledge database system to handle Big Data, the


essential key is to scale up to the extremely large volume of data and
provide actions for the characteristics featured by the HACE theorem.
Figure. 4 shows a conceptual view of the Big Data processing framework,
which includes three tiers from inside out with considerations on data
accessing and computing (Tier I), data isolation and domain knowledge
(Tier II), and Big Data mining algorithms (Tier III).

Tier I: Big Data Mining Platform (Data Accessing & Computing):

In typical data mining systems, the mining procedures require


computational thorough computing units for data analysis and
comparisons. For Big Data mining, because amount of data is massive so
that a single personal computer (PC) cannot handle, a typical Big Data
processing framework will depend on cluster computers with a high-
performance computing platform, with a data mining task being executed
by running some parallel Computing tools, such as MapReduce or
Enterprise Control Language (ECL), on a large number of clusters. The
function of the software module is to make sure that a single data mining
task, such as finding the best match of a

query from a database with billions of records, is divided into many small
tasks each of which is running on one or multiple cluster.

Tier II: data isolation and domain knowledge

XIX
In Big Data, Semantic & Application knowledge refer to several aspect
related to the rules, policies, user information & application information.
The most important aspect in this tier contain 1) Information sharing and
its confidentiality; and 2) domain and application knowledge.

Information Sharing and its confidentiality

Information sharing is an crucial goal for all systems relating multiple


parties . While the Goal of sharing is clear, a real-world concern is that
Big Data applications are related to sensitive information, such as
banking transactions and medical records. Simple data interactions do not
resolve privacy concerns, but public revelation of an individual‟s personal
locations/movements over time can have serious repercussion for privacy.
To protect privacy, two common approaches are to 1) limit access to the
data, such as adding certification or access control to the data entries, so
sensitive information is accessible by a limited group of users only, and
2) Remove data fields such that sensitive information cannot be
pinpointed to an individual record.

Domain and Application Knowledge

Domain and application knowledge provides necessary information for


designing Big Data mining algorithms and systems. In a simple case,
Application knowledge can help to identify right features for modeling
the essential data. The domain and application knowledge can also help
design feasible business objectives by using Big Data analytical
techniques.

Tier III: Big Data Mining Algorithms

Local knowledge and Model synthesis for Multiple Information


Sources

As Big Data applications are featured with independent sources and


decentralized controls, collecting all distributed data sources to a
centralized site for mining is thoroughly excessive due to the possible
transmission cost and privacy concerns. More specifically, the global
mining can be featured with a two-step process, at data, model, and at
knowledge levels. At the data level, each local site can calculate the data
statistics based on the local data sources and exchange the statistics
between sites to achieve a global data distribution view. At the model or
XX
pattern level, each site can carry out local mining activities, with respect
to the localized data, to discover local patterns. By exchanging patterns
between multiple sources, new global patterns can be synthetized by
aggregating patterns across all sites [2]. At the knowledge level, model
correlation analysis finds out the importance between models generated
from different data sources to determine how relevant the data sources are
correlated with each other, and how to form accurate decisions based on
models built from autonomous sources.

Mining from meager, tentative, and partial Data Meager, tentative,


and partial data are defining features for Big Data applications. Being
meager, the number of data points is too few for deriving consistent
conclusions. Tentative data are a special type of data reality where each
data set is no longer deterministic but is subject to some casual/inaccurate
distributions. The absent values can be caused by different realities, such
as the failure of a sensor node, or some regular policies to intentionally
skip some values. While most modern data mining algorithms have in-
built solutions to handle absent values, data attribution is an established
research field that seeks to attribute absent values to produce enhanced
models 5.3.3 Mining Complex and Dynamic Data The growth of Big
Data is driven by the fast growing of complex data and their changes in
volumes and in nature. Documents posted on WWW servers, Internet
backbones, social networks, communication networks, and transportation
networks, and so on are all featured with complex data. While complex
dependency structures based on the data elevate the difficulty for our
knowledge systems, However, Big Data complexity is presented in many
aspects, including complex diverse data types, complex essential
semantic relations in data, and complex association networks among data.
In Big Data, data types include structured data, unstructured data, and
semistructured data, and so on. particularly, there are relational databases,
text, hyper-text, image, audio and video data, and so on.

Complex intrinsic semantic associations in data. News on the web,


comments on facebook, pictures on piccassa, and video clips on YouTube
may discuss about an academic awardwinning event at the same time.
There is no doubt that there are strong semantic associations in these data.
Mining complex semantic associations from “text-image-video” data will
significantly help improve application system performance such as search

XXI
engines or recommendation systems. Complex relationship networks in
data. In the context of Big Data, there exist relationships between
individuals. On the Internet, individuals are webpages and the pages
linking to each other via hyperlinks form a complex network. There also
exist social relationships between individuals forming complex social
networks, such as big relationship data from Facebook, Twitter,
LinkedIn, and other social media , including call detail records (CDR),
devices and sensors information, GPS and geocoded map data, massive
image files transferred by the Manage File Transfer protocol, web text
and click-stream data, scientific information, e-mail, and so on. To deal
with complex relationship networks, emerging research efforts have
begun to address the issues of structure-and-evolution, crowds-and-
interaction, and information-and-communication.

Challenges

- Location of Big Data sources- Commonly Big Data are stored in


different locations

- Volume of the Big Data- size of the Big Data grows continuously.

- Hardware resources- RAM capacity

- Privacy- Medical reports, bank transactions

- Having domain knowledge

- Getting meaningful information

Solutions

- Parallel computing programming

- An efficient platform for computing will not have centralized data


storage instead of that platform will be distributed in big scale storage.

- Restricting access to the data

BIG Data Mining Tools

- Hadoop

- MapReduce

XXII
- Apache S4

- Strom

- Apache Mahout

- etc …

Hadoop

Hadoop is a scalable, open source, fault tolerant Virtual Grid operating


system architecture for data storage and processing. It runs on commodity
hardware, it uses HDFS which is fault-tolerant high bandwidth clustered
storage architecture. It runs MapReduce for distributed data processing
and is works with structured and unstructured data

*/ Hadoop consists of distributed file system, data storage and analytics


platforms and a layer that handles parallel computation, rate of flow
(workflow) and configuration administration

*/ MapReduce is a programming model for processing large data sets


with a parallel, distributed algorithm on a cluster. Hadoop MapReduce is
a programming model and software framework for writing applications
that rapidly process vast amounts of data in parallel on large clusters of
compute nodes

The MapReduce consists of two functions, map() and reduce(). Mapper


performs the tasks of filtering and sorting and reducer performs the tasks
of summarizing the result.

Big Data Mining Algorithm

Big data applications have so many sources to gather information.

If we want to mine data, we need to gather all distributed data to the


centralized site. But it is prohibited because of high data transmission cost
and privacy concerns.

Most of the mining levels order to achieve the pattern of correlations, or


patterns can be discovered from combined variety of sources.

The global data mining is done through two steps process.

– Model level

XXIII
– Knowledge level.

Each and every local sites use local data to calculate the data statistics
and it share this information in order to achieve global data distribution in
their data level.

In model level it will produce local pattern. This pattern will be produced
after mined local data. By sharing these local patterns with other local
sites, we can produce a single global pattern. At the knowledge level,
model correlation analysis investigates the relevance between models
generated from various data sources to determine how related the data
sources are correlated to each other, and how to form accurate decisions
based on models built from autonomous sources

Conclusions

Big Data consists of huge modules, difficult, growing data sets with
numerous and , independent sources. With the fast development of
networking, storage of data, and the data gathering capacity, Big Data are
now quickly increasing in all science and engineering domains, as well as
animal, genetic and biomedical sciences.

data mining has been successfully applied to many domains, such as


business intelligence, Web search, scientific discovery, digital libraries,
etc.

Q/ is there Multimedia mining or not ?And if there is How can extract


Knowledge ?

Multimedia data mining is used for extracting interesting information for


multimedia data sets, such as audio, video, images, graphics, speech, text
and combination of several types of data set which are all converted from
different formats into digital media . Multimedia mining is a subfield of
data mining which is used to find interesting information of implicit
knowledge from multimedia databases. Multimedia data are classified
into five types; they are (i) text data, (ii) Image data (iii) audio data (iv)
video data and (v)electronic and digital ink. Text data can be

used in web browsers, messages like MMS and SMS. Image data can be
used in art work and pictures with text still images taken by a digital
camera. Audio data contains sound, MP3 songs, speech and music. Video

XXIV
data include time aligned sequence of frames, MPEG videos from
desktops, cell phones, video cameras.

Electronic and digital ink its sequence of time aligned 2D or 3D


coordinates of stylus, a light pen, data g love sensors, graphical, similar
devices are in a multimedia database and use to develop a multimedia
system.

Reference
1- MULTIMEDIA MINING RESEARCH – AN OVERVIEW :Dr. S.Vijayarani1 and Ms.
A.Sakila21Assistant Professor, Department of Computer Science, Bharathiar
University,Coimbatore.2M.Phil Research Scholar, Department of Computer Science,
Bharathiar University,Coimbatore. January 2015

2- Information Security in Big Data:Privacy and Data Mining

LEI XU, CHUNXIAO JIANG, (Member, IEEE), JIAN WANG, (Member, IEEE),

JIAN YUAN, (Member, IEEE), AND YONG REN, (Member, IEEE)

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

3- ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING

Jaseena K.U.1 and Julie M. David2

1,2 Department of Computer Applications,

M.E.S College, Marampally, Aluva, Cochin, India

4- BIG DATA ANALYSIS USING HACE THEOREM Deepak S. Tamhane, Sultana N.


Sayyad

International Journal of Advanced Research in Computer Engineering & Technology


(IJARCET) Volume 4 Issue 1, January 2015

5- Big Data and Big Data Mining: Study of Approaches, Issues and Future scope
International Journal of Engineering Trends and Technology Dec 2014

Data Mining with Big Data IEEE Transactions on Knowledge and Data Engineering
26(1):97-107 · January 2014

Undefined By Data: A Survey of Big Data Definitions

https://www.researchgate.net/profile/Ameer_Sameer

https://www.linkedin.com/in/ameer-sameer-452693107

XXV
http://www.slideshare.net/AmeerSameer

http://facebook.com/ameer.Mee/

XXVI

View publication stats

Você também pode gostar