Você está na página 1de 10

St.

Paul University Philippines


Tuguegarao City, Cagayan 3500

DIT 310 Data Mining


2nd Trimester A.Y:2017 2018
Job Sheet No. 1
Name: Romeo T. Balingao Date Submitted: November 7, 2017
Program/Year: DIT - 2 Professor: Byron Joseph A. Hallar, DIT

Activity 1 - Data Mining in the News

A. Using GOOGLE Search Engine


DATA MINING REVEALS THE SIX BASIC EMOTIONAL ARCS OF STORYTELLING
Scientists at the Computational Story Laboratory have analyzed novels to identify the
building blocks of all stories to reveal the emotional arcs of storytelling.
By: Emerging Technology from the arXiv July 6, 2016

Back in 1995, Kurt Vonnegut gave a lecture in which he described his theory about
the shapes of stories. In the process, he plotted several examples on a blackboard. There
is no reason why the simple shapes of stories cant be fed into computers, he said. They
are beautiful shapes. The video is available on YouTube.

Vonnegut was representing in graphical form an idea that writers have explored
for centuriesthat stories follow emotional arcs, that these arcs can have different
shapes, and that some shapes are better suited to storytelling than others.

Vonnegut mapped out several arcs in his lecture. These include the simple arc
encapsulating man falls into hole, man gets out of hole and the more complex one of
boy meets girl, boy loses girl, boy gets girl.

Vonnegut is not alone in attempting to categorize stories into types, although he


was probably the first to do it in graphical form. Aristotle was at it over 2,000 years before
him, and many others have followed in his footsteps.

However, there is little agreement on the number of different emotional arcs that
arise in stories or their shape. Estimates vary from three basic patterns to more than 30.
But there is little in the way of scientific evidence to favor one number over another.

Today, that changes thanks to the work of Andrew Reagan at the Computational
Story Lab at the University of Vermont in Burlington and a few Palestinians. These guys
have used sentiment analysis to map the emotional arcs of over 1,700 stories and then
used data-mining techniques to reveal the most common arcs. We find a set of six core
trajectories which form the building blocks of complex narratives, they say.

Their method is straightforward. The idea behind sentiment analysis is that words
have a positive or negative emotional impact. So words can be a measure of the emotional
valence of the text and how it changes from moment to moment. So measuring the shape
of the story arc is simply a question of assessing the emotional polarity of a story at each
instant and how it changes.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

Reagan and co do this by analyzing the emotional polarity of word windows and
sliding these windows through the text to build up a picture of how the emotional valence
changes. They performed this task on over 1,700 English works of fiction that had each
been downloaded from the Project Gutenberg website more than 150 times.

Finally, they used a variety of data-mining techniques to tease apart the different
emotional arcs present in these stories.

The results make for interesting reading. Reagan and co say that their techniques
all point to the existence of six basic emotional arcs that form the building blocks of more
complex stories. They are also able to identify the stories that are the best examples of
each arc.

The six basic emotional arcs are these:

A steady, ongoing rise in emotional valence, as in a rags-to-riches story such as


Alices Adventures Underground by Lewis Carroll. A steady ongoing fall in emotional
valence, as in a tragedy such as Romeo and Juliet. A fall then a rise, such as the man-in-
a-hole story, discussed by Vonnegut. A rise then a fall, such as the Greek myth of Icarus.
Rise-fall-rise, such as Cinderella. Fall-rise-fall, such as Oedipus.

Finally, the team looks at the correlation between the emotional arc and the number
of story downloads to see which types of arc are most popular. It turns out the most
popular are stories that follow the Icarus and Oedipus arcs and stories that follow more
complex arcs that use the basic building blocks in sequence. In particular, the team says
the most popular are stories involving two sequential man-in-hole arcs and a Cinderella
arc followed by a tragedy.

Of course, many books follow more complex arcs at more fine-grained resolution.
Reagan and Cos method does not capture the changes in emotional polarity that occur on
the level of paragraphs, for example. But instead, it captures the much broader emotional
arcs involved in storytelling. Their story arcs are available here.

Thats interesting work that provides empirical evidence for the existence of basic
story arcs for the first time. It also provides an important insight into the nature of
storytelling and its appeal to the human psyche.

It also sets the scene for the more ambitious work. Reagan and co look mainly at
works of fiction in English. It would be interesting to see how emotional arcs vary
according to language or culture, how they have varied over time and also how factual
books compare.

Vonnegut famously outlined his theory of story shapes in his masters thesis in
anthropology at the University of Chicago. It was summarily rejected, in Vonneguts
words, because it was so simple, and looked like too much fun." Today he would surely
be amused but unsurprised.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

B. Using YAHOO Search Engine

Data Mining DNA for Polycystic Ovary Syndrome (PCOS) Genes


Article written by Erin Spain, August 18, 2015

This article/study shows that scientists identify PCOS susceptibility that appears to be
unique to European women

First study of its kind to investigate PCOS in the genomes of women of European
ancestry
Learning more about genes associated with PCOS can lead to therapies, disease
risk prediction
PCOS affects seven to 10 percent of women and there is no FDA-approved
treatment or cure
PCOS increases risk for type 2 diabetes in adolescent girls and young women more
than 4-fold

CHICAGO Polycystic ovary syndrome (PCOS) has been passed down in many families
for generations causing reproductive and metabolic health problems for millions of
women around the world. Yet, its cause remains unknown despite more than 80 years of
research since the disorder was first described in 1935.

A new Northwestern Medicine genome-wide association study of PCOS the first of its
kind to focus on women of European ancestry has provided important new insights into
the underlying biology of the disorder.

Using the DNA of thousands of women and genotyping nearly 700,000 genetic markers
from each individual, an international team led by investigators from Northwestern
Medicine have identified two new genetic susceptibility regions that appear to be unique
to European women with PCOS, as well as one region also present in Chinese women
with PCOS.

Most importantly, one of these new regions contains the gene for the pituitary hormone
gonadotropin, FSH (follicle stimulating hormone), providing evidence that disruption in this
pathway that regulates ovarian function plays an essential role in the development of
PCOS.

The study was published August 18 in the journal Nature Communications.

Identifying the genes associated with PCOS give us clues about the biological pathways
that cause the disorder, said Dr. Andrea Dunaif, senior author of the study.
Understanding these pathways can lead to new treatments and disease prevention
approaches.

Dunaif is the Charles F. Kettering Professor of Endocrinology and Metabolism at


Northwestern University Feinberg School of Medicine. She also is a physician at
Northwestern Memorial Hospital.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

There are no FDA-approved drugs for PCOS, Dunaif said. We can manage symptoms
and improve fertility in patients but, without understanding the cause of PCOS, we cannot
cure the condition.

Large-scale genetic analyses that became possible after the mapping of the human
genome allow us to investigate the entire genome for disease-causing DNA variants, she
said. This information will be critical for the development of effective treatments for PCOS
and for genetic testing to identify at-risk girls before the onset of symptoms.

PCOS affects seven to 10 percent of reproductive-age women worldwide with symptoms


such as increased male-pattern hair growth, weight gain, irregular periods and infertility.
These symptoms are due to increased production of male hormones, in particular
testosterone, by the ovary and disordered secretion of the pituitary gonadotropins, LH
(luteinizing hormone) and FSH, resulting in anovulation (the absence of ovulation).

As a result of previous research from Dunaifs lab, PCOS also is associated with insulin
resistance and is now recognized as a leading risk factor for type-2 diabetes in young
women. Her group has been at the forefront of genetic studies of PCOS and has shown
that male relatives and the children of affected women are at increased risk type 2
diabetes and reproductive problems.

The Northwestern study complements a recent genome-wide association study of Chinese


women that identified 11 PCOS susceptibility regions of the genome. The regions include
the genes for the receptors for the gonadotropins, LH and FSH. LH and FSH acting through
these receptors regulate the production of ovarian hormones, such as estradiol and
testosterone (ovaries make both male and female hormones), as well as the maturation
and ovulation of the egg. Taken together with the Northwestern study, finding of the gene
for FSH hormone itself, these analyses implicate genes regulating gonadotropins and their
actions in causing PCOS.

For a number of years, researchers had been thinking that it was testosterone produced
by the ovary that was a major problem in PCOS, but our study did not find signals for
genes regulating testosterone, Dunaif said. In contrast, we did find a signal for the FSH
gene, which is produced in the pituitary gland at the base of the brain. This suggests that
FSH, in either how it acts on the ovary or how it is secreted, is very important in the
development of PCOS. This is a new way of thinking about the biology of PCOS.

The Northwestern study included three phases with three different sets of DNA. All of the
DNA was from women of European ancestry, those with PCOS and those without it (the
controls).

My lab focuses on big data analysis such as this genome wide analysis to identify genetic
variants that are associated with increased disease risk, said M. Geoffrey Hayes, lead
author of the study who specializes in statistical genetics at Feinberg. We genotyped
nearly 700,000 markers across the genomes in each of nearly 4,000 women from the U.S.
in the first phase. Then we replicated our findings with two additional groups of more than
4,500 women from the U.S. and Europe.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

Hayes is an assistant professor of endocrinology at Feinberg and an assistant professor


of anthropology at Northwesterns Weinberg College of Arts and Sciences.

Altogether, nearly 9,000 individual DNA samples were used in the study with samples
provided from Northwestern Medicine and partner institutions around the world. The
NUgene Project, a genomic biobank sponsored by the Center for Genetic Medicine at
Northwestern in partnership with Northwestern Medicine, provided many of the control
samples.

The next step in this research is to use the genetic variations in the genes that confer
increased risk for PCOS and build models to investigate the how the biologic pathways
implicated are disrupted functionally, Hayes said. One of the three gene regions
identified in our study in Europeans was also found in Chinese suggesting that there may
be some shared genetic susceptibility to PCOS in Europeans and Chinese, who diverged
evolutionarily more than 40,000 years ago. We plan to DNA sequence the regions that we
found in common with the Chinese.

The Northwestern scientists are partnering with Professor Zi-Jiang Chen and her
colleagues at Shandong University, PRC to advance this work.

More genome-wide association studies of PCOS are needed in other racial and ethnic
groups. Next the Northwestern scientists plan to investigate the genomes of women of
African ancestry with the disorder. These studies will provide insight into the shared
genetic basis for PCOS and will aid in the finding of the genes that are key players in the
development of the disease across ethnicities.

https://news.northwestern.edu/stories/2015/08/data-mining-dna-for-
polycystic-ovary-syndrome-genes
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

Activity 2. Scavenger Hunt

1. What is data mining? In your answer, address the following:

Data mining is the practice of automatically searching large stores of data to


discover patterns and trends that go beyond simple analysis. Data mining uses
sophisticated mathematical algorithms to segment the data and evaluate the
probability of future events. Data mining is also known as Knowledge Discovery in
Data (KDD).

The key properties of data mining are:

Automatic discovery of patterns


Prediction of likely outcomes
Creation of actionable information
Focus on large data sets and databases

Data mining can answer questions that cannot be addressed through simple
query and reporting techniques.

Automatic Discovery
Data mining is accomplished by building models. A model uses an algorithm to
act on a set of data. The notion of automatic discovery refers to the execution of
data mining models.

Data mining models can be used to mine the data on which they are built, but
most types of models are generalizable to new data. The process of applying a
model to new data is known as scoring.

a. Is it another hype?

Answer:

No. There are a number of use cases where big amount of data technologies make perfect
sense and could result in a substantial return on investment, such as Unknown or
Frequently Changing Requirements, Immense Scalability at Lower Cost,

Unknown or Frequently Changing Requirements: One of the greatest aspects of Big


Data is the ability to preserve data sources as a whole, instead of structuring them into
well-defined set of tables and columns. This provides for a high degree of flexibility, which
no traditional data warehouse solution can match. Often, in a traditional data warehouse
scenario, when a business user decides that he needs additional information that is
available in the operational systems and not on his reports, he ends up waiting for months
to get it, since it is not loaded to the data warehouse and the whole ETL process and table
structure needs to be updated. Big Data solutions overcome this challenge by simply
dumping all data from source systems and defining its structure whenever needed.
Immense Scalability at Lower Cost: One of the greatest promises of Big Data is
scalability at lower cost, with technologies such as Hadoop parallelizing immense
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

data storage and querying on relatively low-cost and less-specialized hardware.


On the down side, such technologies usually drive up operational and
administrational costs, resulting in an even higher total cost of ownership than
traditional solutions. Still, it is worth doing a feasibility assessment for
organizations dealing with immense data sets, and could result in substantial cost
savings against traditional data warehousing setups.
Unknown Structured Data: Although relatively infrequent, structures of data
sources are sometimes unknown during development and need to be discovered
when needed. This is especially useful when the business needs quick results and
insights and the BI teams have no time for spending months on source data
discovery. With their dump and ask questions later approach, Big Data
technologies can be of great use in these scenarios.
Readiness and Need for Real-Time Analytics: One of the most common use
cases of Big Data technologies nowadays is real-time reporting and marketing.
With relatively better abilities to process data fast, Big Data solutions enable some
organizations (especially those that already have the abilities to make use of real-
time insights) to decide and act faster.

b. Is it a simple transformation or application of technology developed from


databases, statistics, machine learning, and pattern recognition?

Answer:

c. We have presented a view that data mining is the result of the evolution of
database technology. Do you think that data mining is also the result of
the evolution of machine learning research? Can you present such views
based on the historical progress of this discipline? Address the same for
the fields of statistics and pattern recognition.

Answer:

d. Describe the steps involved in data mining when viewed as a process of


knowledge discovery.

Answer:
The Data Mining Process

Figure 1-1 illustrates the phases, and the iterative nature, of a data mining project.
The process flow shows that a data mining project does not stop when a particular
solution is deployed. The results of data mining trigger new business questions, which
in turn can be used to develop more focused models.

Figure 1-1: The Data Mining Process


St. Paul University Philippines
Tuguegarao City, Cagayan 3500

Problem Definition
This initial phase of a data mining project focuses on understanding the project
objectives and requirements. Once you have specified the project from a business
perspective, you can formulate it as a data mining problem and develop a preliminary
implementation plan.

For example, your business problem might be: "How can I sell more of my product
to customers?" You might translate this into a data mining problem such as: "Which
customers are most likely to purchase the product?" A model that predicts who is most
likely to purchase the product must be built on data that describes the customers who
have purchased the product in the past. Before building the model, you must assemble
the data that is likely to contain relationships between customers who have purchased
the product and customers who have not purchased the product. Customer attributes
might include age, number of children, years of residence, owners/renters, and so on.

Data Gathering and Preparation

The data understanding phase involves data collection and exploration. As you
take a closer look at the data, you can determine how well it addresses the business
problem. You might decide to remove some of the data or add additional data. This is also
the time to identify data quality problems and to scan for patterns in the data.

The data preparation phase covers all the tasks involved in creating the case table
you will use to build the model. Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table, case, and attribute selection
as well as data cleansing and transformation. For example, you might transform
a DATE_OF_BIRTH column to AGE; you might insert the average income in cases where
the INCOME column is null.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

Additionally you might add new computed attributes in an effort to tease information
closer to the surface of the data. For example, rather than using the purchase amount,
you might create a new attribute: "Number of Times Amount Purchase Exceeds $500 in a
12 month time period." Customers who frequently make large purchases may also be
related to customers who respond or don't respond to an offer.

Thoughtful data preparation can significantly improve the information that can be
discovered through data mining.

Model Building and Evaluation

In this phase, you select and apply various modeling techniques and calibrate the
parameters to optimal values. If the algorithm requires data transformations, you will
need to step back to the previous phase to implement them.

In preliminary model building, it often makes sense to work with a reduced set of data
(fewer rows in the case table), since the final case table might contain thousands or
millions of cases.

At this stage of the project, it is time to evaluate how well the model satisfies the originally-
stated business goal (phase 1).
If the model is supposed to predict customers who are likely to purchase a
product, does it sufficiently differentiate between the two classes?
Is there sufficient lift?
Are the trade-offs shown in the confusion matrix acceptable?
Would the model be improved by adding text data? Should transactional
data such as purchases (market-basket data) be included?
Should costs associated with false positives or false negatives be
incorporated into the model?

Knowledge Deployment
Knowledge deployment is the use of data mining within a target environment. In the
deployment phase, insight and actionable information can be derived from data.

Deployment can involve scoring (the application of models to new data), the extraction
of model details (for example the rules of a decision tree), or the integration of data mining
models within applications, data warehouse infrastructure, or query and reporting tools.

2. How is a data warehouse different from a database? How are they similar?
Answer:
Differences between a data warehouse and a database:
A data warehouse is a repository of information collected from multiple
sources, over a history of time, stored under a unified schema, and used for data
analysis and decision support; whereas a database, is a collection of interrelated
data that represents the current status of the stored data. There could be multiple
heterogeneous databases where the schema of one database may not agree with
St. Paul University Philippines
Tuguegarao City, Cagayan 3500

the schema of another. A database system supports ad-hoc query and on-line
transaction processing.

Differences between Operational Databases Systems and Data Warehouses.


Similarities between a data warehouse and a database: Both are repositories of
information, storing huge amounts of persistent data.

3. Define each of the following data mining functionalities: characterization,


discrimination, association and correlation analysis, classification,
regression, clustering, and outlier analysis. Give examples of each data
mining functionality, using a real-life database that you are familiar with.

4. What are the major challenges of mining a huge amount of data (e.g.,
billions of tuples) in comparison with mining a small amount of data (e.g.,
data set of a few hundred tuple)?

Você também pode gostar