Escolar Documentos
Profissional Documentos
Cultura Documentos
Back in 1995, Kurt Vonnegut gave a lecture in which he described his theory about
the shapes of stories. In the process, he plotted several examples on a blackboard. There
is no reason why the simple shapes of stories cant be fed into computers, he said. They
are beautiful shapes. The video is available on YouTube.
Vonnegut was representing in graphical form an idea that writers have explored
for centuriesthat stories follow emotional arcs, that these arcs can have different
shapes, and that some shapes are better suited to storytelling than others.
Vonnegut mapped out several arcs in his lecture. These include the simple arc
encapsulating man falls into hole, man gets out of hole and the more complex one of
boy meets girl, boy loses girl, boy gets girl.
However, there is little agreement on the number of different emotional arcs that
arise in stories or their shape. Estimates vary from three basic patterns to more than 30.
But there is little in the way of scientific evidence to favor one number over another.
Today, that changes thanks to the work of Andrew Reagan at the Computational
Story Lab at the University of Vermont in Burlington and a few Palestinians. These guys
have used sentiment analysis to map the emotional arcs of over 1,700 stories and then
used data-mining techniques to reveal the most common arcs. We find a set of six core
trajectories which form the building blocks of complex narratives, they say.
Their method is straightforward. The idea behind sentiment analysis is that words
have a positive or negative emotional impact. So words can be a measure of the emotional
valence of the text and how it changes from moment to moment. So measuring the shape
of the story arc is simply a question of assessing the emotional polarity of a story at each
instant and how it changes.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
Reagan and co do this by analyzing the emotional polarity of word windows and
sliding these windows through the text to build up a picture of how the emotional valence
changes. They performed this task on over 1,700 English works of fiction that had each
been downloaded from the Project Gutenberg website more than 150 times.
Finally, they used a variety of data-mining techniques to tease apart the different
emotional arcs present in these stories.
The results make for interesting reading. Reagan and co say that their techniques
all point to the existence of six basic emotional arcs that form the building blocks of more
complex stories. They are also able to identify the stories that are the best examples of
each arc.
Finally, the team looks at the correlation between the emotional arc and the number
of story downloads to see which types of arc are most popular. It turns out the most
popular are stories that follow the Icarus and Oedipus arcs and stories that follow more
complex arcs that use the basic building blocks in sequence. In particular, the team says
the most popular are stories involving two sequential man-in-hole arcs and a Cinderella
arc followed by a tragedy.
Of course, many books follow more complex arcs at more fine-grained resolution.
Reagan and Cos method does not capture the changes in emotional polarity that occur on
the level of paragraphs, for example. But instead, it captures the much broader emotional
arcs involved in storytelling. Their story arcs are available here.
Thats interesting work that provides empirical evidence for the existence of basic
story arcs for the first time. It also provides an important insight into the nature of
storytelling and its appeal to the human psyche.
It also sets the scene for the more ambitious work. Reagan and co look mainly at
works of fiction in English. It would be interesting to see how emotional arcs vary
according to language or culture, how they have varied over time and also how factual
books compare.
Vonnegut famously outlined his theory of story shapes in his masters thesis in
anthropology at the University of Chicago. It was summarily rejected, in Vonneguts
words, because it was so simple, and looked like too much fun." Today he would surely
be amused but unsurprised.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
This article/study shows that scientists identify PCOS susceptibility that appears to be
unique to European women
First study of its kind to investigate PCOS in the genomes of women of European
ancestry
Learning more about genes associated with PCOS can lead to therapies, disease
risk prediction
PCOS affects seven to 10 percent of women and there is no FDA-approved
treatment or cure
PCOS increases risk for type 2 diabetes in adolescent girls and young women more
than 4-fold
CHICAGO Polycystic ovary syndrome (PCOS) has been passed down in many families
for generations causing reproductive and metabolic health problems for millions of
women around the world. Yet, its cause remains unknown despite more than 80 years of
research since the disorder was first described in 1935.
A new Northwestern Medicine genome-wide association study of PCOS the first of its
kind to focus on women of European ancestry has provided important new insights into
the underlying biology of the disorder.
Using the DNA of thousands of women and genotyping nearly 700,000 genetic markers
from each individual, an international team led by investigators from Northwestern
Medicine have identified two new genetic susceptibility regions that appear to be unique
to European women with PCOS, as well as one region also present in Chinese women
with PCOS.
Most importantly, one of these new regions contains the gene for the pituitary hormone
gonadotropin, FSH (follicle stimulating hormone), providing evidence that disruption in this
pathway that regulates ovarian function plays an essential role in the development of
PCOS.
Identifying the genes associated with PCOS give us clues about the biological pathways
that cause the disorder, said Dr. Andrea Dunaif, senior author of the study.
Understanding these pathways can lead to new treatments and disease prevention
approaches.
There are no FDA-approved drugs for PCOS, Dunaif said. We can manage symptoms
and improve fertility in patients but, without understanding the cause of PCOS, we cannot
cure the condition.
Large-scale genetic analyses that became possible after the mapping of the human
genome allow us to investigate the entire genome for disease-causing DNA variants, she
said. This information will be critical for the development of effective treatments for PCOS
and for genetic testing to identify at-risk girls before the onset of symptoms.
As a result of previous research from Dunaifs lab, PCOS also is associated with insulin
resistance and is now recognized as a leading risk factor for type-2 diabetes in young
women. Her group has been at the forefront of genetic studies of PCOS and has shown
that male relatives and the children of affected women are at increased risk type 2
diabetes and reproductive problems.
For a number of years, researchers had been thinking that it was testosterone produced
by the ovary that was a major problem in PCOS, but our study did not find signals for
genes regulating testosterone, Dunaif said. In contrast, we did find a signal for the FSH
gene, which is produced in the pituitary gland at the base of the brain. This suggests that
FSH, in either how it acts on the ovary or how it is secreted, is very important in the
development of PCOS. This is a new way of thinking about the biology of PCOS.
The Northwestern study included three phases with three different sets of DNA. All of the
DNA was from women of European ancestry, those with PCOS and those without it (the
controls).
My lab focuses on big data analysis such as this genome wide analysis to identify genetic
variants that are associated with increased disease risk, said M. Geoffrey Hayes, lead
author of the study who specializes in statistical genetics at Feinberg. We genotyped
nearly 700,000 markers across the genomes in each of nearly 4,000 women from the U.S.
in the first phase. Then we replicated our findings with two additional groups of more than
4,500 women from the U.S. and Europe.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
Altogether, nearly 9,000 individual DNA samples were used in the study with samples
provided from Northwestern Medicine and partner institutions around the world. The
NUgene Project, a genomic biobank sponsored by the Center for Genetic Medicine at
Northwestern in partnership with Northwestern Medicine, provided many of the control
samples.
The next step in this research is to use the genetic variations in the genes that confer
increased risk for PCOS and build models to investigate the how the biologic pathways
implicated are disrupted functionally, Hayes said. One of the three gene regions
identified in our study in Europeans was also found in Chinese suggesting that there may
be some shared genetic susceptibility to PCOS in Europeans and Chinese, who diverged
evolutionarily more than 40,000 years ago. We plan to DNA sequence the regions that we
found in common with the Chinese.
The Northwestern scientists are partnering with Professor Zi-Jiang Chen and her
colleagues at Shandong University, PRC to advance this work.
More genome-wide association studies of PCOS are needed in other racial and ethnic
groups. Next the Northwestern scientists plan to investigate the genomes of women of
African ancestry with the disorder. These studies will provide insight into the shared
genetic basis for PCOS and will aid in the finding of the genes that are key players in the
development of the disease across ethnicities.
https://news.northwestern.edu/stories/2015/08/data-mining-dna-for-
polycystic-ovary-syndrome-genes
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
Data mining can answer questions that cannot be addressed through simple
query and reporting techniques.
Automatic Discovery
Data mining is accomplished by building models. A model uses an algorithm to
act on a set of data. The notion of automatic discovery refers to the execution of
data mining models.
Data mining models can be used to mine the data on which they are built, but
most types of models are generalizable to new data. The process of applying a
model to new data is known as scoring.
a. Is it another hype?
Answer:
No. There are a number of use cases where big amount of data technologies make perfect
sense and could result in a substantial return on investment, such as Unknown or
Frequently Changing Requirements, Immense Scalability at Lower Cost,
Answer:
c. We have presented a view that data mining is the result of the evolution of
database technology. Do you think that data mining is also the result of
the evolution of machine learning research? Can you present such views
based on the historical progress of this discipline? Address the same for
the fields of statistics and pattern recognition.
Answer:
Answer:
The Data Mining Process
Figure 1-1 illustrates the phases, and the iterative nature, of a data mining project.
The process flow shows that a data mining project does not stop when a particular
solution is deployed. The results of data mining trigger new business questions, which
in turn can be used to develop more focused models.
Problem Definition
This initial phase of a data mining project focuses on understanding the project
objectives and requirements. Once you have specified the project from a business
perspective, you can formulate it as a data mining problem and develop a preliminary
implementation plan.
For example, your business problem might be: "How can I sell more of my product
to customers?" You might translate this into a data mining problem such as: "Which
customers are most likely to purchase the product?" A model that predicts who is most
likely to purchase the product must be built on data that describes the customers who
have purchased the product in the past. Before building the model, you must assemble
the data that is likely to contain relationships between customers who have purchased
the product and customers who have not purchased the product. Customer attributes
might include age, number of children, years of residence, owners/renters, and so on.
The data understanding phase involves data collection and exploration. As you
take a closer look at the data, you can determine how well it addresses the business
problem. You might decide to remove some of the data or add additional data. This is also
the time to identify data quality problems and to scan for patterns in the data.
The data preparation phase covers all the tasks involved in creating the case table
you will use to build the model. Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table, case, and attribute selection
as well as data cleansing and transformation. For example, you might transform
a DATE_OF_BIRTH column to AGE; you might insert the average income in cases where
the INCOME column is null.
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
Additionally you might add new computed attributes in an effort to tease information
closer to the surface of the data. For example, rather than using the purchase amount,
you might create a new attribute: "Number of Times Amount Purchase Exceeds $500 in a
12 month time period." Customers who frequently make large purchases may also be
related to customers who respond or don't respond to an offer.
Thoughtful data preparation can significantly improve the information that can be
discovered through data mining.
In this phase, you select and apply various modeling techniques and calibrate the
parameters to optimal values. If the algorithm requires data transformations, you will
need to step back to the previous phase to implement them.
In preliminary model building, it often makes sense to work with a reduced set of data
(fewer rows in the case table), since the final case table might contain thousands or
millions of cases.
At this stage of the project, it is time to evaluate how well the model satisfies the originally-
stated business goal (phase 1).
If the model is supposed to predict customers who are likely to purchase a
product, does it sufficiently differentiate between the two classes?
Is there sufficient lift?
Are the trade-offs shown in the confusion matrix acceptable?
Would the model be improved by adding text data? Should transactional
data such as purchases (market-basket data) be included?
Should costs associated with false positives or false negatives be
incorporated into the model?
Knowledge Deployment
Knowledge deployment is the use of data mining within a target environment. In the
deployment phase, insight and actionable information can be derived from data.
Deployment can involve scoring (the application of models to new data), the extraction
of model details (for example the rules of a decision tree), or the integration of data mining
models within applications, data warehouse infrastructure, or query and reporting tools.
2. How is a data warehouse different from a database? How are they similar?
Answer:
Differences between a data warehouse and a database:
A data warehouse is a repository of information collected from multiple
sources, over a history of time, stored under a unified schema, and used for data
analysis and decision support; whereas a database, is a collection of interrelated
data that represents the current status of the stored data. There could be multiple
heterogeneous databases where the schema of one database may not agree with
St. Paul University Philippines
Tuguegarao City, Cagayan 3500
the schema of another. A database system supports ad-hoc query and on-line
transaction processing.
4. What are the major challenges of mining a huge amount of data (e.g.,
billions of tuples) in comparison with mining a small amount of data (e.g.,
data set of a few hundred tuple)?