Data and Knowledge Visualization in Knowledge Discovery Process

Visualization Support for a User-Centered KDD Process
TuBao Ho
Japan Advanced Instituteof
Science and Technology
Tatsunokuchi, Ishikawa
923-1292 Japan
+81-761-51-1730
bao@jaist.ac.jp
TrongDung Nguyen
923-1292 Japan
+81-761-51-1732
nguyen@jaist.ac.jp
ABSTRACT
Viewing knowledge discovery as a user-centered process that
requires an effective collaboration between the user and the
discovery system, our work aims to support an active role of the
user in that process by developing synergistic visualization tools
integrated in our discovery system D2MS. These tools provide an
ability of visualizing the entire process of knowledge discovery in
order to help the user with data preprocessing, selecting mining
algorithms and parameters, evaluating and comparing discovered
models, and taking control of the whole discover process. Our
case-studies with two medical datasets on meningitis and stomach
cancer show that, with visualization tools in D2MS, the user gains
better insight in each step of the knowledge discovery process as
well the relationship between data and discovered knowledge.
analysis [2] and a high potential in knowledge discovery in

databases [3], [9]. In this paper we introduce the knowledge
discovery system D2MS (Data Mining with Model Selection) that
has two main contributions to visual knowledge discovery. First
are its efficient visualizers for large multidimensional databases,
discovered rules, hierarchical structures as well a synergistic
visualization of data and knowledge. In particular, the novel
visualization technique T2.5D (Trees 2.5 Dimensions) for large
hierarchical structures can be seen as an alternative to powerful
techniques for representing large hierarchical structures such as
cone trees [14] or hyperbolic trees [8]. Second is its tight
integration of the visualizers with functions in each step of the
knowledge discovery process for supporting the model selection
purpose.
2. MODEL SELECTION IN D2MS
Keywords
model selection, knowledge discovery process,
knowledge visualization, the user's active role.
DungDuc Nguyen
923-1292 Japan
+81-761-51-1732
dungduc@jaist.ac.jp
data and
1. INTRODUCTION
The process of knowledge discovery in databases (KDD) can be
viewed inherently consists of five steps: (1) understanding the
application domain, (2) data preprocessing, (3) data mining, (4)
post-processing, and (5) applying discovered knowledge, where
each step requires many decisions being made by the user [10].
To find implicit but potentially useful patterns/models from large
databases, one cannot expect just to push a large amount of data
into a KDD system without the user's participation. In other
words, the KDD process can be alternatively viewed as a process
of model selection, i.e., that of choosing by the user the most
interesting discovered patterns/models or algorithms and their
settings for obtaining such patterns/models in a given application.
Model selection in KDD is a complicated human-centered and
domain-centered process in which the participation of the user
plays a key role to the success.
Figure 1 shows a conceptual architecture of D2MS where Data

Mining component currently includes a decision tree learning
(CABRO [11]), and a rule learning (LUPC [6]) subsystems.
2.1 User-centered model selection

The interestingness of discovered patterns/models is commonly
characterized by several criteria: evidence indicates the
significance of a finding measured by a statistical criterion;
redundancy amounts to the similarity of a finding with respect to
other findings and measures to what degree a finding follows
from another one; usefulness relates a finding to the goal of the
users; novelty includes the deviation from prior knowledge of the
user or system; simplicity refers to the syntactical complexity of
the presentation of a finding, and generality is determined by the
fraction of the population a finding refers to. The interestingness
can be seen as a function of the above criteria, and strongly
depends on the user as well his/her domain knowledge.
I GraphicalUserInterface ~ ' ~
.......................................
Visualization has proven its effectiveness in exploratory data
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGKDD "02,July 23-26, 2002, Edmonton, Alberta, Canada.
Copyright 2002 ACM 1-58113-567-X/02/0007...$5.00.
Data
ii ........................................
DataMining
Figure 1: Conceptual architecture of the system
519
The key idea of our solution to model selection in D2MS is to

support an effective participation of the user in this process.
Concretely, D2MS first supports the user in doing trials on
combinations of algorithms and their parameter settings in order
to produce competing models, and then it supports the user in
evaluating them quantitatively and qualitatively by providing both
perforrnanee metrics values as well as visualization of these
models (Figure 2).
2.2 Plan and plan manager

The model selection in D2MS mainly involves with three steps of
data pre-processing, data mining, and post-processing as shown in
Figure 1. There are three phases in doing model selection in
D2MS, and all are managed by the plan management module: (i)
registering plans of selected algorithms and their settings; (ii)
executing the plans to discover models; (iii) selecting appropriate
models by a comparative evaluation of competing models. These
phases can be done interactively in all three steps of the KDD
process.
The first phase is to register plans. A plan is an ordered list of
algorithms associated with their parameter settings that can yield
a model or an intermediate result when being executed. The plans
are represented in a tree form called plan tree whose nodes are
selected algorithms associated with their settings (the top-left
window in Figure 3). The nodes on a path of the plan tree must
follow the order of preprocessing, data mining, and
postprocessing. A plan may contain several algorithms (nodes) of
the preprocessing and postprocessing steps, for example filling
missing values by natural cluster-based mean-and-mode then
discretizing continuous attributes by entropy-based algorithm in
preprocessing. A plan can be edited in D2MS during the KDD
process.
The second phase is to execute registered plans. While registering
gradually a plan the user can run an algorithm just after adding it
to the plan, then evaluate its results before deciding whether to
continue this plan from its current stage with other algorithms, or
to backtrack and try the plan with another algorithm or setting.
The user also can run a plan after fully registering it, or even
register a number of plans then run them altogether. The
intermediate results, the discovered models and their summaries
and exported forms will be automatically created and stored in the
model base.
activated to show the user corresponding data or model.

Otherwise if the user double clicks on an algorithm node, a dialog
box will be activated for the user to enter or change the
parameters of that algorithm. Therefore, the plan visualizer can
serve as a "hub" for the user to activate other visualizers easily.
Another function of plan visualizer is to allow the user following
the discover process. During the running time of a plan tree, the
finished part of the plan tree will be changed to gray color, and
currently running node is blinking. The user then can easily
suspend, continue, or run the plan tree step by step.
3. DATA AND MODEL VISUALIZATION

3.1 Data Visualization
We have chosen the parallel coordinates technique for visualizing
2D tabular datasets defined by n rows and p columns. D2MS
improves parallel coordinates in several ways to adapt
3.1.1 Viewing original data

The basic idea of viewing a p-dimensional dataset by parallel
coordinates is to use p equally spaced axes--which are parallel to
one of the screen axes and correspond to attributes and the ends of
the axes correspond to minimum and maximum values for each
dimension--to represent each data instance as a polyline that
crosses each axis at a position proportional to its value for that
dimension. This view gives the user a rough idea about the
distribution of data on values of each attribute, in particular the
colors of different classes in many eases can show clearly how
classes are different from each other. An example of the stomach
cancer data is visualized in the bottom-left window in Figure 4,
where the dataset is shown in the top-left window.
3.1.2 Summarizing data

This view is significant as the dataset may be very large. The key
idea is not to view original data points but to view their
summaries on parallel attributes. As WinViz [9], D2MS uses bar
charts in the place of attribute values on each axis. The bar charts
in each axis have the same height (depending on the number of
possible attribute values) and different widths that signify the
frequencies of attribute values. D2MS also provides interactively
common statistics on each attribute as mean or mode, median,
variance, box plots, etc. The top-right window in Figure 4 shows
the summaries of the stomach cancer data.
The third phase is to select appropriate models by the user. D2MS

provides a summary table presenting performance metrics of
discovered models according to executed plans (the bottom-right
window in Figure 3). However, the user can evaluate each model
in deep by visualizing it, browsing its slructure, checking its
relationship with the dataset, etc. (the top-middle and top-right
windows in Figure 3) The user can also visualize several models
simultaneously for comparing them. By getting insights into
competing models, the user certainly can make a better selection
of models.
peUng
leb
2.3 Process visualization

The plan manager allows the user to create, edit, and manager
plans with the help of the plan visualizer. In Figure 3, the plan
visualizer displays a plan tree in the top-left window. A plan tree
is displayed in the form of a tree, where each node represents a
dataset, a model, or an algorithm. If the user double clicks on a
data or model node, the data or model visualization will be
II
I ,Select
& Apply
A~IIh~-O~arc~s
Figure 2: An illustration of user-centered model selection
520
number of instances of the class covered by the rule over the total
number of instances in the class. This view gives a first
observation of the rule quality.
3.2.2 Viewing rules and data
!:
i"~ C - 7 -
~i
i 71~" i S - -
"7 -'~- ~"77"~
~ !
" ~ ~ 3 "~ 3 " ~
The subset of instances covered by a rule is visualized together

with the rule by parallel coordinates or by summaries on parallel
coordinates. From this subset of instances, the user can see the set
of rules each of them cover some of these instances, or the user
can smoothly change the values of an attribute in the rule to see
other related possible rules. These possible operations facilitate
the user in evaluating the quality of this rule: a rule is good if
instances covered by it are not recognized by other rules, and
vice-versa. The rules for a class can be displayed together, and
instances of the class as well of other classes covered by these
rule are displayed.
" ~ ES . . . . . . .
" ' ~ ; X.;;:;'7~.~
Figure 3: A screen shot of D2MS in selecting models learned

from Wisconsin breast cancer data. The top-left window
shows the plan tree; the top-middle window shows a tightlycoupled view of a decision tree learned by CABRO; the topright window shows a rule set learned by LUPC; the bottomleft window for displaying intermediate computation results,
and the bottom-right window shows the summary table of
performance metrics of discovered models.
3.1.3 Querying data

This view serves the hypothesis generation and hypothesis testing
by the user. It allows the user to view subsets of the dataset
determined by queries. There are three types of queries: (i) based
on a value of the class attribute where the query determines the
subset of all instances belonging to the indicated class; (ii) based
on a value of a descriptive attribute where the query determines
the subset of all instances having this value, (iii) based on a
conjunction of attribute-values pairs where the query determines
the subset of all instances satisfied this conjunction. The queries
can be determined by just using point-and-click. The subset of
instances matched the query is visualized in viewing data mode
and in summarizing data mode. The gray regions on each axis
show the proportions of specified instances on values of this
attribute as shown in bottom-right window in Figure 4).
3.3 Tree Visualization

D2MS provides several visualization techniques that allow the
user to visualize effectively large hierarchical structures.
3.3.1 Different modes of viewing hierarchical

structures
D2MS tree visualizer provides multiple-views of trees or
hierarchical structures (Figure 5).
Tightly-coupled views: The global view shows the tree

structure with nodes in same small size without labels and
therefore it can display a tree fully or a large part of it,
depending on the tree size. The detailed view shows the tree
structure and nodes with their labels associated with
operations to display node information. The global view is
associated with afield-of-view or panner (a wire-frame box)
that corresponds to the detailed view [7].
Customizing views: Initially, according to the user's choice,

the tree is either displayed fully or with only the root node
and its direct sub-nodes. The tree then can be collapsed or
expanded partially or fully from the root or from any
intermediate node. Any subtree with the root at an
uncollapsed can be collapsed into one node. Thus, the user is
able to interactively customize views of the tree to meet
his/her need and interests. Also, the user is provided the
focus view on one class and its relation to other classes in the
whole hierarchical structure with different colors.
Tiny mode with fish-eye view: Note that no current

visualization technique allows us to display efficiently the
entire tree when it has, says, ten thousands nodes. The
tightly-coupled views are extended with three viewing
modes according to the user's choice: normal size, small size
and tiny size. Fish-eye is an interesting variant of the classic
overview-detail browser, proposed in [4]. This view distorts
the magnified image so that the center of interest is displayed
at high magnification, and the rest of the image is
progressively compressed. In D2MS tree visualizer, we
define three fish-eye components as follows: (i) Focal point
f: some node of current interest in the tree; (ii) Distance from
focal point f to a node x: D~x) = d~x) where d(x, y)
between two points x and y on the tree is the number of links
intervening on the path connecting them in the tree; (iii)
Level of detail, importance, resolution: LOD(x) = -d(r,x)
where r is the root of the tree.
3.2 Rule Visualization

A rule is a pattern related to several attribute-values and a subset
of instances. The importance in visualizing a rule is how this local
structure is viewed in its relation to the whole dataset, and how
the view supports the user's evaluation on the rule interestingness.
D2MS's rule visualizer allows the user to visualize rules in the
form antecedent ~ consequent where antecedent is a conjunction
of attribute-value pairs, consequent is a conjunction of attributevalue pairs in case of association rules, and is a value of the class
attribute in case of prediction rules. A rule is simply displayed by
a subset of parallel coordinates included in antecedent and
consequent. The D2MS's rule visualizer has the following
fimctions:
3.2.1 F i e w i n g rules
Each rule is displayed by polyline that goes through the axes
containing attribute-values occurred on the antecedent part of the
rule leading to the consequent part of the rule that are displayed
with different color. In the case of prediction rules, the ratio
associated with each class in the class attribute corresponds to the
521
3.3.2 Trees 2.5 Dimensions

T2.5D is inspired by the work of Reingold and Tilford [13] that
draws tidy trees in a reasonable time and storage. Different from
tightly-coupled and fish-eye views that can be seen as locationbased views (view of objects in a region), T2.5D can be seen as a
relation-based view (view of related objects). The starting point of
T2.5D is the observation that a large tree consists many subtrees
that are not usually and necessarily viewed simultaneously. The
key idea of T2.5D is to represent a large tree in a virtual 3D space
(subtrees are overlapped to reduce occupied space) while each
subtree of interest is displayed in a 2D space. To this end, T2.5D
determines the fixed position of each subtree (its root node) in
two axes X and Y, and in addition, it computes dynamically a Zorder for this subtree in an imaginary axis Z. A subtree with a
given Z-order is displayed "'above" its siblings those have higher
Z-orders.
When visualizing and navigating a tree, at each moment the Zorder of all nodes on the path from the root to a node in focus in
the tree is set to zero by T2.SD. The active widepath to a node in
focus, which contains all nodes on the path from the root to this
node in focus and their siblings, is displayed in the front of the
screen with highlighted colors to give the user a clear view. Other
parts of the tree remain in the background to provide an image of
the overall structure. With Z-order, T2.5D can give the user an
impression that trees are drawn in a 3D space. The user can easily
change the active wide path by choosing another node in focus
[12].
We have experimented T2.5D with various real and artificial
datasets. It has been verified that T2.5D can handle well trees
with more than 20,000 nodes,' and more than 1,000 nodes can be
displayed together on the screen [12]. Figure 7 illustrates a pruned
tree of 1795 nodes learned from stomach cancer data and drawn
by T2.5D (note that the original screen with colors gives a better
view than this black-white screen).
4. VISUALIZATION IN THE KDD

PROCESS
Figure 2 shows that, in D2MS, in order to take the control of the
KDD process the: user needs visualization support to decide what
task to do next and what are right algorithms and parameters for
that task.. For example, after examining the data by data
visualizers, the user can decide that data require discretization or
not, and if they do what kind of discretization algorithms can be
suitable for that data. In this section we will describe in detail how
the user uses these visualization tools to manage the KDD
process.
4.1 The whole process visualization with plan

visualization
The framework of D2MS allows many algorithms and visualizers
to work together in an integrated environment. That provides the
user a great flexibility of combination of these algorithms and
visualizers in order to archive a better result, however that may
also make tasks more complicated to do. With too many
algorithms involved and a lot of views activated to display
different data or models, it is hard to remember and understand
the relationships among them.
To solve that problem, plan visualizer is designed as a "hub" to
control these algorithms and visualizers. When the user double
click on a node in a plan tree, if that node describes an applied
algorithm, the selected set of parameters will be displayed in a
parameter dialog box. If the node describes a dataset or a model,
corresponding visualizer will be activated to provide the user the
views of that data or model. By following the relationships among
nodes in the plan tree, the user easily track down the relationships
among activated views and applied algorithms.
4.2 Data visualization in the KDD process

With the above three views of data, D2MS integrates data
visualization into different KDD steps by displaying and
interactively changing these views of data at any time. In the first
step of collecting data and formulating the problem, the user can
and often need to view the original dataset and its summarization.
The visual analysis of collected data may help the user to identify
important or redundant attributes or new attributes to be added.
The data visualization has shown to be significant in the data
preprocessing step that consists of functions on data cleaning,
integration, transformation and reduction. For example, many
discretization algorithms provide alternative solution of dividing a
numerical attribute into intervals, and the visual data query on the
discretized attribute and the class attribute can give insights for
decision. The data visualization is also very significant in data
mining step with data query mode, and particularly in the
evaluation step in its synergistic combination with rule and tree
visualization.
4.3 Model Visualization in the KDD process

Figure 4: Rule visualization in D2MS, top-left window shows
the list of discovered rules, the middle-left and the top-right
windows show a rule under inspection, and bottom window
displays the instances covered by that rule.
A model that can be understood is a model that can be trusted. A

data mining algorithm that uses a human-understandable model
can be checked easily by domain experts, providing much needed
semantic validity to the model. To that end, model visualization
provides much help to the user.
There are several ways that support the user in evaluating the
quality of the rule together with other measure such as coverage
522
and accuracy of the rule. For example, two rules predicting a

target class has the same support and confidence but the one
wrongly covered more instances belonging to classes different
from the target class would be considered worse.
Figure 4 illustrates rule visualization in D2MS where the top-left
and bottom left windows display a discovered rule, and the topright and bottom right windows show the instances covered by
that rule.
Three main criteria for selecting hierarchical models are their
size, accuracy and understandability. The tree size and accuracy
can be quantitatively evaluated, among them the accuracy is
widely considered to be of great importance. The
understandability of trees is difficult to be quantified or measured,
and the idea here is to use tree visualizer to support the
understanding of users.
In the current version of CABRO in D2MS, the user can generate
new models each is composed by an attribute selection measure
chosen from the gain-ratio, the gini-index [1], Z2 and R-measure;
a pruning technique from error-complexity, reduced-error and
pessimistic error [1]; and a discretization technique from the
entropy-based and error-based techniques.
For each model candidate, D2MS tree visualizer displays
graphically the corresponding pruned tree, its size, and its
prediction error rate. It offers the user a multiple view of these
trials and facilitates the user to compare results of trials in order to
make his/her final selection of techniques/models of interest.
D2MS tree visualizer is used not only in inducing decision trees
but also in classifying unknown objects. It plays the role of the
interface for visual explanation of the matching process, in a way
similar to the explanation in knowledge-based systems. D2MS
tree visualizer supports three modes of matching an unknown
object according to the way that the unknown object is declared.
The whole record of the unknown object is read from a

database: D2MS directly show the leaf node that matches the
object. The path from the tree root until that leaf node will be
highlighted. Information accumulated along the path can be
viewed at any node.
Values of attributes are given by the user when answering

the system questions: Questions about attributes will be
asked according to the hierarchical structure in a top-down
manner from the root. From menu the user will choose one
value in the list of discrete values of the attribute or enter a
numerical value in case of continuous attribute. Questions
are asked dynamically according to the stepwise refinement
of the matching process.
The user declares values of attributes he/she knows: The user

is able to select attributes that he/she wishes to query on.
These attributes can be selected from the attribute list with
corresponding values. Once the attribute-values pairs are
entered, the tree visualizer will limit the regions on the tree
that partially satisfy the data. The system will then ask
additional questions to fulfill the match.
5. A C A S E - S T U D Y
This section illustrates the utility of synergistic visualization of
data and knowledge of D2MS in extracting knowledge from a
stomach cancer dataset.
Figure 5: Multiple views of generated trees in D2MS, the top

window shows the T2.SD view while the bottom window
shows the tightly-coupled views of the generated decision tree
from stomach cancer data.
5.1 The stomach cancer dataset

The stomach cancer dataset collected at the National Cancer
Center in Tokyo during 1962-1991 is a very precious source for
the research. It contains data of 7,520 patients described
originally by 83 numeric and categorical attributes of location,
combined refection, pre-operative complication, post-operative
complication, etc. The problem is to find predictive and
descriptive rules for the class of patients who died within 90 days
after operation amidst a total of 5 classes "death within 90 days",
"death after 90 days", "death after 5 years", "alive", "unknown".
Several well-known data mining systems have been applied to do
this task. However, the obtained results were far from
expectations: they have low support and confidence, and usually
relate to only a small percentage of patients of the target class.
5.2 Mining rules with visual L U P C

The D2MS's visualization tools associated in LUPC allow us to
examine the data and to gain better insight into complex data
before learning. While the viewing mode of original data offers
an intuition about the distribution of individual attributes and
instances, the summarizing and querying modes can suggest
irregular or rare events to be investigated, or to guide which
biases could be used to narrow the huge search space. It is
commonly known that patients who have symptoms
"liver metastasis" of all levels 1, 2, or 3 will certainly not
survive. Also, "serosal invasion = 3" is a typical symptom of the
class "death within 90 days." With the visualization tools, we
found several unusual events. For example, among 2329 patients
in the class "alive", 5 of them have heavy metastasis of level 3,
and 1 and 8 of them have metastasis level 2 and 1, respectively.
Moreover, the querying data allow us to verify some significant
combination of symptoms such as "liver metastasis = 3", and
"serosal invasion = 3" as shown in Figure 6.
It is commonly known that patients cannot survive when liver
metastasis occurs aggressively. Learning methods when applied to
this datasets often yield rules for the class "death within 90 days"
containing "liver metastasis" that are considered acceptable but
523
I%4
.......
11)01
Ill~IlOl
Figure 6: Visualization of data suggested rare events to be

investigated.
not useful by domain experts. Also, these discovered rules usually

cover only a subset of patients of this class. This low coverage
means that there are patients of the class who are not included in
"liver metastasis" and, therefore, it is difficult to detect them.
Using visual interactive LUPC, we ran different trials and
specified parameters and constraints to find only rules that do not
contain the attribute "liver_metastasis" and/or its combination
with two other typical attributes, "Peritonealmetastasis" and
"Serosal_invasion?' Below is a rule with accuracy 100%
discovered by LUPC that can be seen as a rare and irregular event
in the target class.
Rule 8
[accuracy = 1.0 (4/4), cover = 0.001 (416712) ]
IF
category = R AND sex = F AND proximal3.hird = 3
THEN
The prediction of rare events is becoming particularly interesting.

When supposing that some attribute-value pairs may characterize
some rare and/or significant events. LUPC, thanks to its
associated visualization tools, allows us to examine effectively the
hypothesis space and identify rare rules with any given small
support or confidence. An example is to find rules in the class
"alive" that contain the symptom "liver_metastasis." Such events
are certainly rare and influence human decision making. We
found rare events in the class "alive", such as male patients
getting "liver metastasis" at serious level 3 can survive with the
accuracy of 50%.
Rule i
IF
THEN
[accuracy= 0.500 (2/4); cover = 0.001(4/6712)]

sex = M AND type = B1 AND liver_metastasis = 3
AND middle third = 1
dass = alive
6. C O N C L U S I O N
We have presented the knowledge discovery system D2MS with
support for model selection integrated with visualization. We
emphasize the crucial role of the user's participation in the model
selection process of knowledge discovery and have developed
data, rule and tree visualizers in D2MS to support such
participation. Our basic idea is use right visualization techniques
in right places~ and visualization should be integrated into the
steps of the knowledge discovery process. D2MS with its
visualization support has been used and shown advantages in
extracting knowledge from a real-world application on stomach
cancer data.
AND middleJ:hird = 1
class = death within 90 days
REFERENCES
[1] Breiman, L., Friedman, J., Olshen, R., and Stone, C.,
Classification and Regression Trees, Belmont, CA:
Wadsworth, 1984.
[2] Card, S. K., Mackinlay, J. D., Shneiderman, B., Readings in
Information Visualization, Morgan Kaufmann, 1999.
[3] Fayyad, U.M., Grinstein. G.G., and Wierse, A., Information
Visualization in Data Mining and Knowledge Discovery,
Morgan Kaufmann, 2002.
[4] Furnas, G.W., "The FISHEYE View: A New Look at
Structured
Files",
Bell
Laboratories
Technical
Memorandum, #81-11221-9, 1981.
[8] Lamping, J. and Rao, R., "The Hyperbolic Browser: A Focus

+ Context Techniques for Visualizing Large Hierarchies",
Journal of Visual Languages and Computing, 7(1), pp. 3355, 1997.
[9] Lee, H.Y., Ong, H.L., and Quek, L.H., "Exploiting
Visualization in Knowledge Discovery", First Inter. Conf. on
Knowledge Discovery and Data Mining, 1995, pp. 198-203.
[10] Mannila, H., "Methods and Problems in Data Mining", Inter.
Conf. on Database Theory, Springer, 1997, pp. 41-55.
[ 11 ] Nguyen, T.D. and Ho, T.B., "An Interactive Graphic System
for Decision Tree Induction", Journal of Japanese Society
for Artificial Intelligence, Vol. 14, N. 1, pp. 131-138, 1999.
[5] Han, J. and Cercone, N., "RuleViz: A Model for Visualizing

Knowledge Discovery Process", Sixth Inter. Conf. on
Knowledge Discovery and Data Mining, 2000, pp. 244-253.
[12]Nguyen, T.D., Ho, T.B., and Shimodaira, H., "A

Visualization Tool for Interactive Learning of Large
Decision Trees", Twel~h IEEE lnter. Conf. on Tools with
Artificial Intelligence, 2000, pp. 28-35.
[6] Ho, T.B., Nguyen, D.D., and Kawasaki, S., "Mining

Prediction Rules from Minority Classes", Inter Workshop
Rule-BasedData Mining, Tokyo, 2001. pp. 254-264.
[13]Reingold, E.M. and Tilford, J.S., "Tidier Drawings of

Trees", IEEE Transactions on Software Engineering, Vol.
SE-7, No. 2, pp. 223-228, 1991.
[7] Kumar, H. P., Plaisant, C., Shneiderman, B., "Browsing

Hierarchical Data with Multi-Level Dynamic Queries and
Pruning", Inter Journal of Human-Computer Studies, 46(1),
pp. 103-124, 1997.
[14]Robertson, G.G., Maekinlay, J. D., and Card, S.K.,

"Cone Trees: Animated 3D Visualization o f
Hierarchical Information", A C M Conf. on Human
Factors in Computing Systems, 1991, pp. 189-194.
524

Data and Knowledge Visualization in Knowledge Discovery Process

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data and Knowledge Visualization in Knowledge Discovery Process

Enviado por

Direitos autorais:

Formatos disponíveis

Visualization Support for a User-Centered KDD Process

analysis [2] and a high potential in knowledge discovery in

2. MODEL SELECTION IN D2MS

Figure 1 shows a conceptual architecture of D2MS where Data

2.1 User-centered model selection

Visualization has proven its effectiveness in exploratory data

Figure 1: Conceptual architecture of the system

The key idea of our solution to model selection in D2MS is to

2.2 Plan and plan manager

activated to show the user corresponding data or model.

3. DATA AND MODEL VISUALIZATION

3.1.1 Viewing original data

3.1.2 Summarizing data

The third phase is to select appropriate models by the user. D2MS

2.3 Process visualization

Figure 2: An illustration of user-centered model selection

3.2.2 Viewing rules and data

"7 -'~- ~"77"~

" ~ ~ 3 "~ 3 " ~

The subset of instances covered by a rule is visualized together

" ' ~ ; X.;;:;'7~.~

Figure 3: A screen shot of D2MS in selecting models learned

3.1.3 Querying data

3.3 Tree Visualization

3.3.1 Different modes of viewing hierarchical

Tightly-coupled views: The global view shows the tree

Customizing views: Initially, according to the user's choice,

Tiny mode with fish-eye view: Note that no current

3.2 Rule Visualization

3.3.2 Trees 2.5 Dimensions

4. VISUALIZATION IN THE KDD

4.1 The whole process visualization with plan

4.2 Data visualization in the KDD process

4.3 Model Visualization in the KDD process

A model that can be understood is a model that can be trusted. A

and accuracy of the rule. For example, two rules predicting a

The whole record of the unknown object is read from a

Values of attributes are given by the user when answering

The user declares values of attributes he/she knows: The user

Figure 5: Multiple views of generated trees in D2MS, the top

5.1 The stomach cancer dataset

5.2 Mining rules with visual L U P C

Figure 6: Visualization of data suggested rare events to be

not useful by domain experts. Also, these discovered rules usually

The prediction of rare events is becoming particularly interesting.

[accuracy= 0.500 (2/4); cover = 0.001(4/6712)]

[8] Lamping, J. and Rao, R., "The Hyperbolic Browser: A Focus

[5] Han, J. and Cercone, N., "RuleViz: A Model for Visualizing

[12]Nguyen, T.D., Ho, T.B., and Shimodaira, H., "A

[6] Ho, T.B., Nguyen, D.D., and Kawasaki, S., "Mining

[13]Reingold, E.M. and Tilford, J.S., "Tidier Drawings of

[7] Kumar, H. P., Plaisant, C., Shneiderman, B., "Browsing

[14]Robertson, G.G., Maekinlay, J. D., and Card, S.K.,

Você também pode gostar