Escolar Documentos
Profissional Documentos
Cultura Documentos
CONTENT MINING
Avinash N. Bhute
Research Scholar,VJTI, Mumbai, Maharastra State India
email:anbhute@gmail.com
Harsha Bhute
Pursuing ME(Computer Network), Sinhgad College of Engineering, Pune, Maharastra State, India
email:harshaavi@gmail.com
Dr.B. B. Meshram
Professor & Head, VJTI, Mumbai Maharastra State, India
Email:bandumeshram@yahoo.co.in
Keywords: Graph edit distance, Graph isomorphism, Graph similarity, probabilistic approach, relaxation approach
Abstract: Web mining is the application of machine learning (data mining) techniques to web-based data for the
purpose of learning or extracting knowledge. Web mining encompasses a wide variety of techniques,
including soft computing. web mining methodologies are generally are classified into three distinct
categories: web structure mining, web usage mining & web content mining. In web content mining we
examine the actual content of web pages and performed some knowledge discovery procedure. In this paper
we will discussed about the concepts of graph similarity, graph distance, and graph matching techniques as
they form a basis for the novel approaches. The purpose of the current paper is to give a literature survey of
the various methods that are used to determine similarity, distance and matching between graphs.
| d
representation (such as the matrix M used for graph
d (G1 , G 2 ) C2 ( x, y ) d C2 ( x, y ) | (6) matching) into a continuous representation. Thus we
x y V1 can transform a discrete optimization problem (exact
where is a 1-to-1 mapping (but not necessarily an graph matching using discrete matrix M) into a
isomorphism) between G1 and G2. continuous optimization problem. Compared to the
Here |...| is the standard absolute value operation. If state space search approach , relaxation is a non-
is an isomorphism (i.e. G1 G2), then d linear optimization approach.[ S. Gold,2002] applied
(G1,G2)=0; if G1 and G2 are not isomorphic, then d relaxation to the graph matching problem. They
(G1,G2)>0. This leads to a definition of distance have posed the problem of attributed graph matching
between two graphs, denoted d(G1,G2 ).Here again in terms of an optimization problem The goal is then
we see the idea of examining all the possible to minimize the objective function. The authors use
matching functions ( 3, in the notation of the current the graduated assignment algorithm to find an M
method; M in the notation of graph edit distance) which minimizes E. The general procedure of the
between two graphs in order to determine the algorithm is as follows:
distance between them. The authors also go on to (1)Start with some valid initial matrix M0.
show if the graphs meet certain requirements then (2)Determine a first order Taylor expansion of M0
we can make some other, less expensive calculations. yielding:
For example, if G1 and G2 are connected graphs with
equal numbers of nodes, then we can determine the E |V 1| |V 2|
M
i 1 j 1
2
ij f (C ij )
i 1
M
j 1
ij (1 M ij ) (10 operations (given the lowest cost editing function
between the graphs) for the given " in order to
determine the mean graph. In [H. Bunke and A.
)
Kandel, 2000 ], a theoretical proof is given that any
where M is now a fuzzy membership matrix
graph g such that mcs(G1,G2)gG1 or
(0Mij1) that relates the degree of match between
mcs(G1,G2)gG2 is a mean of G1 and G2. Thus
nodes, C is a compatibility matrix between nodes
the problem becomes finding a graph that is a
(rather than edges as above), is a control supergraph of the maximum common subgraph, but
parameter. a subgraph of one of the original graphs. Finally, we
The summations in Eq.10 are under the
have the concept of the median of a set of graphs,
constraint that (i,j) (|V1|+1,|V2|+1); the extra which acts like a representative of the set. The
nodes in the graphs are dummy nodes similar to median of a set of graphs S [H. Bunke, 2001] is a
slack variables. The authors then go on to derive the
graph gS such that g has the lowest average
necessary update equations for M and C in order to
distance to all elements in S:Since gS, it is
minimize J(M,C) and propose an algorithm which
straightforward (and relatively inexpensive) to
updates these matrices in an alternating fashion.
simply compute the average distance to all graphs
for each graph in S. Further, the median of a set of
graphs always exists; it may or may not also be a
IX. MEAN AND MEDIAN OF mean.
GRAPHS
In addition to the graph matching approaches we X. CONCLUSIONS
have described, we should also mention the concepts
of mean and median of a set of graphs [S. Günter, In this paper we have given a survey of the most
2002]. These do not explicitly give us an indication popular methods for determining graph similarity.
of graph similarity, but are useful in summarizing a Graph isomorphism finds an exact 1-to-1 matching
group of graphs. This is useful in applications such between identical graphs and was the earliest
as clustering, where we need to represent a group of approach to graph matching. Unfortunately, it
graphs by some exemplar graph that represents the cannot handle inexact graph matching. Graph edit
cluster.The mean of two graphs [S. Günter, 2002] distance is a popular approach that can deal with
G1 and G2 is a graph g such that: inexact matching. It determines the cost of a
d(G1,g)=d(G2,g) sequence of edit operations needed to transform one
and graph into another. Methods such as state space
d(G1,G2)=d(G1,g)+d(g,G2) search and relaxation have also been applied to the
problem of determining graph similarity. These Pattern Recognition, Vol.!2, Barcelona, 2000, pp.117–
techniques are often used to provide a sub-optimal 124.
approximation when the original problem is NP- J. J. McGregor, 1982, “Backtrack search algorithms and
Complete or has a high potential for combinatorial the maximal common subgraph problem”, Software
Practice and Experience, Vol. 12, 1982, pp. 23–34.
explosion. For example, state space search can be
J. T. L. Wang, 1995, K. Zhang, and G.-W. Chirn,
used if we represent the matching or edit sequences “Algorithms for Approximate Graph Matching”,
between graphs as states, and then execute a search Information Sciences, Vol. 82, 1995, pp. 45–74.
strategy for the state with the lowest cost. K. Haris, 1999, S. N. Efstratiadis, N. Maglaveras, C.
As we have seen, the approaches often have Pappas, J. Gourassas, and G. Louridas, “Model-Based
similarities with one another. For example, Morphological Segmentation and Labeling of
probability can be seen not just in the Bayesian Coronary Angiograms”, IEEE Transactions on
approach, but also in the cost functions of graph edit Medical Imaging, Vol. 18, No. 10, October 1999, pp.
distance and some state space search approaches. 1003–1015.
K.-C. Tai, 2003, “The tree-to-tree correction problem”,
Journal of the Association for Computing Machinery,
Vol. 26, No. 3, 2003, pp. 422–433.
REFERENCES M.-L. Fernández and G. Valiente, 2001, “A graph distance
metric combining maximum common subgraph and
B. Huet and E. R. Hancock, 1999, “Shape recognition minimum common supergraph”, Pattern Recognition
from large image libraries by inexact graph matching”, Letters,Vol. 22, 2001, pp. 753–758.
Pattern Recognition Letters, Vol. 20, 1999, pp. 1259– McCallum and K. Nigam, 1998, “A comparison of event
1269 models for Naive Bayes text classification”, AAAI–98
C. Apte, 1997 F. Damerau, and S. M. Weiss, “Automated Workshop on Learning for Text Categorization, 1998.
Learning of Decision Rules for Text Categorization”, R. A. Wagner and M. J. Fischer,2001, “The String-to-
ACM Transactions on Information Systems, Vol. 12, String Correction Problem”, Journal of the Association
1994,pp. 233–251. for Computing Machinery, Vol. 21, 2001, pp. 168–173.
G. Chartrand, G. Kubicki, and M. Schultz, 1998, “Graph R. C. Wilson and E. R. Hancock, 1997, “Structural
similarity and distance in graphs”, Aequationes Matching by Discrete Relaxation”, IEEE Transactions
Mathematicae, Vol. 55, 1998, pp. 129–145. on Pattern Analysis and Machine Intelligence, Vol. 19,
G. Levi, 1972, “A note on the derivation of maximal No. 6, June 1997, pp. 634–
common subgraphs of two directed or undirected S. Gold and A. Rangarajan, 2002, “A Graduated
graphs”, Calcolo, Vol. 9, 1972, pp. 341–354. Assignment Algorithm for Graph Matching”, IEEE
H. Bunke and A. Kandel, 2000 “Mean and maximum Transactions on Pattern Analysis and Machine
common subgraph of two graphs”, Pattern Intelligence, Vol. 18, No. 4, April 1996, pp. 377–388.
Recognition Letters, Vol. 21, 2000, pp. 163–168. S. Günter and H. Bunke, 2002, “Self-organizing map for
H. Bunke and K. Shearer, 1998 “A graph distance metric clustering in the graph domain”, Pattern Recognition
based on the maximal common subgraph”, Pattern Letters, Vol. 23, 2002, pp. 405–417.
Recognition Letters, Vol. 19, 1998, pp. 255–259. S. Medasani,R. Krishnapuram, and Y. S. Choi,
H. Bunke, 1999,“Error Correcting Graph Matching: On 2001,“Graph Matching by Relaxation of Fuzzy
the Influence of the Underlying Cost Function”,IEEE ssignments”, IEEE Transactions on Fuzzy Systems,
Transactions on Pattern Analysis and Machine Vol. 9, No. 1, February 2001, pp. 173–182.
Intelligence, Vol. 21, No. 9, September 1999, pp. 917– S. Wei, S. Jun, and Z. Huicheng, 2001, “A fingerprint
922. recognition system by use of graph matching”,
H. Bunke, 2000, X. Jiang, and A. Kandel, “On the Proceedings of SPIE, Vol. 4554, 2001, pp. 141–146.
minimum Common Supergraph of Two Graphs”, Sanfeliu and K. S. Fu, 2003, “A distance measure between
Computing, Vol. 65, 2000, pp. 13–25. attributed relational graphs for pattern recognition”,
H. Bunke, 2001 S. Günter, and X. Jiang, “Towards IEEE Transactions on Systems, Man, and Cybernetics,
Bridging the Gap between Statistical and Structural Vol. 13, 1983, pp. 353–363.
Pattern Recognition: Two New Concepts in Graph T. H. Cormen, 1997,C. E. Leiserson, and R. L. Rivest,
Matching”, in Advances in Pattern Recognition - Introduction to Algorithms, The MIT Press:
ICAPR 2001, S. Singh, N. Murshed, and W. Cambridge, Massachusetts, 1997.
Kropatsch (Eds.), Springer Verlag, LNCS 2013, 2001, T. P. and C. Siva Ram Murthy, 1999, “Optimal task
pp. 1–11. allocation in distributed systems by graph matching
H. Bunke,1997 “On a relation between graph edit distance and state space search”, The Journal of Systems and
and maximum common subgraph”, Pattern Software, Vol. 46, 1999, pp. 59–75.
Recognition Letters, Vol. 18, 1997, pp. 689–694.
H.Bunke, 2000,“Recent developments in graph matching”,
Proceedings of the 15th International Conference on
Avinash N Bhute (ACM M’09, CSI
M ’09, ISTE LM ’05) is Assistant
professor at Sinhgad college of
engineering, Pune. He received his
bachelor degree in Computer science and
Engineering from Amravati University in
1999, M.Tech. from Bharati Vidhyapith,
Pune in 2005.He is Active Member of
ACM, CSI, ISTE, and Indian Science Congress Association.
Recently he Review the book “Software Engineering” 7th Edition
by Stephen Schach, McGraw Hill Publication.He has published
Seven international papers and 12 national papers. His current
research interest includes Knowledge discovery in database,
Mining the Web, Ontology, Artificial intelligent, Software
engineering.