Overview of Graph Similarity Technique in Web Content Mining

OVERVIEW OF GRAPH SIMILARITY TECHNIQUES FOR WEB
CONTENT MINING
Avinash N. Bhute
Research Scholar,VJTI, Mumbai, Maharastra State India
email:anbhute@gmail.com
Harsha Bhute
Pursuing ME(Computer Network), Sinhgad College of Engineering, Pune, Maharastra State, India
email:harshaavi@gmail.com
Dr.B. B. Meshram
Professor & Head, VJTI, Mumbai Maharastra State, India
Email:bandumeshram@yahoo.co.in
Keywords: Graph edit distance, Graph isomorphism, Graph similarity, probabilistic approach, relaxation approach
Abstract: Web mining is the application of machine learning (data mining) techniques to web-based data for the
purpose of learning or extracting knowledge. Web mining encompasses a wide variety of techniques,
including soft computing. web mining methodologies are generally are classified into three distinct
categories: web structure mining, web usage mining & web content mining. In web content mining we
examine the actual content of web pages and performed some knowledge discovery procedure. In this paper
we will discussed about the concepts of graph similarity, graph distance, and graph matching techniques as
they form a basis for the novel approaches. The purpose of the current paper is to give a literature survey of
the various methods that are used to determine similarity, distance and matching between graphs.
1 INTRODUCTION In this paper we are specifically interested in

using graph techniques for dealing with web
document content. Traditional learning methods
I n this paper we will discussed about the concepts

of graph similarity, graph distance, and graph
matching techniques as they form a basis for the
novel approaches. The purpose of the current paper
is to give a literature survey of the various methods
applied to the tasks of text or document
classification and categorization, such as rule
induction [C. Apte, 1997 ] and Bayesian methods [A.
McCallum,1998], are based on a vector model of
that are used to determine similarity, distance and document representation or an even simpler Boolean
matching between graphs. These topics are closely model. Similarity of graphs in domains outside of
related to the topics of inexact graph matching or information retrieval has largely been studied under
graph similarity, and several practical applications the topic of graph matching. In many applications
that utilize graph similarity or graph matching are the input graph is not expected to be an exact match
represented, many of them in the field of image to any database graph since the input graph is either
processing. K. Harris performed labeling of previously unseen or assumed to be corrupted with
coronary angiograms with graph matching [K. Haris, some amount of noise. Thus we sometimes refer to
1999]. In [C. Siva Ram Murthy, 1999] a method for this area as error-tolerant or inexact graph matching.
allocating tasks in multiprocessor systems using As mentioned above, a number of graph matching
graphs and graph matching is described. In [B. Huet, applications have been reported in the literature. For
1999] describe a graph matching method for shape a recent survey [H.Bunke, 2000]. We are not aware,
recognition in image databases. however, of any graph matching applications that
deal with content-based categorization and
classification of web or text documents.
2 GRAPH AND SUBGRAPH compliment [T. H. Cormen, 1997] of a graph G,
denoted G , is the fully connected version of G such
ISOMORPHISM that the edges in G have been removed,
Ec={(u,v)|(u,v).E}.
Here we are describe graph and subgraph
isomorphism. First give definitions for graph and
subgraph. A graph G [H. Bunke, 2000][Wang,
1995] is a 4-tuple: G=(V, E, α, β), where V is a set of G GC
nodes (also called vertices), E V x V is a set of
edges connecting the nodes, :VV is a function
labeling the nodes, and V x V is a function
labeling the edges (V and  being the sets of Fig1. Graph G Compliment of Graph GC
labels that can appear on the nodes and edges, However, a graph may be isomorphic to its
respectively). For brevity, we may abbreviate G as compliment (Fig.1), so it does not necessarily hold
G=(V,E) by omitting the labeling functions. A graph that s(G,GC)=0. Given this limitation, the usual
G1 = (V1, E1, α1, β1) is a subgraph [H. Bunke,1997 ] method of determining numeric similarity between
of a graph G2 = (V2, E2, α2, β2) denoted G1G2, if graphs is to use a distance measure. A distance
V1V2, E1E2(V1x V1), α1(x)= α2(x)xV1, and metric [H. Bunke, 2000][H. Bunke,1998][M.-L.
β1((x,y))= β2 ((x,y)) (x,y)E1. Conversely, graph Fernández,2001] between two graphs, denoted
G2 is also called a supergraph of G1. d(G1,G2), is a function that has the following
When it say that two graphs are isomorphic, properties:
means that the graphs contain the same number of (1) boundary condition: d(G1,G2) 0
nodes and there is a direct 1-to-1 correspondence (2) identical graphs have zero distance: d(G1,G2)=0
between the nodes in the two graphs such that the  G1 G2
edges between nodes and all labels are preserved. (3) symmetry: d(G1,G2)=d(G2,G1)
Formally, a graph G1 = (V1, E1, α1, β1) and a graph (4) triangle inequality:
G2 = (V2, E2, α2, β2),are said to be isomorphic [H. d(G1,G3)d(G1,G2)+d(G2,G3)
Bunke,1997 ], denoted G1G2, if there exists a We note that it is possible to transform a similarity
bijective function f :V1 V2 such that α1(x)= α2(f(x)) measure into a distance measure, for example by:
for )xV1 and β1((x,y))= β2((f(x),f(y))) for
(x,y)V1x V1. Such a function f is also called a d(G1,G2)=1- s(G1,G2)---- (1)
graph isomorphism between G1 and G2.
There is also the notion of subgraph It can be shown that this equation satisfies the
isomorphism, meaning a graph is isomorphic to a various conditions above for similarity. Other
part of (i.e. a subgraph of) another graph. Given a equations are also possible for changing distance
graph isomorphism f between graphs G1 and G2 as into similarity. Throughout the rest of this
defined above and another graph G3, if G2G3 then dissertation we will see several proposed distance
f is a subgraph isomorphism [H. Bunke, 2000] measures, some of which have been created from a
between G1 and G3.Subgraph isomorphism tells us if similarity measure.
one graph appears as part of another graph. Formally,
the similarity between two graphs G1 and G2,
denoted s(G1,G2), is a function that has the following 3 GRAPH EDIT DISTANCE
properties:
(1) 0 s(G1,G2) 1 Edit distance is a method that is used to measure the
(2) s(G1,G2)=1  G1 G2 difference between symbolic data structures such as
(3) s(G1,G2)=s(G2,G1) trees [K.-C. Tai, 2003] and strings [R. A.
(4) if G1 is more similar to G2 than to G3, then Wagner ,2001]. It is also known as the Levenshtein
s(G1,G2)  s(G1,G3) distance, from early work in error
correcting/detecting codes that allowed insertion and
One problem with defining similarity in this deletion of symbols [G. Levi, 1972]. The concept is
way is that it is not clear what case causes straightforward. Various operations are defined on
s(G1,G2)=0. This comes from the fact that we have the structures, such as deletion, insertion, and
no concept of an exact “opposite” of a graph. We do, renaming of elements. A cost function is associated
however, have the idea of compliments of graphs. A with each operation, and the minimum cost needed
to transform one structure into the other using the IV. MAXIMUM COMMON
operations is the distance between them. Edit
distance has also been applied to graphs, as graph SUBGRAPH / MINIMUM
edit distance [A. Sanfeliu, 2003]. The operations in COMMON SUPERGRAPH
graph edit distance are insertion, deletion, and re- APPROACH
labeling of nodes and edges.Formally, an editing
matching function (or an error correcting graph Bunke has shown [H. Bunke,1997 ] that there is a
matching, ecgm [H. Bunke,1997] ) between two direct relationship between graph edit distance and
graphs G1 and G2 is defined as a bijective mapping the maximum common subgraph between two
function M:Gx Gy, where GxG1 and GyG2. graphs. Specifically, the two are equivalent under
The following six edit operations on the graphs, certain restrictions on the cost functions. A graph g
which are implied by the mapping M, are also is a maximum common subgraph (mcs) [H.
defined: Bunke,1997 ] of graphs G1 and G2, denoted
(1)If a node vV1 but vVx then we delete node v mcs(G1,G2), if: (1) gG1 (2) gG2 and (3) there is no
with cost cnd. other subgraph g’ (g’G1, g’G2) such that |g’|>|g|.
(2)If a node vV2 but vVy then we insert node v (Here |g| is usually taken to mean |V|, i.e. the number
with cost cni. of nodes in the graph; it is used to indicate the “size”
(3)If M(vi)=vj for viVx and vjVy and of a graph.) Similarly, there is the complimentary
α1(Vx)α2(Vy) then we substitute node vi with idea of minimum common supergraph. A graph g is
node vj with cost cns. a minimum common supergraph (MCS) [H. Bunke,
(4)If an edge eE1 but eEx then we delete edge e 2000] of graphs G1 and G2, denoted MCS(G1,G2),
with cost ced. if: (1) G1g (2) G2g and (3) there is no other
(5)If an edge eE2 but eEy then we insert edge e supergraph g’ (G1g’, G2g’) such that |g’|<|g|.
with cost cei. Methods for determining the mcs are given in [G.
(6)If M(ei)=ej for eiEx and ejEy and β1(ex) Levi, 1972][J. J. McGregor, 1982].
β2(ey) then we substitute edge ei with edge ej with The general approach is to create a
cost ces. compatibility graph for the two given graphs, and
Usually the cost coefficients c are application then find the largest clique within it. What Bunke
dependant. In the error correcting graph matching has shown is that when computing the editing
sense, they can be related to the probability of the matching function based on graph edit distance, the
operations (errors) occurring. We assume that the function with the lowest cost is equivalent to the
cost coefficients are non-negative and are invariant maximum common subgraph between the two
of the node or edge upon which they are applied (i.e. graphs under certain conditions on the cost
the costs are constant for each operation). The edit coefficients. This is intuitively appealing, since the
distance between two graphs [H. Bunke,1997 ], maximum common subgraph is the part of both
denoted d(G1,G2), is defined as the cost of the graphs that is unchanged by deleting or inserting
mapping M that results in the lowest /(M). More nodes and edges. To edit graph G1 into graph G2,
formally: one only needs to perform the following steps:
(1)Delete nodes and edges from G1 that don’t appear
d(G1,G2)= min  M 
M
 in mcs(G1,G2)
(2)Perform any node or edge substitutions
Thus the distance between two graphs is the cost of (3)Add the nodes and edges from G2 that don’t
an editing function which transforms one graph into appear in mcs(G1,G2)
the other via edit operations and which has the Following this observation that the size of the
lowest cost among all such editing functions.The maximum common subgraph is related to the
advantage to the graph edit distance approach is that similarity between two graphs,[H. Bunke,1998] have
it is easy to understand and straightforward to apply. introduced a distance measure based on mcs. They
The disadvantage is that the costs for the edit defined the following distance measure:
operations (6 parameter values) need to be
determined for each application. In [H. Bunke, | mcs (G1 , G 2 ) |
1999 ], Bunke gives an examination of cost d MCS G1 ,G2  =1- (3)
functions for graph edit distance. max(| G1 |, | G 2 |)
where max(x,y) is the usual maximum of two images or documents) this may not be a concern.
numbers x and y, and |...| indicates the size of a graph These techniques are also sensitive to initialization
(usually taken to be the number of nodes in a graph). and parameter selection, so there can be a wide
This distance measure has four important properties variety in performance. For a more detailed
[H. Bunke, 2000] .First, it is restricted to producing description of this technique as well as experimental
a number in the interval [0,1]. Second, the distance results comparing different search and initialization
is 0 only when the two graphs are identical. Third, strategies, kindly refer the reader to [Wang, 1995].
the distance between two graphs is symmetric.
Fourth, it obeys the triangle inequality, which
ensures the distance measure behaves in an intuitive VI. PROBABILISTIC
way. For example, if we have two dissimilar objects
(i.e. there is a large distance between them) the APPROACHES
triangle inequality implies that a third object which
is similar (i.e. has a small distance) to one of those In this section we will give a summary of the
objects must be dissimilar to the other. The approach proposed by [R. C. Wilson, 1997] which is
advantage of this approach over the graph edit based on probability theory. In the probabilistic
distance method is that it does not require the method, we attempt to match a data graph GD and a
determination of any cost coefficients or other stored model graph GM. These graphs are attributed
parameters. graphs. An attributed graph [36] is a graph
Gy=(V,E,A), where A is a set of attributes associated
with each node, A={ xvy , vV}. The attributes in
the data graph are to be matched to those in the
V. STATE SPACE SEARCH model graph, such that the matched nodes have the
APPROACH same or similar attributes. Edges may also have
associated attributes in this model, but they are not
Depending on the size of the graphs and the costs considered in this approach. Next, we have the
associated with the edit operations, finding the concept of super-clique of a node. A super-clique [R.
lowest cost mapping may require an exhaustive C. Wilson,1997] of a node i in graph G=(V,E) is
examination of all possible matching. If we allow defined as Ci=i{j|(j,i)E}. In other words, the
the possibility of not having to determine the exact super-clique of a node i is the set of nodes which
distance between graphs, we can perform other types contains i and all nodes connected to it by edges. We
of sub-optimal search. These searches may not find attempt to match all super-cliques in the data graph
the global minimum cost function, but they can be with super-cliques in the model graph.
performed more quickly (since we do not need to The set of all possible matches between
find all of the possible matching functions) and still super-clique Ci in the data graph GD and super-
yield acceptable results. Each matching function we cliques Sj in the model graph GM is called a
consider becomes a state in a search space. The cost dictionary [R. C. Wilson , 1997] and is denoted i.
(M) for a state M becomes the value we attempt to To cope with size differences between the data and
minimize through the search. M is actually a graph model super-cliques we allow dummy (or null)
isomorphism between subgraphs of the two graphs nodes 3 to be inserted into Sj so that both graphs
being matched; it specifies the operations needed to have equal numbers of nodes. The function
edit one graph into the other graph. Neighbors of a matching a node in Ci to a node in Sj is
state M can be determined by adding/deleting nodes f:VDVM
and edges to/from these subgraphs along with their The probability of matching errors (a node in
corresponding isomorphic matching; these neighbor the data graph is matched to the wrong node in the
states indicate the creation (or removal) of a single model graph) is denoted Pe and the probability of
matching between a node or edge in the two graphs structural errors (a node in the data graph is matched
(i.e.it specifies a change in the edit operations). Once to a dummy node in the model graph) is denoted P.
the matching is represented in such a manner, many Given these definitions, some assumptions, and
techniques become available for performing the through application of Bayes’ rule and other
search, including hill climbing, genetic algorithms, probability theoretic constructions, Wilson and
simulated annealing, and so forth. These searches Hancock arrive at a mathematical description for the
may not find the optimal solution, but for some probability of a super-clique matching between two
applications (such as graph matching for retrieval of graphs (denoted 4j for super-clique Cj)
K d (G 1 , G 2 )  min ( d  ( G 1 , G 2 )) (7)
(j) 
cj
(kH
exp{ (,S)k[
(,S)
()])}
(4)  
|| 
e j i j i j
j sj j Or, in other words, the sum of distances between all
Where pairs of nodes in a graph. Further theoretical
K ci  [(1   e )(1    )] contributions related to this approach can be found
|c j |
(5)
in [G. Chartrand, G. Kubicki, and M. Schultz, 1998].
H(Sj) is the Hamming distance between the super-

clique of the data graph under the mapping f and the 8 RELAXATION APPROACHES
super-clique of the model graph, (Sj)=|Cj|-|Si|
(i.e. the number of null nodes inserted into Si), and
As we mentioned in Section 2, some early
(is the number of nodes in Cj which are mapped algorithms for determining exact graph matching
onto null nodes in Si. (isomorphism) used a matching matrix (M) which
indicates the compatibility of nodes in the two
graphs being matched. If the ith row and jth column
7 DISTANCE PRESERVATION element of M, denoted Mij, is a 1, then node i in
APPROACH graph G1 is matched with node j in graph G2;
otherwise there is no match and Mij=0. There are
In [G. Chartrand, 1998] describe an approach for constraints on the matrix M so that each row has
graph distance calculation based on preserving the exactly one 1 and no column has more than one 1.
distance between nodes. The idea comes from the Such a representation and the algorithms applied to
fact that when two graphs are isomorphic, the it for determining graph matching are
distances (meaning in this context the number of straightforward, however they can require
edges traversed) between every pair of nodes are generating all the permutations of possible node
identical in both graphs. Given a graph G=(V,E), the matching over the matrix.
distance between two nodes x,y*V, denoted dG(x,y), In order to improve time complexity, we can
is defined as the minimum number of edges that instead attempt to approximate the optimal solution
need to be traversed when traveling from x to y [G. by finding good sub-optimal solutions instead. A
Chartrand,, 1998]. Further, the 3-distance [G. method that is sometimes used to do this for graph
Chartrand,1998] between two graphs G1 and G2, matching problems is called relaxation (or more
denoted d3(G1,G2), is defined as specifically, discrete relaxation). Put simply, discrete
relaxation is a method of transforming a discrete
| d
representation (such as the matrix M used for graph
d  (G1 , G 2 )  C2 ( x, y )  d C2 ( x, y ) | (6) matching) into a continuous representation. Thus we
 x  y V1 can transform a discrete optimization problem (exact
where  is a 1-to-1 mapping (but not necessarily an graph matching using discrete matrix M) into a
isomorphism) between G1 and G2. continuous optimization problem. Compared to the
Here |...| is the standard absolute value operation. If state space search approach , relaxation is a non-
 is an isomorphism (i.e. G1 G2), then d linear optimization approach.[ S. Gold,2002] applied
(G1,G2)=0; if G1 and G2 are not isomorphic, then d relaxation to the graph matching problem. They
(G1,G2)>0. This leads to a definition of distance have posed the problem of attributed graph matching
between two graphs, denoted d(G1,G2 ).Here again in terms of an optimization problem The goal is then
we see the idea of examining all the possible to minimize the objective function. The authors use
matching functions ( 3, in the notation of the current the graduated assignment algorithm to find an M
method; M in the notation of graph edit distance) which minimizes E. The general procedure of the
between two graphs in order to determine the algorithm is as follows:
distance between them. The authors also go on to (1)Start with some valid initial matrix M0.
show if the graphs meet certain requirements then (2)Determine a first order Taylor expansion of M0
we can make some other, less expensive calculations. yielding:
For example, if G1 and G2 are connected graphs with
equal numbers of nodes, then we can determine the E |V 1| |V 2|
lower bound on their distance by Qai=- =  M bj0 C aibj

M ai0 b 1 j 1
(8)
In other words, a mean of two graphs G1
(3)Use relaxation to create a continuous and G2 is a graph g that is equidistant from G1 and
representation of M0 G2 and which is a distance from G1 or G2 equal to
half the distance between G1 and G2. Clearly the
M ai0 =e βQai (9) mean will depend on the distance functions chosen,
and there may be more than one graph satisfying
these conditions; it is also possible that no mean
where  is a control parameter that is slowly exists for a given pair of graphs. The weighted mean
increased as the procedure runs. of two graphs [H. Bunke, 2001] G1 and G2 is a
(4)Update the matrix M by a normalization graph g such that:
procedure over both rows and columns.
(5)Repeat until convergence or iteration limit d(G1,G)=d(G1,G2)
reached. and
[S. Medasani, 2001] gave a procedure based d(G1,G2)= d(G1,G2)+d(g,G2)
on fuzzy assignments and relaxation similar to the
method just described. The objective function for
where 0<<1. If =0.5,
this approach is
An algorithm for finding the weighted mean of
two graphs is given in [H. Bunke, 2001]. The
J(M,C)=
|V 1|1|V 2|1 |V 1|1|V 2|1
method involves finding a subset of editing
 M
i 1 j 1
2
ij f (C ij )   
i 1
M
j 1
ij (1  M ij ) (10 operations (given the lowest cost editing function
between the graphs) for the given " in order to
determine the mean graph. In [H. Bunke and A.
)
Kandel, 2000 ], a theoretical proof is given that any
where M is now a fuzzy membership matrix
graph g such that mcs(G1,G2)gG1 or
(0Mij1) that relates the degree of match between
mcs(G1,G2)gG2 is a mean of G1 and G2. Thus
nodes, C is a compatibility matrix between nodes
the problem becomes finding a graph that is a
(rather than edges as above), is a control supergraph of the maximum common subgraph, but
parameter. a subgraph of one of the original graphs. Finally, we
The summations in Eq.10 are under the
have the concept of the median of a set of graphs,
constraint that (i,j) (|V1|+1,|V2|+1); the extra which acts like a representative of the set. The
nodes in the graphs are dummy nodes similar to median of a set of graphs S [H. Bunke, 2001] is a
slack variables. The authors then go on to derive the
graph gS such that g has the lowest average
necessary update equations for M and C in order to
distance to all elements in S:Since gS, it is
minimize J(M,C) and propose an algorithm which
straightforward (and relatively inexpensive) to
updates these matrices in an alternating fashion.
simply compute the average distance to all graphs
for each graph in S. Further, the median of a set of
graphs always exists; it may or may not also be a
IX. MEAN AND MEDIAN OF mean.
GRAPHS
In addition to the graph matching approaches we X. CONCLUSIONS
have described, we should also mention the concepts
of mean and median of a set of graphs [S. Günter, In this paper we have given a survey of the most
2002]. These do not explicitly give us an indication popular methods for determining graph similarity.
of graph similarity, but are useful in summarizing a Graph isomorphism finds an exact 1-to-1 matching
group of graphs. This is useful in applications such between identical graphs and was the earliest
as clustering, where we need to represent a group of approach to graph matching. Unfortunately, it
graphs by some exemplar graph that represents the cannot handle inexact graph matching. Graph edit
cluster.The mean of two graphs [S. Günter, 2002] distance is a popular approach that can deal with
G1 and G2 is a graph g such that: inexact matching. It determines the cost of a
d(G1,g)=d(G2,g) sequence of edit operations needed to transform one
and graph into another. Methods such as state space
d(G1,G2)=d(G1,g)+d(g,G2) search and relaxation have also been applied to the
problem of determining graph similarity. These Pattern Recognition, Vol.!2, Barcelona, 2000, pp.117–
techniques are often used to provide a sub-optimal 124.
approximation when the original problem is NP- J. J. McGregor, 1982, “Backtrack search algorithms and
Complete or has a high potential for combinatorial the maximal common subgraph problem”, Software
Practice and Experience, Vol. 12, 1982, pp. 23–34.
explosion. For example, state space search can be
J. T. L. Wang, 1995, K. Zhang, and G.-W. Chirn,
used if we represent the matching or edit sequences “Algorithms for Approximate Graph Matching”,
between graphs as states, and then execute a search Information Sciences, Vol. 82, 1995, pp. 45–74.
strategy for the state with the lowest cost. K. Haris, 1999, S. N. Efstratiadis, N. Maglaveras, C.
As we have seen, the approaches often have Pappas, J. Gourassas, and G. Louridas, “Model-Based
similarities with one another. For example, Morphological Segmentation and Labeling of
probability can be seen not just in the Bayesian Coronary Angiograms”, IEEE Transactions on
approach, but also in the cost functions of graph edit Medical Imaging, Vol. 18, No. 10, October 1999, pp.
distance and some state space search approaches. 1003–1015.
K.-C. Tai, 2003, “The tree-to-tree correction problem”,
Journal of the Association for Computing Machinery,
Vol. 26, No. 3, 2003, pp. 422–433.
REFERENCES M.-L. Fernández and G. Valiente, 2001, “A graph distance
metric combining maximum common subgraph and
B. Huet and E. R. Hancock, 1999, “Shape recognition minimum common supergraph”, Pattern Recognition
from large image libraries by inexact graph matching”, Letters,Vol. 22, 2001, pp. 753–758.
Pattern Recognition Letters, Vol. 20, 1999, pp. 1259– McCallum and K. Nigam, 1998, “A comparison of event
1269 models for Naive Bayes text classification”, AAAI–98
C. Apte, 1997 F. Damerau, and S. M. Weiss, “Automated Workshop on Learning for Text Categorization, 1998.
Learning of Decision Rules for Text Categorization”, R. A. Wagner and M. J. Fischer,2001, “The String-to-
ACM Transactions on Information Systems, Vol. 12, String Correction Problem”, Journal of the Association
1994,pp. 233–251. for Computing Machinery, Vol. 21, 2001, pp. 168–173.
G. Chartrand, G. Kubicki, and M. Schultz, 1998, “Graph R. C. Wilson and E. R. Hancock, 1997, “Structural
similarity and distance in graphs”, Aequationes Matching by Discrete Relaxation”, IEEE Transactions
Mathematicae, Vol. 55, 1998, pp. 129–145. on Pattern Analysis and Machine Intelligence, Vol. 19,
G. Levi, 1972, “A note on the derivation of maximal No. 6, June 1997, pp. 634–
common subgraphs of two directed or undirected S. Gold and A. Rangarajan, 2002, “A Graduated
graphs”, Calcolo, Vol. 9, 1972, pp. 341–354. Assignment Algorithm for Graph Matching”, IEEE
H. Bunke and A. Kandel, 2000 “Mean and maximum Transactions on Pattern Analysis and Machine
common subgraph of two graphs”, Pattern Intelligence, Vol. 18, No. 4, April 1996, pp. 377–388.
Recognition Letters, Vol. 21, 2000, pp. 163–168. S. Günter and H. Bunke, 2002, “Self-organizing map for
H. Bunke and K. Shearer, 1998 “A graph distance metric clustering in the graph domain”, Pattern Recognition
based on the maximal common subgraph”, Pattern Letters, Vol. 23, 2002, pp. 405–417.
Recognition Letters, Vol. 19, 1998, pp. 255–259. S. Medasani,R. Krishnapuram, and Y. S. Choi,
H. Bunke, 1999,“Error Correcting Graph Matching: On 2001,“Graph Matching by Relaxation of Fuzzy
the Influence of the Underlying Cost Function”,IEEE ssignments”, IEEE Transactions on Fuzzy Systems,
Transactions on Pattern Analysis and Machine Vol. 9, No. 1, February 2001, pp. 173–182.
Intelligence, Vol. 21, No. 9, September 1999, pp. 917– S. Wei, S. Jun, and Z. Huicheng, 2001, “A fingerprint
922. recognition system by use of graph matching”,
H. Bunke, 2000, X. Jiang, and A. Kandel, “On the Proceedings of SPIE, Vol. 4554, 2001, pp. 141–146.
minimum Common Supergraph of Two Graphs”, Sanfeliu and K. S. Fu, 2003, “A distance measure between
Computing, Vol. 65, 2000, pp. 13–25. attributed relational graphs for pattern recognition”,
H. Bunke, 2001 S. Günter, and X. Jiang, “Towards IEEE Transactions on Systems, Man, and Cybernetics,
Bridging the Gap between Statistical and Structural Vol. 13, 1983, pp. 353–363.
Pattern Recognition: Two New Concepts in Graph T. H. Cormen, 1997,C. E. Leiserson, and R. L. Rivest,
Matching”, in Advances in Pattern Recognition - Introduction to Algorithms, The MIT Press:
ICAPR 2001, S. Singh, N. Murshed, and W. Cambridge, Massachusetts, 1997.
Kropatsch (Eds.), Springer Verlag, LNCS 2013, 2001, T. P. and C. Siva Ram Murthy, 1999, “Optimal task
pp. 1–11. allocation in distributed systems by graph matching
H. Bunke,1997 “On a relation between graph edit distance and state space search”, The Journal of Systems and
and maximum common subgraph”, Pattern Software, Vol. 46, 1999, pp. 59–75.
Recognition Letters, Vol. 18, 1997, pp. 689–694.
H.Bunke, 2000,“Recent developments in graph matching”,
Proceedings of the 15th International Conference on
Avinash N Bhute (ACM M’09, CSI
M ’09, ISTE LM ’05) is Assistant
professor at Sinhgad college of
engineering, Pune. He received his
bachelor degree in Computer science and
Engineering from Amravati University in
1999, M.Tech. from Bharati Vidhyapith,
Pune in 2005.He is Active Member of
ACM, CSI, ISTE, and Indian Science Congress Association.
Recently he Review the book “Software Engineering” 7th Edition
by Stephen Schach, McGraw Hill Publication.He has published
Seven international papers and 12 national papers. His current
research interest includes Knowledge discovery in database,
Mining the Web, Ontology, Artificial intelligent, Software
engineering.
Harsha Bhute (CSI M’03, ISTE M’08)

is student, Pursuing Master of
Engineering in Computer Network from
S.C.O.E., Pune. She received her
bachelor degree in Computer science
and Engineering from Amravati
University, Maharastra, India in 1999.
She was a lecturer at Govt. Polytechnic
for 6 years. She is a Active member of
CSI since 2003. She has published two international and seven
national papers. Her area of interest includes mobile
communication, wireless network, System Programming.
Dr. B.B.Meshram (CSI LM’95, IE ’95)

is Professor and head of Computer
Technology Department at VJTI,
Matunga, Mumbai, Maharastra state,
India. He received bachelor degree,
Master degree and doctoral degree in
computer engineering. He has
participated in more than 16 refresher
courses to meet the needs of current technology. He has chair
more than 10 AICTE STTP Programs.. He has received the
appreciation for lecture at Manchester and Cardip University, UK.
He has contributed more than 50 research papers at national,
International Journals. He is life member of computer society of
India and Institute of Engineers. His current research interests are
in Databases, data warehousing, data mining, intelligent Systems,
Web Engineering and network security.

Overview of Graph Similarity Technique in Web Content Mining

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Overview of Graph Similarity Technique in Web Content Mining

Enviado por

Direitos autorais:

Formatos disponíveis

OVERVIEW OF GRAPH SIMILARITY TECHNIQUES FOR WEB

1 INTRODUCTION In this paper we are specifically interested in

I n this paper we will discussed about the concepts

H(Sj) is the Hamming distance between the super-

lower bound on their distance by Qai=- =  M bj0 C aibj

Harsha Bhute (CSI M’03, ISTE M’08)

Dr. B.B.Meshram (CSI LM’95, IE ’95)

Você também pode gostar