Você está na página 1de 15

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO.

4, April 2011

A Link-Analysis Extension of Correspondence Analysis for Mining Relational Databases


Luh Yen & Marco Saerens Information Systems Research Unit (ISYS/LSM) & Machine Learning Group (MLG) Universit catholique de Louvain (UCL), Belgium e
Email: {luh.yen,marco.saerens}@ucLouvain.be

Francois Fouss Management Department LSM Facult s Universitaires Catholiques de Mons (FUCaM), Belgium e
Email: francois.fouss@fucam.ac.be

I. I NTRODUCTION Traditional statistical, machine-learning, patternrecognition, and data-mining approaches (see, for example, [28]) usually assume a random sample of independent objects from a single relation. Many of these techniques have gone through the extraction of knowledge from data (typically extracted from relational databases), almost always leading, in the end, to the classical double-entry tabular format, containing features for a sample of the population. These features are therefore used in order to learn from the sample, provided that it is representative of the population as a whole. However, real-world data coming from many elds (such as world wide web, marketing, social networks, or biology; see [16]) are often multi-relational and interrelated. The work recently performed in statistical relational learning [22], aiming at working with such datasets, incorporates research topics such as link analysis [36], [63], web mining [1], [9], social-network analysis [8], [66], or graph mining [11]. All these research elds intend to nd and exploit links between objects (in addition to features as is also the case in the eld of spatial statistics [13], [53]), which could be of various

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

AbstractThis work introduces a link-analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analysis. It is based on a random-walk model through the database dening a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a rst step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic complementation [41]. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion-map subspace [42] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are dened and to multiple correspondence analysis when the database takes the form of a simple starschema. On the other hand, a kernel version of the diffusion-map distance, generalizing the basic diffusion-map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several datasets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.

types and involved in different kinds of relationships. The focus of the techniques has moved over from the analysis of the features describing each instance belonging to the population of interest (attribute-value analysis) to the analysis of the links existing between these instances (relational analysis), in addition to the features. This paper precisely proposes a link-analysis based technique allowing to discover relationships existing between elements of a relational database or, more generally, a graph. More specically, this work is based on a random-walk through the database dening a Markov chain having as many states as elements in the database. Suppose, for instance, we are interested in analyzing the relationships between elements contained in two different tables of a relational database. To this end, a two-step procedure is developed. First, a much smaller, reduced, Markov chain, only containing the elements of interest typically the elements contained in the two tables and preserving the main characteristics of the initial chain, is extracted by stochastic complementation [41]. An efcient algorithm for extracting the reduced Markov chain from the large, sparse, Markov chain representing the database is proposed. Then, the reduced chain is analyzed by, for instance, projecting the states in the subspace spanned by the right eigenvectors of the transition matrix ([42], [43], [46], [47]; called the basic diffusion map in this paper), or by computing a kernel principal-component analysis [54], [57] on a diffusion-map kernel computed from the reduced graph and visualizing the results. Indeed, a valid graph kernel based on the diffusion-map distance, extending the basic diffusion map to directed graphs, is introduced. The motivations for developing this two-step procedure are two-fold. First, the computation would be cumbersome, if not impossible, when dealing with the complete database. Second, in many situations, the analyst is not interested in studying all the relationships between all elements of the database, but only a subset of them. Moreover, if the whole set of elements in the database is analyzed, the resulting mapping would be averaged out by the numerous relationships and elements we are not interested in for instance, the principal axis would be completely different. It would therefore not only reect the relationships between the elements of interest only. Therefore,

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011
2

II. T HE DIFFUSION - MAP DISTANCE AND ITS NATURAL


KERNEL MATRIX

In this section, the basic diffusion-map distance [42], [43], [46], [47] is briey reviewed and some of its theoretical justications are detailed. Then, a natural kernel matrix is derived from the diffusion-map distance, providing a meaningful similarity measure between nodes. A. Notations and denitions

Let us consider that we are given a weighted, directed, graph G possibly dened from a relational database in the following, obvious, way: each element of the database is a node and each relation corresponds to a link (for a detailed procedure allowing to build a graph from a relational database, see [20]). The associated adjacency matrix A is dened in a standard way as aij = [A]ij = wij if node i is connected to node j and aij = 0 otherwise (say G has n nodes in total). The weight wij > 0 of the edge connecting node i and node j is set to have larger value if the afnity between i and j is important. If no information about the strength of relationship is available, we simply set wij = 1 (unweighted graph). We further assume that there are no selfloops (wii = 0 for i = 1, ..., n) and that the graph has a single connected component; that is, any node can be reached from any other node. If the graph is not connected, there is no relationship at all between the different components and the analysis has to be performed separately on each of

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

reducing the Markov chain by stochastic complementation allows to focus the analysis on the elements and relationships we are interested in. Interestingly enough, when dealing with a bipartite graph (i.e., the database only contains two tables linked by one relation), stochastic complementation followed by a basic diffusion map is exactly equivalent to simple correspondence analysis. On the other hand, when dealing with a star-schema database (one central table linked to several tables by different relations), this two-step procedure reduces to multiple correspondence analysis. The proposed methodology therefore extends correspondence analysis to the analysis of a relational database. In short, this paper has three main contributions: A two-step procedure for analyzing weighted graphs or relational databases is proposed. It is shown that the suggested procedure extends correspondence analysis. A kernel version of the diffusion-map distance, applicable to directed graphs, is introduced. The paper is organized as follows. Section II introduces the basic diffusion-map distance and its natural kernel on a graph. Section III introduces some basic notions of stochastic complementation of a Markov chain. Section IV presents the two-step procedure for analyzing the relationships between elements of different tables and establishes the equivalence between the proposed methodology and correspondence analysis in some special cases. Section V presents some illustrative examples involving several datasets while Section VI is the conclusion.

them. It is therefore to be hoped that the graph modeling the relational database does not contain too many disconnected components this can be considered as a limitation of our method. Partitioning a graph into connected components from its adjacency matrix can be done in O(n2 ) (see for instance [56]). Based on the adjacency matrix, the Laplacian matrix L of the graph is dened in the usual manner: L = D A, where D = Diag(ai. ) is the generalized outdegree matrix n with diagonal entries dii = [D]ii = ai. = j=1 aij . The column vector d = diag(ai. ) is simply the vector containing the outdegree of each node. Furthermore, the volume of the n n graph is dened as vg = vol(G) = i=1 dii = i,j=1 aij . Usually, we are dealing with symmetric adjacency matrices, in which case L is symmetric and positive semidenite (see for instance [10]). From this graph, we dene a natural random walk through the graph in the usual way by associating a state to each node and assigning a transition probability to each link. Thus, a random walker can jump from element to element and each element therefore represents a state of the Markov chain describing the sequence of visited states. A random variable s(t) contains the current state of the Markov chain at time step t: if the random walker is in state i at time t, then s(t) = i. The random walk is dened by the following single-step transition probabilities of jumping from any state i = s(t) to an adjacent state j = s(t + 1): P(s(t + 1) = j|s(t) = i) = aij /ai. = pij . The transition probabilities only depend on the current state and not on the past ones (rst-order Markov chain). Since the graph is completely connected, the Markov chain is irreducible, that is, every state can be reached from any other state. If we denote the probability of being in state i at time t by xi (t) = P(s(t) = i) and we dene P as the transition matrix with entries pij , the evolution of the Markov chain is characterized by x(t + 1) = PT x(t), with x(0) = x0 and T is the matrix transpose. This provides the state probability distribution x(t) = [x1 (t), x2 (t), ..., xn (t)]T at time t once the initial distribution x(0) is known. Moreover, we will denote as xi (t) the column vector containing the probability distribution of nding the random walker in each state at time t when starting from state i at time t = 0. That is, the entries of the vector xi (t) are xij (t) = P(s(t) = j|s(0) = i), j = 1, . . . n. Since the Markov chain represents a random walk on the graph G, the transition matrix is simply P = D1 A. Moreover, if the adjacency matrix A is symmetric, the Markov chain is reversible and the steady-state vector, , is simply proportional to the degree of each state [48], d (which has to be normalized in order to obtain a valid probability distribution). Moreover, this implies that all the eigenvalues (both left and right) of the transition matrix are real. B. The diffusion-map distance In our two-step procedure, a diffusion-map projection, based on the so-called diffusion-map distance, will be performed after stochastic complementation. Now, since the original denition of the diffusion-map distance deals only with undirected, aperiodic, Markov chains, it will rst be assumed in this section that the reduced Markov chain, obtained after

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

d2 (t) ij

=
k=1

(xik (t) xjk (t)) k


T 1

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om


(1) (2)

stochastic complementation, is indeed undirected, aperiodic and connected in which case the corresponding random walk denes an irreducible reversible Markov chain. Notice that it is not required that the original adjacency matrix is irreducible and reversible; these assumptions are only required for the reduced adjacency matrix obtained after stochastic complementation (see the discussion in Section III-A). Moreover, some of these assumptions will be relaxed in Section II-C, when introducing the diffusion-map kernel that is well-dened even if the graph is directed. The original derivation of the diffusion map, introduced independently by Nadler et al., and Pons & Latapy [42], [43], [46], [47], is detailed in this section but other interpretations of this mapping appeared in the literature (see the discussion at the end of this section). Moreover, the basic diffusion map is closely related to correspondence analysis, as detailed in Section IV. For an application of the basic diffusion map to dimensionality reduction, see [35]. Since P is aperiodic, irreducible and reversible, it is wellknown that all the eigenvalues of P are real and the eigenvectors are also real (see, e.g., [7], p. 202). Moreover, all its eigenvalues [1, +1], and the eigenvalue 1 has multiplicity one [7]. With these assumptions, Nadler et al. and Pons & Latapy [42], [43], [46], [47] proposed to use as distance between states i and j,

where ei is the ith column of I, ei = [0, ..., 0 , 1, 0 , ..., 0]T . n 1 i1 i i+1 The resulting mapping aims to represent each state i in a n-dimensional Euclidean space with coordinates (|t |u2i , 2 |t |u3i , . . ., |t |uni ), as in Equation (4) (the rst right eigen3 n vector is trivial and is therefore discarded). Dimensions are ordered by decreasing modulus, |t |. This original mapping k introduced by Nadler and coauthors will be referred to as the basic diffusion map in this paper, in contrast with the diffusion-map kernel (KDM ) that will be introduced in the next section. The weighting factor, D1 , in Equation (2) is necessary to obtain Equation (3) since the vk are not orthogonal. Instead, T it can easily be shown that we have vi D1 vj = ij , which aims to redene the inner product as x, y = xT D1 y, where the metric of the space is D1 [7]. Notice also that there is a close relationship between spectral clustering (the mapping provided by the normalized Laplacian matrix; see for instance [15], [45], [65]) and the basic diffusion map. Indeed, a common embedding of the nodes consists of representing each node by the coordinates of the smallest nontrivial eigenvectors (corresponding to the smallest eigenvalues) of the normalized Laplacian matrix, L = D1/2 LD1/2 . More precisely, if uk is the kth largest right eigenvector of the transition matrix P and k is the kth smallest non-trivial l eigenvector of the normalized Laplacian matrix L, we have (see Appendix A for the proof and details) uk = D1/2k l (5)

(xi (t) xj (t)) D

(xi (t) xj (t))

since, for a simple random walk on an undirected graph, the entries of the steady-state vector are proportional (the sign) to the generalized degree of each node (the total of the elements of the corresponding row of the adjacency matrix [48]). This distance, called the diffusion-map distance, corresponds to the sum of the squared differences between the probability distribution of being in any state after t transitions when starting (i.e., at time t = 0) from two different states, state i and state j. In other words, two nodes are similar when they diffuse through the network and thus inuence the network in a similar way. This is a natural denition which quanties the similarity between two states based on the evolution of the states probability distribution. Of course, when i = j, dij (t) = 0. Nadler et al. [42], [43] showed that this distance measure has a simple expression in terms of the right eigenvectors of P: n d2 (t) = ij
k=1

2t (uki ukj ) k

(3)

where uki = [uk ]i is component i of the kth right eigenvector, uk , of P and k is its corresponding eigenvalue. As usual, the k are ordered by decreasing modulus so that the contributions to the sum in Equation (3) are decreasing with k. On the other hand, xi (t) can easily be expressed [42], [43] in the space spanned by the left eigenvectors of P, the vk ,
n n

xi (t) = (PT )t ei =

t vk uT ei = k k
k=1

(t uki )vk k
k=1

(4)

and k is associated to eigenvalue (1 k ). l A subtle, still important, difference between this mapping and the one provided by the basic diffusion map concerns the order in which the dimensions are sorted. Indeed, for the basic diffusion map, the eigenvalues of the transition matrix P are ordered by decreasing modulus value. For this spectralclustering model, the eigenvalues are sorted by decreasing value (and not modulus), which can result in a different representation if P has large negative eigenvalues. This shows that the mappings provided by spectral clustering and by the basic diffusion map are closely related. Notice that at least three other justications of this eigenvector-based mapping appeared before in the literature, and are briey reviewed here. (i) It has been shown that the entries of the subdominant right eigenvector of the transition matrix P of an aperiodic, irreducible, reversible, Markov chain can be interpreted as a relative distance to its stationarydistribution (see [60], Section 1.8.1, or the appendix of [18]). This distance may be regarded as an indicator of the number of iterations required to reach this equilibrium position, if the system starts in the state from which the distance is being measured. These quantities are only relative, but they serve as a means of comparison among the states [60]. (ii) The same embedding can be obtained by minimizing n n 2 T the criterion i=1 j=1 aij (zi zj ) = z Lz [10], [26] T subject to z Dz = 1, therefore penalizing the nodes having a large outdegree [74]. Here, zi is the coordinate of node i on the axis and the vector z contains the zi . The problem sums up in nding the smallest non-trivial eigenvector of (I P),

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

which is the same as the second largest eigenvector of P, and this is once more similar to the basic diffusion map. Notice that this mapping has been rediscovered and reinterpreted by Belkin & Niroyi [2], [3] in the context of nonlinear dimensionality reduction. (iii) The last justication of the basic diffusion map, introduced in [15], is based on the concept of 2-way partitioning of a graph [58]. Minimizing a normalized cut criterion while imposing that the membership vector is centered with respect to the metric D leads to exactly the same embedding as in the previous interpretation. Moreover, some authors [72] showed that applying a specic cut criteria to bipartite graphs leads to simple correspondence analysis. Notice that the second justication (i) leads to exactly the same mapping as the basic diffusion map while the third and fourth justications, (ii) and (iii), lead to the same embedding space, but with a possibly different ordering and rescaling of the axis. More generally, these mappings are, of course, also related to graph embedding and nonlinear dimensionality reduction, which have been highly studied topics in recent years, especially in the manifold learning community (see, i.e., [21], [30], [37], [67], for recent surveys or developments). Experimental comparisons with popular nonlinear dimensionality reduction techniques are presented in the experimental section. C. A kernel view of the diffusion-map distance

Since xi (t) = (PT )t ei , Equation (6) becomes d2 (t) = (ei ej ) Pt W PT ij


T T t

(ei ej ) (7)

= (ei ej ) KDM (ei ej ) = [KDM ]ii + [KDM ]jj [KDM ]ij [KDM ]ji where we dened KDM (t) = Pt W PT
t

(8)

We now introduce1 a variant of the basic diffusion-map model introduced by Nadler et al. and Pons & Latapy [42], [43], [46], [47], which is still well-dened when the original graph is directed. In other words, we do not assume that the initial adjacency matrix A is symmetric in this section. This extension presents several advantages in comparison with the original basic diffusion map: (i) the kernel version of the diffusion map is applicable to directed graphs while the original model is restricted to undirected graphs, (ii) the extended model induces a valid kernel on a graph, and (iii) the resulting matrix has the nice property of being symmetric positive denite the spectral decomposition can thus be computed on a symmetric positive denite matrix, and nally (iv) the resulting mapping is displayed in a Euclidean space in which the coordinate axis are set in the directions of maximal variance by using (uncentered if the kernel is not centered) kernel principal-component analysis [54], [57] or multidimensional scaling [6], [12]. This kernel-based technique will be referred to as the diffusion-map kernel PCA or the KDM PCA. Let us dene W = (Diag())1 , where is the stationary distribution of the nite Markov chain. Remember that if the adjacency matrix is symmetric, the stationary distribution of the natural random walk is proportional to the degree of the nodes, W D1 [48]. The diffusion-map distance is therefore redened as d2 (t) = (xi (t) xj (t)) W (xi (t) xj (t)) ij
T

(6)

referred to as the diffusion-map kernel. Thus, the matrix KDM is the natural kernel (inner-product matrix) associated to the squared diffusion-map distances [6], [12]. It is clear that this matrix is symmetric positive semidenite and contains inner products in a Euclidean space where the node vectors are exactly separated by dij (t) (the proof is straightforward and can be found in [17] appendix D where the same reasoning was applied to the commute-time kernel). It is therefore a valid kernel matrix. Performing a (uncentered if the kernel is not centered) principal-component analysis (PCA) in this embedding space aims to change the coordinate system by putting the new coordinate axes in the directions of maximal variances. From the theory of classical multidimensional scaling [6], [12], it sufces2 to compute the m rst eigenvalues/eigenvectors of KDM and to consider that these eigenvectors multiplied by the squared root of the corresponding eigenvalues are the coordinates of the nodes in the principal-component space spanned by these eigenvectors (see [6], [12]; for a similar application with the commute-time kernel, see [51]; for the general denition of kernel PCA, see [54], [55]). In other words, we compute the m rst eigenvalues/eigenvectors of KDM : KDM wk = k wk , where the wk are orthonormal. Then, we represent each node i in a m-dimensional Euclidean space with coordinates ( 1 w1i , 2 w2i , . . ., m wmi ) where wki = [wk ]i corresponds to element i of the eigenvector wk associated to eigenvalue k ; this is the vector representation of state i in the principal-component space. It can easily be shown that when the initial graph is undirected, this PCA based on the kernel matrix KDM is similar to the diffusion map introduced in the last section, up to an isometry. Indeed, by the classical theory of multidimensional scaling [6], [12], the eigenvectors of the kernel matrix KDM multiplied by the squared root of the corresponding eigenvalues dene coordinates in a Euclidean space where the observations are exactly separated by the distance dij (t). Since this is exactly the property of the basic diffusion map (Equation (3)), both representations are similar up to an isometry. Notice that the resulting kernel matrix can easily be centered [40] by HKDM H with H = I(eeT /n), where e is a column vector all of whose elements are 1 (i.e., e = [1, 1, . . . , 1]T ). H is called the centering matrix. This aims to place the origin of the coordinates of the diffusion map at the center of gravity of the node vectors.
2 The proof must be slightly adapted in order to account for the fact that the kernel matrix in not centered, as in [50].

1 Part of the material of this section was published in a conference paper [19] presenting a similar idea.

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

D. Links between the basic diffusion map and the kernel diffusion map While both representing the graph in a Euclidean space where the nodes are exactly separated by the distances dened by Equation (2), and thus providing exactly the same embedding, the mappings are, however, different for each method. Indeed, the coordinate system in the embedding space differs for each method. In the case of the basic diffusion map, the eigenvector uk represents the kth coordinate of the nodes in the embedding space. However, in the case of the diffusion-map kernel, since a kernel PCA is performed, the rst coordinate axis corresponds instead to the direction of maximal variance in terms of diffusion-map distance (Equation (2)). Therefore, the coordinate system used by the diffusion-map kernel is actually different than the one used by the basic diffusion map. Putting the coordinate system in the directions of maximal variance, and thus computing a kernel PCA, is probably more natural. We now show that there is a close relationship between the two representations. Indeed, from Equation (4), we easily observe that the mapping provided by the basic diffusion map remains the same in function of the parameter t, up to a scaling of each coordinate/dimension (only the scaling changes). This is in fact not the case for the kernel diffusion map. In fact, the mapping provided by the diffusion-map kernel tends to be the same as the one provided by the basic diffusion map for growing values of t in the case of an undirected graph. Indeed, it can be shown that the kernel matrix can be rewritten as KDM U2t UT where U contains the right eigenvectors of P, uk , as columns. In this case, when t is large, every additional dimension has a very small contribution in comparison with the previous ones. This fact will be illustrated in the experimental section (i.e., Section V). In practice, we observed that the two mappings are already almost identical when t is equal to 5 or 6 (see for instance Figure 3 in Section V). III. A NALYZING RELATIONS BY STOCHASTIC
COMPLEMENTATION

Thus, the transition matrix is repartitioned as S1 S2 S1 P11 P21 S2 P12 P22

P=

(9)

The idea is to censor the useless elements by masking them during the random walk. That is, during any random walk on the original chain, only the states belonging to S1 are recorded; all the other reached states belonging to subset S2 being censored and therefore not recorded. One can show that the resulting reduced Markov chain obtained by censoring the states S2 is the stochastic complement of the original chain [41]. Thus, performing a stochastic complementation allows to focus the analysis on the tables and elements representing the factors/features of interest. The reduced chain inherits all the characteristics from the original chain; it simply censors the useless states. The stochastic complement Pc of the chain, partitioned as in Equation (9), is dened as (see for instance [41]) Pc = P11 + P12 (I P22 )1 P21 (10) It can be shown that the matrix Pc is stochastic, that is, the sum of the elements of each row is equal to 1 [41]; it therefore corresponds to a valid transition matrix between states of interest. We will assume that this resulting stochastic matrix is aperiodic and irreducible, that is, primitive [48]. Indeed, Meyer showed in [41] that if the initial chain is irreducible or aperiodic, so is the reduced chain. Moreover, even if the initial chain is periodic, the reduced chain frequently becomes aperiodic by stochastic complementation [41]. One way to ensure the aperiodicity of the reduced chain is to introduce a small positive quantity on the diagonal of the adjacency matrix A, which does not fundamentally change the model. Then, P has nonzero diagonal entries and the stochastic complement, Pc , is primitive (see [41], Theorem 5.1). Let us show that the reduced chain also represents a random walk on a reduced graph Gc containing only the nodes of interest. We therefore partition the matrices A, D, L, as A= A11 A21 A12 D1 ;D= A22 O O L ; L = 11 D2 L21 L12 L22

In this section, the concept of stochastic complementation [41] is briey reviewed and applied to the analysis of a graph through the random-walk-on-a-graph model. From the initial graph, a reduced graph containing only the nodes of interest, and which is much more easy to analyze, is built. A. Computing a reduced Markov chain by stochastic complementation Suppose we are interested in analyzing the relationship between two sets of nodes of interest. A reduced Markov chain can be computed from the original chain, in the following manner. First, the set of states is partitioned into two subsets, S1 corresponding to the nodes of interest to be analyzed and S2 corresponding to the remaining nodes, to be hidden. We further denote by n1 and n2 (with n1 + n2 = n) the number of states in S1 and S2 , respectively; usually n2 n1 .

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

from which we easily nd Pc = D1 (A11 + A12 (D2 1 A22 )1 A21 ) = D1 Ac , where we dened Ac = (A11 + 1 A12 (D2 A22 )1 A21 ). Notice that if A is symmetric (the graph Gc is undirected), Ac is symmetric as well. Since Pc is stochastic, we deduce that the diagonal matrix D1 contains the row sums of Ac and that the entries of Ac are positive. The reduced chain thus corresponds to a random walk on the graph Gc whose adjacency matrix is Ac . Moreover, the corresponding Laplacian matrix of the graph Gc can be obtained by Lc = D1 Ac = (D1 A11 ) A12 (D2 A22 )1 A21 = L11 L12 L1 L21 22 (11) since L12 = A12 and L21 = A21 . If the adjacency matrix A is symmetric, L11 (L22 ) is positive denite since it is obtained from the positive semidenite matrix L by deleting the rows associated to S2 (S1 ) and the corresponding

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

columns, therefore eliminating the linear relationship. Notice that Lc is simply the Schur complement of L22 [27]. Thus, for an undirected graph G, instead of directly computing Pc , it is more interesting to compute Lc , which is symmetric positive denite, from which Pc can easily be deduced: Pc = I D1 Lc , directly following from Lc = D1 Ac ; 1 see the next section for a proposition of iterative computation of Lc . B. Iterative computation of Lc for large, sparse, graphs In order to compute Lc from Equation (11), we need to evaluate L1 L21 . We now show how this matrix can be 22 computed iteratively by using, for instance, a simple Jacobilike or Gauss-Seidel-like algorithm. In order to simplify the development, let us pick up one column, say column j, of L21 and denote it as bj ; there are n1 n2 such columns. The problem is thus to compute xj = L1 L21 ej (column j of 22 L1 L21 ) such that 22 L22 xj = (D2 A22 )xj = bj (12)

By transforming this last equation, we obtain xj = D1 (A22 xj + bj ) = P22 xj + D1 bj which leads to the 2 2 following iterative updating rule xj P22 xj + D1 bj 2 (13)

where xj is an approximation of xj . This procedure converges since the matrix L22 is irreducible and has weak diagonal dominance [70]. Initially, one can start, for instance, with xj = 0. The computation of Equation (13) fully exploits the sparseness since only the non-zero entries of P22 contribute to the update. Of course, D1 bj is pre-calculated. 2 Once all the xj have been computed in turn (only n1 in total), the matrix X containing the xj as columns is constructed; we thus have L1 L21 = X. The nal step 22 consists of computing Lc = L11 L12 X. The time complexity of this algorithm is n1 (the complexity of solving one sparse system of n2 linear equations). Now, if the graph is undirected, the matrix L22 is positive denite. Recent numerical analysis techniques have shown that positive denite sparse linear systems in a n-by-n matrix with m nonzero entries can be solved in time O(mn), for instance by using conjugate gradient techniques [59], [64]. Thus, apart from the matrix multiplication L12 X, the complexity is O(n1 n2 m) where m is the number of non-zero entries of L22 . In practice, the matrix is usually very sparse and m = n2 with quite small, resulting in O( n1 n2 ) with n1 n2 . The second step, 2 i.e., the mapping by basic diffusion map or by diffusion-map kernel PCA, has a negligible complexity since it is performed on a reduced n1 n1 matrix. Of course more sophisticated iterative algorithms can be used as well (see for instance [25], [49], [59]). IV. A NALYZING THE REDUCED M ARKOV CHAIN WITH THE BASIC DIFFUSION MAP : LINKS WITH CORRESPONDENCE
ANALYSIS

Once a reduced Markov chain containing only the nodes of interest has been obtained, one may want to visualize the

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

graph in a low-dimensional space preserving as accurately as possible the proximity between the nodes. This is the second step of our procedure. For this purpose, we propose to use the diffusion maps introduced in Section II-B and Section II-C. Interesting enough, computing a basic diffusion map on the reduced Markov chain is equivalent to correspondence analysis in two special cases of interest: a bipartite graph and a starschema database. Therefore, the proposed two-step procedure can be considered as a generalization of correspondence analysis. Correspondence analysis (see for instance [23], [24], [31], [62]) is a widely used multivariate statistical analysis technique which still is the subject of much research efforts (see for instance [5], [29]). As stated for instance in [34], simple correspondence analysis aims to provide insights into the dependence of two categorical variables. The relationships between the attributes of the two categorical variables are usually analyzed through a biplot [23] a two-dimensional representation of the attributes of both variables. The coordinates of the attributes on the biplot are obtained by computing the eigenvectors of a matrix. Many different derivations of simple correspondence analysis have been developed, allowing for different interpretations of the technique, such as maximizing the correlation between two discrete variables, reciprocal averaging, categorical discriminant analysis, scaling and quantication of categorical variables, performing a principal components analysis based on the chi-square distance, optimal scaling, dual scaling, etc [34]. Multiple correspondence analysis is the extension of simple correspondence analysis to a larger number of categorical variables. A. Simple correspondence analysis

As stated before, simple correspondence analysis (see for instance [23], [24], [31], [62]) aims to study the relationships between two random variables x1 and x2 (the features) having each mutually exclusive, categorical, outcomes, denoted as attributes. Suppose the variable x1 has n1 observed attributes and the variable x2 has n2 observed attributes, each attribute being a possible outcome value for the feature. An experimenter makes a series of measurements of the features x1 , x2 on a sample of vg individuals and records the outcomes in a frequency (also called contingency) table, fij , containing the number of individuals having both attribute x1 = i and attribute x2 = j. In our relational database, this corresponds to two tables, each table corresponding to one variable, and containing the set of observed attributes (outcomes) of the variable. The two tables are linked by a single relation (see Figure 1 for a simple example). This situation can be modeled as a bipartite graph where each node corresponds to an attribute and links are only dened between attributes of x1 and attributes of x2 . The weight associated to each link is set to wij = fij , quantifying the strength of the relationship between i and j. The associated nn adjacency matrix and the corresponding transition matrix can be factorized as A= O A21 A12 O , P= O P21 P12 O (14)

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

Document
N N

Word

Fig. 1. Trivial example of a single relation between two variables, Document and Word. The Document table contains outcomes of documents while the Word table contains outcomes of words.

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

where O is a matrix full of zeroes. Suppose we are interested in studying the relationships between the attributes of the rst variable x1 which corresponds to the n1 rst elements. By stochastic complementation (see Equation (10)), we easily obtain Pc = P12 P21 = D1 A12 D1 A21 . Computing the diffusion map for t = 1 1 2 aims to extract the subdominant right-hand eigenvectors of Pc , which exactly corresponds to correspondence analysis (see for instance [24], Equation (4.3.5)). Moreover, it can easily be shown that Pc has only real non-negative eigenvalues and thus ordering the eigenvalues by modulus is equivalent to ordering them by value. In correspondence analysis, eigenvalues reect the relative importance of the dimensions: each eigenvalue is the amount of inertia a given attribute explains in the frequency table [31]. The basic diffusion map after stochastic complementation on this bipartite graph therefore leads to the same results as simple correspondence analysis. Relationships between simple correspondence analysis and link-analysis techniques have already been highlighted. For instance, Zha et al. [72] showed the equivalence of a normalized cut performed on a bipartite graph and simple correspondence analysis. On the other hand, Saerens et al. investigated the relationships between Kleinbergs HITS algorithm [33], and correspondence analysis [18] or principal-component analysis [50]. B. Multiple correspondence analysis

Now, the individuals-to-attributes matrix exactly corresponds to the data matrix A21 = X containing, as rows, the individuals and, as columns, the attributes. Since the different features are coded as indicator (dummy) variables, a row of the X matrix contains a 1 if the individual has the corresponding attribute and 0 otherwise. We thus have A21 = X and A12 = XT . Assuming binary weights, the matrix D1 contains on its diagonal the frequencies of each attribute, that is, the number of individuals having this attribute. On the other hand, D2 contains p on each element of its diagonal since each individual has exactly one attribute for each of the p features (attributes corresponding to a feature are mutually exclusive). Thus, D2 = p I and P12 = D1 A12 , P21 = D1 A21 . 1 2 Suppose we are rst interested in the relationships between attribute nodes, therefore hiding the individuals nodes contained in the main table. By stochastic complementation (Equation (10)), the corresponding attribute-attribute transition matrix is Pc = D1 A12 D1 A21 = 1 2 = 1 1 D A12 A21 p 1 (16)

1 1 1 T D1 X X = D1 F p p 1

Multiple correspondence analysis assigns a numerical score to each attribute of a set of p > 2 categorical variables [23], [62]. Suppose the data are available in the form of a star schema: the individuals are contained in a main table and the categorial features of these individuals, such as education level, gender, etc., are contained in p auxiliary, satellite, tables. The corresponding graph is built naturally by dening one node for each individual and for each attribute while a link between an individual and an attribute is dened when the individual possesses this attribute. This conguration is known as a starschema [32] in the data warehouse or relational database elds (see Figure 2 for a trivial example). Let us rst renumber the nodes in such a way that the attributes nodes appear rst and the individuals nodes last. Thus, the attributes-to-individuals matrix will be denoted by A12 ; it contains a 1 on the (i, j) entry when the individual j has attribute i, and 0 otherwise. The individuals-to-attributes matrix, the transpose of the attributes-to-individuals matrix, is A21 . Thus, the adjacency matrix of the graph is A= O A21 A12 O (15)

where the element fij of the frequency matrix F = XT X, also called the Burt matrix, contains the number of co-occurences of the two attributes i and j, that is, the number of individuals having both attribute i and attribute j. The largest non-trivial right eigenvector of the matrix Pc represents the scores of the attributes in a multiple correspondence analysis. Thus, computing the eigenvalues and eigenvectors of Pc and displaying the nodes with coordinates proportional to the eigenvectors, weighted by the corresponding eigenvalue, exactly corresponds to multiple correspondence analysis (see [62], Equation (10)). This is precisely what we obtain when computing the basic diffusion map on Pc with t = 1. Indeed, as for simple correspondence analysis, it can easily be shown that Pc has real non-negative eigenvalues and thus ordering the eigenvalues by modulus is equivalent to ordering by value. If we are interested in the relationships between elements of the main table (the individuals) instead of the attributes, we obtain 1 1 Pc = A21 D1 A12 = XD1 XT (17) 1 1 p p which, once again, is exactly the result obtained by multiple correspondence analysis (see [62], Equation (9)).

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

Marital status

N Gender 1 N Individual N 1 Education level

1 Location

1 Native language

Fig. 2. Trivial example of a star-schema relation between a main variable, Individual, and auxiliary variables, Gender, Education level, etc. Each table contains outcomes of the corresponding random variable.

A. Graph embedding Two simple graphs are studied in order to illustrate the visualization of the graph structure by diffusion maps alone (without stochastic complementation): the Zachary karate club [71] and the dolphins social network [38]. Zachary karate club. Zachary has studied the relations between the 34 members of a karate club. A disagreement between the club instructor (node 1) and the administrator (node 34) resulted in the split of the club into two parts. Each member of the club is represented by a node of the graph and the weight between nodes (i, j) is set to be the number of time member i and member j met outside the club. The Ucinet drawing of the network is shown on Figure 3(a). The built friendship network between members allows to discover how the club split, but its mapping also allows to study the proximity between the different members. It can be observed from the graph embeddings provided by the basic diffusion map (Figure 3(c)) that the value of parameter t has no inuence on the nodes position, but only on the axis scaling (remember Equation (4)). On the contrary, the inuence of the value of parameter t is clearly visible on the KDM PCA mapping (Figure 3(d)). However, the projection of a graph on a two-dimensional space usually leads to a loss of information. The information preservation ratio can be estimated using (1 + 2 )/ i i ; it

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

V. E XPERIMENTS This experimental section aims to answer to four research questions. (1) How does the graph mappings provided by the kernel PCA based on the diffusion-map kernel (KDM PCA) compares with the basic diffusion-map projection? (2) Does the proposed two-step procedure (stochastic complementation + diffusion map) provide realistic subgraph drawings? (3) How does the diffusion-map kernel combined with stochastic complementation compares to other popular dimensionality reduction techniques? (4) Does stochastic complementation accurately preserve the structural information?

was computed for the 2-D mapping of the network using the basic diffusion map and the KDM PCA, and is shown in Figure 3(b). It can be observed that the ratio increases with t but, overall, the information is better preserved with the KDM PCA than with the basic diffusion map. This is due to a better choice of the projection axis for the KDM PCA that are oriented in the directions of maximal variance. Since a proximity on the mapping can be interpreted as a strong relationship between members, the social community structure is clearly observable visually from Figure 3(c)-(d). For the KDM PCA, the choice of the value for the parameter t depends on the graphs property that the user wishes to highlight. When t is low (t = 1), nodes with few relevant connections are considered as outliers and rejected from the community. On the contrary, when t increases, the effect of marginal nodes is faded and only the global members structure, similar to the basic diffusion map, is visible.

Dolphins social network. This unweighted graph represents the associations between bottlenose dolphins from the same community, observed over a period of seven years. Only the basic diffusion map with t = 1 is shown (since, once again, the parameter t has no inuence on the mapping) while the KDM PCA is computed for t = 1, 3, and 6. It can be observed (see Figure 4) from the KDM PCA that a dolphin (the 61th member, Zig) is rejected far away from the community due to his lack of interaction with other dolphins. His only contact is also poorly connected with the remaining community. As explained in Section II and conrmed in Figure 4, the basic diffusion map and the KDM PCA mappings become similar when t increases. For instance, when t = 6 or more, the outlier is no more highlighted and only the main structure of the network is visible. Two main subgroups can be identied from the mapping (notice that Newman and Girvan [44] actually identied 4 sub-communities by clustering).

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

information preservation ratio

0.9

0.8

0.7

0.6

0.5

0.4

0.3 basic diffusion map KDM PCA 0.1 1 2 3 4 5 6 7 8 9 10

0.2

(a)
Basic diffusion map, t = 1
0.1

(b)
Basic diffusion map, t = 4 4 13 22 18 20 12 1 29

0.2 14 3 0.1 9 0 15,16,19 21,23 28,33 10 34 32 0.1 2427,30 0.2 31 2

84

13
14

2 0.05 9 3

22 18 20 1 12

10 0 15,16,19 21,23 28,33 34

31

29

32

5 11 6 7

0.05

2427,30 5 11

0.1

0.3 17

7 6

17

0.4

0.2

0.1

0.1

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om


0.2 0.3
0.15 0.1 0.05

0.05

0.1

0.15

0.2

(c)

KDM PCA, t = 1 2 1.5 1 3 0.5 15,16 0 19,21 23 0.5 1 1.5 2 2.5 3 1.5 10 27 30 24 31 29 9 2 14

KDM PCA, t = 4 13 22 18 12 1

13

22 18

0.6

12

8 2 14 3 20

20

0.4

0.2

10

31

28 32

25,26 33,34

29

0.2

15,16,19 21,23

33

34 28 32

5 11 6 7 17 0.5 1 1.5

11

2427,30

0.4

0.6

17

0.8

0.5

0.5

1.5

0.5

(d)

Fig. 3. Zachary karate social network: (a) Ucinet projection of the Zachary karate network, (b) the information preservation ratio of the 2-D projection in function of the parameter t for the basic diffusion map (diffusion map) and the KDM PCA, (c) the graph mapping obtained by the basic diffusion map (t = 1 and 4), and (d) the graph mapping obtained by a KDM PCA (t = 1 and 4).

B. Analysing the effect of stochastic complementation on a real-world dataset This second experiment aims to illustrate the two-step mapping procedure, i.e., rst applying a stochastic complementation and then computing the KDM PCA, on a realworld dataset the newsgroups data. However, for illustrative purposes, the procedure is rstly applied on a toy example. Toy example. The graph (see Figure 5) is composed of four objects (e1 , e2 , e3 , and e4 ) belonging to two different classes (c1 and c2 ). Each object is also connected to one or many of the ve attributes (a1 , ..., a5 ). The reduced graph mapping obtained by our two-step procedure allows to highlight the

relations between the attribute values (i.e., the a nodes) and the classes (i.e., the c nodes). To achieve this goal, the e nodes are eliminated by performing a stochastic complementation: only the a and c nodes are kept. The resulting subgraph is displayed on a 2D plane by performing a KDM PCA. Since the connectivity between nodes a1 and a2 (a3 , a4 , and a5 ) is larger than with the remaining nodes, these two (three) nodes are close together on the resulting map. Moreover, node c1 (c2 ) is highly connected to nodes a1 , a2 (a3 , a4 , a5 ) through indirect links and is therefore displayed close to these nodes. Newsgroups dataset. The real-world dataset studied in

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011
10

Basic diffusion map, t = 1

KDM PCA, t = 1

0.25
0.2
0.15
0.1
0.05
0

1
0.05

2
0.1

3
0.15

4
0.2

0.1

0.05

0.05

0.1

0.15

0.2

5 12

10

KDM PCA, t = 3

KDM PCA, t = 6

4
0.8 3
0.6

0.4

0.2

1
0

0
0.2

0.4

0.6

2 1

0.5

0.5

1.5

2.5

0.5

0.5

1.5

Fig. 5.

Toy example illustrating our two-step procedure (stochastic complementation followed by KDM PCA).

this section is the newsgroups dataset3 . It is composed of about 20, 000 unstructured documents, taken from 20 discussion groups (newsgroups) of the Usernet diffusion list. For the ease of interpretation, we decided to limit the dataset size by randomly sampling 150 documents out of three slightly-correlated topics (sport/baseball, politics/mideast, and space/general; 50 documents from each topic). Those 150 documents are preprocessed as described in [68], [69]. The resulting graph is composed of 150 document nodes, 564 term nodes and 3 topic nodes representing the topics of the documents. Each document is connected to its corresponding topic node with a weight xed to one. The weights between the documents and the terms are set equal to the tf.idf factor and normalized in order to obtain normalized weights between 0 and 1 [68], [69]. Thus, the class (or topics) nodes are connected to document nodes of the corresponding topics, and each document is also connected to terms contained in the document. Drawing a
3 Available

from http://people.csail.mit.edu/jrennie/20Newsgroups/.

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

Fig. 4.

Dolphins social network: The graph mapping obtained by the basic diffusion map (t=1; upper left gure) and using the KDM PCA (t=1, 3, and 6).

parallel with our illustrative example (see Figure 5), topics nodes correspond to c-nodes, document nodes to e-nodes and terms to a-nodes. The goal of this experiment is to study the similarity between the terms and the topics through their connections with the document nodes. The reduced Markov chain is computed by setting S1 to the nodes of the graph corresponding to the terms and the topics. The remaining nodes (documents) are rejected in the subgroup S2 . Figure 6 shows the embedding of the terms used in the 150 documents of the newsgroups subset, as well as the topics of the documents. The KDM PCA quickly provides the same mapping as the basic diffusion map when increasing the value of t. However, it can be observed on the embedding with t = 1 that several nodes are rejected outside the triangle, far away from the other nodes of the graph. A new mapping where the terms corresponding to each nodes are also displayed (for the visualization convenience, only terms cited by 25 documents or more are shown) for KDM PCA with t = 1 is shown in Figure 7. It can be observed that a group of terms are stuck near each topic nodes, denoting

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011
11

Basic diffusion map, t = 5

KDM PCA, t = 1

0.02

0.04
0.03

0.01

SPORT

POLITIC

0.02
0.01
POLITIC SPORT

0
0.01
0.02
0.02
SPACE SPACE

0.01

0.03
0.04

0.03

0.05
0.06 0.1

0.03

0.02

0.01

0.01

0.02

0.03

0.08

0.06

0.04

0.02

0.02

0.04

KDM PCA, t = 3

x 10 4

KDM PCA, t = 5

0.015

0.01
SPORT

2 0.005
SPORT POLITIC

POLITIC

0
2

0.005
4
SPACE SPACE

0.01

0.015

0.02

0.025 0.02

10

0.015

0.01

0.005

0.005

0.01

0.015

0.02

0.01

0.005

0.005

0.01

Fig. 6. Newsgroups dataset: Example of stochastic complementation followed by a basic diffusion map (upper left gure) and by a KDM PCA with t = 1 (upper right), t = 3 (lower left), and t = 5 (lower right). Terms and topics nodes are displayed jointly.

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

their relevance to this topic (i.e., player, win, hit, and team for the sport topic). We also observe that terms lying in-between two topics are also commonly used by both topics (human, nation, and european seem to be words used in discussions on politics as well as on space), or centered in the middle of the three topics (common terms without any specicity, such as work, usa, or make). Globally, the projection provides a good representation of the use of the terms in the different discussion groups for both the basic diffusion map and the KDM PCA. Terms rejected outside the triangle are often only cited by few documents and seem to have few links with other terms. They are probably out of topic as, for instance, the series of terms on printer in the outlier group on the left of the sport topic (printer, packard, or hewlett). C. Graph reduction inuence and embedding comparison

KDM PCA, t = 1

0.03

mcgraw hal

royal

benson concordia muslim montreal gave genocid argic serdar cincinnati war turkish israel kill armenian possibl moment tough jew armenia depart arab track sens groupPOLITICisra worst countri brave player start question fact peopl pitch win basebal great dept case claim robinson state organ govern gamerun SPORT year universarticl team world nation hit good time california david back center human write don make generresearch engin joe institutscienc work technologi dai thing greatest usa pure offic european power system ratio widespread locat moscow part mysteri commun SPACE space inher saturn orbit nasa plan florida constant orion vulcan langlei altern egalon claudio oliveira debri push

bobbi

mcrae

0.02

printer packard bois hewlett detweil

0.01

0.01

0.02

0.03

0.04

0.05

0.06

0.08

0.06

0.04

0.02

0.02

Fig. 7. Newsgroups dataset: Example of stochastic complementation followed by a KDM PCA (t = 1). Only terms cited by at least 25 documents and peripheral terms are displayed.

The objective of this experiment is twofold. The rst aim is to study the inuence of the stochastic complementation on the graph mapping. The second one is to compare ve popular dimensionality reduction methods, namely the diffusion-map kernel PCA (KDM PCA or simply KDM), the Laplacian Eigenmap (LE) [3], the Curvilinear Component Analysis (CCA) [14], Sammons nonlinear Mapping (SM) [52], and the classical MultiDimensional Scaling [6], [12] based on geodesic distances (MDS). For CCA, SM and MDS, the distance matrix is given by the shortest-path distance computed on the reduced graph whose weights are set to the inverse of the entries of the adjacency matrix obtained by stochastic complementation.

Notice that the MDS method computed from the geodesic distance on a graph is also known as the ISOMAP method after [61]. Provided that the resulting reduced Markov chain is usually dense, the time complexity of each algorithm is as follows. For KDM PCA, LE, and MDS, the problem is to compute the d dominant eigenvectors of a square matrix since the graph is mapped on a d-dimensional space, which is O(d n2 ) where n1 is the number of nodes of interest 1 being displayed and is the number of iterations of the power method. For SM and CCA, the complexity is about

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

12

O( n2 ) where is the number of iterations (these algorithms 1 are iterative by nature). On the other hand, computing the shortest-path distances matrix takes O(n2 log(n1 )). Thus, each 1 algorithm has a time complexity between O(n2 ) and O(n3 ). 1 1 In this experiment, we address the task of classication of unlabeled nodes in partially labeled graphs, that is, semisupervised classication on a graph [73]. Notice that the goal of this experiment is not to design a state-of-the-art semisupervised classier; rather it is to study the performance of the proposed method, in comparison with other embedding methods. Three graphs are investigated. The rst graph is constructed from the well-known Iris dataset [4]. The weight (afnity) between nodes representing samples is provided by wij = exp[d2 / 2 ] where dij is the Euclidean distance in the ij feature space and 2 is simply the sample variance. The classes are the three iris species. The second graph is extracted from the IMDb movie database [39]. 1126 movies are linked to form a connected graph: an edge is added between two movies if they share common production companies. Each node is classied to be a high or low-income movie (two classes). The last graph, extracted from the CORA dataset [39], is composed of scientic papers from three topics. A citation graph is built upon the dataset where two papers are linked if the rst paper cites the second one. The tested graph contains 1410 nodes divided into three classes representing machinelearning research topics. For each of these three graphs, extra nodes are added to represent the class labels (called the class nodes). Each class node is connected to the graph nodes of the corresponding class. Moreover, in order to dene cross-validation folds, these graph nodes are randomly split into training sets and test sets (called the training nodes and the test nodes respectively), the edges between the test nodes and the class nodes being removed. The graph is then reduced to the test nodes and to the class nodes by stochastic complementation (the training nodes are rejected in the S2 subset and thus censored), and projected into a two-dimensional space by applying one of the projection algorithms described before. If the relationship between the test nodes and the class nodes is accurately reconstructed in the reduced graph, these nodes from the test set should be projected close to the class node of their corresponding class. We report the classication accuracy for several labeling rates, i.e. portions of unlabeled nodes which constitute the test set. The proportion of the test nodes varies between 50% of the graph nodes (2-fold cross validation) to 10% (10-fold cross validation). This means that the proportion of training nodes left apart (censored) by stochastic complementation increases with the number of folds. The whole cross validation procedure is repeated 10 times (10 runs) and the classication accuracy averaged on these 10 runs is reported, as well as the 95% condence interval. For classication, the assigned label of each test node is simply the label provided by the nearest class node, in terms of Euclidean distance in the two-dimensional embedding space. This will permit to assess if the class information is correctly preserved during stochastic complementation and 2D dimensionality reduction. The parameter t of the KDM PCA

is set to 5, in view of our preliminary experiments. The Figures 8(a)(c) show the classication accuracy, as well as the 95% condence interval, obtained on the three investigated graphs for different training/test set partitioning (folds). The x-axis represents the number of folds, and thus an increasing number of nodes left apart (censored) by stochastic complementation (from 0%, 50%, . . . , up to 90%). As a baseline, the whole original graph (corresponding to 1 single fold and referred to as 1-fold) is also projected without removing any class link and without performing a stochastic complementation; this situation represents the ideal case since all the class information is kept. All the methods should obtain a good accuracy score in this setting this is indeed what is observed. First, we observe that, although obtaining very good performance when projecting the original graph (1-fold), CCA and SM perform poorly when the number of folds, and thus the amount of censored nodes, increases. On the other hand, LE is quite unstable, performing poorly on the CORA dataset. This means that stochastic complementation combined with CCA, SM, or LE does not work properly. On the contrary, the performance of KDM PCA and MDS remains fairly stable; for instance, the average decrease of performance of KDM PCA is around 10%, in comparison with the mapping of the original graph (from 1-fold to 2-fold 50% of the nodes are censored), which remains reasonable. MDS offers a good alternative to KDM PCA, showing competitive performance; however, it involves the computation of the all-pairs shortestpath distance. These results are conrmed when displaying the mappings. Figures 8(d)(h) show a mapping example of the test nodes as well as the class nodes (the white markers) of the CORA graph, for the 10-fold cross-validation setting. Thus, only 10% of the graph nodes are unlabeled and projected after stochastic complementation of the 90% remaining nodes. It can be observed that the Laplacian Eigenmap (Figure 8(e)) and the MDS (Figure 8(h)) managed to separate the different classes, but mostly in terms of angular similarity. On the KDM PCA mapping Figure (8(d)), the class nodes are welllocated, at the center of the set of nodes belonging to the class. On the other hand, the mappings provided by CCA and SM after stochastic complementation do not accurately preserve the class information. D. Discussion of the results Let us now come back to our research questions. As a rst observation, we can say that the two-step procedure (stochastic complementation followed by a diffusion-map projection) provides an embedding in a low-dimensional subspace from which useful information can be extracted. Indeed, the experiments show that highly related elements are displayed close together while poorly related elements tend to be drawn far apart. This is quite similar to correspondence analysis to which the procedure is closely related. Secondly, it seems that stochastic complementation reasonably preserves proximity information, when combined with a diffusion map (KDM PCA) or an ISOMAP projection (MDS). For the diffusion

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

13

classification rate

classification rate

IRIS

IMDB

0.95 0.9

0.95

0.9

0.85 0.8

0.85

0.75 0.7

0.8 KDM LE 0.75 CCA SM MDS 0.7 1 2 3 4 5 6 7 8 9 10 number of folds

0.65 0.6 0.55 0.5

KDM LE CCA SM MDS 1 2 3 4 5 6 7 8 9 10 number of folds

(a)
classification rate 1 CORA

(b)
0.02
0.015
0.01
KDM PCA

0.9

0.8

0.005
0.7

0
0.6

0.005
KDM 0.5 LE CCA 0.4 SM MDS

0.01
0.015
0.02 0.03

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

9 10 number of folds

0.02

0.01

0.01

0.02

0.03

0.04

(c)
2 LE

(d)
CCA

800

1.5

600

400

0.5

200

0.5

200

1.5 4

400 400

200

200

400

600

(e)
400
300
200
100
0
0

(f )
200
150
100
50
MDS

SM

100
50

200
100

300
400
500 300
150
200 200

200

100

100

200

300

400

150

100

50

50

100

150

(g)

(h)

Fig. 8. (a)(c) Classication accuracy obtained by the ve compared projection methods for the Iris ((a), 3 classes), IMDb ((b), 2 classes), and Cora ((c), 3 classes) datasets respectively, in function of training/test set partitioning (number of folds). (d)(h) The mapping of 10% of the Cora graph (10-folds setting) obtained by the ve projection methods. The compared methods are the diffusion-map kernel ((d), KDM PCA, or KDM), the Laplacian Eigenmap ((e), LE), the Curvilinear Component Analysis ((f), CCA), the Sammon nonlinear Mapping ((g), SM), and the Multidimensional Scaling or ISOMAP ((h), MDS). The class label nodes are represented by white markers.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011

14

map, this is normal since both stochastic complementation and the diffusion-map distance are based on a Markov-chain model stochastic complementation is the natural technique allowing to censor states of a Markov chain. On the contrary, stochastic complementation should not be combined with a Laplacian eigenmap, a curvilinear component analysis, or a Sammon nonlinear mapping the resulting mapping is not accurate. Finally, the KDM PCA provides exactly the same results as the basic diffusion map when t is large. However, when the parameter t is low, the resulting projection tends to highlight the outlier nodes and to magnify the relative differences between nodes. It is therefore recommended to display a whole range of mappings for several different values of t. VI. C ONCLUSIONS AND FURTHER WORK This work introduced a link-analysis based technique allowing to analyze relationships existing in relational databases. The database is viewed as a graph where the nodes correspond to the elements contained in the tables and the links correspond to the relations between the tables. A two-step procedure is dened for analyzing the relationships between elements of interest contained in a table, or a subset of tables. More precisely, this work (1) proposes to use stochastic complementation for extracting a subgraph containing the elements of interest from the original graph and (2) introduces a kernel-based extension of the basic diffusion map for displaying and analyzing the reduced subgraph. It is shown that the resulting method is closely related to correspondence analysis. Several datasets are analyzed by using this procedure, showing that it seems to be well-suited for analyzing relationships between elements. Indeed, stochastic complementation considerably reduces the original graph and allows to focus the analysis on the elements of interest, without having to dene a state of the Markov chain for each element of the relational database. However, one fundamental limitation of this method is that the relational database could contain too many disconnected components, in which case our linkanalysis approach is almost useless. Moreover, it is clearly not always an easy task to extract a graph from a relational database, especially when the database is huge. These are the two main drawbacks of the proposed two-step procedure. Further work will be devoted to the application of this methodology to fuzzy SQL queries or fuzzy information retrieval. The objective is to retrieve not only the elements strictly complying with the constraints of the SQL query, but also the elements that almost comply with these constraints and are therefore close to the target elements. We will also evaluate the proposed methodology on real relational databases.
ACKNOWLEDGMENTS Part of this work has been funded by projects with the R gion e Wallonne and the Belgian Politique Scientique F d rale. We e e thank these institutions for giving us the opportunity to conduct both fundamental and applied research. We also thank Professor Jef Wijsen, from the Universit de Mons, Belgium, and Dr John e Lee, from the Universit catholique de Louvain, Belgium, for the e interesting discussions and suggestions about this work.

A PPENDIX A. Some links between the basic diffusion map and spectral clustering Suppose the right eigenvectors and eigenvalues of matrix P are uk and k . Then, the matrix I P has the same eigenvectors as P and eigenvalues given by (1 k ). Therefore, the largest eigenvectors of P correspond to the smallest eigenvectors of I P. Assuming a connected undirected graph, we will now express the eigenvectors of P in terms of the eigenvectors k of the normalized Laplacian matrix, L = D1/2 LD1/2 . l We thus have (I P)uk = (1 k )uk But, I P = D1 (D A) = D1 L = D1/2 D1/2 LD1/2 D1/2 = D1/2 LD1/2 (19) (18)

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

Inserting this result in Equation (18) provides LD1/2 uk = 1/2 (1 k )D uk . Thus, the (unnormalized) eigenvectors of L k = D1/2 uk , and are associated to eigenvalues (1 k ). are l R EFERENCES

[1] P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet and the Web: Probabilistic Methods and Algorithms. John Wiley & Sons, 2003. [2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, volume 14, pages 585591. MIT Press, 2001. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:13731396, 2003. [4] C. Blake, E. Keogh, and C. Merz. UCI repository of machine learning databases. [http://www.ics.uci.edu/mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science, 1998. [5] J. Blasius, M. Greenacre, P. Groenen, and M. van de Velden. Special journal issue on correspondence analysis and related methods. Computational Statistics and Data Analysis, 53(8):31033106, 2008. [6] I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer, 1997. [7] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, 1999. [8] P. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press, 2006. [9] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science, 2003. [10] F. R. Chung. Spectral graph theory. American Mathematical Society, 1997. [11] D. J. Cook and L. B. Holder. Mining graph data. Wiley and Sons, 2006. [12] T. Cox and M. Cox. Multidimensional scaling, 2nd Ed. Chapman and Hall, 2001. [13] N. Cressie. Statistics for spatial data. Wiley, 1991. [14] P. Demartines and J. Herault. Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1):148154, 1997. [15] C. Ding. Spectral clustering. Tutorial presented at the 16th European Conference on Machine Learning (ECML 2005), 2005. [16] P. Domingos. Prospects and challenges for multi-relational data mining. ACM SIGKDD Explorations Newsletter, 5(1):8083, 2003. [17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering, 19(3):355369, 2007.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.23, NO. 4, April 2011
15

[18] F. Fouss, J.-M. Renders, and M. Saerens. Links between Kleinbergs hubs and authorities, correspondence analysis and Markov chains. In Proceedings of the 3th IEEE International Conference on Data Mining (ICDM), pages 521524, 2003. [19] F. Fouss, L. Yen, A. Pirotte, and M. Saerens. An experimental investigation of graph kernels on a collaborative recommendation task. Proceedings of the 6th International Conference on Data Mining (ICDM 2006), pages 863868, 2006. [20] F. Geerts, H. Mannila, and E. Terzi. Relational link-based ranking. Proceedings of the 30th Very Large Data Bases Conference (VLDB), pages 552563, 2004. [21] X. Geng, D.-C. Zhan, and Z.-H. Zhou. Supervised nonlinear dimensionality reduction for visualization and classication. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(6):10981107, 2005. [22] L. Getoor, editor. Introduction to statistical relational learning. MIT Press, 2007. [23] J. Gower and D. Hand. Biplots. Chapman & Hall, 1995. [24] M. J. Greenacre. Theory and Applications of Correspondence Analysis. Academic Press, 1984. [25] A. Greenbaum. Iterative Methods for Solving Linear Systems. Society for Industrial and Applied Mathematics, 1997. [26] K. M. Hall. An r-dimensional quadratic placement algorithm. Management Science, 17(8):219229, 1970. [27] D. A. Harville. Matrix Algebra from a Statisticians Perspective. Springer-Verlag, 1997. [28] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data mining, Inference, and Prediction, 2nd ed. SpringerVerlag, 2009. [29] H. Hwang, W. Dhillon, and Y. Takane. An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents. Psychometrika, 71(1):161171, 2006. [30] A. Izenman. Modern multivariate statistical techniques. Regression, classication, and manifold learning. Springer, 2008. [31] R. Johnson and D. Wichern. Applied Multivariate Statistical Analysis, 6th Ed. Prentice Hall, 2007. [32] R. Kimball and M. Ross. The data warehouse toolkit: The complete guide to dimensional modeling. John Wiley & Sons, 2002. [33] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604632, 1999. [34] P. Kroonenberg and M. Greenacre. Correspondence analysis. In Encyclopedia of Statistical Sciences, 2nd ed. (S. Kotz, founder and editor-in-chief), pages 13941403, 2006. [35] S. Lafon and A. B. Lee. Diffusion maps and coarse-graining: A unied framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):13931403, 2006. [36] A. N. Langville and C. D. Meyer. Googles PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. [37] J. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer, 2007. [38] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, and S. Dawson. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Can geographic isolation explain this unique trait? Behavioral Ecology and Sociobiology, 54(4):396405, 2003. [39] S. A. Macskassy and F. Provost. Classication in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935983, 2007. [40] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, 1979. [41] C. D. Meyer. Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Review, 31(2):240 272, 1989. [42] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis. Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators. Advances in Neural Information Processing Systems 18, pages 955962, 2005. [43] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinate of dynamical systems. Applied and Computational Harmonic Analysis, 21:113127, 2006. [44] M. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004. [45] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, volume 14, pages 849856, Vancouver, Canada, 2001. MIT Press.

[46] P. Pons and M. Latapy. Computing communities in large networks using random walks. Proceedings of the International Symposium on Computer and Information Sciences (ISCIS2005). Lecture Notes in Computer Science, Springer-Verlag, 3733:284293, 2005. [47] P. Pons and M. Latapy. Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 10(2):191218, 2006. [48] S. Ross. Stochastic Processes, 2nd Ed. Wiley, 1996. [49] Y. Saad. Iterative Methods for Sparse Linear Systems, 2nd Ed. Society for Industrial and Applied Mathematics, 2003. [50] M. Saerens and F. Fouss. HITS is principal component analysis. Proceedings of the 2005 IEEE/WIC/ACM International Joint Conference on Web Intelligence, pages 782785, 2005. [51] M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its relationships to spectral clustering. Proceedings of the 15th European Conference on Machine Learning (ECML 2004). Lecture Notes in Articial Intelligence, vol. 3201, SpringerVerlag, Berlin, pages 371383, 2004. [52] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5):401409, 1969. [53] O. Schabenberger and C. Gotway. Statistical methods for spatial data analysis. Chapman & Hall, 2005. [54] B. Scholkopf and A. Smola. Learning with kernels. The MIT Press, 2002. [55] B. Scholkopf, A. Smola, and K. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):12991319, 1998. [56] R. Sedgewick. Algorithms in C. Addison-Wesley, 1990. [57] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [58] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Matching and Machine Intelligence, 22:888 905, 2000. [59] D. Spielman and S.-H. Teng. Nearly-linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. arXiv, http://arxiv.org/abs/cs/0607105, 2007. [60] W. J. Stewart. Introduction to the Numerical Solution of Markov Chains. Princeton University Press, 1994. [61] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319 2323, 2000. [62] M. Tenenhaus and F. Young. An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika, 50(1):91119, 1985. [63] M. Thelwall. Link Analysis: An Information Science Approach. Elsevier, 2004. [64] L. Trefethen and D. Bau. Numerical linear algebra. Society for Industrial and Applied Mathematics, 1997. [65] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395416, 2007. [66] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. [67] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):4051, 2007. [68] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens. Graph nodes clustering based on the commute-time kernel. In Proceedings of the 11th Pacic-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007), 2007. [69] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens. Graph nodes clustering with the sigmoid commute-time kernel: A comparative study. Data & Knowledge Engineering, 68:338361, 2009. [70] D. M. Young and R. T. Gregory. A Survey of Numerical Mathematics. Dover Publications, 1988. [71] W. W. Zachary. An information ow model for conict and ssion in small groups. Journal of Anthropological Research, 33:452473, 1977. [72] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon. Bipartite graph partitioning and data clustering. In Proceedings of the ACM 10th International Conference on Information and Knowledge Management (CIKM 2001), pages 2532, 2001. [73] X. Zhu and A. Goldberg. Introduction to semi-supervised learning. Morgan & Claypool publishers, 2009. [74] J. Y. Zien, M. D. Schlag, and P. K. Chan. Multilevel spectral hypergraph partitioning with arbitrary vertex sizes. IEEE Transactions on computeraided design of integrated circuits and systems, 18(9):13891399, 1999.

h ht ttp: tp //w :// w w w w .i w e .ie ee ee xp xp lo lo rep re r pr oje oj c ec ts ts .bl .b og lo s gs po po t.c t.c om om

Você também pode gostar