Escolar Documentos
Profissional Documentos
Cultura Documentos
ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 43
i and j, the other diagonal sum b+c represents the total (b + c )
number of mismatches between i and j. The total sum of DVARI = (23)
4(a + b + c + d )
the 2x2 table, a+b+c+d is always equal to n.
(b + c) 2
DSIZEDIFFERENCE = (24)
Table 2 [5] lists definitions of 76 binary similarity and (a + b + c + d ) 2
distance measures used over the last century where S and n(b + c) − (b − c) 2
DSHAPEDIFFERENCE = (25)
D are similarity and distance measures, respectively. (a + b + c + d ) 2
4bc
Table 2 Definitions of Measures for binary data DPATTERNDIFFERENCE = (26)
(a + b + c + d ) 2
a
S JACCARD = (1) b+c
a+b+c DLANCE&WILLIAMS = (27)
(2a + b + c )
2a
S DICE = (2) b+c
2a + b + c DBRAY &CURTIS =
( 2a + b + c )
(28)
2a
SCZEKANOWSKI = (3) ⎛ ⎞
2a + b + c DHELLINGER = 2 ⎜1 −
a ⎟ (29)
⎜ ( + )( + ) ⎟
⎝ a b a c ⎠
3a
S3W − JACCARD = (4) ⎛ ⎞
3a + b + c D CHORD = 2⎜ 1 −
a ⎟ (30)
⎜ ( a + b )( a + c ) ⎟⎠
⎝
2a
S NEI & LI = (5) a
( a + b) + ( a + c ) SCOSINE = 2 (31)
(a + b)(a + c)
a
S SOKAL&SNEATH −I = (6) a+b a+c
a + 2b + 2c SGILBERT &WELLS = log a − log n − log( ) − log( ) (32)
n n
a+d
S SOKAL & MICHENER = (7) S OCHIAI − I =
a
a+b+c+d (a + b)(a + c) (33)
2( a + d )
S SOKAL&SNEATH − II = (8) S FORBESI =
na
2a + b + c + 2d (a + b)(a + c) (34)
a+d n(a − 0.5) 2
S ROGER&TANIMOTO = (9) S FOSSUM =
a + 2(b + c ) + d (a + b)(a + c) (35)
a + 0.5d a2
S FAITH = (10) S SORGENFREI = (36)
a+b+c+d ( a + b)(a + c)
a+d a
SGOWER&LEGENDRE = (11) S MOUNTFORD =
a + 0.5(b + c) + d 0.5(ab + ac ) + bc (37)
a a
DMANHATTAN = b + c (19) S JOHNSON = + (43)
a+b a+c
b+c ad − bc
DMEAN − MANHATTAN = (20) S DENNIS =
a+b+c+d n(a + b)(a + c) (44)
DCITYBLOCK = b + c (21) a
S SIMPSON = (45)
min(a + b, a + c)
1
DMINKOWSKI = (b + c) 1 (22) a
S BRAUN & BANQUET =
max(a + b, a + c) (46)
44 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524
max(a + b, a + c )
S FAGER&McGOWAN =
a
− σ −σ '
( a + b)( a + c ) 2 (47) S ANDERBERG = (70)
2n
na − (a + b)(a + c)
S FORBES − II = (48) ad + a
n min(a + b, a + c) − (a + b)(a + c) S BARONI −URBANI & BUSER − I = (71)
ad + a + b + c
a a d d
+ + +
=
(a + b) (a + c) (b + d ) (b + d ) (49) ad + a − (b + c)
S SOKAL&SNEATH − IV
4 S BARONI −URBANI & BUSER − II = (72)
ad + a + b + c
a+d
S GOWER = (50) ab + bc
(a + b)(a + c)(b + d )(c + d ) S PEIRCE = (73)
ab + 2bc + cd
n(ad − bc )
2
SPEARSON −I = χ 2 where χ 2 = n 2 (na − (a + b)( a + c))
(a + b)(a + c)(c + d )(b + d )
(51) S EYRAUD = (74)
( a + b)(a + c)(b + d )(c + d )
χ 2
S PEARSON − II = ( )1 / 2 (52) a
n+ χ2 ( a + b) a (c + d )
S TARANTULA = = (75)
ρ where ad − bc
c c (a + b)
SPEARSON−III = ( )1/ 2 ρ= (53) (c + d )
n+ ρ (a + b)(a + c)(b + d )(c + d )
ad − bc
a
S PEARSON&HERON − I = (54) ( a + b) a (c + d )
(a + b)(a + c)(b + d )(c + d ) S AMPLE = = (76)
c c (a + b)
π bc (c + d )
S PEARSON & HERON − II = Cos( ) (55)
ad + bc
a+d
S SOKAL&SNEATH −III = (56) The inclusion or exclusion of negative matches, d in the
b+c
binary similarity measures have been an ongoing issue [9,
ad 12, 15, 16, 17, 18, 26, 27]. The Sokal & Michener, the
S SOKAL&SNEATH −V = (57)
(a + b)(a + c)(b + d )(c + d ) 0.5 Roger & Tanimoto, the Faith, the Ochiai II, the Cole, the
2(ad − bc) Gower, Pearson I, and the Stiles etc. are included in the
SCOLE = (58) negative match inclusive measures. The Jaccard, the
(ad − bc) − (a + b)(a + c)(b + d )(c + d )
2
n
Tanimoto, the Dice & Sorenson, the Kulczynski I, the
n(| ad − bc | − ) 2 Ochiai I, the Mountford, the Sorgenfrei, and the Simpson
S STILES = log 10 2 (59)
( a + b)( a + c )(b + d )(c + d ) etc. are included in the negative match exclusive
ad measures. Sokal et al. argued that the negative matches
SOCHIAI − II = (60) do not mean necessarily any similarity between two
(a + b)(a + c)(b + d )(c + d )
objects [27]. This is because an almost infinite number of
ad − bc attributes is possibly lacking in two objects.
S YULEQ = (61)
ad + bc
2bc In cases where the two binary states are not equally
DYULEQ = (62) important, such as in the asymmetric type of binary data,
ad + bc
the positive matches are usually more significant than the
ad − bc negative matches [1, 6, 10, 26]. Faith included the
S YULEw = (63)
ad + bc negative match but only gave the half credits while giving
a the full credits for the positive matches in eqn (10) [11].
S KULCZYNSKI − I = (64)
b+c In [4], different weights for positive and negative matches
were studied. Weighted similarity measures such as
a
STANIMOTO = (65) weighted hamming distance or azzoo [4] are not covered
( a + b) + ( a + c ) − a
in this paper though.
ad − bc
S DISPERSON = (66)
(a + b + c + d ) 2 Historically, all the binary measures observed above have
had a meaningful performance in their respective fields.
(a + d ) − (b + c)
S HAMANN = (67) The binary similarity coefficients proposed by Peirce,
a+b+c+d
Yule, and Pearson in 1900s contributes to the evolution
4(ad − bc) of the various correlation based binary similarity
S MICHAEL = (68)
(a + d ) 2 + (b + c) 2 measures. The Jaccard coefficient proposed at 1901 is
still widely used in the various fields such as ecology and
σ − σ ' where biology. The discussion of inclusion or exclusion of
S GOODMAN & KRUSKAL =
2n − σ ' negative matches was actively arisen by Sokal & Sneath
σ = max(a, b) + max(c, d ) + max(a, c) + max(b, d ) , (69)
in during 1960s and by Goodman & Kruskal in 1970s. In
σ ' = max(a + c, b + d ) + max(a + b, c + d ) Figure 1, the measures are arranged in historical order.
ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 45
Correlation Based Baroni-Urbani & Buser I,II
46 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524
Figure 2 Hierarchical Clustering Result of Random Binary Data Set
ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 47
4. CONCLUSIONS [15] Gilbert, G.K., (1884), “Finely’s tornado predictions,”
The American Meteorological Journal, 1, 166-72.
Numerous binary similarity measures and distance [16] Goodman, L.A., Kruskal, W.H., (1954), “Measures of
measures have been used in various fields. Each of them is association for cross classifications”, Journal of the
differently defined by its own synthetic properties. Some American Statistical Association 49, 732-764.
include negative matches and some do not. Some use [17] Goodman, L.A., Kruskal, W.H., (1959), “Measures of
simple count difference and some utilize complicated association for cross classifications II. Further discussion
correlation. In this survey, we collected 76 binary similarity and references”, Journal of the American Statistical
and distance measures used over the last century, classified Association 54, 123-163 (pp. 35-75).
them through hierarchical clustering, and observed close [18] Goodman, L.A., Kruskal, W.H., (1963), “Measures of
relationships among some of the measures. We expect that association for cross classifications III. Approximate
the relationship of each pair of measures should help sampling theory”, Journal of the American Statistical
researchers select more accurate measure for binary data Association 58, 310-364.
analysis in various domains. [19] Hubalek, Z., (1982), “Coefficients of Association and
Similarity, Based on Binary (Presence-Absence) Data: An
5. REFERENCES Evaluation”, Biological Reviews, Vol.57-4,669-689.
[20] Jaccard, P., (1901), “Étude comparative de la
distribuition florale dans une portion des Alpes et des Jura”,
[1] Baroni-Urbani, C., Buser, M.W., (1976), “Similarity of Bull Soc Vandoise Sci Nat 37:547-579.
Binary Data”, Systematic Zoology, Vol. 25, No. 3, pp. 251- [21] Jackson, D.A., Somers, K.M., Harvey, H.H., (1989),
259. “Similarity Coefficients: Measures of Co-Occurrence and
[2] Cha, S.-H., Srihari, S.N., (2000), “A fast nearest Association or Simply Measures of Occurrence?”, The
neighbor search algorithm by filtration”, Pattern American Nat1uralist, Vol. 133, No. 3, pp. 436-453.
Recognition 35, P 515-525. [22] Kuhns, J.L., (1965), “The continuum of coefficients of
[3] Cha, S.-H., Tappert, C.C., (2003), “Optimizing Binary association”, Statistical Association Methods for
Feature Vector Similarity Measure using Genetic Mechanized Documentation, (Edited by Stevens et al.)
Algorithm”, ICDAR, Edinburgh, Scotland. National Bureau of Standards, Washington, 33-39.
[4] Cha, S.-H., Yoon S-, Tappert, C.C., (2006), “Enhancing [23] Michael, E.L., (1920), “Marine ecology and the
Binary Feature Vector Similarity Measures”, Journal of coefficient of association: a plea in behalf of quantitative
Pattern Recognition research I. biology”, Ecology 8, 54-59.
[5] Choi, S.-S, (2008), “Correlation Analysis of Binary [24] Michael H., (1976), “Binary coefficients: A theoretical
Similarity Measures and Dissimilarity Measures”, and empirical study, Mathematical Geology, Volume 8,
Doctorate dissertation, Pace University. Number 2, April, 1976.
[6] Clifford, H., Stephenson, W., (1975), “An Introduction [25] Smith, J.R., Chang, S.-F., (1996), “Automated binary
to Numerical Taxonomy”, Academic Press, New York. texture feature sets for image retrieval”, International Conf.
[7] Cormack, R.M., (1971), “A review of classification”, Accoust., Speech, Signal processing, Atlantic, GA.
Journal of the Royal Statistical Society, Series A, 134., pp. [26] Sneath, P.H.A., Sokal, R.R., (1973), “Numerical
321 - 353. Taxonomy: The Principles and Practice of Numerical
[8] Driver, H.E., Kroeber, A.L., (1932), “Quantitative Classification”, W.H. Freeman and Company, San
Expression of Cultural Relationships”, University of Francisco.
California Press. [27] Sokal, R.R., Sneath P.H., (1963), “Principles of
[9] Dunn, G., Everitt, B.S., (1982), “An Introduction to numeric taxonomy”, San Francisco, W.H. Freeman.
Mathematical Taxonomy”, Cambridge University Press. [28] Tubbs, J.D., (1989), “A note on binary template
[10] Faith, D.P, (1983), “Asymmetric binary similarity matching”, Pattern Recognition, 22(4):359-365.
measures”, Oecologia, Vol.57, No. 3, pp. 287-290. [29] Willett, P., Barnard, J.M., Downs, G.M., (1998),
[11] Faith, D.P., Minchin, P.R., Belbin, L., (1987), “Chemical similarity searching” Chem Inf Comput Sci 38:
“Compositional dissimilarity as a robust measure of 983-996.
ecological distance”, Journal of Plant Ecology, Volume 69, [30] Willett, P., (2003), “Similarity-based approaches to
Numbers 1-3. virtual screening”, Biochemical Society Transactions 31,
[12] Finely, J.P., (1884), “Tornado prediction,” The 603–606.
American Meteorological Journal, 1, 85-8. [31] Zhang, B., Srihari, S.N., (2003), “Binary vector
[13] Forbes, S.A., (1907), “On the local distribution of dissimilarities for handwriting identification”, Proceedings
certain Illinois fishes. An essay in statistical ecology,” of SPIE, Document Recognition and Retrieval X, p 15-
Bulletin of the Illinois State Laboratory of Natural History. 166.
[14] Forbes, S.A., (1925), “Method of determining and
measuring the associative relations of species”, Science 61,
524.
48 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524