Escolar Documentos
Profissional Documentos
Cultura Documentos
Argumentation Trails
Marco Büchler, Lutz Maicher*, Frederik Baumgardt, Benjamin Bock*
Natural Language Processing Group,
University of Leipzig, Germany
[mbuechler | maicher | fbaumgardt | bock]@informatik.unileipzig.de
Motivation and Introduction
Small world related research on natural language corpora, hypertext structures on the
web or social networks like coauthorships has shown that the average path length
between two arbitrary nodes is generally not larger than 7. The general problem is the
discovery of the shortest path between these nodes, especially if the edges of the
graph are only partially known and the distance is greater than two.
Academic or political debates are confronted with a similar problem. In a lot of
cases there exists the supposition of a relationship between two terms of a specific
domain, which are the origin and the endpoint of an argument. But the closer
connection between both is unidentified and will become the essence of a discourse.
Our approach is the disclosure of the relevant connections between the origin and the
endpoint of an argument. We model this relationship as a connected path of terms,
based on cooccurrences. This path is called argumentation trail.
A cooccurrence is a directed edge between two terms c(ti,tj) that is extracted
automatically by using (different) statistical methods. An argumentation trail a(t1,tn)
between two arbitrary terms t1 and tn is an ordered list of cooccurrences, providing a
*
Topic Maps Lab: http://www.topicmapslab.de/
2 Büchler, Maicher, Baumgardt, Bock
connected path from t1 and tn. The distance d of an argumentation trail is the number
of cooccurrences in this list. The distance between the terms t1 and tn is the length of
the shortest argumentation trail between them.
The general problem is the calculation of the shortest and most significant
argumentation trails between two terms. Similar to the small world example above,
the argumentation trails for any distances greater than one are not obvious. The
extraction of relevant argumentation trails becomes more complex as the density of
the cooccurrence graphs and the distance between origin and endpoint increases. To
support sensemaking and other discourse supporting techniques in academic and
political debates, we introduce a method for the automatic extraction of
argumentation trails.
Topic Maps are used for representing highly networked and interlinked domains.
Furthermore Topic Maps is a semantic integration technology because each topic is a
hub for all available information about a specific subject. Besides others, topic maps
are used for sensemaking and knowledge federation techniques. For these
applications, the integration of argumentation trails with further background
information is necessary.
Therefore the approach presented in this paper combines the idea of the automatic
generation of argumentation trails with the formal representation of these trails as
topic maps. In the evaluation we assess the quality of this first proof of concept.
State of the Art
A graph is an intuitive representation of relations between words. More formally a
graph can be expressed by G=(V,E) where as V is a set of vertices (nodes, words) and
E a subset of edges of VxV. In Natural Language Processing (NLP) the set V of nodes
can be comprehended as the set of a corpus' word types. The set of edges E can be
computed by cooccurrence analysis [Bue08, Bue09]. Typically tens or hundreds of
million cooccurrences can be extracted. That's why measures are necessary for
computing an edge's significance. In the early 1990s some basic measures like the
mutual information [CH89] were introduced. However this measure displays some
numerical problems for very rare cooccurrences. As a result, in 1993 an adoption of
the log likelihood measure was introduced by [Dun93] which can handle more
infrequent events. However, most significant edges are completely distinctive. Whilst
log likelihood ratio prefers more frequent cooccurrences the mutual information
computes less frequent edges as more significant [Bue08, Eve05]. That's why log
likelihood ratio is better for exploration and understanding a new domain by
computing more general word associations. Whereas if the domain is wellknown less
frequent information becomes more relevant for users [BB04].
One elementary feature of a graph is the small world property which describes the
average path length between two different nodes [WS98, Bar00]. Research on small
worlds is based on works of Milgram [Mil67]. Several evaluations and applications
on natural language corpora, hypertext structures on the web or coauthorships on
Automatic Extraction of Topic Maps based Argumentation Trails
publications [CS01] have shown that the average path length is very small and
generally not larger than 7.
Similar to lexical chaining argumentation trails are a minimum spanning tree of
words having same or similar contexts [MWW07, MWH08]. However, there are
differences in use cases and texts. Lexical chaining is often used in text
summarisation [BCP01] or word sense disambiguation [GK03]. Thereby a stronger
work by sliding from paragraph to paragraph [MWH08] is useful. Since ancient text
corpora are only fragmentary achieved (caused by e. g. natural decomposition and
conscious deletions of person names or cities) an approach working directly on a co
occurrence graph is chosen.
Automatic Topic Maps Generation
Topic Maps (ISO 13250), the international industry standard for semantic information
representation and integration, is an implementation of the subjectcentric modelling 1
approach [MB08]. A topic map is a subjectcentric domain model, consisting of
topics, as subject proxies, and associations between them. Each topic can represent a
set of typed names for the subject. Furthermore, occurrences allow representing typed
properties of the subject. The associations between the topics are typed, rolebased
and nary. Summarised, Topic Maps provides a subjectcentric modelling approach
and a full set of basic modelling constructs, like names, occurrence and fullfeatured
associations for convenient domain modelling. For a more comprehensive
introduction into TM we refer to [AM05, Ma07a]. A topic map can be seen as a set or
a stream of statements about subjects [LH08].
Besides the expressive and flexible modelling constructs, Topic Maps provide a
powerful integration model [Ma07b, TMDM]. This integration model assures that
two topics representing the same subject will always be merged. Technically, the
subject of a topic is identified by a set of URIs which are called subject identifiers.
Whenever two topics in a topic map have one subject identifier in common, they are
automatically merged. Hence it is guaranteed that in a topic map there is always only
one information hub for each subject. This powerful integration model is the
fundament for the usage of Topic Maps as integration technology.
The subjectcentric modelling approach supports the (semiautomatic) generation
of subjectcentric web portals and other interfaces to the highlyinterlinked data
[MB08]. Combined with interchange protocols like TMRAP [Ga06] or TMIP
[Ba05b], these applications simultaneously feed the web of linked data [Be06].
1
According to the Topic Maps standards a subject is “anything whatsoever, regardless of
whether it exists or has any other specific characteristics, about which anything whatsoever
may be asserted by any means whatsoever” [TMDM]. Summarised, a subject is anything
that can be a topic of conversation. Simplified, the subjectcentric modelling enforces that for
each relevant subject exactly one proxy is created within the domain model. Consequently,
proxies become the unique information access points for all information about their subjects.
4 Büchler, Maicher, Baumgardt, Bock
In the context of the work presented in this paper, the generation of Topic Maps
data is an important issue. The following table summarises the general categories for
approaches of creating Topic Maps data:
For the global interoperability and usability of generated Topic Maps data two
issues are important: (1) the domain ontology and (2) the used subject identifiers at
type and instance level [Ma07a].
The domain ontology formalises the domain knowledge behind the data and can be
used for the optimisation and generation of the data consuming applications [Bo08].
The ontology of the Topic Maps data created by the work presented in this paper is
shown in Figure 2.
For the integration of the generated topic maps with other information about the
represented subjects, adequate subject identifiers must be used at the type and, very
important, at the instance level [Ma07b]. In the methodology section the approach for
choosing the subject identifiers in the argumentation trails is sketched.
2
http://www.isotopicmaps.org/ctm/
3
http://www.topicmapslab.de/glossary/XTM
Automatic Extraction of Topic Maps based Argumentation Trails
Methodology
Exploring argumentation trails in a semantic network is closely related to searching
the kshortest paths from a source to a target node in an undirected graph where the
number of paths k is substituted by a maximum path length. The kshortest paths
problem is applicable in many fields and has been extensively studied, with the
number of publications approaching 100. The four most widely recognized methods
are those of Yen [Yen71], Lawler [Law72], Katoh [KIM82] and Hoffman [HP59].
Yen's algorithm is a naive usage of Dijkstra's shortest path algorithm, with complexity
O(kn3), where k is the number of paths and n the number of nodes in the graph.
Lawler and Katoh improve upon Yen by compartementalization of the paths, Lawler
by a constant factor and Katoh with complexity O(kn2). Even before Yen, Hoffman
introduced a different idea with the precalculation of shortestpath for all nodes,
resulting in complexity O(kn2).
With highly datarich environments, as described in this paper, memory constraints
become an issue as well. Thus, of the above algorithms only Yen was feasible, but
much too slow. Early trials demonstrated the need for a custommade method, as
required by the specific problem.
Drastic reduction of the search space was necessary. In the following approach to
explore argumentation trails with maximum length of 3, we utilise topologic attributes
that help us reduce the actual amount of data.
Instead of searching for paths between source s0 and target t0 nodes, connections
and overlaps between the neighborhoods ( Ns = {s1,..,sn}, Nt = {t1,..,tm} ) of both
endpoints are being searched. For each neighbor of the source or target nodes to be
included in an argumentation trail, it must be incident to, or part of the neighborhood
surrounding the endpoint on the opposite side of the trail. Thus, in a Graph G=(V,E)
we search for nodes v that are either member of both Ns and Nt (v∈Ns ∩ Nt) or
incident to a node in the opposite neighborhood ((v, ti)∈E for v∈Ns, or (v,si)∈E for
v∈Nt).
Fig. 1: Path selection on the topology of the endpoints' neighborhoods
6 Büchler, Maicher, Baumgardt, Bock
Fig. 2: Schema (TMCL)8 of the Topic Maps export for argumentation trails
4
http://jung.sourceforge.net/
5
http://prefuse.org/
6
http://www.topicmapslab.de/projects/tiny_TiM
7
http://www.isotopicmaps.org/tmcl/
Automatic Extraction of Topic Maps based Argumentation Trails
Results – Graph and argumentation trail properties
The underlying cooccurrence graph is based on a corpus of about 5.5m sentences and
87m word tokens. A cooccurrence of the graphs shown in table 1 is significant if it
occurs at least three times and has a minimum log likelihood ratio of 6.63. All
columns of table 1 labeled by 2) to 7) are different subgraphs of 1). In columns 2) to
4) the minimum word frequency is 2. Additionally, the 100, 300 and 500 most
frequent words were excluded. Column 5 of table 1 shows a smaller subgraph only
based on named entities. Whilst column 6 expands all named entities of column 5 by
normalised9 equal words, column 7 works on both a normalised corpus and named
entity list.
Comparing the average degree of the underlying cooccurrence graph in row e) and
the average degree of the edge reduced argumentation trail in row g) it is obvious that
the path finding algorithm reduces the degree dramatically. However, the average
degree of a node in the argumentation trail in column g) is significantly smaller than
the degree of the inner nodes of an argumentation trail – h) for trails with two inner
nodes and i) for trails with only one inner node. This is caused by a more central role
of hubs within an argumentation trail.
1) 2) 3) 4) 5) 6) 7)
f)
trail properties
Table 1: Some properties of argumentation trails inclusive the characteristic
features of the underlying cooccurrence graph10 11
8
The schema is created with the TMCL editor Onotoa (http://onotoa.topicmapslab.de/). The
used graphical notation is a nonnormative GTM level 1 syntax
(http://www.isotopicmaps.org/gtm/) proposed by the Topic Maps Lab and implemented by
Onotoa. The namespace “eaqua” must be resolved to “http://psi.eaqua.net/ontology/” and the
namespace “concept” must be resolved to “http://psi.eaqua.net/corpora/[corpusname]/”..
9
Normalised: All letters will be made lowercase and diachritics will be removed.
10
Column labels: 1) Complete graph, 2) TOP 100 stop words and words with a frequency of 1
are removed , 3) SW=300, min. freq=2, 4) SW=500 min. freq. 2, 5) only Named Entities
(NE), 6) Normalised Named Entities, 7) Named Entities Normal. Corpus
11
Row labels: a) Number of nodes, b) Number of cooccurrences, c) Number of significant co
occurrences, d) Percentage, e) Average degree, f) Number of trails, g) Average degree, h)
Average degree of internal node (trail length 2), i) Average degree of internal node (trail
length 3)
8 Büchler, Maicher, Baumgardt, Bock
One more result of table 1 is shown in column 5). Row d) describes the ratio of
significant cooccurrences and found cooccurrences. In table cell 5d) this ratio is
significantly larger than all other ratios. That's why the next section is reduced to this
data set.
Use Case – argumentation trails for Classical Studies
In Classics there are many use case scenarios for argumentation trails. On the one
hand you can use such trails for exploring new domains (e. g. new centuries) by
looking for the way in which different terms are related. On the other hand ancient
texts are strongly fragmented. For those cases you can e. g. observe a person A and
you know the context B of this fragmentary document. Using argumentation trails
you can observe how both concepts belong together based on other texts of the same
time frame. Furthermore, you can filter the found trails more rigorous than in figure
3b) by using other words from the fragmentary text. As a result of this you can get a
virtual expansion of the document's story.
Figure 3: a) Connection between two words with a low number of trails b)
Large trail cloud between two words
Typically, one is to observe trails like in picture 3) in such graphs. From a common
start and end point trails are to be found differing in only one node 3. column in
image 3a. The different nodes of the black and red trails of figure 3a are Krates and
Herodot. Looking for both words in the corpus one can find 46 sentences which
contain both words. The counterexample for this is shown in figure 3b. The black and
red trails have only the common starting and ending point. Based on this two
completely different argumentation trail threads exist.
Automatic Extraction of Topic Maps based Argumentation Trails
Further Work and Conclusion
As mentioned in the introduction this paper is a proof of concept. We examined the
feasibility of the automatic extraction of argumentation trails and their usage as
discourse enriching technique in academic or political debates.
The automatic generation has been identified as difficult. However, some very
interesting results have been achieved and should be basis of further research.
As shown in table 1, the number of trails needs to be reduced dramatically. This
might be achieved by e. g. semantic preclustering or by authors restrictions.
Semantic preclustering causes trails to be rejected if every node is part of another
and completely different semantic cluster. In opposite of lexical chainings [MWH08]
this step is necessary because it's difficult to build a reliable “document based
summary” (text fragments). Author restrictions can be used to reject trails if edges of
a trail are computed by completely different sets of authors or work.
Furthermore, trails containing network hubs should be weighted lower to avoid
forging results. This could be shown in table 1 as well. All of these complexity
reduction approaches are necessary to compute trails on more complex graphs.
In the field of visualization a stronger clustering of trails is necessary. As depicted
in figure 3a) there exist two almost equal trails differing only in two nodes. By
clustering such trails to more globally relevant argumentation trail threads, the
understanding of more complex trail clouds as shown in figure 3b can be done easier
and faster.
Additionally, typing of nodes will be done by typed significant terms (e. g. literary,
geographic or dating classifications) [BHG08]. The same holds for typing or naming
the edges in the argumentation trails. Such kinds of enrichment will additionally
support a stronger integration to Topic Maps. Generally, the work in this paper does
not cover the problem of integrating the generated argumentation trails (as topic
maps) with further background information in sufficient detail.
Argumentation Trails and Topic Maps
Based on the historical roots of Topic Maps, the technology focuses on the
aggregation of information to subjects (esp. at the instance level like persons,
projects, etc.). The idea is collecting and documenting information about a subject
from different “perspectives”, whereby contradictions are expected. The integration
of “facts” and “discourses” about subjects is a long standing tradition in Topic Maps,
which is today coined as sensemaking [Pa08] and knowledge federation.
Argumentation trails support discourses, ease the creation of new hypothesis and
open new views to the data. Combined with further background information they are
a tool for sensemaking as discourse supporting tool in academic and political debates.
By using Topic Maps and adequate subject identifiers, the concepts in the
argumentation trails can be (instantly) integrated with other data or applications
dealing with the same subjects.
10 Büchler, Maicher, Baumgardt, Bock
References
[AM05] Ahmed, K.; Moore, G.: An introduction to Topic Maps. In: The Architecture Journal
5, 2005.
[Ba04b] Barta, R.: Virtual and Federated Topic Maps. In: Proceedings of XML Europe,
Amsterdam (2004).
[Ba05] Barta, R.: TMIP, A RESTful Topic Maps Interaction Protocol. In: Proceedings of
Extreme Markup Languages 2005, Montréal. Online available at:
http://www.mulberrytech.com/Extreme/Proceedings/xslfo-pdf/2005/Barta01/
EML2005Barta01.pdf
[Bar00] Barabasi, A.L. et al .: Scale-free characteristics of random networks: the topology of
the World-wide web, Physica A (281)70-77, 2000
[BB04] Baroni, M.; Bisi, S.: Using cooccurrence statistics and the web to discover synonyms
in a technical language. Proceedings of LREC 2004.
[BCP01] Brunn, M., Chali Y., Pinchak C. J.: Text Summarization Using Lexical Chains. 2001
[Be06] Berners-Lee, T.: Linked Data. Online available at:
http://www.w3.org/DesignIssues/LinkedData.html (2009-02-20)
[BM06] Böhm, K.; Maicher, L.: Real-time Generation of Topic Maps from Speech Streams.
In: Proceedings of First International Workshop on Topic Maps Research and Applications
(TMRA'05), Leipzig; Springer LNAI 3873, (2006).
[Bo08b] Bock, B.: Topic-Maps-Middleware. Modellgetriebene Entwicklung kombinierbarer
domänenspezifischer Topic-Maps-Kompenenten. Diploma thesis at University of Leipzig
(2008).
[Ga06] Garshol, L. M.: TMRAP – Topic Maps Remote Access Protocol. In: Maicher, L.;
Park, J. (Hrsg.): Charting the Topic Maps Research and Applications Landscape. LNAI 3873,
Springer:Berlin (2006).
[GK03] Galley, M., McKeown, K.: Improving Word Sense Disambiguation in Lexical
Chaining. 2003.
[He08] Heuer, L.: Streaming Topic Maps API. In: Maicher, L.; Garshol, L. M. (eds.):
Subject-centric computing. In: Maicher, L.; Garshol, L.M.: Subject-centric computing.
Proceedings of TMRA 2008. Leipzig, (2008).
[HP59] Hoffman, W.; Parley, R.: A method for the solution of the nth best path problem.
Journal of the Association for Computing Machinery (ACM) 1959; 6:506-514.
[KIM82] Katoh, R. K.; Ibaraki, T.; Mine, H.: An efficient algorithm for k shortest simple paths.
Networks 1982; 12:411-427.
[Kle00] Kleinberg, J.: The small-world phenomenon: An algorithmic perspective. Proc. 32nd
ACM Symposium on Theory of Computing, 2000.
[Law72] Lawler, E. L.: A procedure for computing the k best solutions to discrete optimisation
problems and its application to the shortest path problem. In: Management Science, Theory
Series 1972; 18:401-405.
[LK08] Lachica, R.; Karabeg, R.: Metadata Creation in Socio-semantic Tagging Systems:
Towards Holistic Knowledge Creation and Interchange. In: Maicher, L.; Garshol, L.M.:
Scaling Topic Maps. LNAI 4999, Springer:Berlin (2008).
[Ma07a] Maicher, L.: Autonome Topic Maps. Zur dezentralen Erstellung von implizit und
explizit vernetzten Topic Maps in semantisch heterogenen Umgebungen. Doctoral thesis at
University of Leipzig (2007).
[Ma07b] Maicher, L.: The Impact of Semantic Handshakes. In: Maicher, L.; Sigel, A.; Garshol,
L. M.: Leveraging the Semantics of Topic Maps. LNAI 4438, Springer, Berlin (2007).
[Ma08] Maicher, L.: Musica migrans - Mapping the Movement of Migrant Musicians.
Presentation held at the Topic Maps User Conference 2008, Oslo. Slides available at (April 10,
2008): http://www.topicmaps.com/tm2008/maicher.pdf
[Mai08] Maicher, L.: Mapping between the Dublin Core Abstract Model DCAM and the
TMDM. In: Maicher, L.; Garshol, L.M.: Scaling Topic Maps. LNAI 4999, Springer, Berlin.
[MB08] Maicher, L.; Bock, B.: ActiveTM - The Factory for Domain-customised Portal
Engines. In: Proceedings of I-Media’08, Graz (2008).
[Mil67] Milgram, S.: The small world problem. In: Psychology Today 2, pp. 60-67, 1967.
[MWH08] Alexander Mehler, Ulrich Waltinger, and Gerhard Heyer: Towards Automatic
Content Tagging: Enhanced Web Services in Digital Libraries Using Lexical Chaining. In: 4th
International Conference on Web Information Systems and Technologies (WEBIST '08),
Funchal, Portugal , 2008
[MWW07] Mehler, A. Waltinger U. und Wegner A.: A Formal Text Representation Model
Based on Lexical Chaining. Proceedings of the KI 2007 Workshop on Learning from Non-
Vectorial Data (LNVD 2007) September 10, Osnabrück, Seiten 17–26, Osnabrück, 2007.
Universität Osnabrück.
12 Büchler, Maicher, Baumgardt, Bock
[Pa08] Park, J.: Topic Maps, Dashboards and Sensemaking. In: Maicher, L.; Garshol, L.M.:
Subject-centric computing. Proceedings of TMRA 2008. Leipzig, (2008).
[Va05] Vatant, B.: Tools for semantic interoperability : hubjects. Working Paper. Online
available at: http://www.mondeca.com/lab/bernard/hubjects.pdf
[WS98] Watts, D.J., S.H. Strogatz: Collective dynamics of ‘small-world’ networks. In: Nature
393:440-442, 1998.
[Yen71] Yen, J.Y.: Finding the k shortest loopless paths in a network. In: Management Science
1971; 17:712-716.