Você está na página 1de 6

G.

Singh (2010) Use of semantic technology for geocoding entry descriptions from historic gazetteers In: Proceedings of the GIScience 2010 doctoral colloquium, Zurich, Switzerland, September 2010/J.O. Wallgrn, A.-K. Lautenschtz. - Heidelberg : Akademische Verlagsgesellschaft, 2010. - 86 p.; ISBN 978-3-89838-640-1. pp. 75-80

Use of Semantic Technology for Geocoding Entry Descriptions from Historic Gazettteers
Gaurav SINGH 1 Department of Geo-information Processing, Faculty of ITC, University of Twente, The Netherlands

1. Background and Motivation Our work is motivated by the wish to improve SDI applicability for societal problems, especially in developing context, and through the smart use of semantic technology. One important problem case is the geocoding from textual descriptions, which is needed a.o. in societal networks that use sms-based communication. Short messages may hold important information about road conditions, emergency situations or public health matters. This paper reports on a study using somewhat standardized locality descriptions.

Figure 1. An example entry in a gazetteer [14].

In historic times (i.e., pre-GPS days) when the media for collecting and sharing information were limited, locations and their names, were recorded in-situ as textual descriptions. For instance, in Ornithology, a series of gazetteers was published by Paynter and Traylor in the 1980s-1990s [13, 14]. In these gazetteers, a locality description contains a number of statements that can be interpreted as hints to its position. For example, Figure 1 text shows that the position of ABRILONGO is unknown. There are, however, hints in the description like elevation, distance to another locality, and reference papers that may help in determining position of the locality. Typically, the descriptions consist of spatial objects references to other such objects through spatial relations. A spatial object, such as a mountain, river, or town, is represented geometrically as a point, line or polygon, or as a collection of these. Spatial relations refer to other spatial objects, and may be further parameterized by distance, bearing or otherwise. Regularly, a spatial relation is further qualied, as in near left bank
1 Corresponding Author: Department of Geo-information Processing, Faculty of ITC, University of Twente, P.O.Box - 06, 7500 AE Enschede, Then Netherlands; E-mail: singh09721@itc.nl.

Figure 2. Spatial relations based on distance and direction [14].

of Rio So Francisco. Distance/bearing relations should be interpreted differently when referring to different types of spatial object. For example (Figure 2), interpretation of proximity relations should provide different meaning when referring to a building, a river or a state province. Some of the common spatial relations that our description contain are proximity (close to, near), containment (within, in) and other relations (above, over). One can easily deduce from the examples of Figure 2 that various parts of the descriptions belong to a bygone era. For example, the league,2 as a distance unit, has been replaced by miles and kilometers now, probably the latter nowadays being used with more accuracy than the league in its time. Likewise, there are changes in place names, changes in administrative boundaries, and changes in footprints of spatial objects (lakes may have shrunk, cities grew, rivers have changed course). The adopted philosophy is to model these spatial objects and spatial relations in ontology, and do justice to possibilities of change. Thus, the spatial relations used in the descriptions help us in identifying the spatial object being referred to. To clarify it further, in Figure 3 Ribeiro da Vereda is understood to be a river/stream/tributary and not a state or a city or a mountain or a lake since the spatial relation used in its reference is above the mouth of. The problem is to determine a geocode or improve on an existing geocode, determine a level of accuracy, and a statement of our condence in them. All hints come with a different accuracy. The question here is how we can quantify these, and use them to determine accuracy of the overall geocoding process. Geocoding from textual descriptions is non-trivial and important because it bridges textual with spatial intelligence, and allows putting described places on the map. Here, it also helps to remove textual ambiguity and to locate places of historic interest. This will eventually allow us to geocode many more museum specimens. Hence, our main objective is to use semantic technology to resolve locality geocodes that are attributed as Not located by the authors of the gazetteer, and improve on uncertain geocodes.
2 An itinerary measure of distance, varying in different countries, but usually estimated roughly at about 3 miles; app never in regular use in England, but often occurring in poetical or rhetorical statements of distance. (Source: Oxford English Dictionary)

Figure 3. Sensitivity of Spatial relations to spatial objects [14].

2. Research Questions The following research questions have been laid down: 1. How to characterize ontologically various spatial objects and relations present in the gazetteer descriptions?3 2. How can ontologies help in proper interpretation of descriptions in the gazetteer keeping track of the historic changes in spatial objects mentioned in descriptions? 3. What kind of reasoning mechanism is required to be in place to reason over distance and direction and how does that translate into GIS based queries on the datasets?

3. Related Work In the mentioned gazetteers, place-name and location is provided as text, following a semi-structured syntax. Extracting geocode from this is not easy, and is addressed by both Geographic Information Extraction (GIE) and Geographic Information Retrieval (GIR) from textual descriptions. In [12], challenges in GIR are described and an overview of projects like SPIRIT4 and geo-X-walk is provided 5 . The work in [15] used an ontologybased approach to disambiguate geographical names from the publicly available geographic gazetteers using linguistic knowledge obtained from WordNet. The ontology can be extended beyond its current gazetteer function and display information items related to ontology instances. This ontology could also be utilized to determine an optimal geocode from hints in the descriptions. Its creation may help to resolve homonyms, and derive a geocode for localities present in statements in the gazetteers. Recent work on GIR to identify and disambiguate place names is mentioned in [6]. Early research by [5] used PPI (Prepositional Phrase Interpreter) and parsers for natural language location descriptions to convert these to spatial relations. The work applies fuzzy concepts of cardinal directions like north of and other spatial relations. Early papers by Frank [2,3] on qualitative and temporal reasoning are useful in solving these problems, while the work by Mark and Frank [1,4] addresses some of the natural lan3 With reference to the discussion in previous section in which it has been specied how a multitude of spatial relations can be interpreted differently when referring to same or different spatial objects. We call this phenomenon as overloading of the relations with the spatial object(s) involved. 4 SPIRIT (Spatially-Aware Information Retrieval on the Internet) http://www.geo-spirit.org/ 5 Geo-X-walk http://hds.essex.ac.uk/geo-X-walk/

guage issues in GIS. Another dimension is in spatial data heterogeneity [9] in which the challenge lies in representing geographic places through ontologies and reasoning over them. The author in [8], discusses the development of data models based on ontology framework to capture events in space and time, such as, U.S elections of 2008, to support visualization and its analysis. The common thread between this area of research and ours is in the use of developing data model based on ontology. However, our work goes beyond and also targets to nd and (or) improve the geocode of localities from textual descriptions with the use of semantic technology. Recent development in this area has been reported in papers [10, 11].

4. Methodological Approach As a rst step, we OCRed and corrected all 8000+ entries in the Paynter and Traylor gazetteer for Brazil [14]. As a second step, we use Natural Language Processing (NLP) tools to parse and analyse syntactic structure. This helps to extract various noun and verb phrases and associate geographic concepts and relationships between them. We then apply Named Entity Recognition (NER) to classify textual description categories of Place name, Location, Person, Bibliographic reference, and so on. We will apply existing topographic ontology to associate semantic knowledge with extracted geographic concepts from the textual descriptions. We plan device techniques to explicit uncertainty adn/or ambiguity and improve these through descriptions that express distance, direction or time. Further spatial analysis on fundamental data (like elevation, land cover, transportation, hydrological, administrative) will aid in interpretation of geocode from texts. The use of existing online gazetteers is also foreseen as a later quality improvement. In all, our use case corpus has 11% of its entries of unknown geocode, and a further 8% having doubtful geocode. The spatial accuracy of known entries is at the 1-minute level. This is important because in Brazilian natural history, information about collecting sites requires precision: whether on the left or right bank of the river is crucial. With this work we hope to improve substantially on all cases. Moreover, the intact data set will allow mutual consistency checks, for instance, on explorer travels as documented through the gazetteer entries.

5. Research Carried Out The gazetteer case data [14] consist of 8000+ entries. We digitized and corrected the original entries. We have developed a database holding locality name, state names, known geocode and descriptions for all corrected entries. For textual analysis on these descriptions we are using Stanford NLP tool6 which is a statistical parse that works on the input sentence and provides the syntactic structure (structure of sentence) in the form of parse tree and grammatical relations in the form of typed dependencies (Figure 4). It is evident that natural language is ambiguous, unstructured and to deal with this we need a formal model like ontology. The knowledge stored in ontology is useful in combination of reasoning tools to produce new facts that are hidden in NLP and further
6 Stanford NLP Tool - http://nlp.stanford.edu/software/lex-parser.shtml

Figure 4. Output from Stanford parser.

enhance the knowledge base. We propose to use ontology to annotate the leafs of the parse tree (NNP, NNS, IN, JJ)7 with the concepts and relations present in the ontology. Most commonly found relations in our gazetteer descriptions are recognized by prepositions or subordinating conjunction (IN). For example, in, within, at, on, between, close to, near, in vicinity can be easily recognized. An extra challenge is that the data contains Brazilian Portuguese and English names. To overcome this problem, we are characterizing various spatial objects in descriptions that are present in Brazilian Portuguese with the related English terms in ontology. For example, the term Fazenda means Farm and Campos means Fields. Simultaneously, we are also identifying other base data like elevation, land cover, transport network and are required in the project. Since this is a research work in progress, our future work will focus on developing reasoning techniques for our use case corpus.

6. Research Contribution to GI Science The innovation in this research is in using semantic technology for geocoding from semistructured textual description. This will allow us to trace the routes followed by collectors in historic times.

References
[1] A. U. Frank and D. M. Mark, Language Issues for Geographical Information Systems. In Geographic Information Systems: Principles and Application, edited by D.J Maguire, M. F. Goodchild and D. W. Rhind, 1991. A. U. Frank, Qualitative Temporal Reasoning in GIS - Ordered Time Scales. In sixth International Symposium on SDH, Edinburgh, Scotland, Vol. 1, eds. T.C. Waugh and R.G. Healey, Taylor and Francis, U.K, 1994.

[2]

7 Part-Of-Speech (POS) Tags - http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html

[3] [4] [5]

[6]

[7] [8] [9] [10] [11]

[12] [13] [14] [15]

A. U. Frank, Qualitative spatial reasoning: cardinal directions as an example. International Journal of Geographical Information Systems 10(3) (1996), 269-290. D.M. Mark and A. U. Frank, Concepts of space and spatial language. (1989) http://drc.ohiolink.edu/handle/2374.OX/20698 accessed on November 13, 2009. D. N. Chin, M. McGranaghan and T. T. Chen, Understanding location descriptions in the LEI system. Proceedings of the fourth conference on Applied Natural Language Processing. Stuttgart, Germany, Association for Computational Linguistics: 138-143, 1994. D. S. Batista, M. J. Silva, F. M. Couto and B. Behera, Geographic signatures for semantic retrieval. Proceedings of the 6th Workshop on Geographic Information Retrieval. Zurich, Switzerland, ACM: 1-8, 2010. J. H. Hong, M. J. Egenhofer and A. U. Frank, On the Robustness of Qualitative Distance- and DirectionReasoning. In the Proceedings of Auto-Carto 12, 1995. K. Grossner, Modelling and Measuring Campaign 2008 to Support Visualization and Analysis. In Workhop of Media Arts, Science, and Technology (MAST), Santa Barbara, U.S., 2009. K. Janowicz, The Role of Place for the Spatial Referencing of Heritage Data. In the Cultural Heritage of Historic European Cities and Public Participatory GIS workshop, U.K, 2009. M. Piotrowski, S. Lubli and M. Volk, Towards mapping of alpine route descriptions. Proceedings of the 6th Workshop on Geographic Information Retrieval. Zurich, Switzerland, ACM: 1-2, 2010. M. Piotrowski, Leveraging back-of-the-book indices to enable spatial browsing of a historical document collection. Proceedings of the 6th Workshop on Geographic Information Retrieval. Zurich, Switzerland, ACM: 1-2, 2010. . Vestavik, Geographic Information Retrieval: An Overview. International ODRL workshop, Vienna, Austria, 2004. R. A. Paynter Jr. and M. A. Traylor, B. Winter, Ornithological Gazetteer of Bolivia, Harvard University Press, Cambridge, Massachusetts, 1975. R. A. Paynter Jr. and M. A. Traylor Jr., Ornithological Gazetteer of Brazil. Museum of Comparative Zoology, Harvard University Press, Cambridge, Massachusetts, 2 volumes, 1991. R. Volz, J. Kleb and W. Mueller, Towards ontology-based disambiguation of geographic identiers. Proceedings of World Wide Web, Banff, Canada, May 8-12, 2007.

Você também pode gostar