Você está na página 1de 17

Where in the world are you?

Geolocation and language identification in Twitter

Mark Graham, Scott A. Hale, Devin Gaffney !ford "nternet "n#titute, $niver#ity of !ford All author# contri%uted e&ually to thi# 'a'er.

A%#tract The movement# of idea# and content %etween location# and language# are un&ue#tiona%ly crucial concern# to re#earcher# of the information age, and Twitter ha# emerged a# a central, glo%al 'latform on which hundred# of million# of 'eo'le #hare knowledge and information. A variety of re#earch ha# attem'ted to harve#t locational and lingui#tic metadata from tweet# in order to under#tand im'ortant &ue#tion# related to the ()) million tweet# that flow through the 'latform each day. However, much of thi# work i# carried out with only limited under#tanding# of how %e#t to work with the #'atial and lingui#tic conte!t# in which the information wa# 'roduced. *urthermore, #tandard, well+acce'ted 'ractice# have yet to emerge. A# #uch, thi# 'a'er #tudie# the relia%ility of key method# u#ed to determine language and location of content in Twitter. "t com'are# three automated language identification 'ackage# to Twitter,# u#er interface language #etting and to a human coding of language# in order to identify common #ource# of di#agreement. The 'a'er al#o demon#trate# that in many ca#e# u#er+entered 'rofile location# differ from the 'hy#ical location# u#er# are actually tweeting from. A# #uch, the#e o'en+ended, u#er+generated, 'rofile location# cannot %e u#ed a# u#eful 'ro!ie# for the 'hy#ical location# from which information i# 'u%li#hed to Twitter. -eyword#. Geogra'hy, /anguage, Twitter

Thi# i# a 're'rint co'y of the following forthcoming article. Graham, M., Hale, S. A., and Gaffney, D. 01)2(3. Where in the World are 4ou? Geolocation and /anguage "dentification in Twitter. 5rofe##ional Geogra'her. *orthcoming.

Micro%logging #ervice# #uch a# Twitter allow re#earcher#, marketer#, activi#t# and government# un'recedented acce## to digital trail# of data a# u#er# #hare information and communicate online. 5attern# of information e!change on 'latform# that rely on u#er+generated content have %een u#ed recently in #cholarly re#earch a%out community 0Gru6d, Wellman, and Takhteyev 1)223, information diffu#ion 07omero, Meeder, and -lein%erg 1)223, 'olitic# 08run# and 8urge## 1)223, religion 0Shelton, 9ook, and Graham 1)2(3, cri#i# re#'on#e 09ook et al. 1)2): 5alen et. al. 1)223, and many other to'ic#. Such data are al#o im'ortant to government# and marketer# #eeking to under#tand trend# and 'attern# ranging from cu#tomer;citi6en feed%ack to the ma''ing of health 'andemic# 0Graham and 9ook 1)223. Twitter in 'articular with it# large and international u#er %a#e 0there are now over (<) million u#er# on the 'latform3 ha# %een the #ource of much #cholarly re#earch. =ontent 'a##ed through Twitter remain# deconte!tuali6ed, however, unle## we find way# to reattach it to geogra'hy. "n other word#, we don,t >u#t want to know what i# #aid, %ut we al#o want to know where it i# #aid and to whom it i# #aid. A# #uch, the attri%ute# of language and location are crucial for under#tanding the geogra'hie# of online flow# of information and the way# that they might reveal underlying economic, #ocial, 'olitical and environmental trend# and 'attern#. 4et, %oth language and location are challenging to deduce in the #hort me##age# that 'a## through Twitter, and no well acce'ted methodology for their e!traction and analy#i# ha# %een articulated. Thi# 'oint i# e#'ecially #alient %ecau#e of the increa#ing num%er of #tudie#, >ournali#tic account#, and real+world a''lication# that rely on harve#ted locational and language data from Twitter. Therefore, in order to 'rovide a u#eful #tarting 'oint for future reach on Twitter 0and indeed other micro+%logging 'latform#3, thi# 'a'er com'are# #everal a''roache# to working with geogra'hic information in Twitter in order to %etter under#tand the #trength# and limitation# of each. The #hort #i6e of 'o#t# 02?) character# on Twitter3 're#ent# a challenge to accurate language identification due to the fact that mo#t language identification algorithm# are trained on larger #i6ed document# 0=arter, T#agkia#, and Weerkam' 1)223. "n addition, the #tyle of writing on Twitter u#ing a%%reviation# and acronym# com'licate# language cla##ification. "n many in#tance#, re#earcher# have #im'ly relied on the u#er interface language of a u#er,# account or u#ed an off+the+#elf language detection 'ackage without con#ideration of it# #uita%ility for u#e on #hort, informal te!t 'hra#e#. The di#agreement of #everal #tudie# on the mo#t u#ed language# in Twitter 0Honeycutt and Herring 1))@: Semioca#t 1)2): Hong, =onvertino, and =hi 1)22: Takhteyev, Gru6d, and Wellman 1)223 highlight# the difficulty of language detection . All four #tudie# agree Angli#h i# the mo#t u#ed language, %ut give 'ercentage# ranging from <) 'ercent 0Semioca#t 1)2)3 to B1.< 'ercent 0Takhteyev, Gru6d, and Wellman 1)223. The 'ur'o#e of our work i# not to #tudy the 'rominence of different language# on the 'latform, %ut i# rather to highlight im'ortant methodological i##ue# related to language identification in order for future re#earch to more critically engage in geolingui#tic analy#e#. Accurately determining location in me##age# #ent through Twitter i# al#o a #ignificant challenge. The mo#t a''arent method i# to con#ider the 'rofile information that i# directly 'rovided %y a u#er 0e.g. the te!t CWa#hington, D=D in *igure 23 in re#'on#e to an account #et+u' &ue#tion. CWhere in the world are you?D However, thi# &ue#tion, which allow# u#er# to in'ut any te!t
1

#tring to de#cri%e their location 0referred to in thi# 'a'er a# E'rofile location,3, i# often hard to geolocate correctly 0the o'en+ended te!t could >u#t a# ea#ily #ay CAdin%urgh, Scotland,D C8arad+ dFr, Mordor, Middle+earth,D or #im'ly ChereD3. High error rate#, mi##ing data and non+ #tandardi6ed te!t in 'rofile location# have forced #ome re#earcher# wi#hing to em'loy thi# geogra'hic data to u#e #maller #am'le# and la%or+inten#ive manual coding of 'rofile location# 0e.g. Takhteyev, Gru6d, and Wellman 1)223. Figure 1. Screen#hot from 8arack %ama,# Twitter 'rofile 'age.

An alternate a''roach that #ome re#earcher# have ado'ted i# to narrow their #am'le# to only u#e geocoded tweet#. De'ending on u#er,# 'rivacy #etting# and the geolocation method u#ed, the#e tweet# have either an e!act location #'ecified a# a 'air of latitude and longitude coordinate# or an a''ro!imate location #'ecified a# a rectangular %ounding %o!. Thi# ty'e of geogra'hic information 0referred to in thi# 'a'er a# Edevice location,3 re're#ent# the location of the machine or device that a u#er u#ed to #end a me##age on Twitter. More 'reci#ely, the data are derived from either the u#er,# device it#elf 0u#ing the Glo%al 5o#itioning Sy#tem GG5SH3 or %y detecting the location of the u#er,# "nternet 5rotocol 0"53 addre##. 5reci#e coordinate# are almo#t certainly from device# with %uilt+in G5S receiver# 0e.g. 'hone# and ta%let#3. 8ounding %o!e#, however, can re#ult from 'rivacy #etting# a''lied to G5S data or from Geo"5 data. "rre#'ective of the#e limitation#, device location# are challenging for u#er# to manually mani'ulate, and, %ecau#e they are #tructured data, are ea#ily inter'reted %y com'uter#. However, only a #mall 'ortion of u#er# 'u%li#h geocoded tweet#, and it i# unlikely that they form a re're#entative #am'le of the %roader univer#e of content 0i.e. the divi#ion %etween geocoding and non+geocoding u#er# i# almo#t certainly %ia#ed %y factor# #uch a# #ocial+economic #tatu#, location, education, etc.3. *rom a #am'le of 2@.I million tweet# collected %y the author# 0the#e data were collected u#ing Twitter,# E#tatu#e#;#am'le #tream, collection method with E#'rit6er acce##,3 over nineteen day# in June 1)22, only ).B 'ercent of tweet# contained #tructured geolocation information. A# #uch, the e!tremely low 'ro'ortion of information with attached device location# mean# that re#earcher# either have to work with data that are likely highly #kewed or devi#e effective method# to work with the 'rofile location that i# attached to all of the tweet# that do not contain e!'licitly geocoded device location information.

Thi# 'a'er deal# with the#e ga'# of knowledge related to language and location in two 'rimary way#. *ir#t, it e!'lore# the accuracy of a range of language detection method# on tweet#. which, %y, definition, are #hort and often contain informal 'hra#ing# and a%%reviation#. "t identifie# common #ource# of error# and com'are# 'erformance over four re#earch location#, each com'ri#ing a large variety of language#. Second, it com'are# variou# location information within tweet# 0'rofile location, device location, time6one information3 and the accuracy with which geolocation algorithm# can inter'ret the free+form 'rofile location information. "n 'erforming thi# work, we are a%le to refine method# that can %e em'loyed to ma' and mea#ure the geolingui#tic contour# of 'eo'le,# information trail# on Twitter. Doing #o will ultimately allow future work to %uild on thi# re#earch in order to create more accurate and nuanced under#tanding# of the cloud# of digital information that overlay our 'lanet.

Related Work A variety of method# have %een em'loyed in looking at Twitter,# geolingui#tic contour#. Hong et al. 01)223 u#ed two automated tool# to determine the language of a tweet. /ing5i'e and the Google /anguage A''lication 5rogramming "nterface 0A5"3, while Semioca#t 01)2)3 u#e an internal 'ro'rietary tool. =arter et al. 01)223 and Gottron and /i'ka 01)2)3 di#cu## #everal of the challenge# with language identification on #hort te!t#, the large#t %eing that mo#t language detection algorithm# have %een develo'ed and trained on full document# that are longer and %etter formulated than the #hort te!t #ni''et# that 'a## through Twitter. =arter et al. 01)223 focu# on micro%log 'o#t# and develo' two a''roache# 0'rior#3 to enhance 'erformance. a link+%a#ed a''roach to con#ider the language of linked+to content and a %logger+%a#ed a''roach to aggregate tweet# on a 'er account %a#i# to form a larger document to cla##ify. They find %oth a''roache# im'rove accuracy, %ut #till leave room for further im'rovement. Hale 01)21a3 u#ed the =om'act /anguage Detection 0=/D3 kit, 'art of Google =hrome, for detecting the language of %log# in con>unction with the 're#ence of certain keyword#. He found the#e two method# in com%ination to %e @< 'ercent accurate on a #am'le of @I< %log# a%out the Haitian earth&uake. The =/D ha# #ince %een u#ed for in creating vi#uali6ation# of language on Twitter 0*i#cher 1)223, %ut it# accuracy ha# not yet %een evaluated for #hort 'o#t# 'a##ed through Twitter. While geogra'hic metadata in device location# 0i.e. 'reci#e coordinate#3 are unlikely to %e #u%>ect to much de%ate a%out their validity, the #elf+re'orted 'rofile location field in a u#er,# 'rofile i# 'ro%lematic %ecau#e of it# un#tructured nature. However, it remain# that the u#age of 'rofile location# i# often contem'lated in 'a'er# that di#cu## the virtual data #hadow# to geogra'hically %ound #ituation# #uch a# the Ara% S'ring of 1)22 or the "ran election 'rote#t# of 1))@ 0Hecht et al. 1)22: /otan et al. 1)22: Gaffney 1)2)3. Takhteyev et al. 01)223 attem'ted an automated coding of 'rofile location# with an unnamed tool, %ut ultimately decided to hand code 'rofile location detail# due to high+error rate#. Kieweg et al. 01)2)3 al#o handcoded 'rofile location#, %ut al#o manually u#ed tweet content in addition to 'rofile content in determining the u#er,# 'hy#ical location. Java et al. 01))B3 u#ed the 4ahoo geocoding A5", which attem't# to a##ign a 'reci#e location to #elf+re'orted 'rofile location#. However, the accuracy of #uch geocoding algorithm# to 'rofile location data on Twitter ha# not %een 'reviou#ly determined. Mo#t im'ortantly, Hecht et al. 01)223 #how the need for great caution in finding that only II 'ercent of the Twitter 'rofile# they e!amined %y hand had valid geogra'hic information. 2L
?

'ercent were %lank and 2I 'ercent had non+geogra'hic information, mo#tly made of 'o'ular culture reference#. A# a re#ult, geocoding A5"# will likely #truggle with thi# in'ut. "n contra#t to the free form nature of the 'rofile location, -ri#hnamurthy et al. 01))L3 o't to u#e time6one 0$T= off#et3 information in a u#er,# 'rofile to get a u#er,# local time and there%y a''ro!imate longitude. Although it i# im'o##i%le to determine latitude u#ing thi# method, it remain# that #uch a #trategy can im'rove our %e#t gue##e# a%out 'rofile location#. However, it i# unclear how many 'eo'le actually #et an accurate time6one. Thi# i# 'articularly a concern for u#er# that em'loy third+'arty client# in#tead of vi#iting the Twitter we%#ite it#elf 0within our #am'le fewer than <) 'ercent of tweet# are created on Twitter,# own we%#ite3. Mewer re#earch i# develo'ing method# to locate u#er# %a#ed on the te!t content of their tweet#, the time of day u#er# tweet at, and;or the location of the u#er# they are following or followed %y 0=heng et al. 1)2): Ai#en#tein et al. 1)2): Hecht et al. 1)22: Wing and 8aldridge 1)22: Mahmud 1)21: Sadilek et al. 1)213. All of the#e a''roache#, however, have only %een develo'ed and evaluated u#ing tweet# in the Angli#h language and;or geocoded tweet# from the $nited State#. Thi# 'a'er doe# not con#ider the#e develo'ing a''roache#, %ut evaluate# two off+the+#helf geocoding #ervice# and a##e##e# their accuracy and 'erformance acro## four different region#, only one of which i# in the $nited State#. "n the manual e!amination of 'rofile location#, the 'a'er al#o rai#e# hint# of 'o##i%le challenge# the#e newer a''roache# will have to overcome to %e geogra'hically and lingui#tically %roader. The 'a'er al#o 'rovide# im'ortant in#ight# into the di#accord %etween 'rofile and device location#, which i# im'ortant for the data u#ed to develo' and te#t the#e new a''roache#. "n #um, it i# im'ortant to %e aware of the myriad, overla''ing and com'le! way# in which location i# a#cri%ed to information in Twitter %efore attem'ting to em'loy it in any geogra'hic analy#e#. The following #ection more clo#ely e!amine# the accuracy and #ource# of error in a range of method# u#ed to e!tract location and language from Twitter in order to %uild u'on e!i#ting work and more clearly articulate how thi# information collected from Twitter might %e of u#e for re#earch in geogra'hy. Methods 8etween 2) Movem%er and 2I Decem%er, 1)22, 222,2?(,L2? tweet# were collected %y u#ing the #tatu#e#;filter method of Twitter,# #treaming A5"2 0the#e data are ma''ed in *igure 13. The method allow# tweet# to %e collected from within a u#er+#'ecified %ounding %o!, which wa# drawn a# a 2L) %y (I) degree %o! in thi# #tudy 0or a %ounding %o! that encom'a##e# the whole 'lanet3. Tweet# #am'led thi# way from the #treaming A5" only include tweet# with an e!'licit Geo"5 or G5S Edevice, location. The #earch A5", which may make gue##e# a# to u#er#, location, wa# not u#ed in thi# #tudy. While #ome rate limiting error# are met when a large %o! i# %uilt 0Twitter define# a ma!imum data rate, and any additional data a%ove that limit i# dro''ed from the #tream3, thi# effect i# mea#ura%le. 7ate limiting wa# only noticed during time# that were co+ incident with #ome Morth American weekday 'eak hour#, and only 2.2 'ercent of our file# were ultimately affected %y any #ignificant error# of data. Data were otherwi#e collected con#tantly with a few intermittent and %rief cra#he#, and all downloaded information wa# #tored in ta%+ #e'arated file#.
2

Thi# method ca'ture# tweet# that are geocoded %y %oth "5 addre##e# and G5S+ena%led device#.

Figure 2. Ma' of geotagged tweet# ca'tured %etween Movem%er 2) and Decem%er 2I, 1)22

*rom our #am'le of 222 million tweet#, 2,))) tweet# were randomly #elected from each of four metro'olitan area#1 0=airo, Montreal, San Diego and Tokyo3. The#e area# were #elected %y the re#earch team a# re#earch #ite# that are characteri6ed %y intere#ting geogra'hic, lingui#tic and cultural difference#. To avoid over+re're#entation %y heavy u#er#, we only include a ma!imum of one tweet 'er u#er in the #am'le. The location of every tweet wa# determined %y the device location 0Geo"5;G5S3 recorded %y Twitter. 7ather than relying on one 'articular algorithm, li%rary or toolkit to e#ta%li#h a higher degree of accuracy, validity or certainty a%out the data, a central aim of thi# 'a'er i# to com'are e!i#ting #olution#. A# #uch, the re#earch team reviewed e!i#ting availa%le geolocation and language identification #olution# and evaluated them on their ea#e of u#e, through'ut, and thoroughne##. 8a#ed on the#e re#ult#, three language detection and two geolocation #ervice# were ultimately #elected 0Ta%le 23. =u#tom #cri't# em'loying each of the automated language identification algorithm# 0Alchemy, the =om'act /anguage Detection kit and Nero!3 and each of the geolocation #olution# 0Google and 4ahoo3 were written and the re#ult# of the#e algorithm# were #tored in a data%a#e. While other #ervice# certainly e!i#t and the #ervice# con#idered in thi# 'a'er are not e!hau#tive or authoritative, the#e #ervice# currently con#titute the more ea#ily im'lemented off+the+#helf
2

The %ounding %o!e# that we u#ed to define the four ur%an area# are a# follow#. =airo 0(2.2,1@.@<,(2.<?,().1L3, Montreal 0+B?,?<.((,+B(.(<,?<.BL3, San Diego 0+22B.(,(1.?(,+22I.B?, (1.@3, Tokyo 02(@.(,(<.?,2?).1,(<.@3. "n all ca#e#, outer ring road# were u#ed to determine the a''ro!imate e!tent of each city,# conur%ation. While thi# a''roach i# relatively im'reci#e, we deemed it e&ually 'ro%lematic to e#ta%li#h a con#i#tent %ounding %o! #i6e for all #am'le citie# 0due to the #ignificantly different #i6e# of the four ur%an area#3. I

#olution# that are availa%le. "n 'articular, #ince Google #witched it# /anguage Detection A5" to a 'aid #ervice, many re#earcher# and com'anie# have tried the =/D kit, %ut it ha# not %een com'ared with other #olution#. =/D i# al#o form# 'art of the language+detection augmentation offered %y DataSift, a Twitter data re#eller, and hence i# likely u#ed %y many commercial com'anie# working with Twitter data. All of the language detection# algorithm# #urveyed are 're+trained and immediately u#a%le for any 'iece of te!t. Thi# allow# com'ari#on a'art from the #'ecific data u#ed to train the algorithm. Table 1. /anguage and /ocation Service# overview Service Ty'e =om'act /anguage /anguage Detection -it Detection $#e Through'ut =;=OO with a 7u%y Mo limit#. #ervice li%rary and 'ython i# e!ecuted locally %inding# availa%le Alchemy A5" /anguage We% Service 2,))) re&ue#t# ; Detection day ; A5" -ey Nero! 'en Source/anguage We% Service Mo di#cerni%le Detection limit# Twitter $" $#er #elected value Delivered with n;a /anguage tweet Google Geocoding Geolocation We% Service 1,<)) re&ue#t# ; A5" day ; "5 4ahoo 5lace*inder Geolocation We% Service <),))) re&ue#t# ; day ; A5" -ey Thoroughne## 2I2 language#

@B language# ?I language# (( language#

"n order to te#t language detection tool#, we al#o randomly #elected 2,))) tweet# from each of our four #tudy region#. The#e me##age# were then manually coded %y the #tudy,# author# for the 'rimary language of the tweet and di#agreement# were re#olved through di#cu##ion. The #tudy,# author# collectively have e!'erience with Ara%ic, Angli#h, German, Ja'ane#e, -orean, Mandarin, 5er#ian, S'ani#h and Thai. Where word# from multi'le language# were found in a #ingle tweet, the tweet wa# coded for the mo#t a%undant, 'rimary language of the tweet.

Findings Language When e!amining the manual coding of language in ?,))) #am'le tweet#, overall intercoder agreement wa# high %etween human coder# 0a# #hown in Ta%le 13. Agreement wa# calculated with *lei##, ka''a. The mea#ure range# from +2, com'lete di#agreement, to 2, 'erfect agreement, and, com'ared with 'ercent agreement, %etter account# for agreement %etween coder# that could occur %y #im'le chance. =onflict# %etween human coder# were generally due to multi'le

language# %eing u#ed within a #ingle tweet. Tweet# containing auto+generated te!t 0e.g. automatically generated me##age# from *our#&uare3 often contained thi# mi! of language#. "n com'ari#on to the human coder#, the agreement %etween different language detection algorithm# can %e #een to %e much lower 0Ta%le 13.

Table 2. Human and algorithm agreement on language 0*lei##P -a''a3 with and without $" /anguage Machine# Without $" With $" /anguage /anguage =airo ( ).L2? ).<1< + Montreal ( ).BB@ ).<LB ).?I2 San Diego ( ).LIL ).<2( ).??) Tokyo ( ).B1? ).?L< ).(1@ verall ( ).LLL ).I() ).<B@Q Q Ara%ic wa# not availa%le a# a $" language choice at the time of the #tudy: #o, =airo i# not included in the overall #tati#tic#. 7e#earch Site =oder# "f the agreed human coding i# treated a# a Egold #tandard,, the =om'act /anguage Detection kit and Alchemy matched human cla##ification# mo#t clo#ely, although all method# are with one #tandard deviation of each other 0Ta%le (3. Alchemy in general 'erformed %etter than =/D with the #ignificant e!ce'tion of Tokyo. "n the %e#t ca#e, Alchemy agreed with human coder# on @2 'ercent of tweet# in San Diego. However, even thi# only tran#late# an intercoder agreement with a *lei##, ka''a of ).I??, which %etter account# for agreement that could occur %y #im'le chance a# e!'lained a%ove. The %e#t overall #core wa# achieved %y the =/D, which agreed with the human cla##ification in BI.? 'ercent of ca#e#. Thi# tran#late# to an intercoder agreement with a *lei##, ka''a of ).IB). 4et, it remain# that %oth of the#e #core# are much lower than the overall intercoder agreement %etween the human coder#, which had a *lei##, ka''a of ).LLL. The relatively high 'ercent agreement #core# and lower ka''a #core# #ugge#t di#agreement %etween the algorithm and human coder# occurred more a%out le## fre&uently a''earing language#. Meverthele##, the =/D i# more a't for large data#et# a# it i# the only local, fully offline method here con#idered. The code for =/D i# al#o o'en+#ource and could %e ado'ted, although the training cor'ora u#ed to create the language identification finger'rint# are unknown and unavaila%le. Analy#i# #how# the =/D 'erformed 'articularly well in differentiating te!t in different A#ian #cri't#. =/D and Alchemy did not do #o well in =airo where a num%er of Ara%ic+language tweet# were written in the CAra%ic chat al'ha%etD 0i.e. Ara%ic u#ing /atin character#3. Wherea# =/D nearly alway# cla##ified te!t in Ja'ane#e, -orean, =hine#e, and Ara%ic correctly when the#e language# were written in their u#ual #cri't#, it failed in all L@ ca#e# to cla##ify Ara%ic written with /atin character# correctly. "ndeed, all of the language identification algorithm# con#idered here failed to accurately cla##ify the#e me##age# written in the Ara%ic chat al'ha%et.
L

Human# *lei##P -a''a

"t #hould al#o %e 'ointed out that the u#er+interface language of Twitter u#er# wa# a u#eful indicator of language in #ome re#earch #ite#. "t corre#'onded with the human coding of language for more than B< 'ercent of the tweet# from Montreal, San Diego and Tokyo. At the time of data collection, there wa# not an o'tion to #et the Twitter u#er+interface language to Ara%ic, which likely e!'lain# why the u#er+interface language of u#er# in =airo only agree# with human coder# for ?< 'ercent of the tweet# collected there. Table 3. Algorithm agreement with human coder# on language. *lei##P -a''a;ScottP# 5" with 'ercent agreement in 'arenthe#e#. The #tandard deviation on 'ercent agreement i# a%out ).< in all ca#e#. 7e#earch Site Alchemy Nero! =/D Twitter $" /ang =airo ).<2) 0I@.IR3 ).(B? 0<<.@R3 ).?I? 0I2.BR3 +).12< 0??.@R3S Montreal ).I<( 0L(.2R3 ).<?L 0B(.@R3 ).<<) 0B?.BR3 ).?I1 0B<.IR3 San Diego ).I?? 0@).@R3 ).<)2 0L?.)R3 ).?LB 0L1.2R3 ).<I< 0@).IR3 Tokyo ).)2B 0<I.1R3 ).)1@ 0<B.)R3 ).?2L 0LB.1R3 ).1BL 0L(.)R3 verall ).I)@ 0B<.)R3 ).<(? 0IB.BR3 ).IB) 0BI.?R3 ).B2? 0L(.2R3Q S Ara%ic wa# not availa%le a# a $" language choice at the time of the #tudy: #o, thi# value #hould %e inter'reted with caution. Q Ara%ic wa# not availa%le a# a $" language choice at the time of the #tudy: #o, =airo i# not included in the overall #tati#tic#. verall, language identification of tweet# i# difficult for human and machine coder# alike. ne 're'roce##ing #te' that could im'rove re#ult# i# to remove auto+generated te!t and non+language #'ecific te!t 0e.g. emoticon#3. "t will %e im'ortant to train machine algorithm# on informal #cri't# 0e.g. Ara%ic chat al'ha%et3 in addition to cla##ical #cri't#. The #uita%ility of off+the+#helf language identification 'ackage# and the a''ro'riatene## of the u#er+interface language #etting vary %y re#earch #ite: #o, the %e#t algorithm will likely de'end on the #'ecific re#earch &ue#tion# and #tudy location. Geolocation To %etter under#tand how #elf+re'orted 'rofile location# might %e u#ed to ma' the geogra'hy of information in Twitter, the 4ahoo and Google geolocation algorithm# were a''lied to 2,))) randomly #elected u#er# from each re#earch #ite. "deally, thi# #tudy would a''ly the geolocation algorithm# to the entirety of the u#er# in each #ite, %ut rate limitation# 0i.e. the num%er of allowed re&ue#t# 'er minute3 'revent thi# from com'leting in a rea#ona%le time frame. We found 2I 'ercent of the location field# in our #am'le of 2,))) u#er# from each re#earch #ite were %lank, which i# #imilar to the 2L 'ercent of 'rofile# that Hecht et al. 01)223 found %lank in a general 0geocoded and non+geocoded3 #am'le of Twitter, although it i# much lower than the 1L 'ercent of 'rofile# that =heng et al. 01)2)3 found %lank in their general #am'le of Twitter 'rofile#. verall, 4ahoo,# and Google,# geolocation algorithm# 'erform #imilarly 0Ta%le ?3. A!cluding the %lank 'rofile location#, @?.< 'ercent of attem't# u#ing 4ahoo,# 5lace*inder and LI.1 'ercent of attem't# u#ing Google,# Geocoder 'laced the u#er in #ome location. However many of the#e location# were out#ide of the %ounding %o!e# defining each re#earch #ite. n average, only <(.B

'ercent of attem't# with 4ahoo and <?.< 'ercent of attem't# with Google 'laced the u#er within the %ounding %o! from which they originally tweete. "f %lank 'rofile# are included, the#e 'ercentage# dro' to ?<.) 'ercent for 4ahoo and to ?<.L 'ercent for Google, which would %e clo#er to the likely u''er %ound on the actual 'ercentage of u#er# in a general #am'le from Twitter that could %e 'laced correctly %y geolocation algorithm# alone. 8e#ide# returning a geogra'hic 'o#ition, each geolocation #ervice may re'ort that it failed to geolocate the in'ut. =om'aring the value#, it i# clear that while 4ahoo,# algorithm geocode# more 'rofile location# than Google,# algorithm, many of the#e additional location# do not fall within the %ounding %o!e#. Google,# algorithm tend# to fail to geocode a larger num%er of 'rofile location# at all com'ared to 4ahoo. However, of the location# Google,# algorithm doe# geocode, more of the#e location# are within the %ounding %o!e# of the re#earch #ite# than 4ahoo 0Ta%le ?3. Thi# i# 'articularly a''arent in the Tokyo re#earch #ite where Google decline# to determine a location for many more 'rofile location# than 4ahoo. Table 4: 7e#ult# of the geolocation of 2,))) randomly #elected u#er 'rofile# from each re#earch #ite. 7e#earch Site =airo Montreal San Diego Tokyo verall 8lank 1)2 2(L 2(( 2B) I?1 "n 8o! ?(2 ?I@ ??I ?<I 2L)1 4ahoo ut of 8o! ((I (<I (<@ ((< 2(LI *ailed (1 (B I1 (@ 2B) "n 8o! ??? <I2 ?)B ?2@ 2L(2 Google ut of 8o! 1@< 11? (?L 2@B 2)I? *ailed I) BB 221 12? ?I(

5rofile location# out#ide of the relevant %ounding %o! may %e due to u#er# tweeting from a different location than that written in their 'rofile# or due to geocoder error. To re#olve thi# am%iguity, the author# manually e!amined all u#er location# that failed to geolocate or that geolocated out#ide of the relevant %ounding %o! with %oth Google and 4ahoo 0Ta%le <3. The large#t 'ortion 0(<.I 'ercent3 of the#e 'rofile location# were legitimate geogra'hical location# out#ide of the %ounding %o!e#. Thi# #ugge#t# that u#er# do not u'date their 'rofile location# with great fre&uency. The #econd large#t 'ortion 01?.2 'ercent3 wa# non+geogra'hic te!t 0e.g. CMeverlandD3 or generic, non+#'ecific location# 0e.g. Aarth, a 'each orchard3. After thi#, a large 'ortion 012.1 'ercent3 wa# more general geogra'hic location# that included the relevant re#earch #ite 0e.g. Ja'an, =alifornia or -antou Gthe ea#tern half of Ja'an including TokyoH3. The analy#i# al#o #ugge#ted way# to im'rove geolocation accuracy. f the <.L 'ercent of location# that were actually within the %ounding %o!e#, a%%reviation# of 'lace name# wa# the mo#t common rea#on for the geolocator to fail. 8eyond thi#, another ?.1 'ercent of 'rofile location# had multi'le location#, one of which wa# the relevant #tudy area. *inally, @.2 'ercent of the tweet# actually had latitude and longitude coordinate# in the 'rofile location field along with additional te!t 0u#ually the name of an a'' 'lacing the information in the 'rofile location3. Google and 4ahoo recogni6ed latitude and longitude coordinate# without additional te!t, %ut any
2)

Table 5: Human analy#i# of 'rofile# failing to geolocate or geolocating out#ide of the relevant %ounding %o!. 4ahoo 5rofile location %lank Geolocated within %ounding area Geolocated out#ide %ounding area Geolocated within %ounding %o! %y other algorithm "dentified a# within %ounding area %y human coder Mot within %ounding %o! 0Human coder3 More general 0e.g. country, #tate, region3 Multi'le location# including within %ounding area /atitude, /ongitude 'air "nvalid ; Generic 0e.g. la la land, earth3 *ailed to geolocate Geolocated within %ounding %o! %y other algorithm "dentified a# within %ounding area %y human coder Mot within %ounding %o! 0Human coder3 More general 0e.g. country, #tate, region3 Multi'le location# including within %ounding area /atitude, /ongitude 'air "nvalid ; Generic 0e.g. la la land, earth3 I?1 2L)1 2(LI 1)I B1 ?L2 1L1 <I 12 1IL 2B) < ? 2 ? 2 2)1 <( Google I?1 2L(2 2)I? 2)B (@ ?I1 1I@ (@ 2) 2(L ?I( B< (B 2L 2I 2L 22( 2LI

additional te!t cau#e# %oth geocoder# to fail 0or in a #maller num%er of ca#e# geolocate to a location that didn,t corre#'ond with the coordinate# in the 'rofile3. All three of the#e #ituation#, a%%reviation#, multi'le location#, and latitude;longitude coordinate# could likely %e handled %y 're'roce##ing the data for the#e 'o##i%ilitie#. Thi# i# e#'ecially a''lica%le when targeting a #ingle area, where a li#t of likely a%%reviation# might %e more ea#ily created. ne vital caveat to the#e re#ult# i# that the re#earcher# coded data in the mo#t naive form availa%le. *or %oth 4ahoo,# 5lace*inder and Google,# Geocoder, additional o'tion# e!i#t that may increa#e the accuracy of re#ult#. 4ahoo,# 5lace*inder return# multi'le location# ordered %y relevance for a given #tring 0multi'le location# can %e, and are, returned routinely, #uch a# !ford, $-, and !ford, Mi##i##i''i, when u#ing a #tring of C !fordD3. Thi# relevance, or confidence #core, i# %etween 6ero and one hundred and i# #hown with every returned re#ult. Thi# confidence #core could %e u#ed to re>ect all re#ult# when every re#ult i# %elow a certain thre#hold. Thi# would increa#e the num%er of location# that fail to geocode at all, %ut would likely rai#e the

accuracy of 'rofile location# that do geolocate. *urthermore, Google,# A5" allow# re#earcher# to #et a location hint in order to %etter ca'ture data 0'otentially3 originating from the region of focu#. Given the #ignificant difficultie# a##ociated with geolocating 'rofile location#, time6one# have al#o %een #een a# a more relia%le metric for a''ro!imating location 0-ri#hnamurthy, Gill, and Arlitt 1))L3. ur data#et, however, #ugge#t# that many u#er# have incorrectly configured time6one# in their 'rofile# 0Ta%le I3. $#er# #elect their time6one on the Twitter #ite from a 'redefined li#t. Several o'tion# have the #ame $T= off#et 0e.g. for $T=O@ a u#er can #elect CTokyoD or CSeoulD3 although the day light #aving or #ummer time rule# may differ. "t i# likely that #ome u#er# are traveling or have 'ur'o#efully #et different time6one# from the location# in which they are u#ing the #ervice: however, the fact that only <B 'ercent of u#er# tweeting in Montreal had #et an ea#t+coa#t time6one 0much le## the #'ecific o'tion of CAa#tern Time 0$S T =anada3D3 indicate# many u#er# likely do not #et their time6one correctly. "n addition, many u#er# tweeting from Montreal had other $T=O< time6one# #elected 01(2 u#er# in our #am'le of 2,))) u#er# had #et their time6one to Uuito, for e!am'le3, #ugge#ting that caution i# needed in inter'reting the time6one more #'ecifically than the $T= off#et. Acro## all re#earch #ite# I@.1 'ercent of u#er# had #elected a time6one with a $T= off#et that corre#'onded to the device location information in the tweet 0Ta%le I3. The low num%er of u#er# correctly #etting their time6one may %e influenced 'artially %y the large num%er of (rd 'arty client device# u#ed to tweet. verall, only 1(.I 'ercent of tweet# ca'tured acro## all our re#earch #ite# were 'u%li#hed via the Twitter we%#ite with the remainder #ent u#ing (rd 'arty a''lication#. Geocoded tweet# may %e more likely to %e #ent from a (rd 'arty a''lication# than non+geocoded tweet#, however: #o, the 'ercentage of tweet# #ent from (rd 'arty a''lication# in our #am'le of only geocoded tweet# i# likely higher than the 'ercentage would %e in a general #am'le of tweet# including %oth geocoded and non+geocoded tweet#. $ltimately, our finding# related to location 'oint to the #ignificant challenge# a##ociated with automatically identifying geogra'hic reference# in un#tructured te!t. "t i# im'ortant for re#earcher# to %e aware of the#e difficultie# if they want to move %eyond the limitation# of only relying on the unre're#entative amount of information tagged with device location#. Table 6. Time6one information ha# %een #een a# another 'ro!y for location: however, thi# information i# not routinely 'rovided %y all u#er#. 7e#earch Site =airo Montreal San Diego Tokyo verall M All u#er# ca'tured =ity+#'ecific =orrect $T= time6one off#et 2,@<1 <?.IR <<.?R <,1(< ?2.BR <B.)R @,1@1 <B.2R I).<R <<,<B( IL.1R B1.(R B1,)<1 I?.?R I@.1R M Geotagging u#er #am'le =ity+#'ecific =orrect $T= time6one off#et midrule 2,))) <<.<R <I.(R 2,))) ?).2R <B.(R 2,))) <L.IR I(.1R 2,))) B).2R B(.LR ?,))) <I.2R I1.BR

21

Discussion and Conclusions ver three hundred million u#er# 'u%li#h hundred# of million# of #hort me##age# every day on Twitter. A# a re#ult, thi# content ha# %een u#ed %y re#earcher# from field# a# diver#e a# e'idemiology, 'olitic#, marketing and geogra'hy to %etter under#tand, ma' and mea#ure large+ #cale #ocial, economic and 'olitical trend# and 'attern#. However, much of thi# analy#i# i# carried out with only limited under#tanding# of how %e#t to work with the #'atial and lingui#tic conte!t# in which the information wa# 'roduced. A# #uch, it ha# %een nece##ary to #tudy the relia%ility of key method# u#ed to determine language and location of content in Twitter. Thi# 'a'er found that there are #ignificant challenge# to accurately determining the language of tweet# in an automated manner(. Mone of the language identification method# te#ted in thi# 'a'er i# a%le to match the accuracy of human coding %y multi'le coder#. The informal writing #tyle, #hort length of tweet#, u#e of multi'le language# within a #ingle tweet, and the 're#ence of non+ language #'ecific content #uch a# $niform 7e#ource /ocator# 0$7/#3 and emoticon# com'licate the identification of language and limit accuracy. The utility of the u#er+interface language #etting varie# acro## region# and language#. A# of January 1)2(, Twitter ha# thirty+three u#er+interface language# availa%le, which cover# many ma>or language#, %ut mi##e# many key African, middle+ea#tern, "ndian and A#ian language# 0e.g. Afrikaan#, 9ulu, 8engali, Marathi, 5er#ian, Kietname#e and Javane#e3. The im'ortance of the#e omi##ion# will de'end on the de#ign of the #tudy. "t i# im'ortant to note, however, that even when a u#er+interface language i# 're#ent u#er# writing 'rimarily in that language may #till not u#e that #etting. Thi# may %e the ca#e when a new language #etting i# introduced and u#er ado'tion lag# or for font;device com'ati%ility concern# 0War#chauer, et al. 1))13 among other rea#on#. The u#er+interface language #etting al#o doe# not ca'ture multilingual u#er# who write in multi'le language# on the 'latform 0Aleta and Gol%eck 1)213. Meverthele##, the =/D kit and Alchemy #how u#eful 'romi#e a# automated language identification 'ackage#. The former in 'articular ha# a great amount of fle!i%ility a# it can %oth %e run offline and %e modified a# it i# o'en+#ource. The %e#t language identification algorithm for a 'articular #tudy will de'end on a num%er of factor#. ne im'ortant factor i# the language# an algorithm i# trained to identify and the #cri't# of the#e language# it i# trained with 0e.g. Ara%ic in /atin character#3. =/D 'erformed much %etter than Alchemy in Ja'an %ut otherwi#e Alchemy 'erformed %etter in our other re#earch #ite#. 4et, neither recogni6ed the Ara%ic chat al'ha%et. a key omi##ion that i# likely to %e mirrored in other informal and tran#literated al'ha%et#. =a#e# #uch a# San Diego, however, #how relative #ucce## in language detection. We may then conclude that on #ome level, the conte!t of the #tudy matter# when con#idering algorithmic a''roache# to language identification. Alway# running multi'le language detection algorithm# and reviewing #u%#et# of the re#ult# with human coder#, a# in the work with Twitter of =arter et al. 01)223 and the work with %log# of Hale 01)21a3, may give in#ight into the %ia#e# and relia%ility of different automated a''roache# and flag u' 'otential i##ue# on a #'ecific #am'le of tweet#. ne method to 'otentially increa#e the accuracy of off+the+#elf language identification 'ackage# i# to 're+'roce## the tweet# to tem'orarily remove emoticon#, url# and other non+language #'ecific te!t 0#omething not
(

"t i# our ho'e that future work will al#o con#ider the 'o##i%ilitie# of u#ing crowd#ourced la%or to accurately #'atially reference tweet#.

attem'ted in thi# 'a'er3. Te!t automatically generated %y third 'arty #ervice# 0e.g. *our#&uare3 often re#ulted in a mi! of language# within a #ingle tweet: #o, identifying and tem'orarily removing thi# te!t could likely al#o increa#e the accuracy of off+the+#elf language identification 'ackage#. While removing #uch te!t tem'orarily may im'rove language identification, the te!t it#elf may nonethele## %e u#eful for further analy#i# 0e.g. #tudie# of link diffu#ion acro## language#, e.g. Hale 1)21%3, and #o re#earcher# may wi#h to retain a co'y of the unaltered tweet. Studying the effect# of the#e 're+'roce##ing mea#ure# and the effect# of u#ing link content or grou'ing #everal tweet# %y a #ingle u#er together for language identification a# in =arter et al. 01)223 will %e a u#eful avenue for further re#earch. The lingui#tic and geogra'hic analy#i# of #hort, micro+%log te!t# i# #till an area of active re#earch without any e#ta%li#hed %e#t 'ractice#. *urther #tudie# to com'are variou# method# and new a''roache# 0#uch a# crowd#ourcing with Ama6on,# Mechanical Turk3 are needed in order to identify concern# and 'o##i%le future area# of im'rovement. The 'a'er al#o com'ared o'en+ended 'rofile location# within four re#earch #ite# in order to %etter under#tand how u#eful 'rofile location# might %e for #tudying the geogra'hy of information. "m'ortantly, it find# that the geolocation re#ult# of 'rofile location# are not a u#eful 'ro!y for device location# 0i.e. the 'lace in which the information wa# di##eminated3, and identifie# #everal rea#on# for thi# di#cord. Thi# i# an im'ortant finding not only for the #ocial #cience analy#i# of where u#er# are or 'erceive them#elve# to %e, %ut al#o for com'uter #cience re#earch which often u#e# geocoded tweet# to evaluate the 'erformance of new location cla##ification a''roache#. *or in#tance, Sadilek et al. 01)213 demon#trate that when the location of a #u%#et of u#er# i# known, it i# then 'o##i%le to infer the location of the friend# of the#e u#er#. The 're#ent work #ugge#t# that there will %e an im'ortant difference in where u#er# are 'laced de'ending on whether 'rofile location or device location i# u#ed to create the #tarting #et of u#er# with a known location. The #u%#et of u#er# that geocode and the #u%#et of u#er# with clear 'lace name# in their 'rofile# are uni&ue, and im'ortantly, thi# 'a'er ha# found that even when %oth the 'rofile and device location are valid, they do not alway# corre#'ond. Similarly, there i# a danger in relying on the device location a# the %a#eline, true location of the u#er in training new geolocation algorithm# %a#ed on te!t content 0a 'ractice u#ed in, for e!am'le, Ai#en#tein et al. 1)2): Wing and 8aldridge 1)22: Mahmud 1)213. The 'a'er identifie# three main rea#on# for the lack of correlation %etween 'rofile and device location. *ir#t, commen#urate with 'reviou# work 0e.g. Hecht et al., 1)223, a large num%er of 'rofile# contained invalid, non+geogra'hic te!t or #im'ly larger geogra'hic region# 0countrie#, #tate#3. Secondly, adding to the literature, thi# 'a'er find#, a large num%er of u#er# tweeting within the #tudy area# had 'rofile location# #et to location# out#ide of the #tudy area V thi# likely re#ulted from u#er# were commuting, traveling or #im'ly not having u'dated their 'rofile location#. *inally, #everal u#er# were within the relevant #tudy #ite %ut wrote their 'rofile location information in #uch a way that the geolocation algorithm# u#ed failed to correctly code it. "n addition to the recommendation# %y Hecht et al. 01)223 to 're'roce## 'rofile location information for fictitiou# name#, thi# 'a'er find# it i# im'ortant to 're'roce## 'rofile location# to handle a%%reviation#, li#t# of multi'le location#, and latitude+longitude coordinate# #urrounded %y other te!t. The#e #te'# #hould %e inve#tigated along with tweaking the availa%le 'arameter# to geolocation #ervice#. Several 'rofile# had more general geogra'hic %oundarie# 0region#, #tate#, countrie#3, #ugge#ting that the #ucce## at %eing a%le to 'lace u#er# within a geogra'hic region
2?

will vary with the #'ecificity of the region. Attem't# to #im'ly locate u#er# within a country are more likely to %e #ucce##ful than trying to locate u#er# to a #'ecific city or metro'olitan area. *or city+level area#, local ga6etteer# might %e u#eful 0an a''roach not te#t here3, %ut the analy#i# in thi# 'a'er highlight# the im'ortance of #u''lementing #uch a li#t with common a%%reviation#, mi##'elling#, other+language name#, and tran#literation# of 'lace name#. Time6one information, #'ecifically the $niver#al Time =oordinated off#et 0$T=+off#et3, while not 'erfect and #howing difference# in accuracy acro## #tudy #ite#, #eem# to often corre#'ond with the u#er,# current location. $T=+off#et# al#o have the value of %eing more ea#ily 'roce##ed than the free+form 'rofile location#: however, $T=+off#et# only give an indication of longitude and not latitude. 8ecau#e of the #ignificant challenge# a##ociated with geolocating content and 'rofile# in Twitter, it i# tem'ting to a##ociate certain language# with an a##umed geogra'hic origin of content. However, thi# 'a'er demon#trate# the large need for caution in u#ing language a# a 'ro!y for location. Within each of the four re#earch #ite# con#idered in thi# 'a'er, a mi! of language# wa# found #ugge#ting that focu#ing on language a# a 'ro!y for location can lead to two i##ue#. *ir#t #uch a #trategy would mi## other language u#er# located within the location, and #econd, would likely ca'ture u#er# out#ide of the location of intere#t. *uture work #hould look at the di#'er#ion of variou# language# to determine to what e!tent language u#e clu#ter# within certain geogra'hic area#. Although thi# 'a'er highlight# the challenge# a##ociated with accurately under#tanding the geogra'hy of information in Twitter, thi# #hould not lead u# to di#count the u#efulne## of 'rofile location# a# a mean# of geolocating content. 5rofile location# tell u# much a%out how u#er# 'erceive, 're#ent, and 'lace them#elve#, and thi# 'a'er ha# e!'anded on two method# that can %e u#ed to geolocate that un#tructured information. Mo#t im'ortantly, the ma>ority of the ()) million account# on Twitter contain #ome ty'e of 'rofile location, wherea# only a #mall 'ro'ortion of tweet# contain any #tructured device location. A# #uch, further re#earch and additional human coding of 'rofile location# might %e needed in order to accurately determine how well 'rofile location# com'are with device location#, how we might %e#t geolocate 'rofile location#, and the way# in which the geolocation of 'rofile information might %e lingui#tically or geogra'hically contingent. uthors MA7- G7AHAM i# the Director of 7e#earch and a 7e#earch *ellow at the !ford "nternet "n#titute. He i# al#o a Ki#iting 7e#earch A##ociate the $niver#ity of !ford,# School of Geogra'hy and the Anvironment. Hi# re#earch focu#e# on "nternet and information geogra'hie#, and the overla'# %etween "=T# and economic develo'ment. S= TT A. HA/A i# a re#earch a##i#tant and doctoral candidate at the !ford "nternet "n#titute, $niver#ity of !ford, intere#ted in language #e'aration online and the effect# of 'latform de#ign u'on the tran#mi##ion of information %etween #'eaker# of different language#. DAK"M GA**MA4 i# #enior develo'er at /ittle 8ird in 5ortland, regon. He wa# 'reviou#ly affiliated with the !ford "nternet "n#titute a# an MSc candidate and a *ell *und grant+%acked re#earch a##i#tant under Mark Graham. Hi# re#earch intere#t# 'rimarily focu# on &uantitative analy#e# and methodologie# of #ocial media data.

Re!erences 8run#, A. and 8urgre##, J. A. 1)22. WAu#vote#. How Twitter covered the 1)2) Au#tralian federal election. =ommunication, 5olitic# and =ulture ??. (BV<I. htt'.;;e'rint#.&ut.edu.au;?BL2I;. =arter, S., T#agkia#, M., and Weerkam', W. 1)22. Semi+#u'ervi#ed 'rior# for micro%log language identification. "n Dutch+8elgian "nformation 7etrieval Work#ho' 0D"7 1)223 htt'.;;wouter.weerkam'.com;download#;dir1)22+lid.'df. =heng, 9., =averlee, J., and /ee, -. 1)2). 4ou are where you tweet. A content+%a#ed a''roach to geo+locating Twitter u#er#. "n ="-M ,2). 2@th A=M "nternational =onference on "nformation and -nowledge Management. Ai#en#tein, J., ,=onnor, 8., Smith, M. A., and Ning, A. 5. 1)2). A latent varia%le model for geogra'hic le!ical variation. "n AMM/5 ,2). 1)2) =onference on Am'irical Method# in Matural /anguage 5roce##ing, 21BBV21LB. Aleta, "., and Gol%eck, J. 1)21. 8ridging language# in #ocial network#. How multilingual u#er# of Twitter connect language communitie#. 5roceeding# of the American Society for "nformation Science and Technology, ?@023, 2+?. doi.2).2))1;meet.2?<)?@)2(1B. *i#cher, A. 01)223. /anguage communitie# of Twitter. htt'.;;www.flickr.com;'hoto#;walking#f;I1BB2I(2BI;in;'hoto#tream. *lei##, J. /. 2@B2. Mea#uring nominal #cale agreement among many rater#. 5#ychological 8ulletin, BI0<3. (BL+(L1. Gaffney, D. 1)2). WiranAlection. Uuantifying online activi#m. "n 5roceeding# of We% Science 2). A!tending the *rontier# of Society n+/ine. 7aleigh, M=, $SA. We% Science Tru#t. htt'.;;>ournal.we%#cience.org;1@<;. Graham, M. and 9ook, M. 1)22. Ki#uali6ing glo%al cy%er#ca'e#. Ma''ing u#er+generated 'lacemark#. Journal of $r%an Technology 2L. 22<V2(1. Gru6d, A., Wellman, 8., and Takhteyev, 4. 1)22. "magining Twitter a# an imagined community. American 8ehavioral Scienti#t <<. 21@?V2(2L. Gwet, -. /. 1)2). Hand%ook of inter+rater relia%ility. 1nd edition. Gaither#%urg. Advanced Analytic#. Hale, S. A. 1)21a. Met "ncrea#e? =ro##+lingual linking in the %logo#'here. Journal of =om'uter+ Mediated =ommunication, 2B013. 2(<+2<2. VVV 1)21%. "m'act of 'latform de#ign on cro##+language information e!change. "n 5roceeding# of the ()th "nternational =onference on Human *actor# in =om'uting Sy#tem#, =H" ,21, A=M Halley, A. 2B(2. A 'ro'o#al of a method for finding the longitude at #ea within a degree, or twenty league#. 5hilo#o'hical Tran#action# 02IL(+2BB<3 (B 0January 2, 2B(23. 2L<V2@<. Hecht, 8., Hong, /., Suh, 8., and =hi, A. 1)22. Tweet# from Ju#tin 8ie%er,# heart. The dynamic# of the location field in u#er 'rofile#. "n 5roceeding# of the 1)22 Annual =onference on Human *actor# in =om'uting Sy#tem#, 1(BV1?I. Mew 4ork, M4, $SA. A=M.
2I

Honeycutt, =. and Herring, S.=. 1))@. 8eyond Micro%logging. =onver#ation and colla%oration via Twitter. "n Sy#tem Science#, 1))@. H"=SS ,)@. ?1nd Hawaii "nternational =onference on Sy#tem Science# 0H"=SS+?13, 2V2). Hong, /., =onvertino, G., and =hi, A. 1)22. /anguage matter# in Twitter. A large #cale #tudy. "n "nternational AAA" =onference on We%log# and Social Media, <2LV<12. -ri#hnamurthy, 8., Gill, 5., and Arlitt, M. 1))L. A few chir'# a%out Twitter. "n 5roceeding# of the *ir#t Work#ho' on nline Social Metwork#, 2@V1?. Mew 4ork, M4, $SA. A=M. /otan, G., Graeff, A., Ananny, M., Gaffney, D., 5earce, "., and 8oyd, D. 1)22. The revolution# were tweeted. "nformation flow# during the 1)22 Tuni#ian and Agy'tian revolution#. "nternational Journal of =ommunication, <. 2(B<+2?)<. Mahmud, J. 1)21. Where i# thi# tweet from? "nferring home location# of Twitter u#er#. "n "=WSM ,21. Si!th "nternational AAA" =onference on We%log# and Social Media. 5alen, /., Kieweg, S., and Ander#on, -. M. 1)22. Su''orting Eeveryday analy#t#, in time+ and #afety+ critical #ituation#. The "nformation Society Journal, 1B023. <1+I1. 7omero, D. M., Meeder, 8., and -lein%erg, J. 1)22. Difference# in the mechanic# of information diffu#ion acro## to'ic#. "diom#, 'olitical ha#htag#, and com'le! contagion on Twitter. "n 5roceeding# of the 1)th "nternational =onference on World Wide We%, I@<VB)?. Mew 4ork, M4, $SA. A=M. Sadilek, A., -aut6, H., and 8igham, J. 5. 1)21. *inding your friend# and following them to where you are. "n WSDM ,21. *ifth A=M "nternational =onference on We% Search and Data Mining, WSDM ,21., B1(VB(1. Semioca#t. 1)2). Half of me##age# on Twitter are not in Angli#h. Ja'ane#e i# the #econd mo#t u#ed language. Semioca#t 5re## 7elea#e, 5ari#, *rance. Shelton, T., 9ook, M., and Graham, M. 1)2(. The technology of religion. Ma''ing religiou# cy%er#ca'e#. The 5rofe##ional Geogra'her, I<. doi.2).2)L);))(()21?.1)22.I2?<B2. Takhteyev, 4., Gru6d, A., and Wellman, 8. 1)22. Geogra'hy of Twitter network#. Social Metwork#. 2V1I. doi.2).2)2I;>.#ocnet.1)22.)<.))I. Kieweg, S., Hughe#, A. /., Star%ird, -., and 5alen, /. 1)2). Micro%logging during two natural ha6ard# event#. What Twitter may contri%ute to #ituational awarene##. 5roceeding# of the 1Lth "nternational =onference on Human *actor# in =om'uting Sy#tem# 0''. 2)B@+2)LL3. Mew 4ork, M4, $SA. A=M. War#chauer, M., Said, G. 7. A., T 9ohry, A. 1))1. /anguage choice online. Glo%ali6ation and identity in Agy't. Journal of =om'uter+Mediated =ommunication, B0?3. 7etrieved from htt'.;;>cmc.indiana.edu;volB;i##ue?;war#chauer.html. Wing, 8. 5., and 8aldridge, J. 1)22. Sim'le #u'ervi#ed document geolocation with geode#ic grid#. "n A=/ ,22. 5roceeding# of the ?@th Annual Meeting of the A##ociation for =om'utational /ingui#tic#, @<<V@I?. 5ortland, 7. 9ook, M., Graham, M., Shelton, T., and Gorman, S. 1)2). Kolunteered geogra'hic information and crowd#ourcing di#a#ter relief. A ca#e #tudy of the Haitian earth&uake. World Medical T Health 5olicy 1 0Jul3. BV((.

Você também pode gostar