XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121)

XVIII CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL 81
alcanzado logros importantes en la detección probabilísticos para evaluar el grado de

automática de hipónimos e hiperónimos través Precision de la extracción, sin tomar en cuenta
de métodos estadísticos —priorizando sobre las particularidades conceptuales que se
todo aquellos que soportan un proceso de proyectan en dominios de conocimiento
aprendizaje automático—, estas propuestas especializado. En la sección 4 presentamos una
parecen no tomar en cuenta planteamientos perspectiva de clasificación de relaciones de
formulados por teorías semánticas (ya sean de inclusión pertinente a la exploración que
corte lógico-formal o funcional-cognitivo) que realizamos en este trabajo. En la sección 5
permitan explicar la naturaleza lingüística que establecemos nuestros objetos de estudio: la
subyace en las relaciones léxicas. distinción entre adjetivos calificativos y
Reformulando la idea de Manning y Schütze relacionales, el principio de composicionalidad
(1999), podría decirse que los experimentos semántica y la relación entre hiperónimos con
realizados para identificar tales relaciones dan sus campos léxicos. En la sección 6
un alto valor a la cantidad de hipónimos e presentamos nuestras heurísticas para generar
hiperónimos generables, dependiendo del un conjunto de adjetivos calificativos, y con
método probabilístico empleado, más que ello filtrar hipónimos no relevantes. En la
reflexionar si los resultados muestran cómo sección 7 hacemos una descripción de nuestro
opera la composicionalidad semántica para experimento y mostramos nuestros resultados, y
generar relaciones de hiponimia/hiperonimia. finalmente en 8 ofrecemos una discusión junto
Resolver esta cuestión no es trivial: para con algunos comentarios preliminares respecto
Pustejovsky (1995), Partee (1995), Jackendoff a los resultados obtenidos.
(2010) y Pinker (2011), el análisis del
fenómeno de composicionalidad subyacente
1 2 Etapas y métodos considerados en la
entre palabras y frases es esencial para extracción de hipónimos e hiperónimos
entender cómo funciona la semántica de
cualquier lengua natural, de tal suerte que su Como hemos señalado antes, una buena parte
comprensión puede tener un impacto positivo de los avances logrados en tareas de extracción
para mejorar los resultados obtenidos en la de hipónimos e hiperónimos son el resultado de
extracción de relaciones léxicas. aplicar métodos híbridos. De forma resumida, y
En esta propuesta analizamos un fenómeno conforme a la explicación que dan Girju,
de composicionalidad semántica manifiesto en Badulescu y Moldovan (2006), Ritter,
relaciones de hiponimia/hiperonimia: la Soderland y Etzioni (2009), así como Ortega et
selección de adjetivos relacionales al. (2011) todos ellos consideran, por lo menos,
conceptualmente complejos y vinculados con la alguna de las siguientes etapas:
Copyright © 2012. Universitat Jaume I. Servei de Comunicació i Publicacions. All rights reserved.
expresión de conceptos más específicos, ya que  Selección de un conjunto de

proyectan un conjunto de propiedades al instancias-semilla que caractericen la
hiperónimo, en comparación con los relación de interés.
calificativos que proveen una valoración poco o  A partir de este conjunto, se infiere un
nada relevante al dominio en cuestión. conjunto de patrones léxico-
La organización de nuestro artículo es la sintácticos que proyecten hipónimos e
siguiente: en la sección 2 describimos las etapas hiperónimos, p. e.: el <hipónimo> es
y métodos comúnmente aplicados en la un <hiperónimo>, etc.
extracción de hipónimos e hiperónimos. En la  Usando este conjunto de patrones
sección 3 detallamos los problemas que aprendidos, se obtienen nuevas
presenta la mera aplicación de métodos instancias de la relación a través de la
exploración de un corpus textual (o si
1
En este trabajo hacemos uso del término frase es el caso, la WEB). Este proceso se
en un sentido similar al de sintagma, usado repite hasta que ya no es posible
regularmente en estudios de lingüística generados en generar nuevas instancias.
España (p.e., Bosque y Demonte, 1999). En  Se realiza una valoración del nivel de
contraparte, en Latinoamérica, incluyendo México, confianza que muestran los candidatos
se ha hecho un mayor uso del concepto frase, a hipónimos e hiperónimos obtenidos,
aplicándolo tanto en estudios de sintaxis formal así como de los patrones inferidos.
como en trabajos de lingüística computacional (p.e.
Para tal fin, se han usado sobre todo
Pineda y Meza, 2003; así como Galicia y Gelbukh,
2007). medidas de asociación entre palabras
<i>XVIII Congreso de la Asociación Española para el Procesamiento del Lenguaje Natural</i>, edited by Llavorí, Rafael Berlanga, et al., Universitat Jaume I. Servei de
Comunicació i Publicacions, 2012. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bibliotecauptsp/detail.action?docID=4184256.
Created from bibliotecauptsp on 2019-09-28 09:35:08.
tales como PMI (Church y Hanks, papel que juegan fenómenos concretos a la hora
1990; Hearst 1992; Pantel y de elegir hipónimos e hiperónimos que sean
Pennacchiotti, 2006), medición de relevantes para un dominio de conocimiento.
entropía entre pares de palabras (Ryu En un plano de análisis lingüístico, la
y Choi, 2005), así como cálculos de mayoría de estos métodos se han enfocado en
vectores para medir la distancia encontrar nuevas instancias de hipónimos e
conceptual entre palabras (Ritter, hiperónimos a partir de un conjunto de
Soderland y Etzioni, 2009). instancias semilla, que sean reconocibles en
 Como un mecanismo de apoyo para contextos oracionales (Hearst, 1992; Pantel y
corroborar si los candidatos a Pennacchiotti, 2006; Ritter, Soderland y
hipónimos e hiperónimos mantienen Etzioni, 2009; Ortega, Villaseñor y Montes,
una relación canónica, autores como 2007; Ortega et al., 2011). Sin embargo, no se
Hearst (1992), así como Ritter, ha considerado aún el potencial de relaciones de
Soderland y Etzioni (2009) emplean hiponimia que puede generar un hiperónimo en
la base léxica WordNet (Fellbaum, su función como núcleo de una frase nominal.
1998) como fuente de consulta. De acuerdo con lo expresado por Croft y Cruse
 Una vez que se ha corroborado cuál es (2004), creemos que un hiperónimo unipalabra
el contenido de información que un más un rasgo semántico pueden generar
par de palabras comparte como hipónimos relevantes que den cuenta de la
hipónimos e hiperónimos, se pasa a estructura de un dominio de conocimiento, e
una evaluación para determinar el igualmente reflejar perspectivas de clasificación
grado de precisión & Recall (Van de un hiperónimo.
Rijsbergen, 1979) que se ha logrado Siguiendo esta idea, en este trabajo nos
alcanzar con el método empleado, enfocamos en frases nombre + adjetivo,
haciendo ajustes con una medida-F en teniendo en mente la función semántica que
caso de que sea requerido (Ortega, cumplen los adjetivos como unidades que
Villaseñor y Montes, 2007; Ortega et expresan y priorizan rasgos conceptuales, cuya
al., 2011). selección puede ser condicionada por el
dominio de conocimiento en el cual se
3 Problemas en la selección de manifiestan, como es el caso de la terminología
hipónimos e hiperónimos pertinentes a un médica.
dominio de conocimiento Consideramos entonces que si no se toma en
cuenta la observación hecha por Croft y Cruse,
Tomando en cuenta los métodos de extracción
el mero uso de medidas de información para

mencionados, un rasgo característico en todos reconocer pares de palabras que sean candidatos
ellos es que priorizan el uso de medidas de a hipónimos plantea dificultades importantes,
asociación para valorar el grado de cercanía o sobre todo al momento de determinar si tales
lejanía conceptual que mantiene un par de hipónimos son conceptualmente relevantes o no
palabras insertas en una posible relación de para los especialistas de un dominio de
hiponimia/hiperonimia, junto con una conocimiento dado.
evaluación que les permita determinar cuán
precisos son los hipónimos e hiperónimos 4 Relaciones de inclusión entre
detectados, y cuántos podrían ser recuperados a hipónimos e hiperónimos
través de ajustes que mejoren el desempeño del
sistema diseñado para cumplir esta tarea. Para Croft y Cruse (2004) las relaciones de
Si bien esta prioridad responde a hiponimia-hiperonimia son de inclusión,
necesidades prácticas (en concreto, ofrecer un particularmente de dos tipos: hiponimia simple
listado de posibles candidatos a hipónimos e y taxonimia. La hiponimia simple podemos
hiperónimos fiables, en un tiempo de representarla lingüísticamente como X es un Y.
procesamiento razonable, de los cuales puedan Por su parte, la taxonimia puede ejemplificarse
deducirse patrones regulares de constitución), vía la construcción lingüística X es un tipo/clase
existen problemas a considerar que no pueden de Y. Esta última relación discrimina más que la
solucionarse de forma eficaz apoyándose hiponimia simple, y deriva generalmente en una
únicamente en criterios probabilísticos, sino relación taxonómica. Aunado a lo anterior,
que implican un análisis más detallado sobre el Croft y Cruse señalan que en muchos casos
donde un buen hipónimo no es un buen general (p. e., enfermedades gástricas versus
taxónimo de un hiperónimo, existe una enfermedades del estómago).
definición directa del hipónimo en términos del Retomando lo que plantea Demonte (1999),
hiperónimo más un rasgo semántico simple, p. existen dos clases de adjetivos que asignan
e.: propiedades a los nombres: los calificativos y
los relacionales. La diferencia entre los dos
Semental = Caballo macho consiste en el número de propiedades que cada
uno conlleva, así como la manera en que se
A pesar de esta situación, no es posible explicar vinculan con el nombre. Por un lado, los
por qué en algunos casos ciertos hipónimos con calificativos refieren a un rasgo constitutivo del
un rasgo semántico simple sí podrían nombre modificado, el cual exhibe o caracteriza
representar una buena taxonomía, y en otros una única propiedad física: color, forma,
casos no. Un ejemplo de esto es: carácter, etc.: libro azul, señora delgada,
Composicionalidad C (cuchara, w) = (de té, hombre simpático, y otros similares.
de café, de sopa…) Por otro lado, los relacionales se refieren a
La taxonomía anterior enfatiza la función del un conjunto de propiedades o características
objeto cuchara, por lo que puede ser relevante que puedan ser vinculadas a una entidad o
para algunos fines. Por otro lado, en el siguiente evento concreto, p. e.: puerto marítimo, vaca
caso: lechera, paseo campestre, etc.
Composicionalidad C (cuchara, w) =
Dado lo anterior, nuestra propuesta consiste
(redonda, profunda, grande…)
en enfocar la atención en los adjetivos
Tenemos rasgos que son conceptualmente
relacionales y, para discernir éstos de los
simples y que muestran poca o nula utilidad
calificativos, tomamos en cuenta las
para elaborar una clasificación del hiperónimo.
observaciones hechas por Demonte para
Una pregunta que surge aquí es: ¿los rasgos
diferenciarlos.
conceptualmente simples podrían ser
indicativos de hipónimos no relevantes? Si la
respuesta es afirmativa, entonces es necesario 5.1 Composicionalidad semántica
discernir si tal relación muestra rasgos La alternancia entre adjetivos relacionales y
conceptualmente simples o complejos, de modo calificativos se explica en términos de
que esto ayude a ubicar hipónimos que composicionalidad semántica. Para nuestros
expresen valoraciones generales, versus fines, entendemos la composicionalidad
aquellos que configuren una red conceptual semántica como un principio que regula la
jerarquizada subyacente en un dominio asignación de significados específicos a cada
especializado. una de las unidades léxicas que componen una

estructura de frase, dependiendo de la
5 Relaciones entre adjetivos configuración sintáctica que asuma tal
relacionales y nombres estructura (Partee, 1995). De este modo, la
combinatoria que tomen las unidades léxicas
De acuerdo con Demonte (1999) el adjetivo,
determina el significado global de una frase u
además de ser una categoría gramatical que
oración, generando no sólo unidades léxicas
modifica al nombre, igualmente es una clase de
aisladas, sino también bloques que refieren a
palabra con características formales muy
conceptos específicos (Jackendoff 2002).
precisas, así como una categoría semántica,
Siguiendo a Pustejovsky (1995), así como a
pues hay significados que se expresan mejor
Croft y Cruse (2004), podemos considerar que
por medio de adjetivos.
estos bloques refieren a conceptos específicos,
Desde un enfoque terminológico, Saurí
ya que su selección de rasgos de significado (o
(1997) señala que los adjetivos son importantes
estructuras qualia) se ve influenciada
en la construcción de términos, pues
directamente por el dominio de conocimiento
conceptualmente pueden insertar rasgos
semánticos que permiten establecer límites en el que están inmersas.
claros entre entidades o eventos propios de un Así, un término como inflamación
dominio de conocimiento (p. e., inflamación gastrointestinal, opera como un hipónimo de
intestinal versus inflamación gastrointestinal), tipo taxónimo con mayor riqueza de
lo que ayuda a particularizar su significado en información específica, más que un hiperónimo
contraste con conceptos propios de la lengua simple como inflamación. La configuración de
este significado específico se debe a un proceso Por otro lado, si consideramos un adjetivo
de composicionalidad semántica, introducido relacional de la tabla 2, por ejemplo,
para establecer diferencias entre conceptos cardiovascular, tenemos que modifica también
relacionados. a un conjunto de nombres, como se muestra en
la tabla 3:
5.2 Hiperónimos y sus campos léxicos
Tabla 3. Nombres modificados por el adjetivo
El hiperónimo, dado su estatus de categoría relacional cardiovascular
genérica, puede estar en relación directa con C(wi,cardiovascular) C(wi,rara)
más de un modificador que refleje conceptos o efecto, problema, congreso, televisión, enfermedad,
categorías específicas (p. e., enfermedad función, evento, relación, complicación, infancia,
cardiovascular), o simplemente valoraciones examen, inestabilidad, niño, color, obesidad, mhc,
trastorno, enfermedad, nucleótido, sustancia,
sensibles al contexto (p. e., enfermedad rara). bypass, causa, beneficio, mutación, trastorno, grupo,
Así, para el caso del hiperónimo enfermedad, sistema, reparador, meconio, epistaxis,
encontramos un conjunto de 132 relaciones, de descompensación, cirugía, derecha, síndrome, cáncer,
las cuales, 76 pueden considerarse relevantes operación, mortalidad, alelo, forma, caso, párpado
aparato, educación,
(58%). Si consideramos una medida de
síntoma, eficiencia,
asociación como la información mutua puntual episodio, riesgo,
estandarizada (PMI) propuesta por Bouma investigación,
(2009), que tradicionalmente se ha usado en la manifestación, afección,
extracción de colocaciones, tenemos que las 10 medicamento, director,
muerte, salud
relaciones más relevantes se encuentran en la
tabla 1:
Ergo, tenemos que tanto el hiperónimo como el
Tabla 1. Adjetivos con PMI más alta adjetivo, sea relacional o calificativo, pueden
C(enfermedad, wi) PMI estar vinculados con otros elementos, situación
Transmisible 0.59
Prevenible 0.52
que evidencia cómo opera aquí el principio de
Diarreica 0.45 composicionalidad, restando Precision a las
Diverticular 0.44 medidas de asociación para detectar relaciones
Indicadora 0.41 útiles.
Autoinmunitaria 0.39
Aterosclerótica 0.39 6 Heurísticas lingüisticas para el
Meningococica 0.39
Cardiovascular 0.38 filtrado de hipónimos relevantes
Pulmonar 0.37 Con la finalidad de obtener una lista de paro de
adjetivos calificativos de la misma fuente de

Como se puede observar en la tabla 1, hay dos información de entrada, aplicamos los criterios
adjetivos no relevantes en las primeras 10 de Demonte (1999) para distinguir adjetivos
relaciones. Observando los datos de la tabla 2, calificativos de relacionales, junto con un
resulta más claro reconocer la cantidad de criterio de orden de palabra que indique la
adjetivos no relevantes que se relacionan con precedencia de adjetivos frente a nombres.
enfermedad, esto es, un 40% del total de estos Nuestras heurísticas, a grandes rasgos,
primeros 50 adjetivos: consideran:
1. La posibilidad de que un adjetivo sea
Tabla 2. Primeros 50 adjetivos con PMI más alta usado predicativamente: El método es
C(enfermedad, wi) importante. Para obtener estas
Transmisible, prevenible, diarreica, diverticular, construcciones consideramos la siguiente
indicadora, autoinmunitaria, aterosclerótica,
expresión regular: <VSFIN><ADJ>.
meningocócica, cardiovascular, pulmonar, afecto,
febril, agravante, hepática, seudogripal, periodontal, 2. El que un adjetivo sea parte de
sujeto, bacteriano, emergente, benigno, parasitaria, comparaciones, de modo que su
postrombotica, bacteriémica, coexistente, significado sea modificado por adverbios
catastrófica, exclusiva, vectorial, supurativa, de grado: relativamente rápido. Para
infecciosa, debilitante, digestiva, invasora, rara, obtener estas construcciones
inflamatoria, esporádica, antimembrana, consideramos la siguiente expresión:
predisponente, ulcerosa, contagiosa, cardiaca, <ADV><ADJ>.
sistémica, activa, grave, prexistente, miocárdica,
somática, fulminante, atribuible, linfoproliferativa.
3. La precedencia del adjetivo respecto al
nombre: Una grave enfermedad:
<ART><ADJ><NC>.
7 Metodología adjetivos que serán parte de la lista de paro, vía
la aplicación de las heurísticas mencionadas.
La metodología que seguimos para el desarrollo
de nuestro experimento la detallamos a 7.5 Extracción de hipónimos derivados
continuación. de hiperónimos
7.1 Corpus de análisis Después del filtrado de adjetivos calificativos,
Nuestro corpus está constituido por un conjunto obtenemos todos los modificadores adjetivos,
de documentos del dominio médico, así como también su medida PMI.
básicamente enfermedades del cuerpo humano Inmediatamente, pasamos a evaluar nuestro
y temas relacionados (cirugías, tratamientos, método, comparando los niveles de Precision,
estudios, etc.) de MedlinePlus en español Recall y F-Measure obtenidos por ambos
(www.ncbi.nlm.nih.gov/pubmed/). De igual enfoques: PMI y heurísticas lingüísticas.
forma, se adicionaron dos libros de texto con
temáticas relacionadas. 8 Resultados
En total, el tamaño del conjunto de En este apartado mostramos los resultados
documentos es de 750 mil palabras. obtenidos con nuestro método.
Seleccionamos un dominio medico por razones
de disponibilidad de recursos textuales en 8.1 Producción inicial de relaciones
formato digital. Además, asumimos que la candidatas
selección de este dominio no supone Realizamos nuestro experimento considerando
restricciones fuertes para la generalización de los hiperónimos con una frecuencia de
resultados. ocurrencia de 5 o mayor, siguiendo los criterios
propuestos por Acosta, Sierra y Aguilar (2011).
7.2 Herramientas de extracción La tabla 4 muestra los primeros 10 hiperónimos
El lenguaje de programación usado para ordenados de acuerdo con su nivel de
automatizar todas las tareas requeridas fue productividad de relaciones (PR) y las que se
Python, en concreto el módulo NLTK (Bird, consideran relevantes (RRs) para el dominio de
Klein y Loper, 2009). Asimismo, nos basamos análisis, así como también la Precision inicial
en patrones léxico-semánticos, los cuales tienen obtenida (P). Asumimos como baseline esta
un mayor grado de generalidad, por lo que producción inicial, que en este caso sólo se guía
asumimos de entrada un corpus con etiquetado por el potencial hiperónimo:
de partes de la oración. El etiquetado de partes
de la oración lo realizamos con TreeTagger Tabla 4. Hiperónimos más frecuentes
(Schmid, 1994). Hiperónimo PR RRs P

Enfermedad 132 76 58
Infección 125 69 55
7.3 Extracción automática de CDs Tratamiento 112 39 35
Vacuna 79 41 52
Siguiendo la metodología propuesta por Sierra
Problema 67 40 60
et al. (2010), así como Acosta, Sierra y Aguilar Afección 64 38 59
(2011), extraemos un conjunto de hiperónimos Trastorno 61 45 74
más frecuentes detectados en contextos Examen 60 33 55
definitorios (o CDs), y los llevamos a una Dolor 54 26 48
segunda etapa de extracción de hipónimos, Célula 47 22 47
considerando únicamente adjetivos como
modificadores del hiperónimo. 8.2 Ranking por PMI
Consideramos la medida PMI, una versión
7.4. Elaboración de una lista de paro de estandarizada propuesta por Bouma (2009),
adjetivos no relevantes cuya normalización obedece a dos cuestiones
fundamentales: usar medidas de asociación
En este punto asumimos que los adjetivos cuyos valores tengan una interpretación fija y
calificativos presentan rasgos conceptualmente reducir la sensibilidad a frecuencias bajas de
simples y de poca utilidad para generar ocurrencia de datos. La fórmula de PMI
hipónimos relevantes. Dado lo anterior, normalizada es la siguiente:
obtenemos automáticamente un conjunto de
Demonte (1999) señala que los rasgos
semánticos que distinguen mejor adjetivos
calificativos de relacionales son la
A partir de los resultados obtenidos, graduabilidad y la polaridad. Cabe señalar que
observamos que al establecer umbrales PMI este último rasgo no ha sido considerado en
para filtrar relaciones relevantes, si bien el nuestro análisis, dada la complejidad que
Recall se mantiene alta a medida que implica tratarla computacionalmente. En su
aumentamos el umbral, la Precision se ve lugar, hemos considerado tomar en cuenta la
afectada de forma poco significativa, como lo heurística de precedencia de adjetivos a un
muestra la tabla 5: nombre, que para el caso del corpus bajo
análisis, genera un porcentaje de error menor.
Tabla 5. Precision, Recall y F-Measure por umbral
Basados en la exploración del corpus para el
PMI
R P F español del Sketch Engine (Kilgarriff et al.,
Producción inicial 100 51 68 2004), observamos que las heurísticas
Umbral IMP planteadas sí generan más del 95% de adjetivos
Umbral>0 98 51 67 calificativos, empero, cuando se consideran
Umbral>=0.10 93 53 68 dominios especializados, estas heurísticas no
Umbral>=0.20 83 55 66 son tan precisas.
8.3 Filtro de adjetivos no relevantes 9 Comentarios finales
Siguiendo lo comentado por Croft y Cruse
(2004) sobre los hipónimos que pueden ser En este trabajo presentamos una comparación
descritos por un hiperónimo más un rasgo, entre dos enfoques para obtener relaciones de
consideramos que descartando los adjetivos hiponimia relevantes, que puedan surgir a partir
calificativos, los cuales con frecuencia no se de un hiperónimo dentro de un dominio de
relacionan con términos o conceptos conocimiento médico. Los resultados obtenidos
importantes, es posible mejorar la extracción de sostienen empíricamente la idea planteada por
hipónimos relevantes. Los resultados obtenidos Croft y Cruse (2004): un buen hipónimo no
se resumen en la tabla 6: necesariamente es un buen taxónimo de un
hiperónimo.
Tabla 6. Precision, Recall y F-Measure por El punto clave en esta discusión es que
heurística podemos generar una gran cantidad de
Heurística R P F hipónimos relevantes que tengan como núcleo
Graduabilidad de adjetivos 84 67 75 un hiperónimo. Desafortunadamente, dada la

Graduabilidad y
precedencia de adjetivos 82 69 75 naturaleza genérica de los hiperónimos
Graduabilidad, precedencia unipalabra, estos pueden vincularse
y predicación de adjetivos 77 76 76 directamente con una gran cantidad de
modificadores a nivel adjetivo y de frase
La tabla 6 muestra un mejor desempeño en las preposicional.
medidas de Precision y Recall utilizando las En este trabajo sólo consideramos los
heurísticas, en comparación con el modificadores adjetivos donde observamos una
establecimiento de umbrales PMI. Sin embargo, gran cantidad de adjetivos calificativos y
un fenómeno que observamos es que las relacionales, siendo éstos últimos, desde nuestra
heurísticas pueden arrojar un conjunto de perspectiva, mejores candidatos a hipónimos
adjetivos relacionales relevantes, el porcentaje de un hiperónimo. Es notable el alto grado de
de error que encontramos en nuestro corpus se composicionalidad presente en la relación entre
muestra en la tabla 7. hiperónimos y adjetivos relacionales, lo que va
en detrimento de la Precision de medidas de
Tabla 7. Porcentaje de error obtenido de las asociación para seleccionar las relaciones
heurísticas lingüísticas relevantes. Es justo en estos escenarios donde la
Patrón Porcentaje de error regularidad del lenguaje, como lo mencionan
<ADV><ADJ> 18 Manning y Schütze (1999), ayuda a que los
<VSFIN><ADJ> 17 métodos de desambiguación, parsing y en
nuestro caso particular, extracción de
<ADJ><NC> 15
hipónimos relevantes, adquiera gran th
11 EURALEX International Congress, páginas
importancia. 105-116, Lorient (Francia).
Manning, Ch. y Schütze, H. 1999. Foundations of
Bibliografía Statistical Natural Language Processing. MIT
Press, Cambridge, (Mass, USA).
Acosta, O., C. Aguilar, y G. Sierra. 2010. A Method
Ortega, R., L. Villaseñor, y M. Montes. 2007. Using
for Extracting Hyponymy-Hypernymy Relations
lexical patterns for extracting hyponyms from the
from Specialized Corpora Using Genus Terms.
Web. En: Proceedings of MICAI, LNCS,
En: Proceedings of the Workshop in Natural
Springer (Berlin).
Language Processing and Web-based
Ortega, R., C. Aguilar, L. Villaseñor, M. Montes y
Technologies 2010, páginas 1-10, Universidad
G. Sierra. 2011. Hacia la identificación de
Nacional de Córdoba (Argentina).
relaciones de hiponimia/hiperonimia en Internet,
Acosta, O., G. Sierra, y C. Aguilar. 2011. Extraction
Revista Signos. Estudios de Lingüística, 44(75):
of Definitional Contexts using Lexical Relations,
68-84.
International Journal of Computer Applications,
Pantel, P., y M. Pennacchiotti. 2006. Espresso:
34(6): 46-53.
Leveraging generic patterns for automatically
Berland, M., y E. Charniak. 1999. Finding parts in
harvesting semantic relations. En Proceedings of
very large corpora. En: Proceedings of the 37th
Conference on Computational Linguistics
Annual Meeting of the Association for
Association for Computational Linguistics, ACL,
Computational Linguistics, páginas 57-64,
Sydney (Australia).
Orlando (USA).
Partee, B. 1995. Lexical Semantics and
Bird, S., E. Klein, y E. Loper. 2009. Natural
Compositionality. En: Invitation to Cognitive
Language Processing whit Python. O'Reilly,
Science, Part I: Language, páginas: 311-36, MIT
Sebastropol (Cal., USA).
Press, Cambridge (Mass., USA).
Bosque, I. y Demonte, V. 1999. Gramática
Pineda, L. y Meza, I. 2003. Un modelo para la
descriptiva de la lengua española, 3 Volúmenes,
perífrasis española y el sistema de pronombres
Espasa-Calpe (Madrid).
clíticos en HSPG. Estudios de Lingüística
Bouma, G. 2009. Normalized (Pointwise) Mutual
Aplicada, Num. 38: 45–67.
Information in Collocation Extraction. En: From
Pinker, S. 1997. How the mind works, Norton &
Form to Meaning: Processing Texts
Company, (New York).
Automatically, Proceedings of the Biennial
Pustejovsky, J. 1995. The Generative Lexicon. MIT
GSCL Conference, páginas 31-40, Gunter Narr
Press, Cambridge, (Mass, USA).
Verlag (Tübingen).
Ritter, A., Soderland, S., y Etzioni, O. 2009. What is
Croft, W., y D. Cruse. 2004. Cognitive Linguistics.
This, Anyway: Automatic Hypernym Discovery.
Cambridge University Press (Cambridge, UK).
En Papers from the AAAI Spring Symposium,
Cruse, D. 1986. Lexical Semantics. Cambridge
páginas 88-93.
University Press (Cambridge, UK).
Ryu, P., y K. Choi. 2005. An Information-Theoretic

Church. K., y Hanks, P. 1990. Word Association
Approach to Taxonomy Extraction for Ontology
Norms, Mutual information and Lexicography.
Learning. En: Ontology Learning from Text:
Computational Linguistics, 16(1): 22-29.
Methods, Evaluation and Applications, páginas
Demonte, V. 1999. El adjetivo. Clases y usos. La
15-28, IOS Press (Amsterdam).
posición del adjetivo en el sintagma nominal. En:
Saurí, R. 1997. Tractament lexicogràfic dels
Gramática descriptiva de la lengua española,
adjectius: aspectes a considerar. Papers de
Vol. 1, Cap. 3, .páginas 129-215, Espasa-Calpe
l'IULA: Monografies, Universitat Pompeu Fabra
(Madrid).
(Barcelona).
Galicia, S. y Gelbukh, A. 2007. Investigaciones en
Schmid, H. 1994. Probabilistic Part-of-Speech
análisis sintáctico del español. Instituto
Tagging Using Decision Trees. En: Proceedings
Politécnico Nacional (México DF).
of International Conference of New Methods in
Girju, R., A. Badulescu, y D. Moldovan. 2006. Language: www.ims.uni-
Automatic Discovery of Part–Whole Relations.
stuttgart.de~schmid.TreeTagger.
Computational Linguistics, 32(1): 83-135. Sierra G., R. Alarcón, C. Aguilar, y C. Bach. 2010.
Hearst, M. 1992. Automatic Acquisition of
Definitional verbal patterns for semantic relation
Hyponyms from Large Text Corpora. En:
extraction. En: Probing Semantic Relations:
Proceedings of COLING-92, páginas 539-545,
Exploration and Identification in Specialized
Nantes (Francia).
Texts, páginas 73-96. John Benjamins Publishing
Jackendoff, R. 2002. Foundations of Language:
(Amsterdam/Philadelphia).
Brain, Meaning, Grammar, Evolution. Oxford
Snow, R., D. Jurafsky, y A. Ng. 2006. Semantic
University Press (Oxford, UK).
Taxonomy Induction from Heterogeneous
Kilgarriff, A., P. Rychly, P. Smrz, y D. Tugwell.
Evidence. En: Proceedings of the 21st
2004. The Sketch Engine. En: Proceedings of
International Conference on Computational
Linguistics and 44th Annual Meeting of the ACL,
páginas 801–808, Sydney (Australia).
Van Rijsbergen, K. 1979. Information Retrieval.
Butterworths (Ontario, Canadá).
2nd International Workshop on Exploiting Large
Knowledge Repositories (E‐LKR)
1st International Workshop on Automatic Text
Summarization for the Future (ATSF)
Organizadores:
 Ernesto Jiménez‐Ruiz (University of Oxford)

 María José Aramburu (Universitat Jaume I)
 Roxana Dánger (Imperial College London)
 Antonio Jimeno‐Yepes (National Library of Medicine, USA)
 Horacio Saggion (Universitat Pompeu Fabra)
 Elena Lloret (Universidad de Alicante)
 Manuel Palomar (Universidad de Alicante)
2nd International Workshop on Exploiting Large Knowledge

Repositories (E‐LKR)
Very large knowledge repositories (LKR) are being created, published and exploited in a wide
range of fields, including Bioinformatics, Biomedicine, Geography, e‐Government, and many
others. Some well known examples of LKRs include the Wikipedia, large scale Bioinformatics
databases and ontologies such as those published by the EBI or the NIH (e.g. UMLS, GO), and
government data repositories such as data.gov. These repositories are publicly available and
can be used openly. Their exploitation offers many possibilities for improving current
information systems, and opens new challenges and research opportunities to the information
processing, databases and semantic web areas.
The main goal of this workshop is to bring together researchers that are working on the
creation of new LKRs on any domain, or on their exploitation for specific information
processing tasks such as data analysis, text mining, natural language processing and
visualization, as well as for knowledge engineering issues, like knowledge acquisition,
validation and personalization.
Research, demo and position papers showing the benefits that exploiting LKRs can bring to the
information processing area will be especially welcome to this workshop.
1st International Workshop on Automatic Text Summarization

for the Future (ATSF)
Research on automatic text summarization started over 50 years ago and although mature in
some application domains (e.g., news), faces new challenges in the current context of user‐
generated on‐line content and social networks.
Information on the Web is constantly updated sometimes without any quality‐control, an

important proportion of the information being informal and ephemeral, a typical example
being that of opinions and messages on the Internet.
 What techniques can be used to produce appropriate summaries in this context?

 How to measure relevance of ill‐formed input?
 How to produce understandable summaries from noisy texts? How to identify the
most relevant information in a set of opinions?
High quality documentation such as technical/scientific articles and patents, has not received
all the attention that the field deserves. Given the explosion of technical documentation
available on the Web and in intranets, scientist and research and development facilities face a
true scientific information deluge: summarization should be a key instrument not only for
reducing the information content but also for measuring information relevance in context,
providing to users adequate answers in context.
 What techniques can be used to extract knowledge from complex technical

documents?
 How to compile back the information in a well formed summary?
 How to measure relevance in a network of scientific articles, beyond mere citation
counts?
Another summarization research topic lying behind is non‐extractive summarization, the

generation of a concise summary which is not a set of sentences from the input. This is a very
difficult problem since summarization systems must be able to easily adapt from one domain
to another in order to recognize what is important and how to produce a coherent text from a
textual or conceptual representation.
The workshop “Automatic Text Summarization of the Future” aims to bringing together
researchers and practitioners of natural language processing to address the aforementioned
and related issues.
Los artículos completos de este taller han sido publicados en: http://ceur‐ws.org/Vol‐882/
PROGRAMA
A Challenge for Automatic Text Summarization

Leo Wanner, ICREA and DTIC, UPF
Towards an ontology based large repository for managing heterogeneous knowledge

resources
Nizar Ghoula, Gilles Falquet
Enhancing the expressiveness of linguistic structures

José Mora, José A. Ramos, Guadalupe Aguado de Cea
Integrating large knowledge repositories in multiagent ontologies

Herlina Jayadianti, Carlos B. Sousa Pinto, Lukito Nugroho, Paulus Insap Santosa
A proposal for a European large knowledge repository in advanced food composition tables
for assessing dietary intake
Oscar Coltell, Francisco Madueño, Zoe Falomir, Dolores Corella
Disambiguating automatically‐generated semantic annotations for Life Science open

registries
Antonio Jimeno, Rafael Berlanga Llavori, María Pérez Catalán
Redundancy reduction for multi‐document summaries using A* search and discriminative

training
Ahmet Aker, Trevor Cohn, Robert Gaizauskas
A dependency relation‐based method to identify attributive relations and its application in

text summarization
Shamima Mithun, Leila Kosseim
Short Papers
Using biomedical databases as knowledge sources for large‐scale text mining

Fabio Rinaldi
Exploiting the UMLS metathesaurus in the ontology alignment evaluation initiative

Ernesto Jiménez‐Ruiz, Bernardo Cuenca Grau, Ian Horrocks
Statements of interest
KB_Bio_101: a repository of graph‐structured knowledge

Vinay K Chaudhri, Michael Wessel, Stijn Heymans
If it's on web it's yours!

Abdul Mateen Rajput
TASS ‐ Taller de Análisis de Sentimientos
en la SEPLN
Organizadores:
 Julio Villena (Daedalus, SA)

 Sara Lana (Universidad Politécnica de Madrid)
 Alfonso Ureña (Universidad de Jaén)
According to Merriam‐Webster dictionary, reputation is the overall quality or character of a

given person or organization as seen or judged by people in general, or, in other words, the
general recognition by other people of some characteristics or abilities for a given entity.
Specifically, in business, reputation comprises the actions of a company and its internal
stakeholders along with the perception of consumers about the business. Reputation affects
attitudes like satisfaction, commitment and trust, and drives behaviour like loyalty and
support. In turn, reputation analysis is the process of tracking, investigating and reporting an
entity's actions and other entities' opinions about those actions. It covers many factors to
calculate the market value of reputation. Reputation analysis has come into wide use as a
major factor of competitiveness in the increasingly complex marketplace of personal and
business relationships among people and companies.
Currently market research using user surveys is typically performed. However, the rise of social
media such as blogs and social networks and the increasing amount of user‐generated
contents in the form of reviews, recommendations, ratings and any other form of opinion, has
led to creation of an emerging trend towards online reputation analysis. The so‐called
sentiment analysis, i.e., the application of natural language processing and text analytics to
identify and extract subjective information from texts, which is the first step towards the
online reputation analysis, is becoming a promising topic in the field of marketing and
customer relationship management, as the social media and its associated word‐of‐mouth
effect is turning out to be the most important source of information for companies and their
customers' sentiments towards their brands and products.
Sentiment analysis is a major technological challenge. The task is so hard that even humans
often disagree on the sentiment of a given text. The fact that issues that one individual finds
acceptable or relevant may not be the same to others, along with multilingual aspects, cultural
factors and different contexts make it very hard to classify a text written in a natural language
into a positive or negative sentiment. And the shorter the text is, for example, when analyzing
Twitter messages or short comments in Facebook, the harder the task becomes.
Within this context, TASS is an experimental evaluation workshop, as a satellite event of the
SEPLN 2012 Conference that will be held on September 7th, 2012 in Jaume I University at
Castellón de la Plana, Comunidad Valenciana, Spain, to foster the research in the field of
sentiment analysis in social media, specifically focused on Spanish language. The main
objective is to promote the application of existing state‐of‐the‐art algorithms and techniques
and the design of new ones for the implementation of complex systems able to perform a
sentiment analysis based on short text opinions extracted from social media messages
(specifically Twitter) published by a series of representative personalities.
The challenge task is intended to provide a benchmark forum for comparing the latest
approaches in this field. In addition, with the creation and release of the fully tagged corpus,
we aim to provide a benchmark dataset that enables researchers to compare their algorithms
and systems.
PROGRAMA
Overview of TASS 2012 ‐ Workshop on Sentiment Analysis at SEPLN

Julio Villena‐Román, Janine García‐Morera, Cristina Moreno‐García, Linda Ferrer‐Ureña, Sara
Lana‐Serrano, José Carlos González‐Cristóbal, Adam Westerski, Eugenio Martínez‐Cámara, M.
Ángel García‐Cumbreras, M. Teresa Martín‐Valdivia, L. Alfonso Ureña‐López .......................... 94
TASS: Detecting Sentiments in Spanish Tweets

Xabier Saralegi Urizar, Iñaki San Vicente Roncal ...................................................................... 103
Techniques for Sentiment Analysis and Topic Detection of Spanish Tweets: Preliminary
Report
Antonio Fernández Anta, Philippe Morere; Luis Núñez Chiroque, Agustín Santos.................... 112
The L2F Strategy for Sentiment Analysis and Topic Classification

Fernando Batista, Ricardo Ribeiro............................................................................................. 125
Sentiment Analysis of Twitter messages based on Multinomial Naive Bayes

Alexandre Trilla, Francesc Alías ................................................................................................. 129
UNED at TASS 2012: Polarity Classification and Trending Topic System

Tamara Martín‐Wanton, Jorge Carrillo de Albornoz................................................................. 131
UNED @ TASS: Using IR techniques for topic‐based sentiment analysis through divergence
models
Angel Castellano González, Juan Cigarrán Recuero, Ana García Serrano ................................. 140
SINAI en TASS 2012

Eugenio Martínez Cámara, M. Angel García Cumbreras, M. Teresa Martín Valdivia, L. Alfonso
Ureña López............................................................................................................................... 147
Lexicon‐Based Sentiment Analysis of Twitter Messages in Spanish

Antonio Moreno‐Ortiz, Chantal Pérez‐Hernández .................................................................... 156
TASS - Workshop on Sentiment Analysis at SEPLN
TASS - Taller de Análisis de Sentimientos en la

SEPLN
Julio Villena-Román Sara Lana-Serrano
Janine García-Morera DIATEL - Universidad Politécnica de Madrid
Cristina Moreno-García slana@diatel.upm.es
Linda Ferrer-Ureña
DAEDALUS
{jvillena, jgarcia, cmoreno}@daedalus.es
José Carlos González-Cristóbal Eugenio Martínez-Cámara

Adam Westerski M. Ángel García-Cumbreras
GSI - Universidad Politécnica de Madrid M. Teresa Martín-Valdivia
{jgonzalez, westerski}@dit.upm.es L. Alfonso Ureña-López
Universidad de Jaén
{emcamara, maite, magc, laurena}@ujaen.es
Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el

contexto de la SEPLN para fomentar la investigación en el campo del análisis de sentimiento en
los medios sociales, específicamente centrado en el idioma español. El principal objetivo es
promover el diseño de nuevas técnicas y algoritmos y la aplicación de los ya existentes para la
implementación de complejos sistemas capaces de realizar un análisis de sentimientos basados
en opiniones de textos cortos extraídos de medios sociales (concretamente Twitter). Este
artículo describe las tareas propuestas, el contenido, formato y las estadísticas más importantes
del corpus generado, los participantes y los diferentes enfoques planteados, así como los
resultados generales obtenidos.
Palabras clave: TASS, análisis de reputación, análisis de sentimientos, medios sociales
Abstract: This paper describes TASS, an experimental evaluation workshop within SEPLN to
foster the research in the field of sentiment analysis in social media, specifically focused on
Spanish language. The main objective is to promote the application of existing state-of-the-art
algorithms and techniques and the design of new ones for the implementation of complex
systems able to perform a sentiment analysis based on short text opinions extracted from social
media messages (specifically Twitter) published by representative personalities. The paper
presents the proposed tasks, the contents, format and main statistics of the generated corpus, the
participant groups and their different approaches, and, finally, the overall results achieved.
Keywords: TASS, reputation analysis, sentiment analysis, social media.
Specifically, in business, reputation

1 Introduction comprises the actions of a company and its
1 internal stakeholders along with the perception
According to Merriam-Webster dictionary ,
of consumers about the business. Reputation
reputation is the overall quality or character of
affects attitudes like satisfaction, commitment
a given person or organization as seen or judged
and trust, and drives behavior like loyalty and
by people in general, or, in other words, the
support.
general recognition by other people of some
In turn, reputation analysis is the process
characteristics or abilities for a given entity.
of tracking, investigating and reporting an
entity's actions and other entities' opinions
1
http://www.merriam-webster.com/
about those subjective is so hard ,
actions. It covers informati that even the
many factors to on from humans har
calculate the texts, often der
market value of which is disagree the
reputation. the first on the task
Reputation step sentiment bec
analysis has towards of a om
come into wide the online given es.
use as a major reputation text. The W
factor of analysis, fact that ithi
competitiveness is issues n
in the becoming that one this
increasingly a individua con
complex promising l finds text
marketplace of topic in acceptabl ,
personal the field e or TA
2
and business of relevant SS ,
relationships marketing may not whi
among people and be the ch
and companies. customer same to stan
Currently relationsh others, ds
market research ip along for
using user managem with Tall
surveys is ent, as the multiling er
typically social ual de
performed. media aspects, Aná
However, the and its cultural lisis
rise of social associate factors de
media such as d word- and Sen
blogs and social of-mouth different timi
networks and the effect is contexts ent
increasing turning make it os
amount of user- out very hard en
generated to be the to la
contents in the most classify a SE
form of important text PL

reviews, source of written in N
recommendation informati a natural (W
s, ratings and on for language ork
any other form companie into a sho
of opinion, has s and positive p
led to creation their or on
of an emerging customers negative Sen
trend ' sentiment tim
towards sentiment . And ent
online s towards the An
reputatio their shorter aly
n brands the text sis
analysis. and is, for at
The so-called products. example, SE
sentiment Senti when PL
analysis, i.e., the ment analyzing N,
application of analysis Twitter in
natural language is a major messages Eng
processing and technolog or short lish
text analytics to ical comment ) is
identify and challenge. s in an
extract The task Facebook exp
erimental The main their systems to the technological
evaluation objective is to audience in a regular challenge is to
workshop, improve the existing workshop session build a classifier
organized as a techniques and together with special to identify the
satellite algorithms and design invited speakers. topic of the text,
event of the new ones in order to Submitted papers are
SEPLN 2012 perform a sentiment reviewed by the
Conference, held analysis in short text program committee.
on opinions extracted from
September 7th, social media messages 2.1 Task
2012 in Jaume I (specifically Twitter) 1:
University at published by a series of Sentiment
Castellón de la important personalities. Analysis
Plana, The challenge task
Comunidad is intended to provide a This task consists on
Valenciana, benchmark forum for performing an
Spain, to comparing the latest automatic sentiment
promote the approaches in this analysis to determine
research in the field. In addition, with the polarity of each
field of the message in the test
sentiment creation and release of corpus.
analysis in social the fully tagged corpus, The evaluation
media, initially we aim to provide a metrics to evaluate and
focused on benchmark dataset that compare the different
Spanish, enables researchers to systems are the usual
although it could compare their measurements of
be extended to algorithms and precision (1), recall (2)
any language. systems. and F- measure (3)
calculated over the full
2 test set, as shown in
2
D Figure 1.
h
es
t
t cr
p ip
(1)
: ti
/ o
/ n
w of (2)
w ta
w sk
. s
d
a Two tasks are proposed (3)
e for the participants in
d this first edition: Fig
a sentiment analysis and ure
l trending topic 1:
u coverage. Eva
s Groups may luat
. participate in both ion
e tasks or just in one of met
s them. rics
/ Along with the
T submission of
2.2 Task 2:
A experiments,
participants are Trending topic
S
encouraged to submit a coverage
S
paper to the workshop In this case, the
in order to describe
dataset and it with o e

achieves advanced r whi
maximum macro syntactic i ch
averaged F1 technique t of
measure rate of s to h the
36.28%. identify m topi
negations . cs
4 and On corr
. intensifier the other esp
5 s, their hand, for ond
scope and topic s to
their detection, eac
L their h
effect on
S the system eve
I emotions is based nt,
affected on a the
– by them. probabili topi
Beside stic c
U s, the model wit
N method (Twitter- h
E addresses LDA). the
D the They hig
problem first hest
(Martín-Wanton of word build for stati
and Carrillo ambiguity each stic
de Albornoz, , taking topic al
2012) presents into of the corr
the participation account task a elat
of the UNED the lexicon ion
g contextua of words was
r l that best obt
o meaning describe aine
u of terms it, thus d
p by using representi co
a word ng each mp
i s topic as arin
n e a ranking g
n of the
T s discrimin ran
A e ative kin
S words. g of
S d Moreove wor
. i r, a set of ds
For polarity s events of
classification, a is eac
they propose an m retrieved h
emotional b based on topi
concept-based i a c
method. The g probabili and
original u stic the
method makes a approach ran
use of an t that was kin
affective lexicon i adapted g of
to o to the wor
represent the text n characteri ds
as the set of stics of mos
emotional a Twitter. t
meanings it l To like
expresses, along g determin ly
to belong second accuracy each tweet but, as

t team of the an alternative, the
o compone results, named entities or
nt of the they adjectives
t L propose detected as well.
h S several Results show
e I approach that modeling the
es tweets set using
e g focused named entities
v r on carry and adjectives
e o out improves the
n u language final precision
t p models, results and,
. not only as a
The a consideri consequence,
experimental t ng the their
results achieved textual representativenes
show the U content s in the model
adequacy of N associate compared with
their approach E d to the use of
for the task, as D common
s . t
h Their e
o proposal r
w addresses m
n the s
sentiment .
l and topic General
a detection results are
t from an
promising (fifth
e Informati
and fourth
r on
position in each
. Retrieval
of the proposed
(IR)
tasks),
4 perspecti
indicating that
ve, based
. an IR and
on
6 language
language
models
divergenc
L based approach
es.
S may be an
Kullback-
I alternative to
Liebler
Divergen other classical
ce (KLD) proposals
–
is used to focused on the
generate application
U o
N both,
polarity f
E
and topic
D models, c
which l
2 will be a
used in s
(Castellano, s
Cigarrán and the IR
process. i
García Serrano, f
2012) describes In
i
the research order to
c
done for the improve
a
workshop by the the
t
ion (Moreno-Ortiz and

technique Pérez-Hernández,
s. 2012) describes the
participation of the
4.7 SINAI – group at Facultad de
Universidad de Filosofía y Letras in
Jaén Universidad de Málaga.
They use a lexicon-
The participation of the based approach to
SINAI research group of the Sentiment Analysis
University of Jaén is described (SA). These
in (Martínez Cámara et al., approaches differ from
2012). the more common
For the first task, they have machine-learning based
chosen a supervised machine approaches in that the
learning approach, in which former rely solely on
they have used SVM for previously generated
classifying the lexical resources that
polarity. Text features store polarity
included are unigrams, information for lexical
emoticons, positive and items, which are then
negative words and identified in the texts,
i assigned a polarity tag,
n and finally weighed, to
t come up with an
e overall score for the
n text.
s Such SA systems
i have been proved to
t perform on par with
y supervised, statistical
systems, with the
m added benefit of not
a requiring
r a training set.
k However, it remains
e to be seen
r
s
.
In the second task, they
have also used SVM for the
topic classification but several
bags of words (BoW) have
been used with the goal of
improving the classification
performance.
One BoW has been
obtained using Google
AdWordsKeyWordTool,
which allows to enter a term
and directly returns the top N
related concepts. The second
BoW has generated based on
the hash tags of the training
tweets, per each category.
4.8 Universidad de
Málaga (UMA)
Ureña López, L. Alfonso. SINAI at TASS

2012. TASS 2012 Working Notes.
Moreno-Ortiz, Antonio; Pérez-Hernández,
Chantal. Lexicon-Based Sentiment Analysis
of Twitter Messages in Spanish. TASS 2012
Working Notes.
TASS: Detecting Sentiments in Spanish Tweets
TASS: Detección de Sentimientos en Tuits en Espanõl
Xabier Saralegi Urizar Inãki San Vicente
Elhuyar Fundazioa Roncal
Zelai Haundi 3, 20170 Usurbil Elhuyar Fundazioa
x.saralegi@elhuyar.com Zelai Haundi 3, 20170 Usurbil
i.sanvicente@elhuyar.com
Resumen: Este art´ıculo describe el sistema presentado por nuestro grupo para
la tarea de análisis de sentimiento enmarcada en la campanã de evaluacion TASS
2012. Adoptamos una aproximacion supervisada que hace uso de conocimiento
lingu¨´ıstico. Este conocimiento lingu¨´ıstico comprende lematizacion, etiquetado
POS, etiquetado de palabras de polaridad, tratamiento de emoticonos, tratamiento
de negacion, y ponderacion de polaridad seguń el nivel de anidamiento sintactico.
También se lleva a cabo un preprocesado para el tratamiento de errores ortogr
áficos. La deteccion de las palabras de polaridad se hace de acuerdo a un l
éxico de polaridad para el castellano creado en base a dos estrategias: Proyeccion
o traduccion de un léxico de polaridad de inglés al castellano, y extraccion de
palabras divergentes entre los tuits positivos y negativos correspondientes al corpus
de entrenamiento. Los resultados de la evaluacion final muestran un buen
rendimiento del sistema as´ı como una notable robustez tanto para la deteccion de
polaridad a alta granularidad (65% de exactitud) como a baja granularidad (71% de
exactitud).
Palabras clave: TASS, Análisis de sentimiento, Miner´ıa de opiniones,
Deteccion de polaridad
Abstract: This article describes the system presented for the task of sentiment
analysis in the TASS 2012 evaluation campaign. We adopted a supervised approach
that includes some linguistic knowledge-based processing for preparing the features.
The processing comprises lemmatisation, POS tagging, tagging of polarity words,
treatment of emoticons, treatment of negation, and weighting of polarity words
depending on syntactic nesting level. A pre-processing for treatment of spell-errors
is also performed. Detection of polarity words is done according to a polarity lexicon
built in two ways: projection to Spanish of an English lexicon, and extraction of
divergent words of positive and negative tweets of training corpus. Evaluation results
show a good performance and also good robustness of the system both for fine
granularity (65% of accuracy) as well as for coarse granularity polarity detection
(71% of accuracy).
Keywords: TASS, Sentiment Analysis, Opinion-mining, Polarity detection
1 Introduction opinions of users about topics or products

would enable many organizations to obtain
Knowledge management is an emerging re-
global feedback on their activities. Some
search field that is very useful for improving
studies (O’Connor et al., 2010) have poin-
productivity in different activities. Know-
ted out that such systems could perform as
ledge discovery, for example, is proving very
well as traditional polling systems, but at a
useful for tasks such as decision making and
much lower cost. In this context, social media
market analysis. With the explosion of Web
like twitter constitute a very valuable source
2.0, the Internet has become a very rich
when seeking opinions and sentiments.
source of user-generated information, and re-
search areas such as opinion mining or senti- The TASS evaluation challenge consisted
ment analysis have attracted many research- of two tasks: predicting the sentiment of
ers. Being able to identify and extract the Spanish tweets, and identifying the topic of
the tweets. The TASS evaluation workshop icon to classify movie reviews. Read (2005)
aims “to provide a benchmark forum for com- confirmed the necessity to adapt the mod-
paring the latest approaches in this field”. els to the application domain, and (Choi and
Our team only took part in the first task, Cardie, 2009) address the same problem for
which involved predicting the polarity of a polarity lexicons.
number of tweets, with respect to 6-category In the last few years many researchers
classification, indicating whether the text ex- have turned their efforts to microblogging
presses a positive, negative or neutral senti- sites such as Twitter. As an example, (Bol-
ment, or no sentiment at all. It must be noted len, Mao, and Zeng, 2010) have studied the
that most works in the literature only classify possibility of predicting stock market res-
sentiments as positive or negative, and only ults by measuring the sentiments expressed
in a few papers are neutral and/or objective in Twitter about it. The special character-
categories included. We developed a super- istics of the language of Twitter require a
vised system based on a polarity lexicon and special treatment when analyzing the mes-
a series of additional linguistic features. sages. A special syntax (RT, @user, #tag,...),
The rest of the paper is organized as fol- emoticons, ungrammatical sentences, vocab-
lows. Section 2 reviews the state of the art ulary variations and other phenomena lead
in the polarity detection field, placing spe- to a drop in the performance of traditional
cial interest on sentence level detection, and NLP tools (Foster et al., 2011; Liu et al.,
on twitter messages, in particular. The third 2011). In order to solve this problem, many
section describes the system we developed, authors have proposed a normalization of the
the features we included in our supervised text, as a pre-process of any analysis, report-
system and the experiments we carried out ing an improvement in the results. Brody
over the training data. The next section (2011) deals with the word lengthening phe-
presents the results we obtained with our sys- nomenon, which is especially important for
tem first in the training-set and later in the sentiment analysis because it usually ex-
test data-set. The last section draws some presses emphasis of the message. (Han and
conclusions and future directions. Baldwin, 2011) use morphophonemic simil-
arity to match variations with their standard
2 State of the Art vocabulary words, although only 1:1 equival-
Much work has been done in the last dec- ences are treated, e.g., ’imo = in my opinion’
ade in the field of sentiment labelling. Most would not be identified. Instead, they use an
of these words are limited to polarity de- Internet slang dictionary to translate some of
tection. Determining the polarity of a text those expressions and acronyms. Liu et al.
unit (e.g., a sentence or a document) usually (2012) propose combining three strategies,
includes using a lexicon composed of words including letter transformation, “priming” ef-
and expressions annotated with prior polar- fect, and misspelling corrections.
ities (Turney, 2002; Kim and Hovy, 2004; Once the normalization has been per-
Riloff, Wiebe, and Phillips, 2005; Godbole, formed, traditional NLP tools may be used to
Srinivasaiah, and Skiena, 2007). Much re- analyse the tweets and extract features such
search has been done on the automatic or as lemmas or POS tags (Barbosa and Feng,
semi-automatic construction of such polar- 2010). Emoticons are also good indicators
ity lexicons (Riloff and Wiebe, 2003; Esuli of polarity (O’Connor et al., 2010). Other
and Sebastiani, 2006; Rao and Ravichandran, features analyzed in sentiment analysis such
2009; Velikovich et al., 2010). as discourse information (Somasundaran et
Regarding the algorithms used in senti- al., 2009) can also be helpful. (Speriosu et
ment classification, although there are ap- al., 2011) explore the possibility of exploiting
proaches based on averaging the polarity of the Twitter follower graph to improve polar-
the words appearing in the text (Turney, ity classification, under the assumption that
2002; Kim and Hovy, 2004; Hu and Liu, people influence one another or have shared
2004; Choi and Cardie, 2009), machine learn- affinities about topics. (Barbosa and Feng,
ing methods have become the more widely 2010; Kouloumpis, Wilson, and Moore, 2011)
used approach. Pang et al. (2002) proposed combined polarity lexicons with machine
a unigram model using Support Vector ma- learning for labelling sentiment of tweets.
chines which does not need any prior lex- Sindhwani and Melville (2008) adopt a semi-
supervised approach using a polarity lexicon English-Spanish bilingual dictionary Den−es
combined with label propagation. (see Table 2). Despite Pen including neutral
A common problem of the supervised ap- words, only positive and negative ones were
proaches is to gather labelled data for train- selected and translated. Ambiguous trans-
ing. In the case of the TASS challenge, we lations were solved manually by two annot-
would tackle this problem should we want to ators. Altogether, 7,751 translations were
collect additional training data. In order to checked. Polarity was also checked and cor-
automatically build annotated corpora, (Go, rected during this manual annotation. It
Bhayani, and Huang, 2009) collect tweets must be noted that as all translation candid-
containing the “:)” emoticon and regard ates were checked, many variants of the same
them as positive, and likewise for the “:(“ source word were selected in many cases. Fi-
emoticon. Kouloumpis (2011) uses a sim- nally, 2,164 negative words and 1,180 positive
ilar approach based on most common posit- words were included in the polarity lexicon
ive and negative hashtags. Barbosa (Barbosa (see fifth column of Table 3). We detected
and Feng, 2010) rely on existing web services a significant number of OOV words (35%) in
such as Twend or Tweetfeel to collect annot- this translation process (see second and third
ated emoticons. One major problem of the columns of Table 3). Most of these words
aforementioned strategies is that only posit- were inflected forms: pasts (e.g., “terrified”),
ive and negative tweets can be collected. plurals (e.g., “winners”), adverbs (e.g., “vi-
brantly”), etc. So they were not dealt with.
3 Experiments
3.1 Training Data #headwords #pairs avg.
#trans-
The training data Ct provided by the or- lations
ganization consists of 7,219 twitter messages Den−es 15,134 31,884 2.11
(see Table 1). Each tweet is tagged with its
global polarity, indicating whether the text
expresses a positive, negative or neutral sen- Table 2: Characteristics of the Den−es
timent, or no sentiment at all. 6 levels have bilingual dictionary.
been defined: strong positive (P+), positive
(P), neutral (NEU), negative (N), strong neg- b) As a second source for our polarity
ative (N+) and no sentiment (NONE). The lexicon, words were automatically extracted
numbers of tweets corresponding to P+ and from the training corpus Ct . In order to ex-
NONE are higher than the rest. NEU is the tract the words most associated with a cer-
class including the least tweets. In addition, tain polarity; let us say positive, we divided
each message includes its Twitter ID, the cre- the corpus into two parts: positive tweets
ation date and the twitter user ID. and the rest of the corpus. Using the Log-
Polarity #tweets % of #tweets likelihood ratio (LLR) we obtained the rank-
P+ 1,764 24.44% ing of the most salient words in the positive
P 1,019 14.12% part with respect to the rest of the corpus.
NEU 610 8.45% The same process was conducted to obtain
N 1,221 16.91% negative candidates. The top 1,000 negative
N+ 903 12.51% and top 1,000 positive words were manually
NONE 1,702 23.58% checked. Among them, 338 negative and 271
Total 7,219 100% positive words were selected for the polarity
lexicon (see sixth column in Table 3). We
found a higher concentration of good candid-
Table 1: Polarity classes distribution in cor-
ates among the best ranked candidates (see
pus Ct .
Figure 1).
3.2 Polarity 3.3 Supervised System

Lexicon Although some preliminary experiments were
We created a new polarity lexicon for Spanish conducted using an unsupervised approach,
Pes from two different sources: we chose to build a supervised classifier, be-
a) An existing English polarity lexicon cause it allowed us to combine the various
(Wilson et al., 2005) Pen was automatic- features more effectively. We used the SMO
ally translated into Spanish by using an
polarity English
words
Words
trans-
Trans-
lation
Manually Manually Final
selected selected lex-
training corpus. These abbreviations are
in
Pen
lated
by
can-
did-
candid-
ates
from Ct icon
Pes
extended before the lemmatisation pro-
Den−es ates cess.
negative 4,144 2,416 3,480 2,164 271 2,435
positive 2,304 2,057 2,271 1,180 338 1,518 • Overuse of upper case (e.g., “MIRA
Total 6,878 4,473 5,751 3344 609 3,953
QUE BUENO”). Upper case is used to
give more intensity to the tweet. If we
Table 3: Statistics of the polarity lexicons. detect a sequence of two words all the
characters of which are upper case and
which are included in Freeling’s diction-
ary as common, we change them to lower
case.
• Normalization of urls. The complete url
is replaced by the “URL” string.
3.3.1 Baseline
As baseline we implemented a unigram rep-
resentation using all lemmas in the train-
ing corpus as features (15,069 altogether).
Lemmatisation was done by using Freeling.
Contrary to (Pang, Lee, and Vaithyanathan,
Figure 1: Precision of candidates from Ct de- 2002), we stored the frequency of the lem-
pending on LLR ranking intervals (100 can- mas in a tweet. Although using presence
didates per interval {1-100,101-200,...}). performed slightly better in the baseline con-
figuration (improvement was not significant),
implementation of the Support Vector Ma- as other features were included, we achieved
chine algorithm included in the Weka (Hall better results by using frequency. Thus, for
et al., 2009) data mining software. Default the sake of simplicity, all the experiments
configuration was used. All the classifiers shown make use of the frequency.
built over the training data were evaluated by 3.3.2 Selection of Polarity Words
means of the 10-fold cross validation strategy, (SP)
except for the one including additional train- Only lemmas corresponding to words in-
ing data (see section 3.3.6 for details).
cluded in the polarity lexicon Pes (see section

As mentioned in section 2, microblogging 3.2) were selected as features. This allows the
in general and Twitter, in particular, suffers system to focus on features that express the
from a high presence of spelling errors. This polarity, without further noise. Another ef-
hampers any knowledge-based processing as fect is that the number of features decreases
well as supervised methods. We rejected the significantly (from 15,069 to 3,730), thus re-
use of spell-correctors such as Google spell- ducing the computational costs of the model.
checker because they try to treat many cor- In our experiments relying on the polarity
rect words that they do not know. There- lexicon (see Table 4) clearly outperforms the
fore, we apply some heuristics in order to pre- unigram-based baseline. The rest of the fea-
process the tweets and solve the main prob- tures were tested on top of this configuration.
lems we detected in the training corpus:
3.3.3 Emoticons and Interjections
• Replication of characters (e.g., (EM)
“Suenõoo”): Sequences of the same Emoticons and interjections are very strong
characters are replaced by a single expressions of sentiments. A list of emoticons
character when the pre-edited word is is collected from a Wikipedia article about
not included in Freeling’s1 dictionary emoticons and all of them are classified as
and the post-edited word appears in positive (e.g., “:)”, “:D” ...) or negative (e.g.,
Freeling’s dictionary. “:(“ , “u u” ...). 23 emoticons were classified
as positive and 35 as negative. A list of 54
• Abbreviations (e.g., “q”, “dl”, ...): A
negative (e.g., “mecachis”, “sniff ”, ...) and
list of abbreviations is created from
28 positive (e.g., “hurra”, “jeje”, ...) interjec-
the
tions including variants modelled by regular
1
http://nlp.lsi.up c.edu/freeling
expressions were also collected from different citly, the polarity words included in the Pes
webs as well as from the training corpora. but not in the training corpus will be used
The frequency of each emoticon and interjec- by the classifier. By dealing with those OOV
tion type (positive or negative) is included as polarity words, our intention is to make our
a feature of the classifier. system more robust.
The number of upper-case letters in the Two new features are created to be
tweet was also used as an orthographical clue. included in the polarity information: a score
In Twitter where it is not possible to use let- of the positivity and a score of the negativity
ter styling, people often use the upper case of a tweet. In principle, positive words
to emphasize their sentiments (e.g., GRA- in Pes add 1 to the positivity score and
CIAS), and hence, a large number of upper- negative words add 1 to the negativity score.
case letters would denote subjectivity. So, However, depending on various phenomena,
the relative number of upper-case letters in a the score of a word can be altered. These
tweet is also included as a feature. phenomena are explained below.
According to the results (see Table 4),
these clues did not provide a significant im- Treatment of Negations and Adverbs
provement. Nevertheless, they did show a The polarity of a word changes if it is
slight improvement. Moreover, other literat-
included in a negative clause. Syntactic
ure shows that such features indeed help to
information provided by Freeling is used
detect the polarity (Koulompis, 2011). The for detecting those cases. The polarity of a
low impact of these features could be ex- word increases or decreases depending on the
plained by the low density of such elements adverb which modifies it. We created a list of
in our data-set: only 622 out of 7,219 tweets increasing (e.g., “mucho”, “absolutamente”,
in the training data (8.6%) include emoticons ...) and decreasing (e.g., “apenas”, “poco”,
or interjections. Emoticon, interjection and ...) adverbs. If an increasing adverb modify-
capitalization features were included in our ing a polarity word is detected, the polarity
final model. is increased (+1). If it is a decreasing adverb,
3.3.4 POS Information (PO) the polarity of the words is decreased (−1).
Results obtained among the literature are not Syntactic information provided by Freeling
clear as to whether POS information helps is used for detecting these cases.
to determine the polarity of the texts (Kou-
lompis 2011), but POS tags are useful for dis- Syntactic Nesting Level
tinguishing between subjective and objective The importance of the word in the tweet
texts. Our hypothesis is that certain POS determines the influence it can have on the
tags are more frequent in opinion messages, polarity of the whole tweet. We measured
e.g., adjectives. In our experiments POS tags the importance of each word w by calculat-
provided by Freeling were used. We used ing the relative syntactic nesting level ln (w).
as a feature the frequency of the POS tags The lower the syntactic level, the less import-
in a message. ant it is. The relative syntactic nesting level
Results in Table 4 show that this feature is computed as the inverse of the syntactic
provides a notable improvement and it is es- nesting level (1/ln (w)).
pecially helpful for detecting objective mes-
sages (view difference in F-score between SP Features/
Metric
Acc.
(6 cat.)
P+ P NEU N N+ NONE
and SP+PO for the NONE class). Baseline 0.45 0.574 0.267 0.137 0.368 0.385 0.578
SP 0.484 0.594 0.254 0.098 0.397 0.422 0.598
3.3.5 Frequency of Polarity SP+PO 0.496 0.596 0.245 0.093 0.414 0.438 0.634
Words SP+EM
SP+FP
0.49
0.514
0.612
0.633
0.253
0.261
0.097
0.115
0.402
0.455
0.428
0.438
0.6
0.613
(FP) All 0.523 0.648 0.246 0.111 0.463 0.452 0.657
ALL+AC1 0.523 0.647 0.248 0.116 0.46 0.451 0.655
The SP classifier does not interpret the polar-
ity information included on the lexicon. We
explicitly provide that information as a fea- Table 4: Accuracy results obtained on the
ture to the classifier. Furthermore, without evaluation of the training data. Columns 3rd
the polarity information, the classifier will be to 8th show F-scores for each of the class val-
built taking into account only those polarity ues.
words appearing in the training data. Includ-
ing the polarity frequency information expli-
3.3.6 Using Additional Corpora (AC) the training data Ct−train only those tweets
Additional training data were retrieved using of Ctw containing at least one word w from
the Perl Net::Twitter API. Different searches Pes but not appearing in the training corpus
were conducted during June 2012 using the (w ∈ Pes ∧ f req(w, Ct−train ) = 0). Only
attitude feature of the twitter search. Using 7.9% of the retrieved tweets were added.
this feature, users can search for tweets ex- Results were still unsatisfactory, and so,
pressing either positive or negative opinion. additional training data were left out of the
The search is based on emoticons as in (Go final model.
et al., 2009). Retrieved tweets were classified It must be noted that the tweet retrieval
according to their attitude. effort was very simple, due to the limited
time we had to develop the system. We
Corpora/Tweets P N Total conclude that these additional training data
Ctw 11,363 9,865 21,228 were unhelpful due to the differences with
the original data provided: Ctw contained
many more ungrammatical structures and
Table 5: Characteristics of the tweet corpus
nonstandard tokens than the original data;
collected from Twitter.
the dates of the tweets were different which
The corpus Ctw including retrieved tweets could even lead to topic and vocabulary dif-
(see Table 5.) was used in two ways: on the ferences; and especially, the fact that the ad-
one hand, we used it to find new words for our ditional data collected did not include neutral
polarity lexicon Pes , by using the automatic or objective tweets and neither did it include
method described in section 3.2. The first 500 different degrees of polarity in the case of pos-
positive candidates and 500 negative candid- itive and negative tweets.
ates were manually checked. Altogether, 110
Features/ #training Accuracy
positive words and 95 negative ones (AC1) Metric examples
were included in the polarity lexicon Pes . ALL 6,137 0.573
According to the results (see ALL+AC1 in ALL+AC2 27,365 0.507
Table 4), these new polarity words do not ALL+AC2-OOV 7,807 0.569
provide any improvement. The reason is that
most relevant polarity words included in the
training corpus Ct are already included in Table 6: Results obtained by including addi-
Pe s as explained in section 3.2. In order to tional examples in the training data.
measure the contribution of these words bet-

ter, evaluation was carried out against the
test corpus where more OOV polarity words 4 Evaluation and Results
would be likely to appear (see section 4). The evaluation test-set Ce provided by the
On the other hand (AC2), we added Ctw organization consists of 60,798 twitter mes-
to the training data, in the hypothesis that sages (see Table 7) annotated as explained in
more training data would lead to a bet- section 3.1. Only one run of results was al-
ter model, although polarity strength was lowed for submission. Although the results
not distinguished. Thus, only P and N ex- include classification into 6 categories (5 po-
amples are obtained. In order to evalu- larities + NONE), the results were also given
ate the effect of the new data, the original on a 4-category basis (3 polarities + NONE).
training data were divided into two parts: For the 4-category results, all tweets regarded
85% (6,137 tweets) for training (Ct−train ) and as positive are grouped into a single category,
15% (1,082) for testing (Ct−test ). The test and the same is done for negative tweets.
data were randomly selected and the pro- Table 8 presents the results for both evalu-
portions of the polarity classes were main- ations using the best scored classifiers in the
tained equal in both parts. Our first clas- training process. In addition to the accuracy
sifier (ALL+AC2) was trained with all the results, Table 8 shows F-scores for each class
retrieved tweets included in Ctw as well as for the 6-category classification.
the tweets in Ct−train . Results show (see The first thing we notice is that the res-
Table 6) that accuracy decreased when using ults obtained with the test data are bet-
these data for training. A second experiment ter than those achieved with the training
was carried out (ALL+AC2-OOV), adding to data for all configurations. The best sys-
tem (ALL+AC1) achieves 0.653 of accuracy
Polarity #tweets % of #tweets tweets contain positive and negative sen-
P+ 20,745 34.12% tences, which leads us to think that a dis-
P 1,488 2.45% course treatment could be useful in order to
NEU 1,305 2.15%
determine the importance of each sentence
N 11,287 18.56%
N+ 4,557 7.5%
with respect to the whole tweet. In the case
NONE 21,416 35.22% of positive tweets, P tweets, many of them
Total 60,798 100% are classified as P+.
In the experiment (AC1) described in sec-
tion 3.3.6 we did not obtain any improvement
Table 7: Polarity classes distribution in test by adding the words extracted from an addi-
corpus Ce . tional corpus of tweets to the polarity lex-
icon Pes . If we take into account that the
while the same system scored 0.523 of ac- most significant words of the training corpus
curacy in training. Even the baseline shows (Ct ) were already included in Pes , it could be
the same tendency. Regarding the differ- expected that the words in AC1 would have
ences between configurations, tendencies ob- little effect on the training data. In the evalu-
served in the cross validation evaluation of ation against the test data where the vocab-
the training data are confirmed in the eval- ulary is larger, the AC1 lexicon provides a
uation of the test data. Then again, im- slight improvement (see difference between
provement of ALL+AC1 over Baseline is also ALL and All+AC1 in Table 8).
higher in test data-based evaluation than in Metric/ Acc. Acc. P+ P NEU N N+ NONE
the training cross-validation evaluation: a System
Baseline
(4 cat.)
0.616
(6 cat.)
0.527 0.638 0.214 0.139 0.483 0.471 0.587
16.22% improvement in the accuracy over ALL 0.702 0.641 0.752 0.323 0.166 0.563 0.564 0.683
the baseline was obtained in training cross- ALL+AC1 0.711 0.653 0.753 0.32 0.167 0.566 0.566 0.685
validation, while in the test data evaluation,

the improvement rose to 23.91%. P+ Table 8: Results obtained on the evaluation
and NONE classes are those our classifier of the test data.
identi- fies best, being NEU and P the
classes with the worst performance (tables
4 and 8). If we look at the distribution of 5 Conclusions
the polarity classes (tables 1 and 7), we can We have presented an SVM classifier for de-
see that the proportion of the P+ and tecting the polarity of Spanish tweets. Our
NONE classes increases significantly in the system effectively combines several features
test data with respect to the training data. based on linguistic knowledge. In our case,
By contrast, the NEU and P classes using a semi-automatically built polarity lex-
decreased dramatically. The distribution icon improves the system performance signi-
difference together with the performance of ficantly over a unigram model. Other fea-
the system regarding specific classes could tures such as POS tags, and especially word
explain the difference in accuracy between polarity statistics were also found to be help-
test and training evaluations. It remains ful. In our experiments, including external
unclear to us why the F-scores for all the training data was unsuccessful. However, our
classes improved with respect to the training approach was very simple, and so, a more ex-
phase. We should analyse the char- haustive experimentation should be carried
acteristics of the training and test corpora, out in order to obtain conclusive results. In
looking for differences in the samples and an- any case, the system shows robust perform-
notation. ance when it is evaluated against test data
As for the results of the individual classes, different from the training data.
it is worth mentioning that neutral tweets There is still much room for improvement.
are very difficult to classify because they Tweet normalization was na¨ıvely implemen-
do contain polarity words. We looked at ted. Some authors (Pang and Lee, 2004;
its confusion matrix (both for training and Barbosa and Feng, 2010) have obtained pos-
test evaluations) and it shows that NEU itive results by including a subjectivity ana-
tweets wrongly classified are evenly dis- lysis phase before the polarity detection step.
tributed between the other classes, except We would like to explore that line of work.
for the NONE class, with almost no NEU Lastly, it would be worthwhile conducting
tweets classified as NONE. Most of the NEU
in-depth research into the creation of polar- Conference on Artificial Intelligence, Au-
ity lexicons including domain adaption and gust.
treatment of word senses.
Go, A., R. Bhayani, and L. Huang. 2009.
Acknowledgments Twitter sentiment classification using dis-
tant supervision. CS224N Pro ject Rep ort,
This work has been partially founded by the Stanford, pages 1–12.
Industry Department of the Basque Govern-
ment under grant IE11-305 (knowTOUR pro- Godbole, N., M. Srinivasaiah, and S. Skiena.
ject). 2007. Large-scale sentiment analysis for
news and blogs. In Proceedings of the
Reference International Conference on Weblogs and
s Social Media (ICWSM), pages 219–222.
Barbosa, Luciano and Junlan Feng. 2010. Hall, Mark, Eibe Frank, Geoffrey Holmes,
Robust sentiment detection on twit- Bernhard Pfahringer, Peter Reutemann,
ter from biased and noisy data. In and Ian H. Witten. 2009. The
Proceedings of the 23rd International WEKA data mining software: an up-
Conference on Computational Linguistics: date. SIGKDD Explor. Newsl., 11(1):10–
Posters, COLING ’10, pages 36–44, 18, november.
Stroudsburg, PA, USA. Association for
Computational Linguistics. Han, Bo and Timothy Baldwin. 2011.
Lexical normalisation of short text
Bollen, Johan, Huina Mao, and Xiao-Jun messages: Makn sens a #twit-
Zeng. 2010. Twitter mood predicts the ter. In Proceedings of the 49th
stock market. 1010.3003, October. Annual Meeting of the Association
for Computational Linguistics: Human
Brody, Samuel and Nich-
Language Technologies, pages 368–378,
olas Diakopoulos. 2011.
Portland, Oregon, USA, June. Associ-
Cooooooooooooooollllllllllllll!!!!!!!!!!!:
ation for Computational Linguistics.
using word lengthening to detect senti-
ment in microblogs. In Proceedings of Hu, M. and B. Liu. 2004. Mining
the Conference on Empirical Methods in and summarizing customer reviews. In
Natural Language Processing, EMNLP Proceedings of the tenth ACM SIGKDD
’11, pages 562–570. Association for international conference on Knowledge
Computational Linguistics. discovery and data mining, pages 168–177.

Choi, Yejin and Claire Cardie. 2009. Adapt- Kim, Soo-Min and Eduard Hovy. 2004.
ing a polarity lexicon using integer linear Determining the sentiment of opinions.
programming for domain-specific senti- In Proceedings of the 20th international
ment classification. In Proceedings of the conference on Computational Linguistics,
2009 Conference on Empirical Methods in COLING ’04, Stroudsburg, PA, USA. As-
Natural Language Processing: Volume 2 sociation for Computational Linguistics.
- Volume 2, EMNLP ’09, pages 590–598,
Stroudsburg, PA, USA. Association for Kouloumpis, E., T. Wilson, and J. Moore.
Computational Linguistics. 2011. Twitter sentiment analysis: The
good the bad and the OMG! In
Esuli, Andrea and Fabrizio Sebastiani. 2006. Fifth International AAAI Conference on
SENTIWORDNET: a publicly available Weblogs and Social Media.
lexical resource for opinion mining. In
In Proceedings of the 5th Conference Liu, Fei, Fuliang Weng, and Xiao Jiang.
on Language Resources and Evaluation 2012. A broad-coverage normalization
(LREC’06, pages 417–422. system for social media language. In
Proceedings of the 50th Annual Meeting
Foster, Jennifer, Ozlem Cetinoglu, Joachim of the Association for Computational
Wagner, Joseph Le Roux, Stephen Hogan, Linguistics (Volume 1: Long Papers),
Joakim Nivre, Deirdre Hogan, and Josef pages 1035–1044, Jeju Island, Korea, July.
van Genabith. 2011. #hardtoparse: POS Association for Computational Linguist-
tagging and parsing the twitterverse. In ics.
Workshops at the Twenty-Fifth AAAI
Liu, X., S. Zhang, F. Wei, and M. Zhou. Riloff, Ellen and Janyce Wiebe. 2003.
2011. Recognizing named entities Learning extraction patterns for subject-
in tweets. In Proceedings of the ive expressions. In Proceedings of the
49th Annual Meeting of the Association 2003 conference on Empirical methods in
for Computational Linguistics: Human natural language processing -, pages 105–
Language Technologies (ACL-HLT 2011), 112.
Portland, Oregon. Somasundaran, Swapna, Galileo Namata, Ja-
O’Connor, Brendan, Ramnath Balasub- nyce Wiebe, and Lise Getoor. 2009.
ramanyan, Bryan R. Routledge, and Supervised and unsupervised methods in
Noah A. Smith. 2010. From tweets employing discourse relations for improv-
to polls: Linking text sentiment to ing opinion polarity classification. In
public opinion time series. In Fourth Proceedings of the 2009 Conference on
International AAAI Conference on Empirical Methods in Natural Language
Weblogs and Social Media, May. Processing: Volume 1 -, EMNLP ’09,
pages 170–179, Stroudsburg, PA, USA.
Pang, Bo and Lillian Lee. 2004. A senti- Association for Computational Linguist-
mental education: sentiment analysis us- ics.
ing subjectivity summarization based on
minimum cuts. In Proceedings of the Speriosu, Michael, Nikita Sudan, Sid Up-
42nd annual meeting of the Association adhyay, and Jason Baldridge. 2011.
for Computational Linguistics, ACL ’04, Twitter polarity classification with label
Stroudsburg, PA, USA. Association for propagation over lexical links and the fol-
Computational Linguistics. lower graph. In Proceedings of the First
Workshop on Unsupervised Learning in
Pang, Bo, Lillian Lee, and Shivakumar NLP, EMNLP ’11, pages 53–63, Strouds-
Vaithyanathan. 2002. Thumbs up?: sen- burg, PA, USA. Association for Computa-
timent classification using machine learn- tional Linguistics.
ing techniques. In Proceedings of the Turney, Peter D. 2002. Thumbs up
ACL-02 conference on Empirical methods or thumbs down?: semantic orienta-
in natural language processing - Volume tion applied to unsupervised classifica-
10, EMNLP ’02, pages 79–86, Strouds- tion of reviews. In Proceedings of the
burg, PA, USA. Association for Compu- 40th Annual Meeting on Association for
tational Linguistics.
Computational Linguistics - ACL ’02,

Rao, Delip and Deepak Ravichandran. 2009. page 417, Philadelphia, Pennsylvania.
Semi-supervised polarity lexicon induc- Velikovich, Leonid, Sasha Blair-Goldensohn,
tion. In Proceedings of the 12th Kerry Hannan, and Ryan McDon-
Conference of the European Chapter ald. 2010. The viability of web-
of the Association for Computational derived polarity lexicons. In Human
Linguistics, EACL ’09, pages 675–682, Language Technologies: The 2010
Stroudsburg, PA, USA. Association for Annual Conference of the North
Computational Linguistics. American Chapter of the Association
Read, Jonathon. 2005. Using emoticons for Computational Linguistics, HLT
to reduce dependency in machine learning ’10, pages 777–785, Stroudsburg, PA,
techniques for sentiment classification. In USA. Association for Computational
Proceedings of the ACL Student Research Linguistics.
Workshop, ACLstudent ’05, pages 43–48, Wilson, Theresa, Paul Hoffmann, Swapna
Stroudsburg, PA, USA. Association for Somasundaran, Jason Kessler, Janyce
Computational Linguistics. Wiebe, Yejin Choi, Claire Cardie,
Ellen Riloff, and Siddharth Pat-
Riloff, E., J. Wiebe, and W. Phillips. wardhan. 2005. OpinionFinder.
2005. Exploiting subjectivity classifica- In Proceedings of HLT/EMNLP on
tion to improve information extraction. Interactive Demonstrations -, pages
In Proceeding of the national conference 34–35, Vancouver, British Columbia,
on Artificial Intelligence, volume 20, page Canada.
1106.
Techniques for Sentiment Analysis and Topic Detection
of
Spanish Tweets: Preliminary Report∗
Técnicas de análisis de sentimientos y deteccion de asunto de tweets

en espanõl: informe preliminar
Antonio Fernández Philippe Morere†
Anta Institute IMDEA ENSEIRB-MATMECA
Networks Madrid, Spain Bordeaux, France
Luis Nuń˜ez Chiroque Agust´ın Santos

Institute IMDEA Networks Institute IMDEA Networks
Madrid, Spain Madrid, Spain
Resumen: Analisis de sentimientos y deteccion de asunto son nuevos problemas

que estan en la interseccion del procesamiento de lenguaje natural y la miner´ıa de
datos. El primero intenta determinar si un texto es positivo, negativo o neutro,
mientras que el segundo intenta identificar la tematica del texto. Un esfuerzo sig-
nificante está siendo invertido en la construccion de soluciones efectivas para
estos dos problemas, principalmente para textos en inglés. Usando un corpus de
tweets en espanõl, presentamos aqu´ı un analisis comparativo de diversas
aproximaciones y técnicas de clasificacion para estos problemas. Los datos de
entrada son preproce- sados usando técnicas y herramientas propuestas en la
literatura, junto con otras espec´ıficamente propuestas aqu´ı y que tienen en cuenta
las peculiaridades de Twit- ter. Entonces, se han utilizado clasificadores populares
(de hecho se han usado todos los clasificadores de WEKA). No todos los resultados
obtenidos son presentados, de- bido a su alto nu´mero.
Palabras clave: Analisis de sentimientos, deteccion de asunto.
Abstract: Sentiment analysis and topic detection are new problems that are at
the intersection of natural language processing (NLP) and data mining. Sentiment
analysis attempts to determine if a text is positive, negative, or neither, while topic
detection attempts to identify the subject of the text. A significant amount of effort
has been invested in constructing effective solutions for these problems, mostly for
English texts. Using a corpus of Spanish tweets, we present a comparative analysis
of different approaches and classification techniques for these problems. The data
is preprocessed using techniques and tools proposed in the literature, together with
others specifically proposed here that take into account the characteristics of Twitter.
Then, popular classifiers have been used. (In particular, all classifiers of WEKA
have been evaluated.) Due to its high number not all the results obtained will be
presented here.
Keywords: Sentiment analysis, topic detection.
1 Introduction the problems that have been posed to achieve

With the proliferation of online reviews, rat- this are sentiment analysis and topic detec-
ings, recommendations, and other forms of tion, which are at the intersection of natural
online opinion expression, there is a growing language processing (NLP) and data mining.
interest in techniques for automatically ex- Sentiment analysis attempts to determine if
tract the information they embody. Two of a text is positive, negative, or neither, possi-
∗
bly providing degrees within each type. On
Partially funded by the Spanish Ministerio de its hand, topic detection attempts to identify
Econom´ıa y
Competitividad.
the main subject of a given text. Research
†
Work partially done while visiting Institute in both problems is very active, and a num-
IMDEA Networks.
ber of methods and techniques have been pro- Pang and Lee (Pang and Lee, 2008) have
posed in the literature to solve them. Most a comprehensive survey of sentiment analy-
of these techniques focus on English texts sis and opinion mining research. Liu (Liu,
and study large documents. In our work, 2010), on his hand, reviews and discusses a
we are interested in languages different from wide collection of related works. Although,
English and micro-texts. In particular, we most of the research conducted focuses on
are interested in sentiment and topic clas- English texts, the number of papers on the
sification applied to Spanish Twitter micro- treatment of other languages is increasing ev-
blogs. Spanish is increasingly present over ery day. Examples of research papers on
the Internet, and Twitter has become a pop- Spanish texts are (Brooke, Tofiloski, and
ular method to publish thoughts and infor- Taboada, 2009; Mart´ınez-Camara, Mart´ın-
mation with its own characteristics. For in- Valdivia, and Urenã-Lopez, 2011; Mart
stance, publications in Twitter take the form ´ınez Camara et al., 2011).
of tweets (i.e., Twitter messages), which are Most of the algorithms for sentiment anal-
micro-texts with a maximum of 140 char- ysis and topic detection use a collection of
acters. In Spanish tweets it is common to data to train a classifier that is later used
find specific Spanish elements (SMS abbrevi- to process the real data. The (training and
ations, hashtags, slang). The combination of real) data is processed before being used for
these two aspects makes this a distinctive re- (building or applying) the classifier in or-
search topic, with potentially deep industrial der to correct errors and extract the main
applications. features (to reduce the required processing
The motivation of our research is twofold. time or memory). Many different techniques
On the one hand, we would like to know have been proposed for these phases. For in-
whether usual approaches that have been stance, different classification methods have
proved to be effective with English text are been proposed, like Naive Bayes, Maximum
also so with Spanish tweets. On the other, Entropy, Support Vector Machines (SVM),
we would like to identify the best (or at BBR, KNN, or C4.5. In fact, there is no fi-
least good) technique for Spanish tweets. For nal agreement on which of these classifiers
this second question, we would like to eval- is the best. For instance, Go et al. (Go,
uate those techniques proposed in the lit- Bhayani, and Huang, 2009) report similar ac-
erature, and possibly propose new ad hoc curacy with classifiers based on Naive Bayes,
techniques for our specific context. In our Maximum Entropy, and SVM.
study, we try to sketch out a comparative
Regarding preprocessing the data (texts

study of several schemes on term weight-
in our case), one of the first decisions to be
ing, linguistic preprocessing (stemming and
made is which elements will be used as ba-
lemmatization), term definition (e.g., based
sic terms. Laboreiro et al. (Laboreiro et al.,
on uni-grams or n-grams), the combination
2010) explore tweets tokenization (or symbol
of several dictionaries (sentiment, SMS ab-
segmentation) as the first key task for text
breviations, emoticons, spell, etc.) and the
processing. Once single words or terms are
use of several classification methods. When
available, typical choices are using uni-grams,
possible, we have used tools freely available,
bi-grams, n-gram, or parts-of-speech (POS).
like the Waikato Environment for Knowl-
Again, there is no clear conclusion on which
edge Analysis (WEKA, an open source soft-
is the best option, since Pak and Paroubek
ware which consists of a collection of machine
(Pak and Paroubek, 2010) report the best
learning algorithms for data mining) (at Uni-
performance with bi-grams, while Go (Go,
versity of Waikato, 2012).
Bhayani, and Huang, 2009) present better
results with unigrams. The preprocessing
1.1 Related Work
phase may also involve word processing the
As mentioned above, sentiment analysis, also input texts: stemming, spelling and/or se-
known as opinion mining, is a challenging mantic analysis. Tweets are usually very
Natural Language Processing (NLP) prob- short, having emoticons like :) or :-), or ab-
lem. Due to its tremendous value for prac- breviated (SMS) words like “Bss” for “Besos”
tical applications, it has experienced a lot (“kisses”). Agarwal et al. (Agarwal et al.,
of attention, and it is perhaps one of the 2011) propose the use of several dictionaries:
most widely studied topic in the NLP field. an emoticon dictionary and an acronym dic-
tionary. Other preprocessing tasks that have ing. They propose methods for the classifi-
been proposed are contextual spell-checking cation of tweets in an open (dynamic) set of
and name normalization (Kukich, 1992). topics. Instead, in work we are interested in
One important question is whether the al- a closed (fixed) set of topics. However, we ex-
gorithms and techniques proposed for a type plore all the index and clustering techniques
of data can be directly applied to tweets. proposed, since most of them could be ap-
This could be very convenient, since a cor- plied to sentiment analysis process.
pus of Spanish reviews of movies (from Mu-
chocine1 ) has already been collected and 1.2 Contributions
studied (Cruz et al., 2008; Mart´ınez In this paper we have explored the perfor-
Camara et al., 2011). Unfortunately, mance of several preprocessing, feature ex-
Twitter data poses new and different traction, and classification methods in a cor-
challenges, as dis- cussed by Agarwal et al. pus of Spanish tweets, both for sentiment
(Agarwal et al., analysis and for topic detection. The differ-
2011) when reviewing some early and re- ent methods considered can be classified into
cent results on sentiment analysis of Twit- almost orthogonal families, so that a differ-
ter data (e.g., (Go, Bhayani, and Huang, ent method can be selected from each family
2009; Bermingham and Smeaton, 2010; Pak to form a different configuration. In partic-
and Paroubek, 2010)). Engstrom (En- ular, we have explored the following families
gström, 2004) has also shown that the of methods.
bag- of-features approach is topic-dependent
Term definition and counting In this
and Read (Read, 2005) demonstrated how
family it is decided what constitutes a ba-
models are also domain-dependent.
sic term to be considered by the classifica-
These papers, as expected, use a broad
tion algorithm. The different alternatives are
spectrum of tools for the extraction and clas-
using single words (uni-grams), or groups of
sification processes. For feature extraction,
words (bi-grams, tri-grams, n-grams) as ba-
FreeLing (Padró et al., 2010) has been
sic terms. Of course, the aggregation of all
proposed, which is a powerful open-source
these alternatives is possible, but it is typi-
language processing software. We use it as
cally never used because it results in a huge
analyzer and for lemmatization. For
number of different terms, which makes the
classification, Justin et al. (Justin et al.,
processing hard or even impossible. Each of
2010) report very good results using WEKA
the different terms that appears in the in-
(at Univer- sity of Waikato, 2012; Hall et

put data is called by classification algorithms
al., ), which is one of the most widely
an attribute. Once the term formation is de-
used tools for the classification phase.
fined, the list of attributes in the input data is
Other authors proposed the use of
found, and the occurrences of each attributed
additional libraries like Lib- SVM (Chang
are counted.
and Lin, 2011). In contrast, some authors
(e.g., (Phuvipadawat and Mu- rata, 2010)) Stemming and lemmatization One of
propose the utilization of Lucene (Lucene, the main difference between Spanish and En-
2005) as index and text search engine. glish is that English is a weakly inflected
Most of the references above have to do language in contrast to Spanish, a highly
with sentiment analysis, since this is a very inflected one. A part of our work is the
popular problem. However, the problem stemming and lemmatization process. In or-
of topic detection is becoming also popu- der to reduce the feature dimension (num-
lar (Sriram et al., 2010), among other rea- ber of attributes), each word could be re-
sons, to identify trending topics (Allan, 2002; duced to either its lemma (canonical form)
Bermingham and Smeaton, 2010; Lee et (e.g., “cantabamos” is reduced to its infini-
al., 2011). Due to the the realtime nature tive “cantar”) or its stem (e.g., “cantabamos”
of Twitter data, most works (Mathioudakis is reduced to “cant”). One interesting ques-
and Koudas, 2010; Sankaranarayanan et al., tions is to compare how well the usual stem-
2009; Vakali, Giatsoglou, and Antaris, 2012; ming and lemmatization processes perform
Phuvipadawat and Murata, 2010) are inter- with Spanish words.
ested in breaking news detection and track- Word processing and correction Sev-
eral dictionaries are available to correct the
1
http://www.muchocine.net
words and thus reduce the noise caused by factory.
mistakes. A spell checker can be used to cor- The rest of the paper is structured as
rect typos. Other grammar dictionaries can follows. In Section 2 we describe in detail
replace emoticons, SMS abbreviations, and the different techniques that we have imple-
slang terms by their meaning in correct Span- mented or used. In Section 3 we describe our
ish. In addition, any special-term dictionary evaluation scenario and the results we have
can be applied to get a context in a tweet obtained. Finally, in Section 4 we present
(i.e., an affective word list can give us the some conclusions and open problems.
tone of a text, which is relevant for sentiment
analysis). Finally, it is possible to use a mor- 2 Methodology
phological analyzer to determine the type of In this section we give the details of how
each word. Thus, a word-type filter can be the different methods considered have been
applied to the tweets. implemented in our system. A summary of
Valence shifters By default, once the de- these parameters is presented in Table 1.
cision of what constitutes a basic term is
made, each term has the same weight is a 2.1 Term Definition and
tweet. A clear improvement to this term- Processing
counting method is the process of valence n-grams As we mentioned, a term is the
shifters and negative words. Example of neg- basic element that will be considered by the
ative words are “no”, “ni”, or “sin” (“not”, classifiers. These terms will be sets of n words
“neither”, “without”), while examples of va- (n-grams), with the case when terms are sin-
lence shifters are “muy” or “poco” (“very”, gle words (unigrams) as a special case. The
“little”). These words are useful for senti- value of n is defined in our algorithm with the
ment classification since they change and/or parameter n-gram (see Table 1). The reason
revert the strength of a neighboring term. for considering the use of n-grams with n > 1
Tweet semantics The above approaches (instead of restricting always the terms to in-
can be improved by processing specific tweet dividual words) is because they are partic-
artifacts such as author tags, or hashtags and ularly efficient to recognize common expres-
URLs (links) provided in the text. The au- sions of a language. Also, by keeping a word
thor tags act like a history of the tweets of a into its context, it is possible to differentiate
specific person. Because this person will most its different meanings. For example, in the
likely post tweets about the same topic, this sentences “estoy cerca” (“I am close”) and
might be relevant for topic detection. “cierro la cerca” (“I close the fence”), using
Ad- ditionaly, the hashtags are a great 2-grams will allow to detect the two differ-
indicator of the topic of a tweet, whereas ent meanings of the word “cerca”. As the
retrieving keywords from the web-page words stay in their context, an n-gram car-
linked within a tweet allows to overpass the ries more information than the sum of the
limit of the 140 characters and thus improves information of its n words: it also carries the
the efficiency of the estimation. Another way context information. (Using uni-grams every
to overpass this limit is to investigate the single word is a term, and any context infor-
keywords of a tweet in a search-engine to mation is lost.)
retrieve other words of the same context. When using n-grams, n is a parameter
that highly influences performance. Having
Classification methods In addition to a high value of n allows catching more con-
these variants, we have explored the full spec- text information, since the combinations of
trum of classification methods provided by words are less probable. On another side,
WEKA. rare combinations means less occurrences in
We can construct a large set of (more than the data set, which means that a bigger data
100 thousand) different methods by combin- set is needed to have good results. Also, the
ing features from all the described families. larger n is, the longer the attribute list is.
As this number of combinations is too high, In addition, since tweets are short, choosing
we had to reduce it by manually, choosing a a large n would result in n-grams of almost
subset of all the methods that is manageable the size of a tweet, which would make little
and we think is the most relevant. We hope sense. We found that, in practice, having n
the reader finds the subset we present satis- larger than 3 did not improve the results, so
Parameter/flag Description Process
n-gram Number of words that form a term Both
Only n-gram Whether words are also terms Both
Use input data Whether the input data is used to define attributes Both
Lemma/Stem Which technique is used to extract the root of words Both
Correct words Whether a dictionary is used to correct misspellings Both
SMS Whether an emoticons and SMS dictionary is used Both
Word types Types of words to be processed Both
Affective dictionary Whether an affective dictionary is used to define attributes Sentiment
Negation Whether negations are considered Sentiment
Weight Whether valence shifters are considered Sentiment
Hashtags Whether hashtags are considered as attributes Topic
Author tags Whether author tags are considered as attributes Topic
Links Whether data from linked web pages is used Topic
Search engine Whether a search engine is used Topic
Table 1: Parameters and flags that define a configuration of our algorithm.
we limit n to be no larger than 3. dictionary (see below) we may not use the in-
Of course, it is possible to combine the n- put data. This is controlled with a parameter
grams with several values of n. We only con- that we denote Use input data (see Table 1).
sider the possibility of combining two such Moreover, even if the input data is processes,
values, and one has to be n = 1. This is we may filter it and only keep some of it; for
controlled with the flag Only n-gram (see instance, we may decide to use only nouns.
Table 1), which says whether only n-grams This can be controlled with the parameter
(with n > 1) are considered as terms or also Word types (see Table 1), which is described
individual words (unigrams) are considered. below. In summary, the list of attributes is
In the latter case, the lists of attributes of built from the input data (if so decided) pre-
both cases are merged. The drawback of processed as determined by the rest of pa-
merging is the high number of entries in the rameters (e.g., filtered Word types ) and from
final attribute list. Hence, when doing this, a potentially the additional data (like the af-
threshold is used to remove all the attributes fective dictionary).

that appear too few times in the data set, Once the list of attributes is constructed,
as they are considered as noise. We force a vector is created for each tweet in the input
that the attribute appears at least 5 times data. This vector has one position for each
in the data set to be considered. Also, a sec- attribute, so that the value at that position is
ond threshold is used to remove ambiguous the number of occurrences of the attribute in
attributes. For example, the entry “ha sido” the tweet. This value can be modified in some
(“has been”) can be found in tweets indepen- tweets if the occurrence of an attribute is near
dently of its topic or sentiment and can be a valence shifter (see below). Once this pro-
safely removed. This threshold has been set cess is completed, the list of attributes and
to 85%, which means that more than 85% of the list of vectors obtained from the tweets
the occurrences of this entry have to be for a are the data passed to the classifier.
specific topic or sentiment.
2.2 Stemming and Lemmatization
Processing Terms The processing of When creating the list of attributes from
terms involves first building the list of at- a collection of terms, different forms of
tributes, which is the list of different terms the same word will be found (e.g., singu-
that appear in the data set of interest. In lar/plural, masculine/feminine). Including
principle, the data set used to identify at- each form as a different attribute would make
tributes is formed at least by all the tweets the list unnecessarily long. Hence, typically
that are provided as input to the algorithm, only the root of the words is used in the at-
but there are cases in which we do not use tribute list. The root can take the form of
them. For instance, when using an affective the lemma or the stem of the word. The pro-
cess of extracting it is called lemmatization according to the information
or stemming, respectively. Lemmatization we are inter-
preserves the meaning and type of a word
(e.g., words “buenas” and “buenos” become
“bueno”). We have used the FreeLing soft-
ware to perform this processing, since it can
provide the lemma of those words that are
in its dictionary. After lemmatization, there
are no plurals or other inflected forms, but
still two words with the same root but differ-
ent type may appear. Stemming on its hand
reduces even more the list of attributes. A
stem is a word whose affixes has been re-
moved. Stemming might lose the meaning
and any morphological information that the
original word had (e.g., words “aparca”, verb,
and “aparcamiento”, noun, become “aparc”).
The Snowball (Sno, 2012) software stemmer
has been used in our experiments.
We have decided to always use one of the
two processes. Which one is used in a partic-
ular configuration is controlled with the pa-
rameter Lemma/Stem (see Table 1).
2.3 Word Processing

and
Correction
As mentioned above, one of the possible pre-
processing steps of the data before extracting
attributes and vectors is to correct spelling
errors. Whether or not this step is taken is
controlled with the flag Correct words (see
Table 1). If correction is done, the algorithm

uses the Hunspell dictionary (Hun, 2012) (an
open source spell-checker) to perform it.
Another optional preprocessing step (con-
trolled with the flag SMS ) expands the
emoticons, shorthand notations, and slang
commonly used in SMS messages which is not
understandable by the Hunspell dictionary.
The use of these abbreviations is common in
tweets, given the limitation to 140 charac-
ters. An SMS dictionary (dic, 2012) is used
to do the preprocessing. It transforms the
SMS notations into words understandable by
the main dictionary. Also, the emoticons are
replaced by words that describe their mean-
ing. For example :-) is replaced by
feliz (“happy”) and :-( by triste (“sad”).
The emoticons tend to have a strong emotional
semantic. Hence, this process helps
estimating the sentiment of the tweets with
emoticons.
We have observed that the information of
a sentence is mainly located in a few key-
words. These keywords have a different type
ested in. For topic estimation, the keywords are mainly
nouns and verbs whereas for sentiment analysis, they are
adjectives and verbs. For example, in the sentence La
pelicula es buena (“The movie is good”), the only word
that is carrying the topic information is the noun
pelicula, which is very specific to the cinema topic.
Besides, the word that best reflects the sentiment of the
sentence is the adjective buena, which is positive. Also,
in the sentence El equipo ganó el partido (“The team
won the match”), the verb ganó is carrying information
for both topic and sentiment analysis: the verb ganar
is used very often in the soccer and sport topics and has
a positive sentiment. We allow to filter the words of the
input data using their type with the parameter Word types
(see Table 1). The filtering is done using the FreeLing
software, which is used to retrieve the type of each
word.
When performing sentiment analysis, we have found
useful to have an affective dictionary, whose use is
controlled with the flag Affective dictionary (see Table
1). We have used an affective dictionary developed by
Mart´ın Garc´ıa (Garc´ıa, 2009). This dictionary consist
of a list of words that have a positive or negative
meaning, expanded by their polarity “P” or “N” and their
strength “+” or “-”. For example, the words bueno
(“good”) and malo (“bad”) are respectively positive and
negative with no strength whereas the words mejor
(“best”) and peor (“worse”) are respectively positive
and negative with a positive strength. As a first
approach, we have not intensively used the polarity and
the strength of the affective words in the dictionary. Its
use only forces the words that contain it to be added as
attributes. This has the advantage of drastically reducing
the size of the attribute list, specially if the input data is
filtered. Observe that the use of this dictionary for
sentiment analysis is very pertinent, since the affective
words carry the tweet polarity information. In a more
advanced future aproach, the characteristics of the
words could be used to compute weights. Since not all the
words in our affective dictionary may appear in the corpus
we have used, we have built artificial vectors for the
learning machine. There is one artificial vector per senti-
ment analysis category (positive+, positive, negative,
negative+, none), which has been built counting one
occurrence of those words
whose polarity and strength match with the #BAR,
appropriate category.
2.4 Valence Shifters
There are two different aspects of valence
shifting that are used in our methods. First,
we may take into account negations
that can invert the sentiment of positive and
negative terms in a tweet. Second, we
may take weighted words, which are
intensifiers or weakeners, into account.
Whether these cases are processed is
controlled by the flags Negation and Weight
(see Table 1).
Negations are words that reverse the senti-
ment of other words. For example, in the sen-
tence La pelicula no es buena (“The movie
is not good”), the word buena is positive
whereas it should be negative because of the
negation no. The way we process negations
is as follows. Whenever a negative word is
found, the sign of the 3 terms that follow it
is reversed. This allows us to differentiate a
positive buena from a negative buena. The
area of effect of the negation is restricted to
avoid false negative words in more sophisti-
cated sentences.
Other valence shifters are words that
change the degree of the expressed senti-
ment. Examples of these are, for instance
muy (“very”), which increases the degree,
or poco (“little”), which decreases it. These
words were included in the dictionary de-
veloped by Mart´ın Garc´ıa (Garc´ıa, 2009)

as words with positive or negative strength
but no polarity. If the flag Weight is set,
our algorithm finds these words in the
tweets, and changes the weight of the 3 terms
following them. If the valence shifter has
positive strength the weight is multiplied by
3, while if it is negative by 0.5.
2.5 Twitter Artifacts
It has been noticed that with the previous
methods, not all the potential data contained
in the tweets is used. There are several fre-
quent element in tweets that carry a signif-
icant amount of information. Among others
we have the following.
• Hashtags (any word which starts with
“#”). They are used for identify mes-
sages about the same topic. Hashtags
are very helpful for topic estimation
since some of them may carry more topic
information than the rest of the tweet.
For example, if a tweet contains
which is the hashtag of the Barcelona soccer team,
it can almost doubtlessly be classified in a soccer
tweet.
• References (a “@” followed by the username of the
referenced user). It is used to reference other
Twitter users. Any user can be referenced. For
example,
@username means the tweet is answer- ing a tweet
of username, or referring to his/her. References are
interesting because some users appear more
frequently in certain topics and will more likely
tweet about them. A similar behaviour can be
found for sentiment.
• Links (a URL). Because of the character
limitation of the tweets, users often include URLs of
webpages where more details about the message can
be found. This may help obtaining more context,
specially for topic detection.
In our algorithms, we have the possibility of
including hashtags and references as attributes. This is
controlled by the flags Hashtags and Author tags (see
Table 1), respectively. We believe that these options
are just a complement to previous methods and cannot
be used alone, because we have found that the number of
hashtags and references in the tweets is too small.
We also provide the possibility of adding to the terms
of a tweet the terms obtained from the web pages
linked from the tweet. This is controlled by the flag
Links. A first approach could have been retrieving the
whole source code of the linked page, get all the terms it
contains, and keep the ones that match the attribute list.
Unfortunately, there are too many terms, and the menus
of the pages induce an unexpected noise which de- grades
the results. The approach we have chosen is to only
keep the keywords of the pages. We chose to only
retrieve the text within the HTML tags h1, h2, h3 and
title. The results with this second method are much
better since the keywords are directly related to the
topic.
Because of the short length of the tweets, our
estimations often suffer from a lack of words. We found
a solution to this problem in several paper (Banerjee,
Ramanathan, and Gupta, 2007; Gabrilovich and
Markovitch,
2005; Rahimtoroghi and Shakery, 2011) that use web
sources (like Wikipedia or the Open Directory) to
complete tweets. The web is a
mine of information and search-engines can file, mainly for two reasons.
be used to retrieve it. We have used this First, it has been
technique to obtain many keywords and a
context from just a few words taken from
the tweets. For implementation reasons,
Bing (Bin, 2012) was chosen for the process.
The title and description of the 10 first
results of the search are kept and processed in
the same way as the words of the tweet.
We found out that we have better results
by searching in Bing with only the nouns
contained in the tweet; therefore, this is the
option we chose. The activation of this
option is controlled with the flag Search
engine.
2.6 Classification Methods

The Waikato Environment for
Knowledge
Analysis (WEKA) (at University of Waikato,
2012) is a collection of machine learning al-
gorithms that can be used for classification
and clustering. The workbench includes al-
gorithms for classification, regression, clus-
tering attribute selection and association rule
mining. Almost all popular classification al-
gorithms are included. WEKA includes sev-
eral Bayesian methods, decision tree learn-
ers, random trees and forests, etc. It also
provides several separating hyperplane ap-
proaches and lazy learning methods.
Since we use WEKA as learning machine,
it is worth knowing that each element in the

learning machine data set will be called an
attribute, and each element of the data itself
will be called a vector. (These correspond to
the attributes and vectors we have been han-
dling above.) WEKA uses a specific file for-
mat ARFF (Attribute-Relation File Format)
to reference the attributes and the vectors it
uses to learn. This file is first composed of
a list of all the attributes whose order is di-
rectly related to the order of the vectors’ val-
ues. The second part of the file is composed
by a list of vector, each one representing a
tweet. Thus, each tweet adds a vector (line)
to the file whereas an attribute adds a line in
the first part of the file and a value in each
vector.
The different parameters described in Ta-
ble 1 form a configuration that tells our al-
gorithm which attributes to choose and how
to create the vectors. The output of this al-
gorithm is an ARFF file for the configuration
and the input data. In general, some of the
parameters intend to reduce the size of this
noticed that WEKA is more efficient when there is a
smaller number of attributes. Sec- ond, a smaller file
avoids having lack of memory issues: a great amount of
memory, which is proportional to the file size, is needed
while WEKA builds a model.
Once the ARFF file is available, we are able to run
all the available classification algorithms that WEKA
provides. However, due to time limit we will below
concentrate on only a few.
3 Experimental Results
3.1 Data Sets
We have used a corpus of tweets provided for the
TASS workshop at the SEPLN 2012 conference (TAS,
2012) as input data set. This set contains about 70,000
tweets provided as tuples ID, date, userID. Additionaly,
over 7,000 of the tweets were given as a small training set
with both topic (chosen politics, economy, technology,
literature, music, cinema, entertainment, sports, soccer
or others ) and sentiment (or polarity, chosen strong pos-
itive, positive, neutral, negative, strong negative or none
) classification. The data set was shuffled for the topics
and sentiments to be randomly distributed. Due to the
large time taken by the experiments with the large data
set, most of the experiments presented have used the
small data set, using 5,000 tweets for training and 2,000
for evaluation.
3.2 Configurations for the
Submitted Results
We tested multiple configurations with all the WEKA
classifiers to choose the one with the highest accuracy
to be submitted to the TASS challenge. Different
configurations gave the best results for sentiment
analysis and topic detection. For instance, for topic
detection the submitted results were obtained with a
Complement Naive Bayes classifier on attributes and
vectors obtained from the input data by not applying
lemmatization nor stemming, filtering the words and
keeping only nouns, and using hastags and author tags.
The reported accuracy by the challenge organizers in the
large data set is
45.24%.
Regarding sentiment (polarity), the submitted
results were obtained by first classifying the tweets in
5 subsets by using the topic detection algorithm, and
then running the sentiment analysis algorithm within
each
subset. The latter used Naive Bayes Multi- for each classification method a new configu-
nomial on data preprocessed by using the af- ration is created and tested with the param-
fective dictionary, filtering words and keep- eter settings that maximized the accuracy.
ing only adjectives and verbs (adjectives were The accuracy values computed in each of
stemmed, and verbs were lemmatized), using the configurations with the five methods with
the SMS dictionary, and processing negations the small data set are presented in Figures
at the sentence level. The accuracy reported 1 and 2. In both figures, Configuration 1
in the large data set was of 36.04%. is the basic configuration. The derived con-
Since the mentioned results were submit- figurations are numbered 2 to 9. (Observe
ted, we have worked on making the algorithm that each accuracy value that improves over
more flexible, so it is simpler to activate and the accuracy with the basic configuration is
deactivate certain processes. This has led to shown on boldface.) Finally, the last 5 con-
a slightly different behaviour from the sub- figurations of each figure correspond to the
mitted version, but we believe it has resulted parameters settings that gave highest accu-
in an improvement in accuracy. racy in the prior configurations for a method
(in the order Ibk, Complement Naive Bayes,
3.3 Process to Obtain the New Naive Bayes Multinomial, Random Commit-
Experimental Results tee, and SMO).
As mentioned, the algorithm used for ob-
taining the new experimental results, is more 3.4 Topic Estimation Results
flexible and can be configured with the pa- As mentioned, Figure 1 presents the accu-
rameters defined in Table 1. In addition, racy results for topic detection on the small
all classification methods of WEKA can be data set, under the basic configuration (Con-
used. Unfortunately, it is unfeasible to exe- figuration 1), configurations derived from this
cute all possible configurations with all pos- one by toggling one by one every parameter
sible classification methods. Hence, we have (Configurations 2 to 9), and the seemingly
made some decisions to limit the number of best parameter settings for each classification
experiments. method (Configurations 10 to 14). Observe
First, we have chosen only five clas- that there are no derived configuration with
sification algorithms from those provided the search engine flag set. This is because
by WEKA. In particular, we have chosen the ARFF file generated in that
the methods Ibk, Complement Naive Bayes, configuration after searching the web as
Naive Bayes Multinomial, Random Commit- described above (even for the small data
tee, and SMO. This set tries to cover the set) was extremely large and the experiment
most popular classification techniques. Sev- could not be completed
eral configurations of the parameters from The first fact to be observed in Figure 1
Table 1 will be evaluated with these 5 meth- is that Configuration 1, which is supposed
ods. to be similar to the one used for the sub-
Second, we have chosen for each of the mitted results, seems to have a better ac-
two problems (topic and sentiment) a basic curacy with some methods (more than 56%
configuration. In each case, the basic con- versus 45.24%). However, it must be noted
figuration is as close as possible to the con- that this accuracy has been computed with
figuration used to obtain the submitted re- the small data set (while the value of 45.24%
sults. (Since the algorithm has been mod- was obtained with the large one). A second
ified to add flexibility, the exact submitted observation is that in the derived configura-
configuration could not be used.) The rea- tions there is no parameter that by changing
son for choosing these as basic configurations its setting drastically improves the accuracy.
is that they were found to be the most ac- This also applies to the rightmost configu-
curate among those explored before submis- rations, that combine the best collection of
sion. Then, starting from this basic config- parameter settings.
uration a sequence of derived configurations Finally, it can be observed that the largest
are tested. In each derived configuration, one accuracy is obtained by Configuration 2 with
of the parameters of the basic configuration Complement Naive Bayes. This configura-
was changed, in order to explore the effect of tion is obtained from the basic one by sim-
that parameter in the performance. Finally, ply removing the word filter that allow only

XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121)

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

XVIII Congreso de La Asociacion Española para El P... - (PG 82 - 121)

Enviado por

Direitos autorais:

Formatos disponíveis

XVIII CONGRESO DE LA SOCIEDAD ESPAÑOLA PARA EL PROCESAMIENTO DEL LENGUAJE NATURAL 81

alcanzado logros importantes en la detección probabilísticos para evaluar el grado de

expresión de conceptos más específicos, ya que  Selección de un conjunto de

el mero uso de medidas de información para

especializado. una de las unidades léxicas que componen una

adjetivos calificativos de la misma fuente de

(Schmid, 1994). Hiperónimo PR RRs P

Graduabilidad de adjetivos 84 67 75 un hiperónimo. Desafortunadamente, dada la

Ryu, P., y K. Choi. 2005. An Information-Theoretic

 Ernesto Jiménez‐Ruiz (University of Oxford)

2nd International Workshop on Exploiting Large Knowledge

1st International Workshop on Automatic Text Summarization

Information on the Web is constantly updated sometimes without any quality‐control, an

 What techniques can be used to produce appropriate summaries in this context?

 What techniques can be used to extract knowledge from complex technical

Another summarization research topic lying behind is non‐extractive summarization, the

A Challenge for Automatic Text Summarization

Towards an ontology based large repository for managing heterogeneous knowledge

Enhancing the expressiveness of linguistic structures

Integrating large knowledge repositories in multiagent ontologies

Disambiguating automatically‐generated semantic annotations for Life Science open

Redundancy reduction for multi‐document summaries using A* search and discriminative

A dependency relation‐based method to identify attributive relations and its application in

Using biomedical databases as knowledge sources for large‐scale text mining

Exploiting the UMLS metathesaurus in the ontology alignment evaluation initiative

KB_Bio_101: a repository of graph‐structured knowledge

If it's on web it's yours!

 Julio Villena (Daedalus, SA)

According to Merriam‐Webster dictionary, reputation is the overall quality or character of a

Overview of TASS 2012 ‐ Workshop on Sentiment Analysis at SEPLN

TASS: Detecting Sentiments in Spanish Tweets

The L2F Strategy for Sentiment Analysis and Topic Classification

Sentiment Analysis of Twitter messages based on Multinomial Naive Bayes

UNED at TASS 2012: Polarity Classification and Trending Topic System

SINAI en TASS 2012

Lexicon‐Based Sentiment Analysis of Twitter Messages in Spanish

TASS - Taller de Análisis de Sentimientos en la

José Carlos González-Cristóbal Eugenio Martínez-Cámara

Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el

Specifically, in business, reputation

form of important text PL

dataset and it with o e

to belong second accuracy each tweet but, as

ion (Moreno-Ortiz and

Ureña López, L. Alfonso. SINAI at TASS

TASS: Detecci´on de Sentimientos en Tuits en Espan˜ol

Keywords: TASS, Sentiment Analysis, Opinion-mining, Polarity detection

1 Introduction opinions of users about topics or products

3.2 Polarity 3.3 Supervised System

cluded in the polarity lexicon Pes (see section

measure the contribution of these words bet-

validation, while in the test data evaluation,

Computational Linguistics. discovery and data mining, pages 168–177.

Computational Linguistics - ACL ’02,

T´ecnicas de an´alisis de sentimientos y deteccion de asunto de tweets

Luis Nu´n˜ez Chiroque Agust´ın Santos

Resumen: Analisis de sentimientos y deteccion de asunto son nuevos problemas

1 Introduction the problems that have been posed to achieve

Regarding preprocessing the data (texts

(at Univer- sity of Waikato, 2012; Hall et

Table 1: Parameters and flags that define a configuration of our algorithm.

threshold is used to remove all the attributes fective dictionary).

2.3 Word Processing