Text Mining

1
Languages classication by individuation of characterizing features. Classicazione delle lingue mediante lindividuazione degli elementi caratterizzanti.
Alessia Trucchia and Laura Caramanna
Consorzio interuniversitario CASPUR Abstract. Lidenticazione degli elementi caratterizzanti un linguaggio scritto risulta un processo complesso che richiede unappropriata rappresentazione e analisi degli elementi che compongono le parole e la loro successione nelle frasi. La strategia di analisi, basata sulla rappresentazione dei testi tramite gli N-grammi, si propone, in prima istanza, dindividuare gruppi di lingue simili e, in seconda istanza di estrarre per ognuna di queste gli elementi che maggiormente la caratterizzano.
Keywords: N-Gram, Text Categorization, Correspondence Analysis, Oblique Principal Component Analysis.
1.1
Introduction
Text analysis is important for classifying electronic documents in order to recognize the language they are written in. The amount of data needed to represent documents can be a problem. Several authors propose the N-grams approach to solve this task (see, for example, Cavnar, 1994). An N-gram is a sequence of N adjacent letters, and it is well described by Beesley (Beesley, 1998). In our work we use the Tri-grams document representation and we calculate the frequency of each sequence of three letters. The idea is that some Tri-grams are peculiar to a language, for examples, a word ending in di is more likely to be an Italian word than a French word, and a word ending in es in more likely to be French. The N-grams approach is a very robust method for language analysis. It is not aected by the presence of dierent kinds of textual errors; in fact, errors in texts tend to aect only a limited number of N-grams. For this reason the N-grams based methods are extensively used in text analysis. In our application we also consider the beginning and the ending of a word by appending blanks to the string (we use the underscore character to represent blanks). Furthermore, in order to analyze dierent coding sets of the characters, we have transformed the letters into their numerical code, derived from the standard UTF-8. We have considered the translation of the Universal Declaration of Human Rights into the main West European languages. We have also considered several local dialects and some East European languages.
1.2
Preliminary classication of languages
Let X be an n x m matrix containing the n observed frequencies of Tri-grams for all the m languages (in our case we have 15378 Tri-grams and 39 languages). It is possible to classify the languages using an Oblique Principal Component Analysis (OPCA) that divide the set of numeric variables Xj, where j=1,m, into disjoint hierarchical clusters. Clusters are formed to obtain maximum variance explained by each cluster. In detail the clustering procedure uses the following steps. 1. Principal Components Analysis on each cluster where the rst two components are retained; 2. Choice of the cluster to split by selecting the one with the maximum second eigenvalue; 3. Split of the chosen cluster in two by performing an orthoblique rotation and by assigning each variable to the rotated component with which it has the highest square correlation. Step 1, 2 and 3 are repeated until there are no more cluster to split (second eigenvalues of each cluster are equal or less than one). It has to be note that at the beginning of the procedure there is one only cluster with all the variable, so step 2 is not required. The clusters obtained dont have the restriction of orthogonality; this allows to nd factors that best represent language clusters and to improve the interpretability of the resulting structure. Associated with each cluster is the rst principal component of the variables in the cluster, that is a linear combination of the variables that explains as much variance as possible. The gure 1.1 shows the result of the clustering procedure. The algorithm stops after 7 iterations by forming 8 clusters.
Fig 1.1: Language clustering procedure based on the oblique principal component cluster analysis.
The language clusters identied, with the corresponding variance explained by the rst principal component, are shown in table 1.1. The amount of overall expalined variance is 57%. Cluster 1 (56%) Cluster 2 (85%) Esperanto, Occitan Auvergnat, Cata- Slovenian, Bosnian, Croatian, Serlan, French, Galician, Portuguese, bian Occitan Languedocien, Romanian, Spanish, Sardinian Cluster 3 (68%) Cluster 4 (50%) Danish, German, Norwegian, Swedish Corsican, Sammarinese, Friulian, Picard, Italian, Maltese Cluster 5 (41%) Cluster 6 (61%) Breton, Basque, Dutch, Finnish, Irish Gaelic, Scottish Gaelic, Welsh Hungarian, Luxembourgish Cluster 7 (65%) Cluster 8 (48%) Czech, Polish, Slovak English, Lithuanian, Latin Table 1.1: Languages associated to each cluster and proportion of explained variance (in parenthesis). It is interesting to observe that Sardinian belongs to the cluster including Spanish, Catalan, etc; it is known that this language has been strongly inuenced by Spanish and Catalan. Furthermore Corsican and Italian are classied in the same cluster. In fact Corsican, that was originally considered an Italian dialect, was subsequently recognised as a language by the French government. Correlations among clusters express language group relations. The analysis of the inter-clusters correlation shows that cluster 1 and cluster 4 are the most related (with a correlation value of 0.6). Although Italian and Sardinian are not in the same cluster, they belong to the strongest correlated clusters. Analogously Corsican is in the cluster best related with French. Table 1.2 represents the inter-clusters correlation matrix in detail.
Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7 Cluster8 Cluster1 1.00 0.26 0.41 0.63 0.47 0.27 0.26 0.51 Cluster2 0.26 1.00 0.14 0.31 0.19 0.13 0.53 0.20 Cluster3 0.41 0.14 1.00 0.32 0.58 0.19 0.12 0.35 Cluster4 0.63 0.31 0.32 1.00 0.35 0.27 0.25 0.45 Cluster5 0.47 0.19 0.58 0.35 1.00 0.31 0.19 0.35 Cluster6 0.27 0.13 0.19 0.27 0.31 1.00 0.19 0.27 Cluster7 0.26 0.53 0.12 0.25 0.19 0.19 1.00 0.19 Cluster8 0.51 0.20 0.35 0.45 0.35 0.27 0.19 1.00
Table 1.2: Inter-cluster correlation matrix.
1.3
A methodology to identify language characteristics
We can see that the matrix X contains the frequencies of each Tri-gram belonging to the languages. Given this, the relations between row-proles (Tri-grams) and column-proles (languages) can be investigated using the Correspondence Analysis. Therefore, to associate each Tri-gram to only one language, we have rst used Correspondence Analysis and we have then considered the projections of row-proles and column-proles into the complete factorial space (that is a 38-dimensional space); in order to measure the proximity between row and column projections we have used the angle cosine between them. In particular it is possible to dene: cos (ri , cj ) = < ri , cj > ||ri || ||cj || (1.1)
where ri and cj are, respectively, the projections of the i-th row-prole and the j-th column prole (i=1,..., n; j=1,..., m) in the factorial space, (ri, cj) is the angle between ri and cj. For the i-th row-prole we calculate the value of (1.1) for each column-prole and we associate the i-th row-prole to the column-prole that gives the maximum cosine value. This operation can be repeated for each Tri-gram, in order to associate it to one language only. This can be useful to identify the Tri-grams that distinguish one language from another. Moreover the use of the cosine permits the sorting of Tri-grams by their inuence on the language.
1.4
Results
Applying this methodology we reduce the number of Tri-grams present in each language just considering the signicant ones. The reduction for the principal languages is shown in table 1.3. Languages Tri-grams present Tri-grams signicant Italian 1341 101 German 1656 437 French 1540 239 English 1531 189 Spanish 1462 141 Table 1.3: Frequencies of Tri-grams originally present in languages and of signicant Tri-grams derived from the association obtained by Correspondence Analysis. It is interesting to notice that the number of Tri-grams is drastically reduced; e.g. for the Italian language we can consider 101 Tri-grams instead of 1341. Finally, using the Correspondence Analysis projections, we can also extract some examples of signicant Tri-grams:
The, th for English; Sur, leu, les for French; di, in, lla for Italian; Und for German; Y l for Spanish.
1.5
Conclusion and further remarks
The illustrated strategy allow us to single out interesting similarities among several languages, but the most important result of this work is the strong data reduction. It can be a good starting point for text recognition methodologies. In fact it can be useful to simplify text classication in order to recognize the language a document is written in. Further developments include repeating the analysis using larger documents, deriving from various sources. The languages considered should also be chosen according to their spread between countries.
References
1.Beesley K. R. (1998) Language identier: a computer program for automatic natural-language of on-line text, Language at crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, Oct 12-16, 47-54. 2.Cavnar W. B. and Trenkle J. M. (1994) N-Gram based text categorization, 1994 Symposium On Document Analysis and Information Retrieval, University of Nevada, 161-176. 3.Harman, H.H. (1976), Modern Factor Analysis, Third Edition, Chicago: University of Chicago Press.

Text Mining

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Text Mining

Enviado por

Direitos autorais:

Formatos disponíveis

1

Preliminary classication of languages

Table 1.2: Inter-cluster correlation matrix.

A methodology to identify language characteristics

Conclusion and further remarks

Você também pode gostar