Você está na página 1de 19

Performance Analysis of Local and Global

Feature Selection Methods for Text Document


Classification
Supervisor
Associate. Prof. Dr. Kashif Javed
Presented by
Muhammad Sajid Ali
2015-MS-EE-048
Introduction
• Now a days web is a main source of ever increasing unstructured textual
data. Approximately 70-80 % information of an organization is stored in
an unstructured text format.
• Arranging text documents into predefined different categories is a text
document classification.
• Categorization of documents into known categories helps to find
documents related to user queries. Large number of terms in text
documents is a major challenge for the accuracy of a classifier.
• Mainly text classification (TC) is a two way process : training and
prediction.
Contd.
• A major problem of text classification (TC) is its high dimensional
feature space: very large number of features. Therefore, feature selection
(FS) methods are being used for speeding up the task of classification,
reducing the irrelevant and redundant features and to increase classifier
performance.
• Bag of words approach is used for text representation.
• Latest trend in FS is going on, as to select a feature vector by combining
both global and local based FS schemes.
• Main objective of the this research is to analyze state of the art feature
selection (FS) methods including both global and local FS methods.
Process of Text Classification
Learning
Feature Extraction Feature Selection
Read documents
•VGFSS
•Tokenization •IGFSS
•Stop Words Removal •IG
•Stemming •GI
•Label •NDM
•BNS

Prediction
Input Feature
Classifier Label
documents Extraction
Text Document Representation
• Bag of Words (BoW) is most commonly used text representation scheme
in which terms or words represented in a vector based on their frequency
of occurrences called term count (tc).
• The BoW doesn’t take into consideration the order of words in a
documen 𝐷 = 𝑡𝑤1, 𝑡𝑤2, 𝑡𝑤3, … , 𝑡𝑤𝑟 ,where 𝑡𝑤𝑖 is the ith word weight
in the vocabulary of r words 𝑊 = 𝑤1, 𝑤2, 𝑤3, … , 𝑤𝑟 .
• The most commonly term weighting scheme is tf.idf.
• In vector space model each term which is also called a feature, represents
one dimension of the text data.
Feature Extraction: A pre-processing Stage
Tokenization: conversion of long text documents into list of tokens or
words.
Stopword Removal: Stop words “the”, “an”, “a”, “and”… etc. are too
frequently occurring words without giving any information about the
subject matter, hence should be removed.
Stemming: Conversion of words to their root form i.e. connection to
connect, taught to teach etc.
Feature Selection (FS) methods
• Feature selection methods selects subsets of features from pre-processed
documents
• Task of FS methods is to reduce high dimensionality of feature space by
eliminating redundant and irrelevant features and also to preserve those
features having strong correspondence to specific classes.
• Feature selection methods broadly falls into three categories: filter,
wrapper and embedded.
• Filter based text classification methods falls into two categories global
and local based feature selection schemes. Which further subdivided into
one sided and two-sided metrics.
Global and Local Feature Selection Methods
• In local feature selection scheme each class-based local score is
computed and considered as final score.
• Global feature selection (GFSS) assigns a score to each feature based on
its discriminating power and then select top N features.
• Multiple local scores of features for each class are converted to make a
global score.
• Local based feature selection methods are used for labeling of features.
• Variable global feature selection scheme (VGFSS) and Improved
variable global feature selection scheme (IGFSS) are global feature
selection schemes.
Global Local Class
Score based Scores
Terms Global DFS score C1 C2 C3 Class Label
wolf 1 1 0 0 C1
zebra 1 0 1 0 C2
pelican 1 0 0 1 C3
elephant 0.8 0 0.8 0 C2
fish 0.75 0.75 0 0 C1
bat 0.71 0 0 0.71 C3
cow 0.71 0 0 0.71 C3
horse 0.67 0.08 0.53 0.07 C2
tiger 0.64 0.24 0.4 0 C2
rat 0.63 0 0 0.63 C3
toad 0.6 0.24 0 0.36 C3
deer 0.58 0 0.31 0.27 C2
buffalo 0.57 0 0.57 0 C2
duck 0.56 0 0 0.56 C3
hen 0.56 0 0 0.56 C3
mouse 0.5 0.14 0.19 0.17 C2
Odds Ratio (OR) and Mutual Information Gain (MI)
Local feature selection metrics
Terms Global_DFS OR_C1 OR_C2 OR_C3 Membership Label
wolf 1 6.65821148 -5.26678654 -5.4547399 positive C1
zebra 1 -5.506032 6.658211483 -5.8615301 positive C2
pelican 1 -5.8215968 -5.98868469 6.65821148 positive C3
elephant 0.8 -5.101538 6.247927513 -5.4547399 positive C2
fish 0.75 6.08037342 -4.70043972 -4.886132 positive C1
bat 0.71 -5.101538 -5.26678654 5.93073734 positive C3
cow 0.71 -5.101538 -5.26678654 5.93073734 positive C3
horse 0.68 -1.2661282 6.247927513 -3.1142961 positive C2
tiger 0.64 1.90777271 2.981222793 -6.178487 negative C3
rat 0.63 -4.5374341 -4.70043972 5.357552 positive C3
toad 0.6 1.90777271 -5.98868469 1.81915135 negative C2
deer 0.58 -6.0803734 2.206161151 0.95968254 negative C1
buffalo 0.57 -3.5982593 4.700439718 -3.9341121 positive C2
duck 0.56 -3.5982593 -3.7548875 4.39231742 positive C3
hen 0.56 -3.5982593 -3.7548875 4.39231742 positive C3
mouse 0.5 3.59825932 3.754887502 -4.3923174 negative C3
Contd.
• Odds ratio represents the odds of occurring positive class to that of
negative class.
• Odds ratio is local metric which assigns class membership in the form of
positive and negative value.
• Mutual information is a local FSM is used to measure the information
contained by a term t_i.
• It assigns higher score to rare terms.
Global feature selection metrics: DFS, IG, BNS and ACC2
• Distinguishing feature selection (DFS), information gain (IG), bi-normal
separation (BNS) and balanced accuracy measure ACC2 falls into the
categories of GFSS.
• Each metric compute local score of each feature for each class. Then
combines that score to form a global score for that feature.
• Then these metrics selects top N scoring features. But features of different
categories may be not included into top N features. Which degrades the
performance of classifier.
• DFS 𝑡𝑖 =
𝑟=𝑛𝑜.𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 𝑝 𝐶𝑗 𝑡𝑖
σ𝑗=1 ഥ 𝑔𝑙𝑜𝑏𝑎𝑙 𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 𝑑𝑓𝑠 𝑚𝑒𝑡𝑟𝑖𝑐
𝑝 𝑡ഥ𝑖 𝐶𝑗 +𝑝 𝑡𝑖 𝐶𝑗 +1
Improved global feature selection Scheme (IGFSS)
• By selecting top N features discarded important features for some
classes.
• IGFSS metric tackle this issue by selecting equal number of features
from each class.
• Counting number of positive and negative features in each class.
• Finding the positive and negative feature ratio i.e. pfr = 1-nfr
• Designing a criterion for equal split of features in each class.
Flow of IGFSS
Based on or metric score compute positive and negative features in each class

Choose negative feature ratio for each class then select max nfr

Compute positive feature ratio pfr = 1 – nfr

Design an equal split criterion: Equal_split(Cj) = N total number of classes

Compute the selection criteria for choosing equal number of positive and
negative features from each class
Procedure/Methodology
• Datasets play key role in determining the efficacy of FSMs. Response of the local and global FSM
methods will be discussed on different datasets ranging in characteristics from balanced to imbalanced
having binary class distribution to multi-class distribution.
• Five datasets will be incorporated in this research (Webkb, 20NG, ohsumed10, Reuters10, and
Classic4). While Webkb, 20NG, and Classic4 datasets are balanced ones but ohsumed10 and
Reuters10 are highly imbalanced datasets.
• In order for performance analysis of FSMs text classifiers are mandatory. Most commonly used text
classifiers are Naïve Bays and SVM. As text datasets are high dimensional and high dimensional
spaces are more likely linearly separable, so simple linear kernel based classifiers perform best in high
dimensional space, so we are choosing these two classifiers for performance analysis. Little toy
dataset will also be incorporated in order to present better idea of working of feature selection
schemes.
• Two state of the art machine learning measures will be incorporated as classifiers i.e. support vector
machines (SVM) and Naïve Bayes while micro and macro F1 measures will be used for performance
analysis.
Performance Evaluation Measures
• F1 measure is the harmonic mean of precision and recall. Where precision and recall is defined as:
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝑝𝑟𝑒𝑐𝑖𝑠𝑜𝑛 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑐𝑒𝑖𝑣𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
• Micro F1 measure doesn’t take into consideration of class distribution and it tends to bias towards
large classes while Macro-F1 measure does consider class distributions and consider locally all
precision and recall values of individual classes.
2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙
• 𝑚𝑖𝑐𝑟𝑜 − 𝐹1 =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
σ𝑟𝑗=1 2 𝑥 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 _𝑗 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙_𝑗
• 𝑚𝑎𝑐𝑟𝑜 − 𝐹1 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛_𝑗+𝑟𝑒𝑐𝑎𝑙𝑙𝑗
−−−−−−−−−−−−−−−−−−−−−
𝑟=𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠
Time Schedule
Year 2018
Activities
Jan Feb Mar Apr May Jun
Literature Survey

Pre-processing of data

Experimental Setup
and test-bed creation

Performance analysis
and evaluation of
classifiers
Thesis writing and
submission
Questions and Discussion

Você também pode gostar