Escolar Documentos
Profissional Documentos
Cultura Documentos
Multiclass
1 Introduction
Wheat crop is one of the most important crops for Egypt because it is used in making
Baladi bread which is considered a vital source to feed the Egyptian population
because for what it provides a high source of nutritive values. Wheat crop contains a
lot of important minerals such as potassium and calcium, in addition, vitamins A, E
and vitamin K. Egypt is one of the top 20 commodities production for wheat globally
in 2014 with 9.3 Million tons per year, according to the international wheat produc-
tion statistics produced by the food and agriculture organization1.Thus, it is consid-
ered one of the most strategic food crops in Egypt because it used in the Egyptian
diet, and it produces more than one-third of the daily caloric intake of the Egyptian
population and more than 45% of the total daily protein consumed by the Egyptians.
Data mining technique play an important role as Knowledge Discovery in Database
(KDD) tool by extracting the hidden patterns from exhaustive data records in Multi-
dimensional scaling.
The goal of classification is to assign unknown objects to one of the already prede-
fined groups. Its target function is applied to map the input of attribute space to one of
2
the predefined groups by adding a class label to the input [2]. Recently, the huge
number of classifiers used to solve such classification problem reached 179 classifiers
as in [3] with 17 families. The common types of classifiers can be (probabilistic, dis-
criminant, decision tree and support vector machine). The performance of each type
has pros and cons, and it shows a high variation in accuracy according to the number
of patterns, the size of training samples and numbers of input variables [4]. Many
studies show that Neural Network (NN) classifier is used in many applications and it
allows handling noisy data, moreover, it can work well with a high dimension data.
However, it has some limitations such as; NN can fall in local minima, besides NN
suffers from the difficulty to determine the number of neurons needed in the hidden
layer. Also, NN needs a long time to build a training model and it does not work well
with the small size of training samples. On the other side, the decision trees achieve
high accuracy in classification problems [5], but they suffer from the overfitting and
fragmentation problems related to the nature of data or if sub-trees are replicated in
decision tree [6]. Therefore, according to previous drawbacks of NN and decision
trees, in this paper we utilized SVM classifier which has the following advantages:
High generalization ability to overcome the overfitting problem.
Easily handle nonlinear data point.
High performance in the high-dimensional dataset.
High performance in small training data.
Since SVM has a previous advantage, initially we reduce the effects of the high
dimension of features space dataset by using PCA [7] as processing step. SVM mul-
ticlass has been utilized by different types of kernel functions for identifying and pre-
dicting the Egyptian wheat diseases. The main object of this research work is increas-
ing the national income of wheat crop by preventing wheat losing. Therefore, it will
reduce the cost and minimize the agrochemicals use aiming to reduce the losses of
wheat yield due to dwarfing of the plants and malformations of the leaves. Besides, it
will prevent contaminating other crops within the same area. In this research, the main
contribution is to compare the results and measure the performance metrics of various
data mining classifiers for early prediction of Egyptian wheat diseases. The rest of
this paper is organized as follows: Section 2 describes the literature review. Section 3
presents the proposed system. Section 4 describes the wheat dataset. Section 5 ex-
plains the metrics of evaluation. The results of experiments and discussion are illus-
trated in Section 6. Finally, the conclusion and future work are in Section 7.
2 Proposed system
The proposed system as shown in Fig1 for early prediction of wheat diseases based
on SVM multiclass is composed of the following stages: (1) Building of Egyptian
knowledge base of wheat disease: based on a powerful web-based tool developed by
the central laboratory for agricultural expert systems (CLAES) [14]. (2) Pre-
processing stage: Replacing all missing values in a dataset with null as nominal
attribute values. (3) Reduce of dataset dimensionality: Performs a principal compo-
nents analysis to reduce the high number of dataset attributes. (4) Classification
stage: is used to build an SVM multiclass model based on training data which con-
tains (name of wheat diseases and related symptoms).
Count, and Weight). Second, to get the best performance, the missing values have
been replaced by a constant value null because it can be inconsistent with other
recorded data.
yi (w T x i +b) 1 0 (1)
2
Equation (1) combine two class yi = 1 ,the margin between the planes is
w
and we need to find w and b that minimize the objective function as:
2
1 w
2
Subject to (2)
yi (w x i +b) 1, i
T
For nonlinearly separable classes, the equation (2) after adding a penalty term will be
w2 r
min C i
2 i =1
Subject to (3)
yi (w x i +b) 1 i
T
r
The second term C i added as a penalty value when is a slack value to han-
i =1
dling non-separable data. By using the Lagrange function, this optimization problem
solved as:
N
1
L(w, b, ) w T w i [ yi (w T x i b) 1]
2 i 1
Subject to (4)
i 0, i
Where 1,2, ,N are Lagrange multipliers, and = [1,2, ,N] T. SVM Mul-
ticlass problem can be solved by single optimization problem as in Equation (5).
When m-th is a function for SVM multiclass where w Tm(x i )+b separates the train-
ing data of class m from other classes are constructed, notation refer to the map-
ping function.
n L
1
min
2
w Tm w m C i m w Tm(x i ) b yi
m=1 i 1 m yi
w,b,
w Tm (x i )+b m +2- i m
with i m 0, i=1,, L, m {1, , n} (5)
By solving equation (6) you can expect the new data of training sample and determine
that it relates to any class.
WEKA data mining tool has been used in experiments. Prediction of wheat diseases
determined by using the dataset includes 285 instances, 63 attributes and 24 classes of
wheat diseases. Description of attributes shows in table 1.
Attributes Description
Variety Sakha 8, Giza 157, Sakha 61, Giza 160, Sakha 69, Giza 162
Names of diseases Genetic-flecking, Downy mildew, Barley yellow Dwarf, aphids, Leaf rust
4 Metrics of Evaluation
The main performance metrics are described and calculated by using different formu-
las as in Fig3.
Many experiments are applied on existing wheat dataset within the WEKA platform
for many selected classification algorithms to evaluate the performance of the pro-
posed model. Our approach is based on the comparison between the numbers of clas-
sifiers such as; J48 decision tree, Random Forest (RFs), K-Nearest Neighbor (KNN),
Nave Bayes (NBs), ANN and SVM. The variance parameter of PCA has changed
from 0.95 to be 1.0 to allow flexibility to change the suitable output number of di-
mensions. The results of accuracy due to utilizing PCA are presented in Fig4. The
SVM number of attributes, for example, is reduced from 63 to 45 and the accuracy
before and after applying PCA has been kept very close as shown in Fig 4. In table 3,
the accuracy of most classifiers is very close after applying the PCA and dimension is
reduced. The accuracy of NN and Nave Bayes classifier has increased. Thus, the rate
of time to build the model is decreased especially in ANN case from 24.92 seconds to
7.49 seconds. We noticed that the accuracy of SMO (96.1404%) is the best is a
comparison to other classifiers followed by J48 (93.33%). Science RFs and KNN give
the same accuracy (91.6 %).
80%
60%
Accuracy
40%
20%
0%
SVM J48 RFs KNN ANN Nave
Classifier Bayes
Without PCA With PCA
Table 4 shows that the results of different classifiers are based on different evaluation
measures. Such as mean absolute, root means square error and kappa statistic as nu-
meric value but root relative square and relative absolute error calculated as percent-
age value for all test samples. Another performance measures have been computed
such that precision, recall and Mathews Correlation Coefficient (MCC) [24] which
applied as a linear correlation coefficient. It is used as a measure of the quality of
classifications. The output value of MCC scale range between +1 indicate a (perfect
prediction), 0 refer to (average random prediction) and -1 an inverse prediction. Fig 5
shows the comparison between the different evaluation parameters of classifiers. The
specificity metric for each RFs and KNN is the same value is (91.6) but the best recall
metric for SVM (96.1) and the lowest value of specificity metric for Nave Bayes is
(87). Several experiments were run based on the fine tuning of the parameters of
SVM, then the best parameter was chosen to achieve the best accuracy.
10
95.2
96.1
94.6 95.3
93.3 92.5 90.2 90.3
91.9 92.4 91.6 89.8 91.6 89.4 89.4
90.4 90.5 89.5
89.5 89.9
87
83.7
Accuracy (%)
82.2 83.2
Accuracy Mean Abs. Relative Kappa Sta. Root Mean Sq. Root Relative
Measures Error Ab. Error (KS) Error Sq. Error
(MAE) (RAE)% (RMSE) (RRSE)%
A lot of experiments were run based on fine tuning, then the best parameters were
chosen as shows for SVM parameters in table 5. The -insensitive loss function has
affected the number of support vectors, and its effects on the smoothness of the
SVMs response both the complexity and the generalization capability of SVM de-
pend on its value. Parameter C controls the trade-off between margin maximization
and errors of the SVM on training data. Gamma parameter is the free parameter effect
on kernel functions. On the other side, the accuracy of SVM multiclass was tested by
using several types of kernel functions. Fig 6 shows that the polynomial kernel func-
tion achieved the best accuracy 96.1%. The effects of different multiclass methods are
shown in Fig 7, the random correcting code achieves the best results 96.14% in con-
junction with SVM multiclass because it enhances the generalization ability of binary
classifier.
Accuracy (%)
96.14%
85.26% 83.85%
100% 68.07%
50%
0%
Normalized RBF Kernel PUK Kernel Polynomial
PolyKernel Kernel
SVM Kernels
100%
96.14%
96%
Accuracy (%)
91.50%
92% 90.17%
88%
84%
SVM Multiclass Methods
1-aganist-all Random Correction Code 1-aganist-1
98%
96.14%
96%
Accuracy (%)
In this paper, the SVM-based wheat diseases prediction model is proposed by using
PCA. The dataset composed of 24 kinds of wheat diseases with different symptoms
(63 attributes). 285 instances have been used for training and testing the proposed
model. In this research work, we utilized different types of SVM kernel with 10-folds
of cross-validation and the effects on accuracy has measured. In addition, the effect of
PCA in a number of attributes and training time and final accuracy has also been ob-
tained. The experimental results showed that random ECOC technique is used to
decompose the multiclass to a set of binary spaces to solve SVM multiclass problems
in conjunction with the maximum probability method as one of the voting techniques
achieved the best accuracy 96.1% compared to J48, RFs, KNN, ANN and Nave
Bayes classifiers which respectively achieved 93.3%, 91.6%, 91.6%, 90.5% and
87.0%. In the future work, we will be utilizing the deep neural network classifier for
classification which can deal with a huge number of attributes in conjunction with
extracting features directly from plant images by using pattern recognition techniques.
References
1. Jiang, Heling.& An Yang& Fengyun Yan& Hong Miao.(2016). Research on Pattern Anal-
ysis and Data Classification Methodology for Data Mining and Knowledge Discovery. In-
ternational Journal of Hybrid Information Technology, 9(3), 179-188.
3. Hakizimana Leopord& Dr. Wilson Kipruto Cheruiyot &Dr. Stephen Kimani (2016). A
Survey and Analysis of Classification and Regression Data Mining Techniques for Diseas-
es Outbreak Prediction in Datasets. The International Journal Of Engineering And Science
(IJES), 5(9), 2319-1813.
5. Mr. Brijain R Patel,&Mr. Kushik K Rana. (2014).A survey on decision tree algorithm for
classification. International Journal of Engineering Development and Research,2(1), 2321-
9939
7. hang Jian&Zhang Wei.(2010) Support vector machine for recognition of cucumber leaf
diseases. IEEE, 5, 264-266.
14
11. Usama Mokhtar & Mona A. S. Aliy & Aboul Ella Hassenianz & Hesham Hefny.(2015)
.Tomato leaves diseases detection approach based on Support Vector Machines. IEEE,246-
250.
12. Haiguang Wang & Guanlin Li & Zhanhong Ma & Xiaolong Li.(2012).Application of neu-
ral networks to image recognition of plant diseases. 2012 International Conference on Sys-
tems and Informatics (ICSAI 2012) IEEE, 2159-2164.
13. Rumpf, T., A-K. Mahlein, U. Steiner, E-C. Oerke, H-W. Dehne, and L. Plmer. (2010).
Early detection and classification of plant diseases with support vector machines based on
hyperspectral reflectance. Computers and Electronics in Agriculture 74, 91-99.
14. Sannakki, Sanjeev S., & Vijay S. Rajpurohit,& V. B. Nargund,& Pallavi Kulkarni. (2013) .
Diagnosis and classification of grape leaf diseases using neural networks. In Computing,
Communications and Networking Technologies (ICCCNT), 2013 Fourth International
Conference on IEEE,1-5.
15. Rafea, Ahmed. (2010) . Web-Based Domain Specific Tool for Building Plant Protection
Expert Systems. INTECH Open Access Publisher,193-203.
16. Mark Hall & Eibe Frank & Geoffrey Holmes & Bernhard Pfahringer & Peter Reutemann
& Ian H. Witten. (2009) . The WEKA data mining software: an update. ACM SIGKDD
explorations newsletter ,11(1), 10-18.
17. P.Subbuthai & Azha Periasamy & S.Muruganand . (2012) .Identifying the character by
applying PCA method using Matlab. International Journal of Computer Applica-
tions, 60(1),1-4.
18. Herve Abdi1 & Lynne J. Williams . (2010) .Principal component analysis. Wiley interdis-
ciplinary reviews: computational statistics , 2(4), 433-459.
19. A. Basu & C. Watters & M. Shepherd. (2002) . Support Vector Machines for Text Cate-
gorization. 36th IEEE Hawaii International Conference on System Sciences, 1-7.
20. 19. Hwanjo Yu & Sungchul Kim. (2012).Svm tutorialclassification, regression, and
ranking. In Handbook of Natural computing Springer Berlin Heidelberg, 479-506.
21. Nello Cristianini & John Shawe-Taylor.(2000).An Introduction to Support Vector Ma-
chines
and Other Kernel-based Learning Methods. Cambridge University Press,18(6),687-689.