Escolar Documentos
Profissional Documentos
Cultura Documentos
DOI: 10.1111/exsy.12135
Abstract: Early prediction of school dropout is a serious problem in education, but it is not an easy issue to resolve. On the one hand,
there are many factors that can inuence student retention. On the other hand, the traditional classication approach used to solve this
problem normally has to be implemented at the end of the course to gather maximum information in order to achieve the highest accuracy.
In this paper, we propose a methodology and a specic classication algorithm to discover comprehensible prediction models of student
dropout as soon as possible. We used data gathered from 419 high schools students in Mexico. We carried out several experiments to
predict dropout at different steps of the course, to select the best indicators of dropout and to compare our proposed algorithm versus some
classical and imbalanced well-known classication algorithms. Results show that our algorithm was capable of predicting student dropout
within the rst 46 weeks of the course and trustworthy enough to be used in an early warning system.
Keywords: predicting dropout, classication, educational data mining, grammar-based genetic programming
1. Introduction
Predicting student dropout in high school is an important
issue in education because it concerns too many students in
individual schools and institutions over the entire world, and
it usually results in overall nancial loss, lower graduation
rates and an inferior school reputation in the eyes of all
involved (Neild et al., 2007). The denition of dropout differs
among researchers, but in any event, if an institution loses a
student by whatever means, the institution has a lower
retention rate. The early identication of vulnerable students
who are prone to drop their courses is crucial for the success
of any school retention strategy. And, in order to try to reduce
the aforementioned problem, it is necessary to detect students
who are at risk as early as possible and thus provide some care
in order to prevent these students from quitting their studies
and intervene early to facilitate student retention (Heppen &
Bowles, 2008). Seidman developed a slogan about student
retention (Seidman, 1996) showing that early identication
of students at risk, in addition to maintaining intensive and
continuous intervention, is the key to reduce dropout levels.
So, to develop and use an early warning system (EWS) is a
good solution for detecting students at high risk of dropout
as early as possible. An EWS is any system that is designed
to alert decision makers of potential dangers. Its purpose is
to allow for prevention of the problem before it becomes an
107
2. Background
Tintos model (Tinto, 1975) is the most widely accepted
model in student retention literature. Tinto claims that the
decision of students to persist or drop out of their studies
is quite strongly related to their degree of academic
integration, and social integration, at university. On the
other hand, classication algorithms (Kumar & Verma,
2012) are the most widely applied data mining technique
for predicting student dropout, as we describe later. A rst
example of work (Lykourentzou et al., 2009) in which
several classication techniques [feedforward neural
network, support vector machines (SVMs), probabilistic
ensemble simplied fuzzy and decision scheme] were applied
for dropout prediction in e-learning courses using data from
students of the University of Athens. The most successful
technique in promptly and accurately predicting dropoutprone students was the decision scheme.
Another comparative analysis of several classication
methods (articial neural networks, decision trees, SVMs
and logistic regression) were used in order to develop early
models of rst year students who are most likely to drop
Age of students
(years)
Dropout
20102011 (%)
Dropout
20112012 (%)
Dropout
20122013 (%)
6 to 12
12 to 15
15 to 18
Over 18
0.8
5.6
14.5
8.2
0.7
5.4
13.9
8.0
0.7
5.1
13.1
7.9
Primary school
Secondary school
High school (preparatoria)
Higher education
108
out (Delen, 2010). The data for this study come from a
public university located in the midwest region of the United
States. SVMs performed the best, followed by decision trees,
neural networks and logistic regression. Other similar work
(Zhang et al., 2010) used three classication algorithms
(naive Bayes, SVM and decision tree) over university
student data in order to improve student retention in higher
education. The specic attributes used were the following:
average mark, online learning systems information, library
information, nationality, university entry certicate, course
award, current study level, study mode, age, gender, and
so on. Different congurations of the algorithms were tested
in order to nd the optimum result, and Naive Bayes
achieved the highest prediction accuracy, while the Decision
Tree had the lowest one. A related work (Kovacic, 2010)
used different classication tree methods [Chi-square
Automatic Interaction Detector (CHAID), exhaustive
CHAID, QUEST and classication and regression tree
(CART)] for early prediction of student success. It explored
the sociodemographic variables (age, gender, ethnicity,
education, work status and disability) and study
environment (course programme and course block) that
may inuence persistence or dropout of students at the Open
Polytechnic of New Zealand. It found that the most
important factors separating successful from unsuccessful
students were as follows: ethnicity, course programme and
course block; and the most successful classication method
was the CART. Class association rules (CAR) were also
applied as for predicting student dropout as soon as possible
(Antunes, 2010). A CAR is a rule in which the consequent is
a single proposition related to the class attribute. The data
set used in this study comes from the results of students
enrolled in the last 5 years in an undergraduate programme
at Instituto Superior Tcnico in Lisboa. This data set
contained 16 attributes about weekly exercises, tests and
Subject
To predict student
retention in university
Zhang
et al., 2010
To identify potential
student at risk in
higher education
To identify students at
risk of dropping out in
higher education
To anticipate
undergraduate
students failure as
soon as possible
To predict freshman
retention in university
students
Kovacic, 2010
Antunes, 2010
Results
CAR
CHAID, Chi-square Automatic Interaction Detector; CART, classication and regression tree; CAR, class association rules.
109
110
111
(1)
Positive
Negative
TP
FP
FN
TN
TP, true positive; TN, true negative; FP, false positive; FN, false
negative.
4. Data set
The data set used in this work comes from 419 students
enrolled in the Academic Unit Preparatoria at the
Autonomous University of Zacatecas in Mexico. All
students were about 15 years old and were registered in rst
year of the preparatoria (high school). In this study, we used
only the information of the rst semester, that is, when more
students drop out. In fact, in our case, 13.6% of students
drop out, as we can see in Figure 2.
All the data used have been gathered from different
sources and on different occasions during the period from
August to December 2012. Figure 3 shows the specic steps
when the student information was gathered. We used these
stages for collecting the information in accordance with
the particular characteristics of Mexico Academic Program
II. But our proposed methodology can be implemented in
other institutions by simply changing the number of stages
and dates depending on their own characteristics.
Step 0 was before the beginning of the semester, and it
contained previous marks/scores. At this stage, the only
available information about students came from the
admission exam. Step I was just at the beginning of the
semester and had general information about school
enrolment. Once students were enrolled, we obtained new
(2)
(3)
112
5. Experiments
We carried out three experiments in order to test our
methodology and to compare the performance of our
proposed ICRM2 algorithm versus ve classical and four
imbalanced well-known classication algorithms publicly
available in WEKA data mining software (Witten et al., 2011).
5.1. Experiment 1
In this rst experiment, we predicted dropout by using all
the attributes in each step of the course, that is, all attributes
available from the beginning of the course in the
corresponding stages. We executed the following classical
classication algorithms:
N. attributes
0
I
2
10
II
11
III
IV
26
VI
113
TP TN
TP TN FP FN
(4)
TP
TP FN
(5)
114
TN rate
TN
TN FP
(6)
p
TPrate TN rate
(7)
0.994
0.992
0.994
0.994
1.000
0.735
0.873
0.981
0.948
0.983
1.000
0.769
0.931
0.961
0.961
0.950
0.975
0.876
0.909
0.931
0.981
0.950
0.967
0.975
0.901
0.928
0.967
0.959
0.981
0.981
0.961
0.992
0.986
0.981
0.983
1.000
0.070
0.000
0.070
0.070
0.000
0.807
0.298
0.018
0.123
0.000
0.000
0.825
0.509
0.544
0.579
0.439
0.474
0.843
0.719
0.632
0.561
0.614
0.579
0.857
0.649
0.561
0.421
0.649
0.544
0.895
0.965
0.912
0.895
0.842
0.807
0.983
0.869
0.857
0.869
0.869
0.864
0.733
0.795
0.850
0.835
0.850
0.864
0.782
0.874
0.905
0.909
0.881
0.907
0.857
0.883
0.890
0.924
0.905
0.914
0.945
0.866
0.878
0.893
0.916
0.921
0.950
0.962
0.981
0.974
0.962
0.959
0.998
0.264
0.000
0.264
0.264
0.000
0.770
0.510
0.133
0.341
0.000
0.000
0.797
0.688
0.723
0.746
0.646
0.680
0.859
0.808
0.767
0.742
0.764
0.748
0.914
0.765
0.722
0.638
0.789
0.731
0.937
0.963
0.951
0.939
0.909
0.891
0.991
0.01
0.09
0.01
0.06
0.02
1.11
0.01
0.17
0.01
0.06
0.06
5.92
0.02
0.20
0.01
0.08
0.06
8.52
0.02
0.25
0.01
0.08
0.06
13.23
0.03
0.28
0.02
0.11
0.08
19.02
115
1
6
II
III
IV
116
1.000
1.000
1.000
1.000
1.000
0.710
0.854
1.000
0.956
0.989
1.000
0.761
0.901
0.972
0.972
0.964
0.975
0.925
0.912
0.972
0.972
0.964
0.970
0.959
0.925
0.972
0.967
0.953
0.978
0.975
0.967
0.983
0.978
0.970
0.983
0.978
0.000
0.000
0.000
0.000
0.000
0.772
0.333
0.000
0.123
0.000
0.000
0.789
0.491
0.439
0.316
0.421
0.474
0.825
0.614
0.491
0.351
0.596
0.561
0.825
0.649
0.561
0.421
0.649
0.579
0.842
0.947
0.842
0.772
0.789
0.842
0.965
0.860
0.860
0.860
0.860
0.860
0.743
0.783
0.864
0.842
0.854
0.864
0.782
0.845
0.900
0.883
0.890
0.907
0.900
0.871
0.907
0.888
0.914
0.914
0.950
0.888
0.916
0.893
0.912
0.924
0.964
0.964
0.964
0.950
0.945
0.964
0.976
0.000
0.000
0.000
0.000
0.000
0.740
0.533
0.000
0.343
0.000
0.000
0.775
0.665
0.653
0.554
0.637
0.680
0.874
0.748
0.691
0.584
0.758
0.738
0.889
0.775
0.738
0.638
0.786
0.753
0.906
0.957
0.910
0.869
0.875
0.910
0.971
0.01
0.04
0.01
0.04
0.02
0.20
0.01
0.13
0.01
0.04
0.03
0.58
0.01
0.14
0.01
0.04
0.03
0.87
0.02
0.14
0.01
0.05
0.05
2.02
0.02
0.19
0.01
0.06
0.06
3.03
1 TP FP
2
(8)
117
TP rate
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
TN rate
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
Accuracy
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
GM
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
AUC
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
0.983 0.890
0.972 0.997
0.937
0.997
0.937
0.992
0.953
0.997
0.981
0.995
0.649 0.776
0.644 0.782
0.466 0.748
0.854
0.934
0.785
0.865
0.934
0.909
0.870
0.950
0.922
0.959
0.989
0.986
0.807 0.825
0.843
0.857
0.895
0.983
0.079 0.456
0.158 0.173
0.623
0.377
0.667
0.544
0.702
0.483
0.833
0.974
0.737 0.474
0.737 0.404
0.666 0.594
0.544
0.509
0.609
0.702
0.702
0.842
0.702
0.561
0.852
0.877
0.947
0.986
0.735 0.769
0.876
0.975
0.981
1.000
0.767 0.786
0.777 0.758
0.861
0.849
0.872
0.885
0.893
0.874
0.945
0.990
0.661 0.735
0.656 0.730
0.492 0.728
0.812
0.876
0.762
0.843
0.902
0.900
0.847
0.897
0.913
0.948
0.983
0.986
0.733 0.782
0.857
0.945
0.950
0.998
0.279 0.637
0.392 0.415
0.764
0.613
0.790
0.734
0.818
0.694
0.904
0.984
0.692 0.606
0.689 0.562
0.557 0.666
0.681
0.689
0.691
0.779
0.809
0.875
0.781
0.730
0.886
0.917
0.968
0.986
0.770 0.797
0.859
0.914
0.937
0.991
0.643 0.769
0.565 0.499
0.841
0.687
0.836
0.768
0.873
0.740
0.946
0.984
0.702 0.626
0.690 0.593
0.567 0.621
0.727
0.721
0.749
0.790
0.818
0.875
0.797
0.756
0.854
0.928
0.968
0.984
0.787 0.806
0.854
0.923
0.946
0.991
118
TP rate
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
TN rate
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
Accuracy
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
GM
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
AUC
C45-SMOTE
SVMSMOTE
C45-CS
SVM-CS
GPCOACH-H
ICRM
1.000 0.950
1.000 1.000
0.961
0.981
0.961
0.992
0.964
0.989
0.989
0.995
0.613 0.746
0.613 0.751
0.417 0.688
0.887
0.939
0.895
0.928
0.931
0.910
0.920
0.964
0.935
0.967
0.981
0.986
0.772 0.789
0.825
0.825
0.842
0.965
0.000 0.211
0.000 0.000
0.500
0.412
0.649
0.474
0.667
0.518
0.930
0.939
0.825 0.474
0.825 0.456
0.614 0.607
0.439
0.456
0.624
0.649
0.561
0.757
0.632
0.632
0.891
0.895
0.930
0.951
0.710 0.761
0.925
0.959
0.975
0.978
0.761 0.773
0.761 0.761
0.851
0.845
0.887
0.868
0.893
0.876
0.975
0.981
0.642 0.709
0.642 0.711
0.443 0.678
0.826
0.874
0.860
0.890
0.881
0.890
0.881
0.919
0.929
0.957
0.974
0.981
0.743 0.782
0.900
0.950
0.964
0.976
0.000 0.447
0.000 0.000
0.693
0.636
0.790
0.685
0.802
0.715
0.959
0.966
0.711 0.594
0.711 0.585
0.506 0.646
0.624
0.654
0.747
0.776
0.723
0.830
0.762
0.780
0.913
0.930
0.955
0.968
0.740 0.775
0.874
0.889
0.906
0.971
0.500 0.676
0.500 0.500
0.830
0.697
0.866
0.733
0.866
0.753
0.976
0.967
0.715 0.555
0.719 0.604
0.516 0.648
0.744
0.698
0.759
0.776
0.746
0.846
0.771
0.798
0.883
0.940
0.955
0.963
0.745 0.771
0.875
0.890
0.906
y84
6. Discovered models
Two examples of the different models discovered by our
ICRM2 algorithm in each experiment are shown and
119
120
8. Conclusions
The proposed methodology has shown to be valid for predicting
high school dropout. We carried out two experiments using
data from 419 rst year high school Mexican students. It is
important to notice that most of the current research on the
application of EDM to resolve the problems of student
dropouts has been applied primarily to the specic case of
higher education. However, little research into compulsory
education dropout rates has been conducted, and what has been
121
GBGP algorithm
BEGIN
WHILE there are remaining instances and the number of
rules is lower than the maximum allowed REPEAT
1.1. To initialise the population for a class
It generates the individuals (rules) of the population to
learn a given class.
1.2. To do parent selection
It selects the parents on which the genetic operations are
applied
1.3. To do crossover
It mixes the genetic information of the parents to
generate an offspring
1.4. To do mutation
It mutates the offspring to facilitate the search
exploration
1.5. To evaluate
It evaluates the tness of the new rules
1.6. To update population
It selects the best rules from the parent population and
the offspring to keep the population size constant with
the best rules for the next generation
1.7. IF the algorithm has more generations to run
GOTO step 1.2 and iterate next generation
ELSE
CONTINUE with step 1.8
1.8. To select the best rule and append to the rule-set
It selects the best rule from the population using the
tness function and appends to the rule-set
1.9. To remove the instances covered by the rule
Instances covered by the rule are removed from the
training examples so that new rules can be learned on the
remaining instances
END WHILE
2. To return the rule-set
END
122
References
AHA, D. and D. KIBLER (1991) Instance-based learning algorithms,
Machine Learning, 6, 3766.
ANTUNES, C. (2010) Anticipating students failure as soon as possible,
Handbook of Educational Data Mining. CRC Press, 353364.
123
The authors
Cristobal Romero
Cristbal Romero received the BSc and PhD Degrees in
computer science from the University of Granada,
Granada, Spain, in 1996 and 2003, respectively. He is
currently an Associate Professor in the Department of
Computer Science and Numerical Analysis, University of
Cordoba, Spain. He has authored or co-authored more than
50 international publications, 20 of them published in
international journals. He is a member of the Knowledge
Discovery and Intelligent Systems Research Laboratory,
and his research interests include applying data mining in
e-learning systems. He is a member of the IEEE Computer
Society, the International Educational Data Mining
(EDM) Working Group, and the steering committee of the
EDM Conferences.
Carlos Mrquez-Vera
Carlos Mrquez-Vera is a Professor in the Preparatory Unit
Academic of the Autonomous University of Zacatecas,
Mexico. He received the MSc Degree in Physics Education
from the University of Havana, Cuba, in 1997. He is currently
a PhD student of the University of Crdoba, Spain, and his
research interests lie in Educational Data Mining.
Alberto Cano
Alberto Cano was born in Cordoba, Spain, in 1987. He is
currently an Assistant Professor in the Department of
Computer Science at the Virginia Commonwealth University,
USA, where he heads the High-Performance Data Mining
Lab. He was previously a researcher at the University of
Crdoba, Spain, as a member of the Knowledge Discovery
and Intelligent Systems research group. His research is
focused on soft computing, machine learning, data mining,
general-purpose computing on graphics processing units
(GPGPU), and parallel computing.
124
Sebastian Ventura
Sebastin Ventura was born in Cordoba, Spain, in 1966. He
received the BSc and PhD Degrees from the University of
Cordoba, in 1989 and 1996, respectively. He is currently
Associate Professor in the Department of Computer
Science and Numerical Analysis, University of Cordoba,
where he heads the Knowledge Discovery and Intelligent
Systems Research Laboratory. He is the author or coauthor of more than 90 international publications, 35 of
which have been published in international journals. He
has also been engaged in 12 research projects (being the
coordinator of three of them) supported by the Spanish
and Andalusian governments and the European Union,
concerning several aspects of the area of evolutionary
computation, machine learning, and data mining and its
applications. His current main research interests are in the
elds of soft-computing, machine learning, and data mining
and its applications.