Você está na página 1de 118

Title

Computational methods for identifying and classifying questions


in online collaborative learning discourse of Hong Kong
students

Advisor(s)

Law, NWY; Chan, KP

Author(s)

Wong, On-wing.; .

Citation

Issued Date

URL

Rights

2013

http://hdl.handle.net/10722/188758

The author retains all proprietary rights, (such as patent rights)


and the right to use in future works.

Computational methods for identifying and


classifying questions in online collaborative
learning discourse of Hong Kong students

by

WONG On Wing
Degree of Master of Philosophy

January 2013

Abstract of thesis entitled

Computational methods for identifying and classifying


questions in online collaborative learning discourse of Hong
Kong students

Submitted by

WONG On Wing

for the degree of Master of Philosophy


at The University of Hong Kong
in January 2013

This study aims to investigate the automated question detection and


classification methods to support teachers in monitoring the progression of
discussion

in

Computer-Supported

Collaborative

Learning

(CSCL)

discourse of Hong Kong students. Questioning is an important component


of CSCL. Through the analysis of question types in CSCL discourse,
teachers may probably get a general idea of how an inquiry is constructed.
This study is an attempt to take up this time-consuming task of question
classification with the techniques developed from machine learning. In
general, the performance of machine learning algorithms will improve by
increasing the amount of empirical data for training. The amount of training
data is a determining factor for the performance of machine learning
algorithms. The machine learning based question classification algorithms
may not able to detect those question types with a small amount of training
data. In order not to miss out those questions, an extra step to detect the
occurrence of all question types might be needed.

One Chinese and one English datasets are collected from an online
discussion platform. These datasets are selected for comparing the
performance of question detection and classification in the two languages,
and a sentence is defined as the unit of analysis. Question detection is a
process to distinguish questions from other types of discourse act. A hybrid
method is proposed to combine the rule-based question mark method and
machine-learning-based syntax method for question detection. This method
achieves 94.8% f1-score and 98.9% accuracy in English question detection
and 94.8% f1-score and 93.9% accuracy in Chinese question detection.
While question detection focuses mainly on the identification of questions,
question classification concentrates on the categorization of questions. The
literature showed that the tree kernel method is almost a standardized
method for question classification. The classification of English verification
and reason questions using tree kernel method can both attained f1-score
above 80%. Though the precision of Chinese question classification using
the same settings remains at a similar level, the recall drops greatly. This
result indicates that the syntax-based tree kernel method may not be
appropriate for classifying questions in Chinese languages. In order to
improve on the Chinese question classification result, Case-Based
Reasoning (CBR) is introduced. CBR is a method to retrieve example case(s)
which shares the maximum percentage of similarity with the test case from
a database. In this study, the similarity is measured by the lexemes that
composed a question. Although the implementation of the CBR method can
improve the recall, it also causes the great drop of precision. Considering
the high precision of tree kernel method and wide coverage of CBR method,
a hybrid method is proposed to combine the two methods. The experiment
result shows that f1-score of the hybrid method for multi-class classification
surpasses the tree kernel and CBR methods. This indicates that the
implementation of hybrid method can generally improve the result of
Chinese question classification. (Word Count: 477)

Computational methods for identifying and


classifying questions in online collaborative
learning discourse of Hong Kong students

by

WONG On Wing
B.Sc. H.K.B.U.; M.Sc. H.K.U.

A thesis submitted in partial fulfilment of the requirements for


the Degree of Master of Philosophy
at The University of Hong Kong
January 2013

Declaration

I declare that this thesis represents my own work, except where due
acknowledgement is made, and that it has not been previously included in a
thesis, dissertation or report submitted to this University or to any other
institution for degree, diploma or other qualifications.

Signed ___________________________
WONG On Wing

Acknowledgements

This thesis is dedicated to my beloved parents, sisters and husband.


Thank you very much for walking side-by-side with me through this
exciting journey. This thesis cannot be completed without their countless
care and support.

I would like take this opportunity to thank my supervisor, Prof. Nancy


Law, for her continued support during all these years of study. She pushed
me to surpass myself and I learnt a lot from the intelligent talk with her.

In addition, I would also like to thank my co-supervisor, Dr. K. P.


Chan, for his generous comments and suggestions toward my study. He led
me to an advanced level of computer science research.

Last but not least, I need to thank all my friends and colleagues for
their support. All of you are the angels of my life. I am especially indebted
to Dr. Emily Oon for her encouragement and sincerity throughout my
research. No matter how the things will be going, I will always remember
all of your help.

Thanks be to God for making this miracle happen.

ii

Table of Contents
Declaration .................................................................................................. i
Acknowledgements ..................................................................................... ii
Table of Contents ........................................................................................ iii
Lists of Figures ........................................................................................... vi
Lists of Tables ............................................................................................. vii

Chapter 1

INTRODUCTION ................................................................... 1

1.1

The needs of automated approach for analyzing CSCL discourse .. 1

1.2

Computational Methods for question detection .............................. 2

1.3

Computational Methods for question classification ........................ 4

1.4

Taxonomy for analyzing questions in CSCL................................... 6

1.5 Additional challenges to computational analysis of CSCL discourse


written by Hong Kong students.................................................................. 7
1.6

Research Questions ......................................................................... 8

1.7

Definition of terms .......................................................................... 8

Chapter 2

LITERATURE REVIEW ....................................................... 10

2.1

Computation Methods for detection of question in CSCL discourse


10

2.2

Computational Methods for question classification ...................... 12

2.3

Case-based reasoning .................................................................... 15

2.4

Taxonomy for analyzing questions in CSCL................................. 17

Chapter 3

METHOD............................................................................... 23

3.1

Selection of dataset ........................................................................ 23

3.2

Data format and data preparation .................................................. 24

3.3

Manual question classification ...................................................... 25

3.4

Automated Question Detection ..................................................... 26

3.5

Automated Question Classification ............................................... 30

3.5.1

Syntax-based Tree kernel method .......................................... 30

3.5.2

Lexeme-based Case-based reasoning ..................................... 32


iii

3.5.3

Hybrid method for question classification ............................. 33

Figure 1. The procedure of the hybrid method for question classification .. 34


Chapter 4
4.1

RESULTS ............................................................................... 35

Results of manual question detection and question classification 35

Table 4. Distribution of Questions in English dataset .................................. 36


Table 5. Distribution of Question in Chinese dataset ................................... 36
4.2

Results of automated question detection ....................................... 37

4.2.1

Result of English question detection ...................................... 37

Table 6. English question detection results with QM, Syntax and QM+Syntax
methods ........................................................................................................ 37
4.2.2

Result of Chinese question detection ..................................... 38

Table 7. Chinese question detection results with QM, Syntax and QM+Syntax
methods ........................................................................................................ 38
4.3

Results of automated question classification ................................. 40

4.3.1

Question classification using tree kernel method................... 40

Table 9 . Ten-fold stratified cross-validation result of English reason question


using tree kernel method .............................................................................. 41
Table 12. Ten-fold stratified cross-validation result of Chinese reason
question using tree kernel method ............................................................... 43
4.3.2

Question classification using Case-based Reasoning method 44

Table 15. Ten-fold stratified cross-validation result of Chinese verification


question using CBR method ........................................................................ 45
Table 16. Ten-fold stratified cross-validation result of Chinese reason
question using CBR method ........................................................................ 45
Table 17. Ten-fold stratified cross-validation result of Chinese attribute
question using CBR method ........................................................................ 46
Table 18 ........................................................................................................ 46
Ten-fold stratified cross-validation result of multi-class classification using
CBR .............................................................................................................. 46
4.3.3

Question classification using the hybrid method ................... 46

Table 20. Ten-fold stratified cross-validation results of Chinese reason


question with tree+CBR method .................................................................. 47
iv

Chapter 5
5.1

DISCUSSION ........................................................................ 49

Hybrid method for question detection ........................................... 49

5.1.1

Hybrid method for English question detection ...................... 50

Figure 2. Parse tree of question "Why?" ...................................................... 52


Figure 3. Parse tree of question "Teapot?" ................................................... 53
Figure 4. Parse tree of question "CFC lead to global warming?" ................ 53
Figure 5. Parse tree of question "money comes from where?" .................... 54
5.1.2

Hybrid method for Chinese question detection ...................... 55

5.1.3

A comparison of Chinese and English questions ................... 57

5.2

Syntax-based tree kernel method for question classification ........ 62

5.2.1

Non-fact oriented question types ........................................... 63

Table 23. Classification result of English Verification, Reason and Attribute


questions....................................................................................................... 64
Figure 6. Parse tree of attribute question Can u tell me where have many
wind? .......................................................................................................... 65
Figure 7. Parse tree of attribute question Where are the rubbish?............ 65
Figure 9. Parse tree of others question Can u explain? ............................ 65
5.2.2

English questions by non-native English speaker .................. 66

Figure 11. Parse tree of procedure question How can we police


? .............................................................................................................. 68
Figure 12. The correct parse tree of question How can we police
? .......................................................................................................... 68
Figure 13. Parse tree of procedure question "How we protect the
environment?" .............................................................................................. 69
Figure 14. The correct parse tree of question How can we police
? .......................................................................................................... 69
Figure 15. Direct translation of the question Hong Kong
police ?........................................................................................ 69
5.2.3

Chinese question classification .............................................. 69

Figure 16. Parse tree of reason question ? ........ 71


Figure 17. Parse tree of reason question ? ........ 71
Figure 18. Parse tree of sentence
v

...................................................................................................................... 72
5.3 Lexeme-based case-based reasoning for Chinese question
classification ............................................................................................. 72
Table 25. A comparison of the question classification result by CBR and tree
kernel method ............................................................................................... 73
Figure 19. Syntactic structure of question
? .......................................................................................................... 74
Figure 20. Parse tree of attribute question
? .............................................................................................................. 75
Figure 21. Parse tree of attribute question ? ....................... 75
5.4

Hybrid method for Chinese question classification ...................... 76

5.5

Question taxonomy........................................................................ 78

5.5.1
The reliability of question taxonomy for automatic question
classification ......................................................................................... 78
5.5.2

The distribution of questions in Chinese and English dataset 79

Chapter 6
IMPLICATIONS, RECOMMENDATIONS, LIMITATION
AND CONCLUSION .................................................................................. 83
6.1

Main Findings ................................................................................ 83

6.2

Implication and Recommendation ................................................. 86

6.3

Limitation of this study ................................................................. 87

6.4

Future Work ................................................................................... 88

6.5

Conclusion ..................................................................................... 89

6.6

Implication and Recommendation ................................................. 90

6.7

Limitation of this study ................................................................. 92

6.8

Future Work ................................................................................... 92

6.9

Conclusion ..................................................................................... 93

Appendix I.................................................................................................... 95
List of lexemes with Information Gain higher than or equal to 0.011 ......... 95
References .................................................................................................... 97

vi

List of Figures
Figure 1. The procedure of the hybrid method for question classification .. 34
Figure 2. Parse tree of question "Why?" ...................................................... 52
Figure 3. Parse tree of question "Teapot?" ................................................... 53
Figure 4. Parse tree of question "CFC lead to global warming?" ................ 53
Figure 5. Parse tree of question "money comes from where?" .................... 54
Figure 6. Parse tree of attribute question Can u tell me where have many
wind? .......................................................................................................... 65
Figure 7. Parse tree of attribute question Where are the rubbish? ............ 65
Figure 8. Common subset trees of attribute questions Can u tell me where
have many wind? and Where are the rubbish? ....................................... 65
Figure 9. Parse tree of others question Can u explain? ............................ 65
Figure 10. Common subset trees of attribute question Can u tell me where
have many wind? and elaboration question Can u explain? ................... 66
Figure 11. Parse tree of procedure question How can we police
? .............................................................................................................. 68
Figure 12. The correct parse tree of question How can we police
? .......................................................................................................... 68
Figure 13. Parse tree of procedure question "How we protect the
environment?" .............................................................................................. 69
Figure 14. The correct parse tree of question How can we police
? .......................................................................................................... 69
Figure 15. Direct translation of the question Hong Kong
police ?........................................................................................ 69
Figure 16. Parse tree of reason question ? ........ 71
Figure 17. Parse tree of reason question ? ........ 71
Figure 18. Parse tree of sentence
...................................................................................................................... 72
Figure 19. Syntactic structure of question
? .......................................................................................................... 74
Figure 20. Parse tree of attribute question
? .............................................................................................................. 75
Figure 21. Parse tree of attribute question ? ....................... 75

vii

List of Tables
Table 1. Patterns for question types (Lee et al., 2000) .......................................... 13
Table 2. Categories of Question (Graesser & Person, 1994, p. 111) ..................... 19
Table 3. Question categorization used in the present study based on the
purpose of the questions in inquiry ....................................................................... 20
Table 4. Distribution of Questions in English dataset ........................................... 36
Table 5. Distribution of Question in Chinese dataset ............................................ 36
Table 6. English question detection results with QM, Syntax and QM+Syntax
methods ................................................................................................................. 37
Table 7. Chinese question detection results with QM, Syntax and QM+Syntax
methods ................................................................................................................. 38
Table 8. Ten-fold stratified cross-validation result of English verification
question using tree kernel method ........................................................................ 41
Table 9 . Ten-fold stratified cross-validation result of English reason question
using tree kernel method ....................................................................................... 41
Table 10. Ten-fold stratified cross-validation result of English attribute
question using tree kernel method ........................................................................ 41
Table 11. Ten-fold stratified cross-validation result of Chinese verification
question using tree kernel method ........................................................................ 42
Table 12. Ten-fold stratified cross-validation result of Chinese reason
question using tree kernel method ........................................................................ 43
Table 13. Ten-fold stratified cross-validation result of Chinese attribute
question using tree kernel method ........................................................................ 43
Table 14. Average of the ten-fold stratified cross-validation result of
multi-class classification using tree kernel method .............................................. 43
Table 15. Ten-fold stratified cross-validation result of Chinese verification
question using CBR method ................................................................................. 45
Table 16. Ten-fold stratified cross-validation result of Chinese reason
question using CBR method ................................................................................. 45
Table 17. Ten-fold stratified cross-validation result of Chinese attribute
question using CBR method ................................................................................. 46
Table 18 ................................................................................................................. 46
Table 19. Ten-fold stratified cross-validation results of Chinese verification
question with tree+CBR method ........................................................................... 47
Table 20. Ten-fold stratified cross-validation results of Chinese reason
question with tree+CBR method ........................................................................... 47
Table 21. Ten-fold stratified cross-validation results of Chinese attribute
question with tree+CBR method ........................................................................... 48
viii

Table 22 Ten-fold stratified cross-validation results of multi-class


classification by feeding only the negative returns in all tree kernels to CBR
method as input ..................................................................................................... 48
Table 23. Classification result of English Verification, Reason and Attribute
questions................................................................................................................ 64
Table 24. Classification result of Chinese verification, reason and attribute
questions with tree kernel method ........................................................................ 70
Table 25. A comparison of the question classification result by CBR and tree
kernel method ........................................................................................................ 73
Table 26. A comparison of the tree kernel, CBR and tree kernel + CBR
method for question classification ........................................................................ 76
Table 27. A comparison of the multi-class classification performance with
tree kernel, CBR and tree+CBR method ............................................................... 77

ix

Chapter 1

INTRODUCTION

1.1 The needs of automated approach for analyzing


CSCL discourse
This study aims to investigate the automated question classification and
detection methods to support teachers in monitoring the progression of
discussion

in

Computer-Supported

Collaborative

Learning

(CSCL)

discourse of Hong Kong students. The increasing popularity of CSCL in


education has posed a great challenge for teachers in keeping track of the
large amounts of postings submitted by students for monitoring and
facilitation purposes. Teachers currently have to read through all of the
students postings to gain an understanding of how students online
discussions progress. This method may provide an in-depth understanding
of students discourse, but teachers may not have sufficient time to read
through students postings in detail, and the excessive time spent may even
adversely affect the overall quality of teaching if this causes insufficient
time for teacher to attend to his/her other responsibilities.

There are quite a lot of tools being developed in the recent years to
serve the purpose of monitoring students progression in online discussion.
Analytic Toolkit (ATK) (Burtis, 1998) is an example of the tools to serve the
purpose of tracking students progress in online discussion. ATK contains a
set of measurements developed by the Knowledge Building team at the
University of Toronto for capturing the level of participants engagement
through their contributions (i.e. notes, keywords, scaffold, reference, etc.) on
Knowledge Forum (Burtis, 1998). PolyCAFe (Rebedea, Dascalu, &
Trausan-matu, 2011) is another system which equips with the social network
analysis module for tutor to assess the social perspective of the online
conversation. These two examples make use of the participatory statistics
for the assessment of students progression. However, the research on tool
development in CSCL is not only limited to quantitative analysis. Tag
1

Helper (Ros et al., 2008)is an attempt to leverage the success of natural


language processing techniques for assessing online conversation. With the
same orientation as the Tag Helper, the focus of this study is on the
automated method for the identification and classification of questions in
online collaborative learning discourse to cater for teachers need to
formulate appropriate facilitation pedagogical decisions by monitoring the
progress in online discourse without having to spend excessive time in
reading.

Progressive inquiry is one of the core ideas of CSCL and an essential


component of a genuine progressive inquiry is the setting up of an inquiry
question (Hakkarainen, 2003). Questions in inquiry may serve the functions
of goal setting, guiding cognitive processing, activating prior knowledge,
focusing attention, promoting cognitive monitoring, and promoting displays
of knowledge (Burbules, 1993 in Hmelo-Silver & Barrows, 2008). Through
the analysis the type of question in CSCL discourse, teachers may probably
get a general idea of the way how an inquiry is constructed and what
constitutes an inquiry. This study is an attempt to take up this
time-consuming task of question classification with the techniques
developed from machine learning. The basic idea of machine learning is to
learn from experiences, or more precisely empirical data. This learning
process will be referred as training in the remaining of this thesis. In general,
the performance of machine learning algorithms will be improved by
increasing the amount of empirical data for training. The amount of training
data is a determining factor to the performance of machine learning
algorithm. However, the distribution of questions with different functions
might be uneven in CSCL discourse (Hmelo-Silver & Barrows, 2008). The
machine learning based question classification algorithm may not able to
detect those question types with little amount of training data. In order not
to miss out those questions, an extra step to detect the occurrence of all
question types might need.

1.2 Computational Methods for question detection


2

Question detection is a task to identify the presence of questions in


discourse. Precision, recall, f1-score and accuracy are commonly used in
question detection research as performance metrics. Precision is the
percentage of items classified as positive that are actually positive. Recall is
the percentage of positive items which are correctly classified as positive.
F1-score is the harmonic mean of precision and recall. Accuracy is the
percentage of correctly classified items, both positive and negative (Forman,
2003).

Research on question detection is scarce in the field of natural language


processing. One possible reason is that the room for improvement is
relatively little as compared to the other natural language processing
problems. The experiment by Li, Si, Lyu, King, & Chang (2011) shown that
using only rules-based method with question mark as an indicator can detect
84.6% of questions from an hour tweets on twitter , 2045 tweets in English,
and 96.9% of the detection are correct. With such a good question detection
result, one might question why further work on question identification
methods for CSCL discourse is needed. The foremost reason is that a large
portion of questions in online discussion might not end with question mark
(Cong, Wang, Lin, Song, & Sun, 2008). This phenomenon triggers
researchers to turn into other directions, namely machine learning approach
and interaction-based approach, for question detection.

Both machine learning and rule-based approaches require the


generation of rules.

The only difference between the rule-based and

machine learning approaches is that the rules to be used for classifying


questions are not explicitly set by human experts but generated from a
training process. It is found in the literature that the use of syntactic patterns
for question classification can yield around 90% precision (Wang & Chau,
2010). In theory, if an enough amount of training data is provided for
training, the machine learning approach would have the capability to detect
most types of questions from discourse, however, coded data in the real
world context, especially Hong Kong, is scarce. Interaction-based approach,
as the third approach, is different from the above two approaches. This
3

approach does not rely on the linguistic characteristics of questions, but the
behavioral patterns of the different user groups making the postings. As
shown in some research (Hong & Davison, 2009), people who newly join a
group have a high tendency to ask questions than other types of group
members. This approach is quite attractive, as it requires the least amount of
effort of natural language processing, but research on the relationship
between students behavior to the tendency to ask question is relatively
limited in the CSCL context. Hence, it is yet uncertain whether the
interaction-based approach can be implemented for analyzing CSCL
discourse.

Based on the satisfactory performance of question detection by the


rule-based question mark method and machine-learning-based syntactic
analysis method reported in the literature, it is highly possible that a hybrid
method combining the two approaches would further improve the
performance of question detection. This is the goal of this study to
experiment this hybrid method to see if it can function effectively in the
Hong Kong context. The term effectively used here refers to reasonable
scores for precision, recall, f-score and accuracy.

1.3 Computational Methods for question


classification
The techniques for automatic question classification have a long history.
It arose from the development of question-answering (QA) systems. The
history of QA systems dates back to 1959 (Simmons, 1965) and the
establishment of the TREC question-answering track in 1990 marked
another milestone in the question-answering history. Question answering
tasks involve the classification of question types and the retrieval of answers
from databases or the internet. Since the strategy involved in the QA system
is to classify questions based on their answer types, this type of techniques
is often referred to as answer type identification (Prager, Chu-Carroll, &
Czuba, 2002). The early approaches for question type classification rely on
hand-crafted rule matching mechanism (Hull, 1999; Lee, Oh, Huang, Kim,
4

& Choi, 2000; Pasca & Harabagiu, 2001; Prager, Radev, Brown, Coden, &
Samn, 1999). This approach categorizes questions based on the rules set by
human experts. However, natural language is highly ambiguous and it is
quite unlikely, even for the experts, to exhaustively generate all the
classification rules in advance. This limitation has led the research on
question classification from rule-based approaches to machine learning
approaches (Bloehdorn & Moschitti, 2007; Bu, Zhu, Hao, & Zhu, 2010;
Carlson, Cumby, Rosen, & Roth, 1999; Day, Ong, & Hsu, 2007; Hakan,
2007; X. Li & Roth, 2002; Suzuki, Taira, Sasaki, & Maeda, 2003; D. Zhang
& Lee, 2003).

Question classifier is an algorithm to classify questions in different


question categories. The common types of question classifier include
Nearest Neighbor (NN), Nave Bayes (NB), Decision Tree (DT), Sparse
Network of Winnows (SNoW) and Support Vector Machine (SVM), and
SVM is the most popular types of classifiers being used. SVM is a binary
classification approach using hyperplane for separating data points in
feature space. Hyperplane is a plane in a high dimensional space. The SVM
aims at finding the best hyperplane that separates the training data. In the
cases where data points cannot be separated by a linear hyperplane, the data
points needed to be transformed into high-dimensional feature space by a
mapping function. The use of kernel function is to transform the problem
into a much higher dimensional, so that it is more likely to find a hyperplane
to separate the two classes. The experiment by Zhang & Lee (2003)
demonstrated that SVM with tree kernel yields the best question
classification results. Similar to the question detection problem, precision,
recall, f1-score and accuracy are metrics commonly used in the question
classification tasks to measure the performance of classification.

Besides the issue of choosing question classifiers, the question


taxonomy used is another crucial concern in question classification. The
taxonomies for question classification are mostly derived from analyzing the
types of answer which questions may elicit (Harabagiu et al., 2000). This
way of classification is fundamentally different from the way adopted in
5

CSCL, where the classification of questions adopted is to help the teachers


or researchers to understand how students are actually engaged in the
collaborative learning activities. Commonly used question classification
frameworks in CSCL may reveal the level of epistemology (Hakkarainen,
2003), socio-cognitive engagement (Scardamalia & Bereiter, 1991), depth of
reasoning (Graesser & Person, 1994), or group dynamics (Hmelo-Silver &
Barrows, 2008). As not all classification methods lead themselves equally to
computational linguistic analysis, an investigation on the question
classification framework which more suited for automatic analysis of the
students questions in CSCL discourse might be needed.

1.4 Taxonomy for analyzing questions in CSCL


The performance of the question classifier is dependent on the question
taxonomy adopted. There are three criteria for the selection of a suitable
question taxonomy: 1) the relevance of the taxonomy to inquiry, 2) the
reliability of content analysis with the taxonomy by independent human
coders (Landis & Koch, 1977) and 3) the likelihood that the questions in
taxonomy can be differentiated by computational linguistic algorithms.

The first criterion concerns the meaningfulness of the question


taxonomy for understanding the progression of inquiry. If the taxonomy is
irrelevant to the theory of collaborative learning, then that taxonomy should
not be chosen. The second criterion considers the reliability of the taxonomy.
Cohens Kappa coefficient (Cohen, 1960) is one of the metrics of
inter-coder reliability of question taxonomy with nominal scale variable. It
measures the agreement between two coders with the consideration of the
agreement by chance. High inter-coder reliability ensures that the taxonomy
is well defined, and can be meaningfully and consistently operationalized at
least by humans. The last criterion focuses on the linguistic features of the
question categories. Since automatic question classification is based on their
syntactic and/or semantic structure, it is probably impossible for an
algorithm to yield good classification results if the linguistic features of the
question categories are not distinctive. To date, most of the question
6

classification frameworks adopted by the CSCL community only pay most


attention to the first and second criterion, and work which focus on the third
criterion is relatively rare. It is believed that the third criterion deserve more
attention, as the advancement of understanding on this criterion would
facilitates the development of computational tools for investigation CSCL
discourse and as a result proliferate the productive of CSCL.

1.5 Additional challenges to computational analysis


of CSCL discourse written by Hong Kong
students
Chinese and English were legislated as official languages in Hong
Kong in 1974 under the Official Language Ordinance (Man, 2006).
However, the majority of the Hong Kong population uses Cantonese as their
usual language (Census and Statistics Department, 2012). Most students
have limited exposure to English outside formal schooling, and they think
that the knowledge of English is irrelevant to their first language knowledge
(Law, 2005) that is Chinese or more precisely Cantonese. The question
detection and classification algorithms generated from the native English
speakers discourse may not function satisfactorily in the CSCL
environment where most participants are non-native English speakers.

Besides, language use in online discourse is usually casual and


informal (Cong et al., 2008), and hence more likely to be ungrammatical
and colloquial. Code-mixing is one of the most significant phenomena
found in oral conversations in Hong Kong. Code-mixing refers to the
alternation of languages within a sentence or a clause (Li, 2008). For
example, send file ? (Can you send the file to me?).
English terms are used to denote important ideas in a sentence. Since most
algorithms for question detection and classification are designed to analyze
only a single language, the effectiveness of these algorithms for analyzing
discourse containing different languages has to be explored.

1.6 Research Questions


As mentioned, the research on natural language processing techniques
for question detection and classification has been well studied in English
language, but similar work in Chinese language in the Hong Kong context is
lacking. Also, research on the application of question detection and
classification techniques to CSCL discourse data is also scarce. Tools for
question detection and classification would help teachers to quickly get a
general idea of the progression of inquiry, and make the corresponding
adjustment in their scaffolding. The present study aims to address these
concerns. The research questions are as follows:

1.

Is a hybrid method combining the rule-based question mark


method and machine-learning-based syntactic parsing method an
effective method for detecting questions in CSCL discourse of
Hong Kong students?

2.

Is tree kernel method an effective method for classifying questions


in CSCL discourse of Hong Kong students?

3.

Is a hybrid method combining the lexeme-based case-based


reasoning method and syntax-based tree kernel method an
effective method for classifying Chinese questions in CSCL
discourse of Hong Kong students?

4.

Is a question taxonomy based on the functions of questions in


inquiry suitable for the automatic classification of questions in
CSCL discourse of Hong Kong students?

1.7 Definition of terms


Below is a list of terms which will be used throughout this thesis:

Accuracy: Accuracy is the percentage of correctly classified items,


both positive and negative (Forman, 2003).

F1-score: F1-score is the harmonic mean of precision and recall


(Forman, 2003).
8

Feature: In classification, feature is the characteristic of an object


which makes it distinctive from other objects.

Information gain: A metric measures the reduction of uncertainly after


taking into consideration a feature for the classification task.

Machine learning: Machine learning is an approach to learn from


experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves
with experience E. (Mitchell, 1997, p. 2).

Precision: Precision is the percentage of items classified as positive


that are actually positive (Forman, 2003).

Question classifier: Question classifier is an algorithm to classify


questions into different categories of questions.

Syntactic parsing: A process to retrieve the syntactic structure of a


sentence.

Chapter 2

LITERATURE REVIEW

2.1 Computation Methods for detection of question


in CSCL discourse
Questions is a fundamental component for the human cognitive process
(Graesser, Baggett, & Williams, 1996). They are ubiquitous in social
interaction. The functions of questions range from interrogative, to clarify,
offers, requests, challenges (Enfield, Stivers, & Levinson, 2010), etc.
Moreover, questions are mostly stated in informal language and embedded
in various description sentences (Wang & Chau, 2010).

These

characteristics of questions have increased the difficulty of question


detection from discourse resulted from social interaction. Three main types
of machine based question detection approaches can be identified from the
literature.

The first type is hand-crafted rule-based approach. This approach


matches the input data with a predefined set of rules established by human
experts. Question mark is usually used as a marker for rule-based detection
of questions. The rule-based approach is easy to implement and requires no
algorithm training process. It is especially good for the situation where
training data is limited. Cong et al. (2008) and Li et al. (2011) used question
mark as indicators for detecting questions in their online forum and twitter
datasets respectively, and both of them achieved a precision of around 97%.
However, the robustness of the rule-based approach is not guaranteed (Wang
& Chau, 2010). It is found in the literature that the experimental results on
question detection with rule-based methods vary greatly. Hong & Davison
(2009) reported a precision of 65% precision by using question mark to
detect questions from their Ubuntu Linux Community Forum dataset. These
results inform us that extra measures are needed to improve the reliability of
the rule-based approach.

Another type of approach is the interaction-based approach. In this


approach, questions are not identified based on the content in questions but
10

the behavioral patterns of the different participants contributing to the


discussion. Hong & Davison (2009) believed that new users are more
inclined to ask questions than answer questions in the Yahoo! Answer
questions system, and hence, questions are more likely to be found in posts
made by new users. However, there exists no agreement among the CSCL
community on the correlation between the role of users in discussions and
the occurrence of questions. Since the validity of the interaction-based
approach for detecting questions is not yet confirmed in the CSCL literature,
it is not be chosen as the approach for detecting questions in CSCL
discourse in this study.

Machine learning approach is another for question detection that may


overcome the weakness of the rule-based approach. Machine learning, as a
core subarea of artificial intelligent, has the goal to reduce the effort of
manually generating the rules through the process of learning from
empirical output of manual processing generated by human. In machine
learning, a computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.
(Mitchell, 1997, p. 2) Pattern mining is one of the machine learning methods.
Pattern mining is the generation of frequent patterns from empirical data.
The literature shows that syntactic structure is a reliable pattern that can be
detected from naturally occurring discussions. Wang & Chaus experiment
(2010) achieved a 90.06% precision and 78.19% recall by using syntactic
patterns for classifying question from community-based question asking
system. Syntactic structure is the tree representation of a sentence showing
the relationship between different lexical items in a sentence.

A robust

syntactic analyzer is crucial to the performance of question pattern


recognition. Stanford Syntactic Parser (hereafter Stanford Parser) is an
implementation of CockeYoungerKasami (CYK) parser. It is proclaimed
to be an unlexicalized probabilistic context-free grammars (PCFGs) parser,
as it only does annotation on the function words (closed-class words)
instead of the lexical words. It has made two more adaptations to tackle the
problems found in the Penn Treebank (Taylor, Marcus, & Santorini, 2003).
11

One of those is to annotate the non-terminal node with its parent node to
cater for the problem of over-generalized part-of-speech tags found in the
Penn Treebank. The second is to implement horizontal and vertical
markovization to break down a complex parse and better represent the
probability of the trees to tackle the uneven distribution of trees in the Penn
Treebank. The experimental result by Klein & Mannings ( 2003) shows that
the Stanford parser can achieve a f1-score of 86.3%. Since the Stanford
parser can achieve a satisfactory performance, it is believed that it can also
able to generate the syntactic structure of questions with the same high
f1-score, and therefore, it is chosen as a recognizer of the syntactic patterns
of questions in online discussions for this study. This result indicates that the
performance of machine learning using syntactic patterns is comparable to
the rule-based question mark method for question detection.

The above investigation indicates that the rule-based question mark and
machine-learning-based syntax methods are reliable methods for detecting
English questions from online discussion. However, it lacks investigation on
the performance of the two approaches for detecting questions generated by
students who may not be proficient in the languages used. This study aims
to verify the effectiveness of these two approaches for question detection in
online discussion in both English and Chinese languages by Hong Kong
students, and to experiment whether a hybrid method combining of two
methods can improve the performance of question detection.

2.2 Computational Methods for question


classification
Question classification refers to the task of categorizing questions into
a predefined set of mutually exclusive question categories using natural
language processing techniques. Most methods for question classification
found in the literature are developed for analyzing fact-oriented questions
formulated during the question-answering tasks with the aim to assist the
process of automated answer generation. Thus, questions are mainly
categorized according to the type of answers they solicit. The early
12

approaches use rule-based pattern matching methods to classify questions.


Table 1 lists an example of question types with their corresponding patterns
(Lee et al., 2000). From these patterns, we can observe that the category of
question is determined by a combination of wh-word and the predicate of
questions.

Table 1. Patterns for question types (Lee et al., 2000)

Question Type

Patterns

Person

Who, Whos, Whom~, ~mans name, ~womans name

Location

Where~, What + location (city, country)~, In what +


location (city, country)~, What nationality~

Organization

What company~, What institution~

Time

When~, What time~, How many + Time (years,


months, days)~

Currency

How much~ spend, rent, cost, money, price~

Measure

How much, How many, How + Adjective

There are a number of limitations in using a manual process to


establish the matching rules of linking specific question types with their
corresponding word patterns. The first and foremost limitation is that it
requires a large amount of human effort to construct the mapping rules. In
addition, representation of the mapping of lexical items with fine-grained
question classes may be very complex and difficult to construct manually.
Finally, manually constructed mapping is restricted to a particular domain
and cannot be easily transferred into a new context (X. Li & Roth, 2002).
Since the machine learning approach requires little human manipulation
compared with the rule-based approach, question classification methods
have gradually shifted to machine learning approaches.

Question classifier is an algorithm to classify questions into different


question categories. Zhang & Lee (2003) compared the performance of five
common classifiers, namely Nearest Neighbors (NN) algorithm, Nave
Bayes (NB), Decision Tree (DT), Sparse Network of Winnows (SNoW) and
Support Vector Machine (SVM), on question classification, and found that
SVM outperforms other classifiers with training data size of more than
13

1000.
SVM is a one-class linear classifier using maximum margin hyperplane
to separate data points in feature space. However, most situations in real
world contexts are not linearly separable, so the input space needs to be
converted into linearly separable high-dimensional feature space with a
mapping function

. If the dimension is very high, it is computationally

not feasible to generate the feature vector (Suzuki et al., 2003). Kernel
function, as defined in equation (1), can be used to avoid this problem.

(1)

Kernel function allows the incorporation of prior knowledge into the metric
for measuring the similarity between two data points without creating an
explicit numeric feature space. Among different types of kernel functions,
tree kernel is the one that yields the best result in question classification. A
comparison by Zhang & Lee ( 2003) shows that tree-kernel gives a better
question classification performance than bag-of-word and n-gram linear
kernel.

In addition to the comparison of the performance of different kernel


functions for SVM, Zhang & Lee ( 2003) also compared the question
classification performance between coarse-grained and fine-grained
question category definition. The comparison result shows that the trend of
improvement of the fine-grained question category definition with an
increasing number of training data is generally consistent with that of the
coarse-grained question category definition. However, there are a few points
which are worth our attention. First, the results of fine-grained question
category definition are generally lower than coarse-grained question
category definition. This may inform us that question classification
technology works best with coarse-grained question category definition.
Second, the performance of coarse-grained question classification is
14

saturated with 4000 training data, while there is no sign of saturation for the
classification task with fine-grained question category definition. It is
assumed that the same trend of improvement would apply to the
fine-grained question category but the saturation would be achieved at a
later stage. This result indicates that the coarse-grained question category
definition would be a better choice for question classification task with an
insufficient amount of training data. Third, CSCL discourse question data
generated from authentic learning contexts is not easily available in large
quantities for training the algorithm. It is a concern of this study to strive for
a balance between the size of training data and performance of algorithm.
Fourth, the results reviewed above are based on experiments with English
large data. It is yet unknown whether the same result can be attained from
the classification of Chinese questions. Most Chinese questions are
wh-in-situ questions and do not have wh-movement as found in English
question. This means that the wh-element in the Chinese questions will not
undergo an overt moments.

2.3 Case-based reasoning


The syntax-based tree kernel method derived from the classification of
English questions may not be appropriate for classifying Chinese questions.
In order to tackle the foreseeable problem of a syntax-based approach for
Chinese question classification, there is a need for exploration of methods
which are not syntax-based. Since Chinese question is wh-in-situ, the lexical
items used determine whether a sentence is a question or not. Therefore, the
lexeme-based approach, which was commonly used in the field of natural
language processing, is being implementation to alleviate the problem of
syntax-based

approaches

for

Chinese

question

classification.

The

bag-of-word linear kernel method is proven to be less effective than the tree
kernel method (D. Zhang & Lee, 2003), this present study explores the use
of case-base reasoning, a direct retrieval approach without any
generalization of the training data, for Chinese question classification.
Aamodt & Plaza (1994) defined case-based reasoning as a paradigm to
solve a new problem by remembering a previous similar situation and by
15

using information and knowledge of that situation (p. 40). A key


assumption of CBR is that a problem can be solved by the solution applied
to a similar problem. The first case-based reasoning system CYRUS
questioning answering system was developed by Kolodner (1983). It is
different from the conventional AI approach by which no generalization of
knowledge is carried out in case-based reasoning.

As CBR is developed from the problem solving domain, a case in CBR


represents a problem situation. Those cases which have been remembered
are called past cases, and a database of the past cases is named the case base.
The process of CBR is a cyclic processes including four main stages,
namely (Aamodt & Plaza, 1994):

1. Case retrieval
2. Case reuse
3. Case revision
4. Case retention

Before the further discussion on case retrieval, we need to understand


the distance measures in CBR. A case is represented as a n-dimensional
vector and the distance between two cases is a ordinal number of the range
[0, 1]. The distance between a past case p and new case n are calculated by
the summation of all differences between the corresponding values of the
two cases and normalize to the interval [0, 1]. The similarity between p
and n as follows (Schmitt & Bergmann, 1999):

where

The past case with the smallest distance with the new case is retrieved from
the case base.
16

After the retrieval of a past case from the case base, the solution of the
past case will be retrieved as a solution for the test case. The solution for the
test case will then be applied in the real environment and feedback will be
collected from the environment or users for the revision of the past case.
Based on the problem descriptor of each algorithm, different information
will be retained for future retrieval. An advantage of using CBR for question
classification over other machine learning approaches is that it requires no
generalization of the past cases. Therefore, cases without any common
characteristics can still be grouped into the same category. This feature is
especially important for processing natural language with a wide range of
syntactic, lexical and semantic variations.

2.4 Taxonomy for analyzing questions in CSCL


The question taxonomy for the question-asking task is constructed to
assist the machine to retrieve an answer quickly and accurately from the
database. The number of questions categories and the organization of
question taxonomy vary greatly across different research. Some question
taxonomies are organized into multiple levels with the hope that question
classification based on hierarchical question taxonomy would have a better
performance than the one based on single-level question taxonomy. (X. Li &
Roth, 2002), but until now no supporting evidence has been found to prove
this claim. In addition, there is a great difference in the number of question
categories found in different studies. The number of question categories
ranges from 32 (Ittycheriah, Franz, & Roukos, 2001) to 150 (Suzuki et al.,
2003). There is no standardized taxonomy in the question-answering domain
for analyzing questions.

Since the aim of question classification in this study is not to identify


the answer type but to classify the question based on its functions in the
context of CSCL inquiry, the taxonomy mentioned above is not suitable for
the need of this study. In CSCL, a genuine inquiry cannot happen without
asking questions (Hakkarainen, 1998). Further, an inquiry is not just a
question-and-response pair, but a guided process of socio-cognition.
17

Participants take the responsibility for their own learning and questioning is
a way for them to make their gap of their understanding explicit. Since
deep-reasoning questions are highly correlated with the deeper levels of
cognition (Graesser & Person, 1994), the GPH scheme, a question
taxonomy proposed by Graesser & Person (1994) for studying the depth of
reasoning, is chosen as the question taxonomy of this study. Table 2 lists the
question categories in the GPH scheme.

The shortcoming of this scheme is that the question categories are not
mutually exclusive (Graesser & Person, 1994). A question may be assigned
to one or more question categories. This feature violates the basic assumption
in the computation of Cohens Kappa coefficient that all categories are
mutually exclusive. Hence, a refinement to the scheme is needed to eliminate
the overlap in the different categories for it to be useful for the purpose of the
present study. One possible refinement is to group question categories with
similar functions into one category. The interpretation, causal antecedent,
causal consequence, goal orientation, expectational and enablement
questions in the GPH scheme are all explanation-driven questions, and thus,
being grouped into question category reason. While the concept completion,
feature specification and quantification questions both solicit the property of
a phenomenon, are grouped into question category attribute. The question
category request in the GPH scheme and some other question categories,
such as the task-oriented questions (Hmelo-Silver & Barrows, 2008) are not
directly related to the progression of inquiry. These types of questions are
put into question category others and not considered in this present study.
Finally, the name of some question categories has also been changed for
ease of understanding. Table 3 shows the question taxonomy proposed in
this study. The question taxonomy proposed in this study is based on the
function of questions in CSCL inquiry instead of the factual information that
a question solicits. This type of taxonomy is quite different from the one
used in the question classification literature. It awaits validation whether the
question classification techniques derived from the classification of factual
questions is suitable for the present classification of questions based on their
functions in inquiries.
18

Table 2. Categories of Question (Graesser & Person, 1994, p. 111)


Question category
Short answer

Long answer

Abstract specification

Example

Verification

Is a fact true? Did an event occur?

Is the answer 5?

Disjunctive

Is X or Y the case? Is X, Y, or Z the case?

Is gender or female the variable?

Concept completion

Who? What? What is the referent of a noun argument slot?

Who ran this experiment?

Feature specification

What qualitative attribute does entity X have?

What are the properties of a bar graph?

Quantification

What is the value of a qualitative variable? How many?

How many degrees of freedom are on this variable?

Definition

What does X mean?

What is a t test?

Example

What is an example label or instance of the category?

What is an example of a factorial design?

Comparison

How is X similar to Y? How is X different from Y?

What is the difference between a t test and a F test?

Interpretation

What concept or claim can be inferred from a static or active

What is happening in this graph?

pattern of data?
Causal antecedent

What state or event causally led to an event or state?

How did this experiment fail?

Causal consequence

What are the consequences of an event or state?

What happens when this level decreases?

Goal orientation

What are the motives or goals behind an agents action?

Why did you put decision latency on the y-axis?

Instrumental/procedural

What instrument or plan allows an agent to accomplish a goal?

How do you present the stimulus on each trial?

Enablement

What object or resource allows an agent to perform an action?

What device allows you to measure stress?

Expectational

Why did some expected event not occur?

Why isnt there an interaction?

Judgmental

What value does the answerer place on an idea or advice?

What do you think of this operational definition?

19

Assertion

The speaker makes a statement indicating he lacks knowledge or

I dont understand main effects.

does not understand an idea.


Request/Directive

The speaker wants the listener to perform an action.

Would you add those numbers together?

Table 3. Question categorization used in the present study based on the purpose of the questions in inquiry
Question type

GHP question type

Requesting For

Definition

Definition

Meaning of a word/phrase/event

Option

Disjunctive

Selection between two objects (this or that) based on fact

Example

What is energy crisis?

What does global warming refer to?

Does the factory give out clean water or


harmful water?

Attribute

Clarification

Comparison

Do electrons behave as particles or wave?

Concept completion,

Information on some properties (i.e. dimension,

What is the size of the smallest whale?

feature specification,

background, quantity, function, composition, location,

Where is the notice?

quantification

duration, etc.) of an object/event

What is the most powerful energy?

Assertion

Clearing of uncertainty for factual points

What do you mean by wasting energy?

What are you talking about?

What is the difference between fuel and energy?

What are the differences between these types of

Comparison

Evaluation of the differences between two or more objects

games?

20

Reason

Interpretation, causal

Causal antecedent/rationale for an event/phenomenon

antecedent, causal

Why do water become brown color?

Why do we need to save water?

consequence, goal
orientation,
expectational,
enablement
Procedure

Instrumental/

Instructions/process necessary to accomplish a goal

How to make a water filter?

Procedural

(step-wise); or underlying principles/mechanisms of an

How can we make clean water far away from

event

dirty water so that we can make dirty water


clean step by steps?

How can we protect these animals?

How do planktons give birth to the next


generation?

Example

Opinion

Example

Judgment

Examples/evidence to support an opinion/claim

Any feedback/feelings/experiences/answers with personal

Can you use a everyday example to explain it?

Can I have some examples of online games?

How would you think a family can help to save

opinion, as perceived by the questioner

power?

Can you give me some feedback on how you


feel?

Verification

Verification

Confirmation of factual/hypothetical points for an

21

Is it safe to drink the water only through looking

object/situation; not expecting an answer

at it?

Is carbon dioxide toxic and harmful?

Do you really think the government will care


about our living standard?

Others

Other types of question

22

Chapter 3

METHOD

3.1 Selection of dataset


The dataset for this study is collected from an archive owned by a
project1 that supports the professional development of teachers in a network
promoting the use of knowledge building pedagogy in primary and
secondary schools in Hong Kong. The archive contains online discussions on
Knowledge Forum by students from Hong Kong using either English or
Chinese as the medium of discussion.

In order to test the performance of the proposed question detection and


classification methods, five discussion forums in Chinese and five discussion
forums in English from classrooms in which there have been relatively
extensive engagement by students are selected from the archive of
discussion collected from Hong Kong primary and secondary schools.
Although both Chinese and English are the official languages of Hong Kong,
we cannot presume that the language proficiency of Hong Kong students in
both languages is as proficient as native speakers.

Firstly, the Chinese language in Hong Kong is westernized because of


the historical background of Hong Kong. Passive voice is one of the most
significant examples of westernization of Chinese language found in Hong
Kong. For example, the sentence . should be changed
to . It is expected that the performance of the Chinese
question detection and classification might be affected by this factor.

Moreover, the proficiency of English language in Hong Kong is


another issue of the sample data. The majority of the population in Hong
Kong has Cantonese as the first language and the chance of communication
with English in daily life in rather rare. The English proficiency of Hong
1

Individual and organizational learning in a professional development network for


knowledge building as pedagogical innovation (Project code: HKU 747911H, Approval
Ref.: EA1310310, dated 2010 April 13, project duration 30 months)

23

Kong citizens may have a distant from the native English speakers. The
2013 Business English Index & Globalization of English Research (Global
English, 2013) shows that the level of business English of Hong Kong is at
the rank of 21. The result of this report reflects that the level of English in
Hong Kong has a distant from the language proficiency from the top rank,
and hence, it is predictable that the language in the dataset may not be fully
grammatical.

Besides the above factors, the language proficiency of students may


also vary across different ages. In order to capture the variation of language
proficiency, the dataset for each language contains three discussion forums
from primary school students and two discussion forums from secondary
school students to assure that discourse with various level of language
proficiency are included in the datasets.

3.2 Data format and data preparation


The dataset for this study is directly collected from Knowledge
Forum . Knowledge Forum (http://www.knowledgeforum.com) is an
online learning platform, designed to engage participants in collaborative
learning activities (Scardamalia, 2002). It contains functions such as
message posting, graphical message organization, content referencing, and
scaffolding, etc. The data in Knowledge Forum is stored as a collection of
tuples called tuplestore. Each tuple contains a unique tuple ID and a variable
length of data. The tuple-based method poses no restriction on the size of
the data in one tuple to enhance the flexibility of content to be stored in the
database.

The first step of the data preparation is to convert the database from
tuplestore format to relational database format. A main reason for this
transformation is that the relational database provides a more flexible and
optimized way for data query than tuplestore. During the process of
conversion, all personal identifiers of the authors were removed. The
resulting relational database contains only the identifier of the discussion
24

(discussion ID), discussion title, message identifier (message ID), message


title, message content and the message ID of the parent note to identify the
reply sequence of the messages.

3.3 Manual question classification


Qualitative content analysis is used for analyzing questions in the
datasets. All questions are classified solely based on grammatical forms.
According to Zhang & Wildemuth (2009), the steps for content analysis
consists of eight steps. The first step is preparation of data. Since our data
is already in written text format, no transformation is needed. The second
step is define the content of analysis. We choose a sentence as the unit of
analysis for this study. A sentence is defined as a sequence of words, which
is delimited by a full-stop, question mark, exclamation mark, colon,
semi-colon, ellipsis, tilde, carriage return or line feed. Due to the fact that a
message may contain more than one question, a sentence is adopted as the
unit of analysis in order to capture all occurrences of questions in a message.
The third step is develop categories and a coding scheme. The
preliminary taxonomy for classifying questions is a refinement from the
categorization of explanatory-reasoning questions proposed by Graesser &
Person (1994). New categories may emerge inductively during the course of
content analysis (Miles & Huberman, 1994). These categories of question
will be grouped under the question category others. A coding manual,
containing the name, description and examples of each question category, is
prepared and given to each human coders for reference. The fourth step is
test your coding scheme on a sample text. About one tenth of the dataset
will be selected as the sample text. Two coders will code the sample text
separately using the coding manual provided and the level of agreement will
be calculated. If the level of consistency is low, the two coders may need to
revise the coding rules iteratively until a sufficient level of consistency is
reached (Weber, 1990). The reliability of content analysis will be measured
by Cohens Kappa coefficient (Cohen, 1960). According the benchmark
suggested by Landis & Koch (1977) for interpreting the Cohens Kappa
coefficient, results at or above 0.61 is considered to be substantial to almost
25

perfect agreement. The fifth step is code all the text. The remaining data
in the datasets will be coded by the two coders and each piece of data will
only code once by a single coder. The sixth step is assess your coding
consistency. Since no new category of question will be emerged in step
four and five, the process to re-calculate the inter-coder reliability can be
omitted. The seventh and eighth steps are draw conclusions from the coded
data and report your methods and findings. Concerning that the objective
of the manual content analysis in this study is to prepare training and testing
data for the question detection and classification algorithms, the last two
steps are irrelevant to the present study and therefore neglected.

3.4 Automated Question Detection


The previous research on question detection has indicated that the
rule-based question mark method and machine-learning-based syntax
method can both yield satisfactory question detection results in a specific
domain. This study is to experiment whether a hybrid method combining the
two methods can have a better question detection performance than the two
methods. Below is the introduction of the rule-based, machine-learning and
hybrid method and the experiment to verify the performance of the hybrid
method.

Firstly, the rule-based question method is a method to identify


questions based on the occurrence of question mark at the end of the
sentence. Wangs study (2010) achieved a precision of 94.12% using the
rule-based question mark method for question detection. The definition of
the method is stated below.

QM method: A statements ending with a question mark is a question.

Secondly, the machine-learning-based syntax method is to identify


questions based on the occurrence of subtree structure for questions.
Stanford parser (Klein & Manning, 2003) is chosen as the syntactic analyzer
26

for both Chinese and English language in this study. It is an unlexicalized


PCFGs parser for analyzing the syntactic structure of natural language based
on the probability generated from treebank. It can be trained by any
treebank which has the same syntactic bracketing format as the Penn
Treebank.

The Stanford parser is packaged with the language model trained with
the Penn Treebank. Penn Treebank contains three million words of parsed
text from a wide range of genres, such as IBM computer manuals, nursing
notes, Wall-street Journal articles, transcribed telephone conversations, etc
(Taylor et al., 2003). The Penn Treebank contains two question related
part-of-speech tags, namely SQ and SBARQ. SBARQ is a direct question
introduced by wh-word or wh-phrase and SQ is the subconstituent of
SBARQ excluding wh-word or wh-phrase (Marcus, 1993, pp. 321). Since
Penn Treebank is used commonly in natural language processing research, it
is selected as the treebank for training the Stanford parser for analyzing the
syntactic structure of sentences in English dataset.

While the Penn Treebank is used for English syntactic analysis, the
Sinica Treebank (F. Chen, Tsai, Chen, & Huang, 1999) is used for training
the Stanford Parser for analyzing sentences in the Chinese dataset.

Sinica

Treebank is chosen because it is one of the largest Chinese Treebank of


modern Chinese containing sixty-one thousand tree structures extracted
from Academia Sinica Balanced Corpus of Modern Chinese Corpus (Sinica
Corpus), the first balanced traditional Chinese corpus. Most of the text
collected in the Sinica Treebank is in Traditional Chinese and the text covers
the genres of news reports, advertisements, novels, scripts, meeting minutes,
textbook, bulletin board conversation, etc. It is believed that the language
found in Sinica Treebank would be most akin to the language used by Hong
Kong students. However, the two different features of the Sinica Treebank
from the Penn Treebank make it impossible for the direct implementation of
the Sinica Treebank for the machine-learning-based question detection
method in this study. Firstly, the syntactic bracketing of the Sinica Treebank
is different from the Penn Treebank. (1) shows a bracketed question found
27

in the Sinica Treebank.


(1)

S(agent:NP(Head:Naeb:)|negation:Dc:|epistemics:Dbaa:
|Head:VE2:|goal:VP(deontics:Dbab:|Head:VJ2:
|goal:NP(Head:Nac:))|particle:Td:)#
(QUESTIONCATEGORY)
In order to use Sinica Treebank for training the Stanford Parser, the

Sinica Treebank will first convert into the format of Penn Treebank. (2)
shows the tree structure (1) after the conversion.
(2)

(S (agent:NP (Head:Naeb:))
(negation:Dc:)
(epistemics:Dbaa:)
(Head:VE2:)
(goal:VP (deontics:Dbab:)
(Head:VJ2:)
(goal:NP (Head:Nac:)))
(particle:Td:)
?
)
Besides the difference in bracketing format, the missing of the

part-of-speech tag for questions in the Sinica Treebank tagset would also
cause the failure of the proposed method to detect question with the
occurrence of subtree structure for question. (2), for example, is a question,
but there exists no part-of-speech tag to indicate the constituent of a
question. One way to tackle this problem is to create a part-of-speech tag for
question SQ and use it for the replacement of root node S in (2). (3) shows
the syntactic structure of (2) after the substitution.
(3)

(SQ (agent:NP (Head:Naeb:))


(negation:Dc:)
(epistemics:Dbaa:)
(Head:VE2:)
(goal:VP (deontics:Dbab:)
(Head:VJ2:)
28

(goal:NP (Head:Nac:)))
(particle:Td:)
?
)
Syntax method: A statement which contains one or more question tag, SQ
or SBARQ, in its syntactic structure will be marked as a
question.
The hybrid method is a combination of the QM and Syntax method.
All sentences will first be processed by the QM method, and then all the
negative returns from the QM method will be passed to the Syntax method
for further processing. Any sentence which was marked as positive by either
QM or Syntax method will be regarded as question by the hybrid method.

Hybrid method: A two-step process combining the QM and Syntax methods.


A statement which is marked as positive by either QM or
Syntax method will be marked as question.
In order to compare the performance of Chinese and English question
detection, the experiments on Chinese and English question detection are
conducted separately. The experiment on each language is divided into three
cases as below. These cases are established to verify whether the hybrid
method proposed can improve the performance of question detection over
the other two methods.

Case
1
2
4

Method
QM method
Syntax method
QM + Syntax method

Sentence is chosen as the unit of analysis of this study. The online


discussion in the Chinese and English dataset is tokenized into sentences
before the process of automatic question detection. The collection of
sentences forms the testing data for this experiment. And, the results of
automatic question detection are compared against the hand coded question
29

detection results. Precision, recall, f1-score and accuracy is chosen as the


primary figure of merit. These metrics may inform us the effectiveness of
each method for detecting question in online discussion by Hong Kong
students.

3.5 Automated Question Classification


3.5.1 Syntax-based Tree kernel method
Support Vector Machine (SVM) is binary classifier. A binary classifier
can only determine whether the input data belong to a target class or not. As
a result, one SVM is needed for each category of question. The combination
of multiple SVM can form a multi-class classifier. In the situation which an
input data will be categorized into two or more question categories, the test
instance with the highest margin score is selected. SVM, as a kernel-based
technique, allow one to incorporate domain-specific knowledge into the
classifier using kernel function. A general description of kernel function is
the distance measurement between two inputs. The kernel function used in
this study is based on the tree kernel defined by Bloehdorn & Moschitti
(2007) for calculating the similarity of the syntactic structure of two
sentences.
Syntactic structure is a tree structure. Tree T is defined as a connected
acyclic graph. A connected acyclic graph is a graph with every vertex
connected with edge and there exists only a unique path to connect any two
vertices in the tree. Tree kernel is designed to measure to similarity of two
trees. To put this expression in a technical term, tree kernel is to measure
whether the set of production rules of two trees are the same. However, it is
rare in the context of natural language processing with two syntactic trees
which derived from exactly the same set of production rules. Sentences I
love eating apple and I love eating apple very much could support this
claim. These two sentences are nearly the same, but these two sentences
would have different syntactic structures because the second sentence has
included an extra adverbial phase very much. In order to solve this
problem, researchers explored the use of subtree for the comparison.
30

Subtree is the internal tree structure found in a tree. A subtree must


include at least two vertices connecting with edge. There is no constraint on
the type of leave of the subtree, a non-terminal symbol can also form the
leave of a subtree. A collection of all internal subtrees of a tree is called a
subset tree. A subset tree is defined as

where

is the

tree fragment. The tree kernel counts the common subtrees between the two
input trees. Below is the definition of tree kernel function and the
corresponding conditions proposed by Bloehdorn & Moschitti (Bloehdorn
& Moschitti, 2007, pp.862):

Definition: Given trees

and

, tree kernel

is defined as:
(1)
(2)

calculates the common fragment root at node

and

. It can

be computer in polynomial times with the following definition. The


parameter is added to control the influence of the size of tree to the result.

1. If the productions at

and

are different than

2. If the productions at

and

are the same, and

are

and

are

pre-terminals then
3. If the productions at

and

and

are the same, and

not pre-terminals then


(3)

31

where
is the number of children of
of node n and is the decay parameter.

is the j-th child

Stratified cross-validation evaluating methodology is used for


evaluating the performance of tree kernel method for classifying questions
in CSCL discourse. In order to compare the performance of Chinese and
English question classification, the experiments on Chinese and English
datasets are conducted separately. The datasets will contain a collection of
questions with their corresponding question category labels. Each dataset is
divided into ten subsets and each subset includes similar distribution of
question categories. Nine subsets of the data are used for training the SVM
model and the remaining subset is used for testing the trained model. A round
of training and testing is referred as iteration and there are ten iterations for
each selected question category. The performance of the trained model for
each question category is an average of the measurements (i.e. precision,
recall, f-score and accuracy) of 10 iterations.

3.5.2 Lexeme-based Case-based reasoning


Case-based reasoning is based on the concept of solving new problem by
reusing past cases reside on the memory. It is a method frequently adopted by
human being for solving new problems (Aamodt & Plaza, 1994). The
algorithm of case-based reasoning can be described as a two-step process: 1)
Interpret the input problem and 2) Retrieve the solution of a similar problem
from the memory space. The technical challenge of case-based reasoning is to
identify the correct set of parameters to represent a problem case. An extreme
way for problem representation is to make use every feature of a problem and
the consequence is the low computational efficiency and the existence of
noise.

In this study, each question is represented as a case with its


corresponding question category as the solution of that case. A case is
represented as an n-dimensional vector of lexemes found in all questions and
the value of each attribute is the count of a particular lexeme in question. As
discussed, not all lexemes are relevant to the determination of question
32

category and information gain is used as a criterion for feature selection.


Information gain is a measure of number of bit saved for the determination of
a class by providing a given feature. Each selected lexeme forms one feature
in a case and the number of count of a selected lexeme in a question will
become the feature value. The similarity between p and n as follows
(Schmitt & Bergmann, 1999):

where

and

K-Nearest Neighbor (K-NN) is implemented for the selection of the


past cases from the case base. K cases with the smallest distance with the
test case will be retrieved from the case base and the majority of the
question categories of the retrieved cases will become the question category
of the test case, where K=3. In order to guarantee a complete separation of
the training and testing data, ten-fold stratified cross-validation is used. The
performance of the trained model for each question category is an average of
the measurements (i.e. precision, recall, f-score and accuracy) of ten
iterations.

3.5.3 Hybrid method for question classification


The hybrid method for question classification is a combination of
syntax-based tree kernel method and lexeme-based case-based reasoning
method. Figure 1 shows the procedure of the hybrid method. All questions
will first be passed to the SVM model for each question category for
analysis. Then, the SVM scores from each SVM model will be passed to the
SVM results processor. In the processor, a question will be assigned to the
question category with the highest positive SVM scores and those questions
33

with negative scores from all SVM models will be sent to the case-based
reasoning algorithm for further processing. The case-based reasoning
algorithm will determines the question category for those questions.

Figure 1. The procedure of the hybrid method for question classification

34

Chapter 4

RESULTS

4.1 Results of manual question detection and


question classification
The manual process of question analyses was divided into two stages.
In the first stage, two coders manually selected questions from the Chinese
and English datasets. Despite the fact that no guideline was given to the two
coders for identifying questions, Cohens Kappa coefficient of 0.79 for
Chinese question detection and 0.85 for English question detection were
achieved. A value between 0.61-0.8 is considered to be a substantial
agreement (Landis & Koch, 1977). It is surprising that the two coders could
achieve a substantial agreement without any guideline on the identification
of questions. This may tell us that people situated in the same community
may share similar implicit criteria or expectation on the forms of questions
even without systemic linguistics training. The main discrepancy between
the two raters is on the questions with a single noun phrase, such as
Teapot? The rater, who regards this as non-question, believed that this
sentence is a rhetorical question and does not serve the function of inquiry.
Another rater, in the contrary, believed that this question is a verification
question to verify the idea expressed. Since it is difficult to have an
objective judgement on whether a question is a rhetorical or verification
question, the agreement between the two coders after the discussion is that
this type of question would be regarded as questions and the reason is that
the question mark marks the questioners intension to ask question.

In the second stage after the identification of questions, the two coders
categorized questions in accord with the question taxonomy described in
section 2.4. The Cohens Kappa coefficient for Chinese and English
question classification are 0.89 and 0.9 respectively for all question
categories combined. These values indicate that the inter-rater reliability is
substantial (Landis & Koch, 1977) for both Chinese and English datasets.
Besides, the similarity found in the Cohens Kappa coefficient also informs
us that the languages are not a factor affecting the reliability of the question
35

taxonomy Table 4 and Table 5 show the distribution of questions in the two
datasets, English and Chinese respectively.

Table 4. Distribution of Questions in English dataset

Question Type

Frequency

% of different types of questions

Verification

318

38.3%

Reason

172

20.7%

Attribute

116

14.0%

Procedural

116

14.0%

Definition

32

3.9%

Others

31

3.7%

Clarification

23

2.8%

Opinion

11

1.3%

Example

0.6%

Option

0.5%

Comparison

0.2%

Table 5. Distribution of Question in Chinese dataset

Question types

Count

% of different types of
questions

Verification

434

45.8%

Attribute

150

15.8%

Reason

147

15.5%

Procedural

77

8.1%

Opinion

45

4.8%

Others

29

3.1%

Option

21

2.2%

Clarification

17

1.8%

Definition

14

1.5%

Example

10

1.1%

Comparison

0.4%

It is noticed that both Chinese and English datasets contains a certain


amount of questions which have been classified as others.

This category

includes some meaningful question categories which have been missed out
36

in the question taxonomy of this study. Those questions might serve the
purpose of monitoring of discussion, providing a suggestion, asking for
elaboration or facilitating a consensus between different members of the
discussion. It is believed that these question categories would inform us the
students agency to regulate their own inquiry. However, this is out of the
scope of this study and how these questions are related to the inquiry has not
been further analyzed.

4.2 Results of automated question detection


Three methods for question detection were implemented in this study.
The first method is the rule-based question mark (QM) method. This method
uses only question mark as indicator of question. Any sentence which ends
with question mark will be regarded as question. The second method is the
machine-learning-based syntax (Syntax) method by using Stanford Parser
for analyzing the syntactic structure of questions. A sentence with the
substructure for question will be identified as question. The third method is
the hybrid (QM + Syntax) method. It is a method combining the QM and
Syntax methods. Sentence will first be analyzed by the QM method. Then,
all negative returns from the QM method will be passed to the Syntax
method for further analysis.

4.2.1 Result of English question detection


Table 6. English question detection results with QM, Syntax and QM+Syntax methods

Precision

Recall

F1-score

Accuracy

QM

96.8%

87.7%

92.0%

98.5%

Syntax

96.6%

59.6%

73.8%

95.9%

QM +
Syntax

96.0%

93.5%

94.8%

98.9%

Note. Total number of questions = 830 and total number of sentence=8811

Table 6 shows the English question detection results with the three
methods. Both methods have precision and accuracy above 95%.There are
three findings which can help us to have a better understanding to the nature

37

of English questions by Hong Kong students. The first observation is that


the precision of QM method is lower than 100%. This means that not all
sentences, which end with question marks, are questions. The second
observation is that the recall of Syntax method is only 59.6%. This means
that around 40% of questions cannot be detected by the Syntax method.
One possible explanation to this is because those questions may be
ungrammatical. The last observation is that the result of rule-based QM
method is very close to or even better than the machine-learning-based
Syntax method. Machine-learning approach has long been recognized as a
method surpassing the rule-based approach. However, this result has
demonstrated that by choosing some representative features, it is possible
for rule-based method to outperform the machine-learning method. This
experiment

results

show

that

neither

the

rule-based

QM

nor

machine-learning based Syntax method is the most desirable method for


English question detection. As a combination of the QM and Syntax
method, the QM + Syntax method yields the highest f1-scores among the
three methods. Although the extent of improvement is quite little, this
experiment can still prove that a hybrid method combining the rule-based
and machine-learning based method can yield a better result than the QM or
Syntax methods alone.

4.2.2 Result of Chinese question detection


Table 7. Chinese question detection results with QM, Syntax and QM+Syntax methods

Precision

Recall

F1-score

Accuracy

98.3%

91.5%

94.8%

93.9%

Syntax

0%

0%

0%

89.5%

QM +
Syntax

98.3%

91.5%

94.8%

93.9%

QM

Note. Total number of questions = 948 and total number of sentence=9039

Table 7 shows the result of Chinese question detection. The result of


Chinese question detection is quite different from that of English question
detection. One of the most significantly different happens on the precision
of the Syntax method. The precision of using the Syntax method for
38

Chinese question detection is 0%. This result informs us that the syntax
method, using the Sinica Treebank as the training data, fails to detect any
questions from the dataset. There are two possible causes of the failure. The
first possible cause is the low percentage of questions in the Sinica Treebank.
Since the treebank contains only around 1.3% of questions, this low
frequency would affect the performance of statistics-based Stanford
syntactic parser used in the syntax method. Another possible cause is the
missing of part-of-speech tag for questions in the tagset of the Sinica
Treebank. Although a question tag has been constructed in this study to
replace the root node of the questions in treebank, the poor performance
clearly indicates that this way of substitution may not be appropriate.

Although the question detection performance is disappointing, this


result does not imply that the Stanford parser fails to analyze the syntactic
structure of Chinese questions. As stated in Levy & Manning (2003), the
f1-score for Chinese parsing with the Stanford Parser trained with Penn
Chinese Treebank is 78.8%. This figure shows that the parser can analyze
the syntactic structure of Chinese sentences with satisfactory result. It is
highly possible that the performance of the Syntax method can be improved
if the problem of missing part-of-speech tag for questions is solved. A
detailed discussion of the syntactic structure of Chinese question will be
illustrated in Chapter five to investigate the possibility of using syntax for
detecting Chinese questions.

In addition to the poor precision and recall of the syntax method, the
high accuracy of the syntax method also informs us the accuracy may not be
a suitable metric for measuring the performance of question detection. The
main reason is that the distribution of question and non-question is highly
skewed and most of the correct detection is contributed by the correct
classification of non-questions. A solution to this problem is to have a
balanced corpus. However, the occurrence of questions is normally less than
the non-questions in the real-life context. It is quite unlikely to have a
corpus with equal amount of questions and non-questions. Hence, accuracy
shall not be considered as a suitable metric for question detection.
39

4.3 Results of automated question classification


The research (Hakan, 2007; D. Zhang & Lee, 2003) demonstrates that
SVM with tree kernel surpasses other methods for classification question
generated by native English speakers. However, there is a lack of studies on
the performance of SVM for classifying questions in online asynchronous
discussion by non-native English speakers, such as Hong Kong school
students. There are two main concerns for using the well-accepted tree
kernel method for classifying questions in the discourse by Hong Kong
students.

Firstly, the tree kernel method is only well tested for classifying
questions in English speaking environment. Those questions are more likely
to be grammatical than the questions generated by non-native English
speakers. However, the majority of students in Hong Kong have Chinese as
their first language and English is seldom used in their daily
communications. Their English language proficiency should be lower than
the native English speakers, and it is unrealistic for us to expect that the
language generated by Hong Kong students is fully grammatical. It is yet
unknown whether the existence of ungrammatical forms of questions would
affect the performance of tree kernel method.

Secondly, Chinese is a language totally different from English. It is not


guided by a well-formed grammar. The syntax-based tree kernel approach
may not be fully applicable for analyzing Chinese questions. The following
sections show the result of tree kernel method for classifying questions in
English and Chinese datasets.

4.3.1 Question classification using tree kernel method


4.3.1.1 Result of English question classification
Since the amount of training data is a determining factor to the
performance of tree kernel method, a training set with a small data size
40

would highly affect the result of question classification. Hence, only those
question types with more than 100 counts are selected for the analysis. The
three question types selected are attribute, reason and verification questions.
Table 8, Table 9 and Table 10 show the ten-fold stratified cross-validation
results for these three types of questions using tree kernel method.

Table 8. Ten-fold stratified cross-validation result of English verification question using tree
kernel method
R1

R2

Precision 100.0% 87.5%

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

81.3%

87.5%

90.3%

93.6%

78.8%

90.0%

90.3%

87.5%

89.0%

Recall

80.7%

90.3%

83.9%

90.3%

90.3%

93.6%

86.7%

90.0%

93.3%

93.3%

89.0%

f1-score

89.3%

88.9%

82.5%

88.9%

90.3%

93.6%

82.5%

90.0%

91.8%

90.3%

89.0%

Note. Total number of verification question = 306 & {Rx |

} represents the ten

round of validation

Table 9 . Ten-fold stratified cross-validation result of English reason question using tree
kernel method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 100.0% 100.0% 93.3% 100.0% 100.0% 92.3% 100.0% 100.0% 100.0% 100.0% 99.0%
Recall

82.4%

47.1%

82.4%

70.6%

68.8%

75.0%

87.5%

68.8%

87.5%

75.0%

74.0%

f1-score

90.3%

64.0%

87.5%

82.8%

81.5%

82.8%

93.3%

81.5%

93.3%

85.7%

84.7%

Note. Total number of reason question = 165 & {Rx |

} represents the ten

round of validation

Table 10. Ten-fold stratified cross-validation result of English attribute question using tree
kernel method
R1

R2

Precision 100.0% 50.0%

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

66.7% 100.0% 63.6%

85.7%

50.0% 100.0% 80.0%

85.7%

78.0%

Recall

58.3%

41.7%

33.3%

54.6%

63.6%

54.6%

18.2%

81.8%

36.4%

54.4%

50.0%

f1-score

73.7%

45.5%

44.4%

70.6%

63.6%

66.7%

26.7%

90.0%

50.0%

66.6%

60.9%

Note. Total number of attribute question = 113 & {Rx |

} represents the ten

round of validation

The classification results with tree kernel method are exceptionally


good. The precision for verification, reason and attribute questions are 89%,
99% and 78% respectively, and the recall for verification and reason
41

questions are both above 70%. The only unsatisfactory result is the recall
rate of attribute questions. The average recall of attribute question is just
50%. This means that the algorithm missed out 50% of the attribute
questions in the dataset. The wide variation of syntactic structures of
attribute questions may be a cause of the low recall rate. Attribute question
is a type of question inquiring the property of an object or event. For
example, What is the colour of lion? and Which part of this vehicle is
broken? are both attribute questions, but the syntactic structures are quite
different between the two questions. Hence, if the training data cannot cover
the syntactic structures of all questions, the algorithm may fail to detect
those missing questions. This hypothesis is supported by the result at round
8. The recall at round 8 is 31.2% higher than the average recall. This result
shows that if the rare cases do not occur in the testing data set, the
algorithm would perform quite well.

It is one of our concerns that the question generated by Hong Kong


students might be ungrammatical. This encouraging result suggests that this
factor should not be a concern for question classification. A detailed
discussion on the reason will be illustrated in Chapter 5.

4.3.1.2 Result of Chinese question classification


Coincidentally, verification, reason and attribute questions are also the
question types with more than 100 counts in the Chinese dataset. These
three types of questions were, therefore, chosen for the Chinese question
classification experiments. Table 11, Table 12 and Table 13 show the ten-fold
stratified cross-validation results for these three types of questions using tree
kernel method. It is found that throughout the ten iterations of validation
there exists only one instance with positive score at more than one question
category. This indicates that the questions in each category of question
possess distinctive syntactic structure and the separation of the classification
into three SVM will have no influence on the classification result.

Table 11. Ten-fold stratified cross-validation result of Chinese verification question using

42

tree kernel method


R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 67.9%

74.5%

78.6%

66.7%

73.8%

66.7%

67.9%

71.4%

76.7%

69.8%

71.4%

Recall

83.7%

81.4%

76.7%

83.7%

72.1%

74.4%

83.7%

83.3%

78.6%

71.4%

78.9%

f1-score

75.0%

77.8%

77.6%

74.2%

72.9%

70.3%

75.0%

76.9%

77.6%

70.6%

74.8%

Note. Total number of verification question = 427 & {Rx |

} represents the ten

round of validation

Table 12. Ten-fold stratified cross-validation result of Chinese reason question using tree
kernel method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 91.7% 100.0% 99.2%
Recall

73.3%

60.0%

60.0%

46.7%

33.3%

57.1%

42.9%

71.4%

78.6%

57.1%

58.0%

f1-score

84.6%

75.0%

75.0%

63.6%

50.0%

72.7%

60.0%

83.3%

84.6%

72.7%

72.2%

Note. Total number of reason question = 153 & {Rx |

} represents the ten

round of validation

Table 13. Ten-fold stratified cross-validation result of Chinese attribute question using tree
kernel method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Precision 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 66.7%

Avg.
96.7%

Recall

18.8%

25.0%

43.8%

20.0%

40.0%

6.7%

13.3%

20.0%

20.0%

26.7%

23.4%

f1-score

31.6%

40.0%

60.9%

33.3%

57.1%

12.5%

23.5%

33.3%

33.3%

38.1%

36.4%

Note. Total number of attribute question = 145 & {Rx |

} represents the ten

round of validation

Table 14. Average of the ten-fold stratified cross-validation result of multi-class


classification using tree kernel method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 89.3%

91.5%

92.9%

88.9%

91.3%

88.9%

89.3%

90.5%

89.5%

78.8%

89.1%

Recall

58.6%

55.5%

60.2%

50.1%

48.5%

46.1%

46.6%

58.3%

59.0%

51.7%

53.5%

f1-score

63.7%

64.3%

71.2%

57.1%

60.0%

51.9%

52.8%

64.5%

65.2%

60.5%

61.1%

The above results illustrate that the precision of Chinese question


classification using tree kernel remains at a high level. The precision of
Reason and Attribute questions are even close to 100%. However, the recalls
for all three question types are less satisfactory. The recall of attribute
43

question is just 23.4%. A reason for this drop may because Chinese is
ambiguous and the syntactic variation is larger in the Chinese attribute
questions than in the English attribute questions. It is highly probable that
some variation of syntactic structures found in the testing data is not
available in the training data and as a result the tree kernel approach fails to
correctly identify the testing attribute questions.

4.3.2 Question classification using Case-based Reasoning


method
Since the English question classification result using tree kernel
method is satisfactory, we are not going to further the exploration on the
method to improve on the performance of English question classification.
The remaining of this chapter is going to focus on the method to improve on
the results of Chinese question classification.

As wh-in-situ question, the formation of Chinese question is


determined by the lexemes occurred in questions instead of the syntactic
structure of question. Case-based reasoning (CBR) is implemented for
exploring the importance of the word usage to the classification of question
types. CBR is a method to retrieve the most relevant cases from a database.
CBR is a method to retrieve example case(s) which shares the maximum
percentage of similarity with the test case from a database. Each past case
will be regarded as an individual object and the same type of past cases do
not need to share anything in common. This characteristic is quite suitable
for the analytical needs for this study. Since questions might contain
different content words, it is difficult for algorithms, such as SVM, to
generalize from instances with a wide variation of lexemes. This may be the
reason why the performance of SVM using bag-of-words for question
classification is reported to be worse than that for SVM using syntactic
structure (X. Li & Roth, 2002).

The CBR method allows one to plug in any types of features for the
analysis. A detailed description of the CBR method is illustrated in section
44

3.5.2. Considering that the goal of this experiment is to examine the


importance of word usage for classifying questions in Chinese, lexemes are
chosen as the features for question classification. Lexeme is a word that
conveys a complete meaning, and a total amount of 2542 lexemes were
extracted from all questions in the Chinese dataset. This lexicon is generated
from the tokenization results by the Stanford Segmenter (Tseng, Chang,
Andrew, Jurafsky, & Manning, 2005). Some lexemes found may be
irrelevant to the determination of question types. In order to include only the
lexemes which are qualified as the attribute for question classification,
lexemes are selected based on their information gain. Information gain is a
metric for measuring the reduction in uncertainty in the categorization of a
class Y with a given attribute X. It is commonly used for feature selection.
Different information gain thresholds was experimented for selecting the
feature for CBR and it is found that an information gain more than or equal
to 0.011 would yield the best classification result. With the use of
information gain, the amount of lexemes was reduced by 91% to 219 (See
Appendix I). Table 15, Table 16, and Table 17 show the CBR results for the
three types of questions examined in section 4.3.1.2.

Table 15. Ten-fold stratified cross-validation result of Chinese verification question using
CBR method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 70.4%

58.2%

73.5%

58.8%

63.8%

62.5%

71.7%

68.1%

66.7%

63.8%

65.7%

Recall

88.4%

74.4%

83.7%

69.8%

86.0%

81.4%

88.4%

76.2%

71.4%

71.4%

79.1%

f1-score

78.4%

65.3%

78.3%

63.8%

73.3%

70.7%

79.2%

71.9%

69.0%

67.4%

71.7%

Note. Total number of verification question = 427 & {Rx |

} represents the ten

round of validation & k=3

Table 16. Ten-fold stratified cross-validation result of Chinese reason question using CBR
method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 90.9%

66.7%

71.4%

53.8%

80.0%

76.9%

83.3%

71.4%

71.4%

70.0%

73.6%

Recall

66.7%

53.3%

66.7%

46.7%

26.7%

71.4%

71.4%

71.4%

71.4%

50.0%

59.6%

f1-score

76.9%

59.3%

69.0%

50.0%

40.0%

74.1%

76.9%

71.4%

71.4%

58.3%

64.7%

Note. Total number of reason question = 153 & {Rx |

45

} represents the ten

round of validation & k=3

Table 17. Ten-fold stratified cross-validation result of Chinese attribute question using CBR
method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 44.4%

35.3%

64.3%

53.3%

56.3%

36.4%

46.2%

50.0%

50.0%

38.5%

47.5%

Recall

50.0%

37.5%

56.3%

53.3%

60.0%

26.7%

40.0%

46.7%

33.3%

33.3%

43.7%

f1-score

47.1%

36.4%

60.0%

53.3%

58.1%

30.8%

42.9%

48.3%

40.0%

35.7%

45.2%

Note. Total number of attribute question = 145 & {Rx |

} represents the ten

round of validation & k=3


Table 18

Ten-fold stratified cross-validation result of multi-class classification using CBR


R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision

68.6%

53.4%

69.7%

55.3%

66.7%

58.6%

67.1%

63.2%

62.7%

57.4%

62.3%

Recall

68.4%

55.1%

68.9%

56.6%

57.6%

59.8%

66.6%

64.8%

58.7%

51.6%

60.8%

f1-score

67.5%

53.7%

69.1%

55.7%

57.1%

58.5%

66.3%

63.9%

60.1%

53.8%

60.6%

The recalls of all three question types are better than the tree kernel
method. The largest improvement was found in attribute question, but only
little improvement can be observed from the verification and reason
questions. The average recall of the reason question increases from 25.1% to
43.7%. This result informs us that some question types may have more
distinctive syntactic structures than the other types of questions. For those
question types with a wide variation of syntactic structures, the
classification with lexemes might improve on the result.

4.3.3 Question classification using the hybrid method


Although CBR can improve on the recall, it also causes the drop in
precision. Based on the high precision of tree kernel and improved recall by
CBR, it is hypothesized that an algorithm combining the two methods
would possibly yield a more satisfactory result in Chinese question
classification. To take the advantage of high precision by tree kernel method,
the data will first process by the tree kernel method and then the negative
returns will be passed to CBR for further processing. The mechanism of the
46

hybrid method is shown in Figure 1. Two experiments are designed based


on this heuristic method. The first experiment is to use bi-class approach to
classify each question category separately by using the negative returns of
each of the three tree kernels as input for the CBR. Table 19, Table 20 and
Table 21 present the results of the first experiment. The second experiment
is a multi-class classification by passing the negative returns in all three tree
kernels as input for CBR. Table 22 shows the result of the second
experiment.

The hybrid method is a combination of the syntax-based tree kernel


method and the lexeme-based case-based reasoning method. The experiment
results show that the recalls of the three types of Chinese questions with the
hybrid method are the highest amount of three question classification
methods discussed in this chapter. This result indicates that both syntax and
lexeme should be used as feature for question classification. However, the
introduction of the lexeme as feature also caused in a drop in precision. The
precision of the three types of question with the hybrid method is lower than
the tree kernel method. This reflects that the lexeme-based case-based
reasoning has included the lexemes which are irrelevant to the classification
of questions. In order to improve the result, an extra mechanism might need
to filter the irrelevant lexemes.

Table 19. Ten-fold stratified cross-validation results of Chinese verification question with
tree+CBR method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

58.5%

70.2%

58.3%

60.3%

59.4%

60.6%

63.8%

64.9%

60.0%

62.1%

100.0% 88.4%

93.0%

97.7%

95.3%

88.4%

93.0%

88.1%

88.1%

85.7%

91.8%

78.9%

80.0%

73.0%

73.9%

71.0%

73.4%

74.0%

74.7%

70.6%

74.0%

Precision 65.2%
Recall
f1-score

70.4%

Note. Total number of verification question = 427 & {Rx |

} represents the ten

round of validation & k=3

Table 20. Ten-fold stratified cross-validation results of Chinese reason question with
tree+CBR method

Precision

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

92.9%

71.4%

75.0%

53.8%

87.5%

80.0%

83.3%

73.3%

85.7%

76.9%

78.0%

47

Recall

86.7%

66.7%

80.0%

46.7%

46.7%

85.7%

71.4%

78.6%

85.7%

71.4%

72.0%

f1-score

89.7%

69.0%

77.4%

50.0%

60.9%

82.8%

76.9%

75.9%

85.7%

74.1%

74.2%

Note. Total number of reason question = 153 & {Rx |

} represents the ten

round of validation & k=3

Table 21. Ten-fold stratified cross-validation results of Chinese attribute question with
tree+CBR method
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 44.4%

35.3%

64.3%

65.0%

56.3%

36.4%

46.2%

50.0%

50.0%

46.7%

49.4%

Recall

50.0%

37.5%

56.3%

68.4%

60.0%

26.7%

40.0%

46.7%

33.3%

46.7%

46.6%

f1-score

47.1%

36.4%

60.0%

66.7%

58.1%

30.8%

42.9%

48.3%

40.0%

46.7%

47.7%

Note. Total number of attribute question = 145 & {Rx |

} represents the ten

round of validation & k=3

Table 22 Ten-fold stratified cross-validation results of multi-class classification by feeding


only the negative returns in all tree kernels to CBR method as input
R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Avg.

Precision 72.8%

63.0%

77.1%

74.2%

78.3%

60.4%

72.1%

63.9%

71.5%

65.3%

69.8%

Recall

72.5%

64.2%

74.3%

61.4%

62.9%

59.9%

61.2%

64.3%

66.7%

61.1%

64.9%

f1-score

70.5%

62.2%

74.6%

62.2%

64.6%

57.9%

62.6%

62.9%

67.1%

61.7%

64.6%

{Rx |

} represents the ten round of validation & k=3

48

Chapter 5

DISCUSSION

5.1 Hybrid method for question detection


This study proposed a hybrid method combining the rule-based QM
and machine-learning-based Syntax method for detecting questions from
naturally occurred online discussion by Hong Kong students. The QM
method consists of the rule to identify all sentences with question mark as
ending punctuation as question. This rule is established based on the fact
that question mark is used in both Chinese and English to mark an
interrogative statement. However, question mark is often found missing in
causal form of conversation (Cong et al., 2008), such as online discussion,
and this affect the performance of question detection by using only question
mark as an indicator of question. The Syntax method serves as a
complement to the QM method. Stanford parser, an unlexicalized PCFGs
parser, is chosen as the syntactic analyzer in this study. The parser is trained
by the Penn Treebank and Sinica Treebank for the English and Chinese
syntactic analysis respectively. The performance of PCFGs parser for
question detection is determined by the probability of the production rule for
questions found in the treebanks. For the hybrid method, all sentences were
first process by the QM method, and the negative returns from the QM
method were passed to the Syntax method for further analysis. Any positive
returns from either the QM or Syntax method were identified as questions.

The rule-based QM method and machine-learning-based Syntax


method are selected as the baseline for comparison with the hybrid method.
Different unit of analysis can be chosen, such as a message or paragraph,
depending on the focus of investigation, and a sentence is chosen as the unit
of analysis for this experiment. A sentence is defined in this study as a unit
of syntax, which may or may not contain a verb, and concludes with
appropriate end punctuation. A general concept of a simple sentence is that it
should carry a single piece of meaning. If two separable meaning is being
conveyed in the same sentence, one could base on the two meanings and
divided the sentence into two individual sentences. It is observed in both
49

Chinese and English datasets that students may put different questions into a
sentence. For example, Why do we always think of buy electricity from
other places, Don't Hong Kong have potential to develop reusable energy?
This sentence can actually be divided into two questions. The correct forms
of question should be Why do we always think of buy (consider to buy)
electricity from other places?, and Dont Hong Kong have potential to
develop reusable energy? This kind of sentence tokenization requires the
sentence tokenizer to have the ability to analyze the semantics of sentences
and this is out of the scope of this study. Hence, in the present study, the
sentences were tokenized based on the end punctuations.

5.1.1 Hybrid method for English question detection


The result shown in Table 6 indicates that the hybrid method yields the
highest f1-score than both baseline QM and Syntax methods. The recall
using the hybrid method is boosted from the 87.7% in the QM method and
the 59.6% in the Syntax method to 93.5%. This increase marks a
significantly improvement in the identification of questions by the hybrid
method. The following is an analysis of the performance of the hybrid
method.

In the hybrid method, the sentence will first be processed by the QM


method and then passed to the Syntax method for further analysis. Although
the QM method has an exceptionally high precision, the direct incorporation
of this method into the hybrid method has induced some wrong detection
results caused by the usage of question marks other than question asking.
Hyperlink, such as http://www.sciscape.org/news_detail.php?, is one type of
text that is wrongly detected. The question mark included in the hyperlink is
to inform the web server that some parameters will be passed as response.
This type of wrong detection has contributed to 56% of the false positive
returns in this study. Another type of wrong detection is the emoticon in
online discussion. Emoticon is a group of keyword characters that typically
represents a facial expression or suggests an attitude or emotion and that is
used especially in computerized communications (Emoticon [Def. 1],
50

n.d.).

?_? is an example of emoticon found in the dataset of this study.

Since question mark is used as an indicator of the boundary between


sentences, this emoticon is divided into two separate sentences which end
with question mark and the QM method will identify this emoticon as two
questions. A possible solution to the above wrong detection is to build extra
rules to filter out such sentences, which contain only hyperlinks or
emoticons.

The second part of the hybrid method is to use unlexicalized PCFGs


Stanford parser for analyzing the syntactic structure of sentences. The Penn
Treebank is used for training the syntactic parser for the Syntax method.
There are two part-of-speech tags for questions, namely SQ and SBARQ,
included in the tagset of Penn Treebank. SBARQ represents direct questions
beginning with wh-pronoun and SQ represents yes/no question and the
constituents of SBARQ (Santorini, 1990). Since the sentence structures of
most of the sentences in the English dataset are generally complex, it is
highly possible that questions only form a portion of the whole sentences. In
order not to miss out those questions, in this study any sentence with
syntactic structure which containing either SBARQ or SQ is regarded as
questions.

The experiment result shows that the Syntax method can detect
questions which cannot be identified by the QM method, such as What will
happen if the gouverment (government) close down to (too) many schools
and it is not enough for the pupils to study., Why don't they have some
clean water., What is greenhouse gas and Can you write it again,
please.... Although these questions do not end with question marks, all of
them can be detected by the Syntax method by their constitution of
syntactically identifiable question substructure. However, the low recall also
revealed the problem of the Syntax method. There are two possible
explanations for this low recall. One possible explanation is that the Penn
Treebank does not cover the syntactic structure of questions used by Hong
Kong students. Another possible explanation is that the missing questions
are ungrammatical so that it cannot be detected by the Stanford Parser.
51

In order to understand the cause of the low recall, we need to


investigate the nature of the questions which have been missed in the
identification. The missing questions by the Syntax method can be
generalized into four main categories. The first category is the questions
with only a single wh-adverb or a conjunction before the wh-adverb, such as
Why or But Why? Figure 2 shows the syntactic structure of question
Why?. The figure shows that the sentence is only regarded as fragment
(FRAG) of a complete sentence. A simple solution to cater for this problem
is to add production rules to generate this type of question into the Penn
Treebank, but it needs validation whether this solution would affect the
accuracy of syntactic processing.

Figure 2. Parse tree of question "Why?"

The second category of questions is very similar to the first category. It


includes questions with only a noun phrase (NP). Figure 3 shows the
syntactic structure of the example question Teapot?. From the figure, we
can see that the parent node of the NNP (noun phrase) node is X, where X
means uncertain category. Since statistical parser is used for the syntactic
analysis, the occurrence of uncertain category may reflect that a sentence
containing only a single noun phrase is very rare in the Penn Treebank. As
this category of question is in the form of normal statement, the only factor
that determines it as interrogative statement is the present of question mark.
Hence, the QM method is the only method to identify this category of
questions.

52

Figure 3. Parse tree of question "Teapot?"

The third category of questions is a statement that ends with a question


mark. Such questions make no distinction from a statement. Figure 4 shows
the syntactic structure of an example question CFC lead to global
warming?.The syntactic structure of this category of question is the same
as a complete statement with subject and verb phrase. As for the second
category, the occurrence of a question mark is the only feature to indicate
that the statement is an interrogative statement. Hence, the QM method is
the only method to identify this category of question.

Figure 4. Parse tree of question "CFC lead to global warming?"

53

Figure 5. Parse tree of question "money comes from where?"

The last category of question is ill-formed questions. For example,


money comes from where? The correct form of this question should be
Where does the money come from? and its syntactic structure is shown at
Figure 5. This example demonstrates two common types of grammatical
errors. One is the wrong word order, and the other is the omitting of word
do in question. Since most of the text from the Penn Treebank are
collected from official documents or newspaper reporting, the majority of
sentences in the Penn Treebank should be grammatical and may not include
ill-formed sentences, such as the one illustrated in this example. Different
approaches have been proposed to handle ill-formed sentences. One of them
is to construct the error grammar and each grammar rule can augment with a
probability for the analysis by Probabilistic Parser, such as Stanford Parser
(Foster & Vogel, 2004). It is believed that the parsing result might be
improved by using the error grammar as a complement to the Penn
Treebank. However, the construction of error grammar requires a large
collection of ungrammatical sentences. It is yet not possible to construct the
grammar with the limited amount of ungrammatical sentences included in
the dataset of this study and such an improvement might be carried out in
the future research.

In addition, there exist some instances which cannot be tackled by both


54

QM and Syntax methods. My new title is How can we use it?, for
example, is a sentence which will be wrongly identified as question by both
QM and Syntax method. The purpose of this sentence is to introduce the
new title for the speakers message instead of question asking. Both QM
and Syntax method fail to identify this example as non-question. It is quite
difficult for us to tackle this problem without the understanding of the
semantics of statements. A counter example is My question is how can we
use it?. This counter example has the same syntactic constituents as the
original example, but this is obviously a question. The two examples can
only be distinguished if our algorithm can differentiate the meaning between
new title and question. Another example is question Look carefuly
(carefully) what did I write? This sentence can be interpreted as Look
carefully! What did I write? as a combination of exclamatory and
interrogative sentences. However, it can also be interpreted as Look
carefully at what I wrote. Both interpretations require alternation to the
original sentence. Any alternation might cause the derivation of the original
meaning. This kind of ambiguity can hardly be resolved without the
consultation of the author of the original meaning of sentence.

5.1.2 Hybrid method for Chinese question detection


The result of the Chinese question detection indicates that the hybrid
method could not bring any improvement over the QM method. The
precision and recall for the hybrid method is 98.3% and 91.5% respectively,
which is the same as the result attained by the QM method. As shown in the
result, the Syntax method fails to identify any questions from the online
discourse and all the questions in the hybrid method are identified from the
QM method. The only type of wrong detection of the QM method is caused
by the inclusion of the hyperlink in the discussion message by students as
observed in the English dataset. As stated in the section 5.1.1, a question
mark may be found at the end of a hyperlink to indicate the boundary
between the hyperlink and the parameter to be passed to the server as
response. All of the false negative count was caused by the inclusion of
hyperlink.

Concerning

the

exceptionally
55

good

question

detection

performance and its ease to implement, it is believed that the QM method is


the most desirable method for teachers to detect questions in online
discussion which is mediated in Chinese by Hong Kong students.

There is no doubt that the unsatisfactory result of the Syntax method of


Chinese question detection is caused by the limited amount of questions and
the missing of the question tag in the tagset of the Sinica Treebank. One way
to tackle this problem is to switch to another treebank for the question
detection experiment. However, it is impossible for us to select a suitable
treebank without the understanding of the problems found in the current
setting. An investigation to the Sinica Treebank and the parsing result from
parser trained with the Sinica Treebank can help us to make a better decision
for the selection on treebank.

The Sinica Treebank has a tagset with different focus from the tagset in
the Penn Treebank. The Sinica Treebank has a strong focus on the lexical
content while the Penn Treebank was built on its assumption of
context-freedom. This difference can be revealed from the way how the
part-of-speech for noun is defined in the two treebanks. The Penn Treebank
differentiates noun with two criteria: 1) singular or plural and 2) noun or
proper noun, while the Sinica Treebank derives eleven part-of-speech tags
for noun including tags for location, direction, time, quantifier, etc. From the
way how noun is classified in the Sinica Treebank, we can understand that
the Sinica Treebank is emphasized on the meaning of words in Chinese and
this characteristic lead to a wider variation of syntactic structures found in
the treebank. Since the Stanford parser is an unlexicalized parser, the
inclusion of the lexical content in the Sinica Treebank violates the
theoretical assumption of the Stanford parser and as a result affects the
performance syntactic analysis. Lexicalized parser, such as the one proposed
by Chen (1996), can be implemented in the future study to experiment if the
use of lexicalized parser would improve on the result on Chinese question
detection.

In addition, the segmentation of Chinese sentences is also a problem


56

that might affect the accuracy of syntactic analysis. Appendix I shows a list
of lexemes segmented by the Stanford Segmenter. (do you agree),
for example, is a lexeme selected from Appendix I. This lexeme is actually
composited of two lexemes: 1) (agree) and 2) (ma), but the
segmenter wrongly tokenized it as a single lexeme. An investigation of the
segmentation result of the training questions in the Chinese dataset shows
that around 45.5% of segmented questions contain segmentation errors. It is
found from the segmentation errors that lexemes are mostly segmented as
morphologically complex words, such as (can bring). Tseng et al.
(2005) reports that the favour of morphologically complex words is one of
the cause of the segmentation error. Since those wrongly segmented lexemes
are not included in the corpus where the treebank generated from, the
syntactic parser may fail to identify the part-of-speech of lexemes and can
only rely on the other lexemes in the sentences for the determination of
syntactic structure of sentence. In the future study, we might possibly
remove the conditions to favour the morphologically complex words from
the segmenter and test if the segmentation results will be improved.

5.1.3 A comparison of Chinese and English questions


Chinese is very different from Indo-European language, such as
English. The characteristics of Chinese interrogative sentences are quite
different from English interrogative sentences (Yuan & Jurafsky, 2005). One
of those is the lacking of subject verb inversion. Subject verb inversion is
used in English to convert a statement into question. List (1) demonstrates
that a statement is turned into question by shifting the verb-to-be into the
subject position.

List (1) Statement: He is a policeman.


Question: Is he a policeman?
Such form of inversion is not found in Chinese. A question can be formed
by merely attaching a question mark with or without a question marker at
the end of statement. (2), (3), (4), (5) show that questions can be formed by
57

different kind of lexical items such as adjective, noun phrase, verb phrase or
even sentence.
(1)

[ADJ]
Hard-working?

(2)

[NP]
The function of Chlorophyll?

(3)

[VP]

Encourage them to be independent?

(4)

[SUBJ+Verb]
You take care (of it)?

However, it is interesting that such a phenomenon is found in the


questions in the English dataset as well. Figure 4 shows an example
sentence CFC lead to global warming?. This question is formed by
attaching a question mark at the end of a complete sentence. The existence
of such a phenomenon makes the method to detect question by syntactic
structure fails and we can only rely on the question mark to detect the
existence of question. It is argued that question mark is the most essential
criterion for the detection of question.

Besides, Chinese questions have the characteristic of wh-in-situ (Aoun


& Li, 1993). It means that the wh-element will remain as the position as the
subject of answer. For example,
(5)
Brother ate an apple.

(6)

58

What did brother eat?


The wh-element (what) remains as the same position as
(apple). The syntactic structure of the (19) and (20) should be very similar to
each other. This makes the use of syntactic structure for question detection
fails.

In addition, the existence of question marker is another unique


characteristics found only in Chinese questions. Question marker refers to a
morpheme found at the end of a question right before the question mark.
(ma), (ne), and (ba) are common question markers found at the end
of questions. (8), (9), (10) shows the example questions which end with the
three question markers. In the Chinese dataset, 38.4% questions are marked
by question markers. However, 48.2% of those questions contains wrong
segmentation of the question markers, such as (relationship ma).
This factor affects the question detection performance of the Syntax method.
An improvement on the current hybrid method can be made by adding the
question marker as an criterion for question detection.
(7)
Do you agree with what I said?

(8)
There should be a deeper meaning and propose for education?

(9)
What is my chase?
Besides the question markers, A-not-A question is another unique
characteristic found in Chinese questions. A-not-A questions with the format
of A + NEG + A, where A can be adjective, verb or noun (Huang & Chen,
2008). (11) and (12) show the examples of A-not-A question.

59

(10)
Is it cold today?

(11)
Will you visit Japan?
In addition to the characteristics of Chinese question, the questions in
the Chinese dataset also demonstrated a characteristic which is unique to
Cantonese speaking environment. This characteristic is the direct translation
of the oral language into the written format. Below are a few examples of
those questions:
(12)
What is your strength?

(13)
Do you know that?

(14)
What do you want to say?

(15)
Could you stop initiating some nonsense topics?

(16)
What is the relationship?

(17)
(Do you think) my intention is to make the earth even worse?
60

Two types of translation are shown in the above examples. One type of
translation is the use of a Chinese word with the same pronunciation as the
replacement of a Cantonese word. One example is the Chinese word
(mei). mei is one of the onomatopoeia in Chinese representing the sound of
sheep, but it is used in (13) and (15) as a replacement of the interrogative
word (what) and in (17) a substitution of question marker (ma).
Another example is the morpheme (ng). ng is frequently used in the
A-not-A question as a replacement of the word (bu). Another type of
translation is the use of English characters to replace the Cantonese words,
the d and ge in (15) and (17), respectively, are the substitution of the
Chinese auxiliary word (de).

The above discussion illustrated that the Chinese and English questions
have quite a few different characteristics. However, we still found from the
datasets a common characteristic of question. This characteristic is the
occurrence of code-mixing. Code-mixing refers to the situation in which
multiple languages are used in the same sentence during communication.
(19) is an example found in the Chinese dataset, while (20) is an example
retrieved from the English dataset. Since most natural language processing
method are designed to handle a single language, code-mixing increases the
difficulty of language processing and may probably affect the correctness of
syntactic analysis.

(18)

DNA
Does plant have DNA?

(19) How can we

police

How can we improve the efficiency of police?


It is generally regarded that English and Chinese belong to two
different language systems. The natural language processing methods for
61

one language may not be totally applicable to another language. This claim
is correct to some extent, especially in the processing of syntactic structure
of the two languages. However, these two languages found in our datasets
also share some similarities. Firstly, the question mark is used in both
languages as the punctuation for questions, and, satisfactory results were
attained by using only the QM method to detect questions in both languages.
Secondly, the phenomenon of code-mixing is found in both Chinese and
English datasets. Questions with code-mixing may contain lexemes from
more than one language, and this phenomenon poses a great challenge to the
corpus-based Syntax method. Since a corpus would normally include
lexemes of a single language, in order to process the question with
code-mixing an extra process to translate the text into a single language
might be needed.

5.2 Syntax-based tree kernel method for question


classification
Question classification is the task to classify questions according to a
predefined question taxonomy. There are three main concerns for the
implementation of tree kernel for classifying questions by Hong Kong
students.

Firstly, most research on the question classification is focused on the


problem to classify questions based on the factual information which a
question requests for. The tree kernel method is proven to give a satisfactory
result for this type of problem (D. Zhang & Lee, 2003). The basic
mechanism of tree kernel method is based on the comparison of the subtree
structures found between the syntactic structure of questions found in the
training and testing data. The success of this method indicates that the
syntactic information is important for the determination of the answer type
of the factual question. In recent years, the question classification
community is gaining interest in the classification of questions other than
factual questions, such as definition question (Cui, Kan, Chua, & Xiao,
2004). However, the amount of study on the classification of question type,
62

such as explanation or procedure question, is still limited. It is yet unknown


whether the method derived from the task to classify the fact-oriented
questions is suitable for classifying other question types.

Secondly, most research on tree kernel method is experimented with


the questions generated by native English speakers. As discussed in 5.1.1,
the English questions generated by Hong Kong students have demonstrated
some characteristics different from those generated by native English
speakers. It is yet unknown whether the unique characteristic of the English
questions generated by Hong Kong students would have any influence on
the classification result.

Finally, the Chinese language used in Hong Kong is also a big concern
of this experiment. First of all, the Chinese language in Hong Kong has
some unique characteristics which make it different from the modern
Chinese. Also, Chinese and English belong to different language families. It
awaits validation whether the tree kernel method derived from the
classification of English questions is suitable for classifying Chinese
questions generated by Hong Kong students.

5.2.1 Non-fact oriented question types


Since most question taxonomies found in the question classification
literature is fact-oriented, a question taxonomy is constructed in this study
for analyzing questions based on their functions in inquiry. The detail of the
taxonomy is discussed in section 3.3. Verification, reason and attribute
questions are selected for the experiment of question classification. These
question categories are chosen because both them have more than 100
counts in the English and Chinese datasets. In these three types of question,
only the attribute question is the fact-oriented question and the other two are
not. The results shown in Table 8, Table 9 and Table 10 are summarized
below in Table 23. Through the comparison shown in Table 23, we may
understand whether tree kernel method is suitable for classifying questions
with non-fact-oriented nature.
63

Table 23. Classification result of English Verification, Reason and Attribute questions

Precision

Recall

f1-score

Verification

89.0%

89.0%

89.0%

Reason

99.0%

74.0%

84.7%

Attribute

78.0%

50.0%

60.9%

This result illustrates that the classification of non-fact-oriented


question is even better than the fact-oriented attribute question. The
precision of reason question is close to 100% and the f1-score of
verification and reason questions are both above 89%. In contrast, the
classification result of fact-oriented attribute question is relatively poor.
Although the precision of attribute remains at a high level, the recall drops
greatly. One possible explanation to this may because the wide range of
syntactic structures included in the attribute question. Figure 6 and Figure 7
show the syntactic structure of attribute questions Can u (you) tell me
where have many wind? and Where are the rubbish? Both of these
questions are attribute questions asking for location information but the two
common subset trees are found in these two questions (see Figure 8).

64

Figure 6. Parse tree of attribute question Can u tell me where have many wind?

Figure 7. Parse tree of attribute question Where are the rubbish?

Figure 8. Common subset trees of attribute questions Can u tell me where have many
wind? and Where are the rubbish?

Figure 9 shows an elaboration question Can u (you) explain?


Although the type of this question is different from the question Can u (you)
tell me where have many wind? these two questions question have eleven
common subset trees (see Figure 10).

Figure 9. Parse tree of others question Can u explain?

65

Figure 10. Common subset trees of attribute question Can u tell me where have many
wind? and elaboration question Can u explain?

These examples reveal the problem of the current tree kernel method in
handling indirect question. One possible solution to this problem is to have
mechanism to identify the relative clause where have any wind from
Figure 6 and a higher weighting will be given to the subset tree within the
relative clause.

5.2.2 English questions by non-native English speaker


In the literature, it shows that the tree kernel method gives the best
result when classifying questions generated at a native English speaker
environment. But there is little evidence to show that this performance can
be achieved by implementing the same method for classifying questions by
non-native English speakers. The high f1-score of 89% and 84.7% retrieved
at the classification of verification and reason questions in this study
indicates that the tree kernel has little problem in handling the questions by
non-native English speakers. The only problem which might need our
attention is the code-mixing problem found in the questions generated by
Hong Kong students. The following is an illustration on how code-mixing
would affect the performance of tree kernel method.
An example of code-mixing is How can we police
66

?. In this example, the lexical items for improve and efficiency are
substituted by Chinese terms (improve) and (efficiency).
This substitution leads to a wrong parse as shown in Figure 11. Figure 12
shows the correct syntactic structure for comparison. To deal with this type
of code-mixing, we can first translate all lexical items in Chinese back to
English and then pass the translated question to syntactic parser for further
analysis. Figure 13 shows the syntactic structure of the same questions as
Figure 11 after the translation. This syntactic structure is literally the same
as the one shown in Figure 12. However, this proposed solution may fail if
code-mixing is found to replace a phrase instead of individual lexical item.
An example is, Hong Kong police ? (How could
we improve the efficiency of the Hong Kong police?). There are two main
problems for the translation. First of all, the Chinese lexical items are
grouped into a clause and word alignment is needed to map the Chinese
lexical items into English one. Besides, a grammatical English question may
not be formed by direct mapping of the Chinese lexical items into English
one. Figure 15 shows the result of direct translation of the questions. It is
obvious that the translation is not in a grammatical form. One way to handle
this problem is to use the example-based machine translation (Nagao, 1984).
Since most machine translation methods are decided to translate text from
one single to another one, those method may not fit the need for the
translation with code-mixing. The example-based machine translation
method leverages only the experiences from previous translation as an
example for future translation. It is believed to be the most suitable method
for handling the case of code-mixing in the context of this study.

67

Figure 11. Parse tree of procedure question How can we police ?

Figure 12. The correct parse tree of question How can we police ?

68

Figure 13. Parse tree of procedure question "How we protect the environment?"

Figure 14. The correct parse tree of question How can we police ?

Figure 15. Direct translation of the question Hong Kong police ?

5.2.3 Chinese question classification


As described in the last section, tree kernel method is recognized as one of the best

69

classification method for handling questions formulated in native English speaking


environment. However, there is hardly any literature to focus on the investigation of
question classification for questions by Hong Kong students. Table 24 is a summary of the
Chinese question classification shown in Table 11, Table 12, Table 13 and

Table 14.

Table 24. Classification result of Chinese verification, reason and attribute questions with
tree kernel method

Precision

Recall

f1-score

Verification

71.2%

78.9%

74.7%

Reason

99.2%

58.0%

72.2%

Attribute

96.7%

25.1%

38.4%

Mutli-class

89.1%

53.5%

61.1%

The Chinese question classification result is similar to the result of


English question classification However, the general performance of
Chinese question classification is poorer than that of English question
classification. It is noticed that the precision of reason question maintains at
a level similar to that of English reason question, but the recall drops greatly.
It is believed that through the investigation of the cause of missing questions
may shed light on the way to improve the method for Chinese question
classification.

The variation of the syntactic structure is one of the main cause of the
unsatisfactory performance of Chinese question classification. Figure 16 and
Figure 17 show the syntactic structure of two reason questions. Both
questions have the forms (why) [A] (have) [B]? However,
the syntactic structures of the two figures are completely different. One
main cause of this problem is due to the fact that the Sinica Treebank, a
treebank for training the parser for Chinese syntactic analysis, is highly
lexicalized. Both lexemes (Hong Kong), (kids in Hong Kong),
(plant) and (venom) should have the part-of-speech of noun,
however, it is tagged with three different types of part-of-speech tag as
shown in the below figures. The lexicalized nature of the Sinica Treebank
has introduced a wider variation of syntactic structures. It is believed that by
70

reducing the variety of part-of-speech tags, a similar syntactic structures


would be resulted for the below examples. In addition, a further step can be
taken to handle the syntactic variation found in the Chinese questions. This
is to use tree edit distance (K. Zhang & Shasha, 1989) to replace the subset
tree comparison. Tree edit distance is a cost function to measure the delete,
insert or rename actions that need to take to transform a tree to another one.
Our assumption is that the difference between the two examples after the
replacement of the part-of-speech tags for all nouns would be scaled down
to two lexemes, namely (only) and (some), so that, the tree
transformation should be limited. But whether this hypothetical solution can
handle the problem of syntactic variation need to be validated in the future
study.

Figure 16. Parse tree of reason question ?

Figure 17. Parse tree of reason question ?

71

Besides, one other characteristic of the Sinica Treebank would also be


a cause of the poor classification result. This characteristic is the favor of
flat tree structure. As shown in Figure 18, the syntactic structure of sentence
(The tourism board also arranged
a number of outing activities) is rather flat and have a depth of 4. This
characteristic limits the amount of subset trees and this reduces the
probability of identifying the same subset tree between two questions.

Figure 18. Parse tree of sentence

The above discussion illustrates that the selection of treebank is a


determining factor to the performance of question classification. The Penn
Chinese Treebank has the same context-free assumption as the Penn
(English) Treebank. It is possible to carry out a study with the Penn Chinese
Treebank to understand the extent to which the selection of treebank would
affect the performance of tree kernel method.

5.3 Lexeme-based case-based reasoning for Chinese


question classification
The comparison of Figure 16 and Figure 17 shown in the last section
not only informs us that syntactic processing is a cause of the disappointing
Chinese question classification result, but also demonstrates that the
importance of lexical sequence for the decision of question type. A method
for comparing the occurrence of lexemes in questions may possibly able to
identify the questions shown in the two figures as the same type of
questions.
72

Case-based reasoning is chosen in this study for the comparison of the


composition of lexemes between testing and training questions. Each
question is represented as a case which contains a list of attributes that mark
the occurrence count of particular lexemes in a question. Although a
question may be composited of a long string of lexemes, not all the lexemes
would provide the same amount of information for the determination of
question types. In order to select only the relevant lexemes, only those
lexemes with information gain more than 0.011 were selected for further
process. Table 25 shows a comparison of the question classification results
by syntax-based tree kernel and lexeme-based CBR from Table 11, Table 12,
Table 13, Table 15, Table 16 and Table 17.

Table 25. A comparison of the question classification result by CBR and tree kernel method

Verification

Reason

Attribute

CBR

Tree

CBR

Tree

CBR

Tree

Precision

65.7%

71.2%

73.6%

99.2%

47.5%

96.7%

Recall

79.1%

78.9%

59.6%

58.0%

43.7%

25.1%

f1-score

71.7%

74.7%

64.7%

72.2%

45.2%

38.4%

As discussed in section 5.2.3, the wide variation of syntactic structure


demonstrated in the Chinese questions would affect the performance of the
syntax-based tree kernel method. The recall of attribute question by the tree
kernel method is only 25.1%. The implementation of CBR has greatly
improved the recall of attribute question. The recall is boosted from 25.1%
to 45.2%. CBR can identify 88.89% of the correctly classified attribute
questions by tree kernel method, while tree kernel method can only
identified 49.25% of the correctly classified attribute questions by CBR. The
CBR method can tackle two technical problems of the tree kernel method.
The first problem is the occurrence of uncertain components, and another
one is the difference in syntactic structure for questions with similar lexical
sequence.

Firstly, for the problem of occurrence of uncertain components in


syntactic structure as illustrate in Figure 19. The figure shows an example
73

question ? (Besides virtue, what kind


of fault can you observe?), in which the phrase (Besides virtue)
is recognized as an uncertain syntactic component. The occurrence of
uncertain syntactic component would affect the correctness of syntactic
processing, and consequently, affect the result of syntax-based question
classification. CBR uses the lexical items instead of the syntactic structure
for question classification. It retrieves questions from case base which share
the maximum number of common lexemes with the test question. The similar
question retrieve from the case base is ? (What is
the method to protect it?). This question share two common lexical
components with the example question, namely (what) and (ne).
This example demonstrates that there is no need for questions to share a
large amount of common lexemes to be able to identify by the lexeme-based
case-base reasoning method.

Figure 19. Syntactic structure of question ?

Besides the existence of uncertain components, the variation of


syntactic structure is also a problem to the syntax-based tree kernel method.
Figure 20 and Figure 21 show that the syntactic structures of two attribute
questions. The syntactic structures of these two questions vary greatly, so
the tree kernel method fails to identify these two questions as similar
question. However, the CBR method can easily recognize the two questions
as the same type of questions by the identification of a similar phrase
74

(any method) found in both question. This example reinforces the


understanding that lexical content is an important feature for the
determination of question types.

Figure 20. Parse tree of attribute question?

Figure 21. Parse tree of attribute question ?

Although CBR method has demonstrated some improvement over the


tree kernel method, the demonstrated improvement cannot compensate for
the problem of low precision. The precision of attribute questions achieved
by CBR method is much lower than the 96.7% precision achieved by tree
kernel method. Since k-NN is used for the retrieval of cases from the
database, this measure is bias toward question categories with a higher
frequency. The wrong classification of definition question
? as attribute question can explain this problem. Firstly, the
frequency of attribute questions is much higher than the definition questions.
75

Secondly, lexical items (is) and (what) are usually found in


attribute question. These two factors together induce a higher amount of
match from the attribute questions than the definition questions, and hence,
a wrong classification of the example question as attribute question would
be resulted. A possible solution is to generate hand-crafted rules to assign a
higher weighting to some lexemes which is crucial for the decision of the
question type, such as (literal meaning) in this case.

5.4 Hybrid method for Chinese question classification


A hybrid method is proposed to leverage the high precision by the
syntax-based tree kernel method and the improved recall by the
lexeme-based CBR method. Two experiments have been carried to
understand the performance of the combined approach for bi-class and
multi-class classification. Table 26 is a summary of Table 11, Table 12, Table
13, Table 15, Table 16, Table 17, Table 19, Table 20 and Table 21 of the
bi-class classification result of the tree kernel, CBR and the combined
approach.

Table 26. A comparison of the tree kernel, CBR and tree kernel + CBR method for question
classification
Verification
Tree

CBR

Reason
Tree

Tree

CBR

+CBR

Attribute
Tree

Tree

CBR

+CBR

Tree
+CBR

Precision

71.2%

65.7%

62.1%

99.2%

73.6%

78.0%

96.7%

47.5%

49.4%

Recall

78.9%

79.1%

91.8%

58.0%

59.6%

72.0%

25.1%

43.7%

46.6%

F1-score

74.7%

71.7%

74.0%

72.2%

64.7%

74.2%

38.4%

45.2%

47.7%

Table 26 shows that the hybrid method achieved the highest f1-score
among the three methods in classifying reason and attribute questions. Even
though the f1-score of verification question is not the highest among the
three methods, but the result is very close to the tree kernel method. This
result indicates that the hybrid method in general have a better performance
than the two other methods.

76

An interesting result is found at the classification results of the attribute


questions. Not much improvement was made by hybrid method, and the
recall of the hybrid method is similar to CBR method. This indicates that
most questions found by tree kernel method can be identified by CBR
method. This again proofs our hypothesis that the lexical content is
important for the determination of question category.

More importantly, it is found from the experiment result that most


unclassified questions exhibit segmentation errors. For example, the
lexemes (kun), and (where) in question ? (So,
which place is more suitable?) are Cantonese terms, and the segmenter fails
to tokenize this question correctly. Segmentation is an important
pre-processing mechanism for both tree kernel and CBR method. A poor
segmentation result may cause a wrong classification in both methods.
However, it is found from the segmentation result that 45.5% of the
questions in training data contain segmentation error. In the future study, we
may test on the classification result with ICTCLAS (H. Zhang, Hong-kui,
Xiang, & Liu, 2003), a segmented with reported recall of 97%, to see if the
performance of the hybrid method will be improved.

Finally, Table 27 shows the summary of multi-class classification shown in

Table 14, Table 18 and Table 22. The result shows that f1-score of the
hybrid method for multi-class classification is the highest among the three
methods. This indicates that the general performance of the hybrid method
for multi-class classification is the best among the three methods. Besides,
based on the factor that tree kernel method gives the highest precision of
question classification, only the negative returns in all tree kernels will be
passed to the CBR as input. This method eliminates the unnecessary input to
the CBR, and the result shows that by using this method both precision and
recall for the CBR method has been improved.

Table 27. A comparison of the multi-class classification performance with tree kernel, CBR
and tree+CBR method
Tree

CBR

Tree+CBR

77

Precision

89.1%

62.3%

69.8%

Recall

53.5%

60.8%

64.9%

f1-score

61.1%

60.6%

64.6%

5.5 Question taxonomy


5.5.1 The reliability of question taxonomy for automatic
question classification
Three criteria are listed in section 1.4 for the selection of question
taxonomy for this study. The first criterion is the relevance of the taxonomy
to CSCL. The second criterion is the reliability of taxonomy for manual
coding. The third criterion is the reliability of taxonomy for automatic
question classification.

Firstly, this taxonomy is a refinement from a framework for


explanatory reasoning during learning and it was used in other CSCL
research (Hmelo-Silver & Barrows, 2008) for the investigation of students
knowledge construction progression. It is believed that the question
taxonomy for this study is suitable for analyzing the inquiry in CSCL
discourse. Secondly, the Cohens Kappa coefficient of 0.89 for the English
dataset and 0.9 for the Chinese dataset demonstrates that this taxonomy is a
reliable taxonomy for manual coding. Lastly, although there is a variation in
the performance of question classification across different question
detection and classification methods, a satisfactory performance can still be
achieved in some question categories. This reflects that the linguistics
characteristics may vary across different question categories and some
question categories may not have distinctive linguistics characteristics
which can be used by the algorithm for question classification. A further
analysis on the automatic question classification results might need to
consider if there is any needed to re-structure the question categories so that
each category can have distinctive linguistics characteristics from other
question categories. In summary, the question taxonomy in this study can
generally satisfy both three criteria listed at the beginning of the thesis and
78

should be regarded as an appropriate question taxonomy for the purpose of


automatic question classification.

5.5.2 The distribution of questions in Chinese and English


dataset
The results in section 4.1 show that the English and Chinese datasets
have similar distribution of questions. The three most frequently occurred
questions are verification, reason and attribute questions. The similarity in
question distribution may due to the fact that both datasets were extracted
from the same online collaborative learning platform, Knowledge Forum .
Knowledge Forum was designed based on a strong theoretical foundation,
and hence, students who participate in the discussion on the Knowledge
Forum are likely to engage in similar mode of inquiry. The goal of inquiry
in Knowledge Forum is to deepening the mutual understanding of concept
possessed by a community of learners. An inquiry shall not be regarded as a
process approaching the truth. It is, however, a continuous process for the
advancement of concept. This form of inquiry is akin to the inquiry by the
scientific community (Bereiter, 1994).

Scientific inquiry should not be solely information seeking. It should


be explanatory-driven instead. This type of inquiry is quite different from
the question-answering practice occurred in traditional classrooms. In
traditional classroom, teacher is the authority who possesses the final
answer, and the correctness of students answers were assessed by the
similarity of the students answers with the model answer. High order
explanation-driven questions are less likely to be found in traditional
classrooms. As opposed to the traditional classroom, the inquiry in
Knowledge Forum is student-centred and students are responsible to
identify their gap of understanding and questioning is a resources for
students to synthesize the cognitive conflict. The goal of inquiry is to
generate a better explanation to a phenomenon, or more precisely to
increase the explanatory coherent of a concept. Explanatory-seeking
question, as a type of question for eliciting explanation, is always regarded
79

by educators (Chan, Lee, & Van Aalst, 2001) as an indicator of the


deepening of inquiry. The comparison, reason, procedural and example
questions of the question taxonomy in this study belong to the
explanation-oriented question. These questions composite 35.5% and 25.1%
of the total number of questions in the English and Chinese datasets. A
higher percentage of explanation-oriented questions is found in English
dataset than in Chinese dataset. This result may inform us that the inquiry in
English dataset is more interest in the discussion of explanation than that of
the Chinese dataset. Since we do not possess the background information of
the discussion in both datasets, it is impossible for us to draw any
conclusion on the difference between the English and Chinese datasets.
However, other researchers might in the future explore on this topic to
investigate the influence of language on the nature of inquiry.

Besides the difference shown above, there are two phenomena which
we find particularly interesting. The first phenomenon is about the high
occurrence of verification questions. The results in section 4.1 show that the
verification questions contribute to 38.3% and 45.8% of the total amount of
questions in English and Chinese datasets. The percentages of verification
questions in both datasets are higher than the total percentage of all
explanatory-oriented questions. The result of this study is quite similar to
the result attained by van Boxtels study (2000), in which 59% of the total
number of questions are verification questions.

According to the Graesser & Persons question taxonomy (Graesser


& Person, 1994), questions can be generally divided into two categories
namely, long-answer question and short-answer question. Verification
question is categorized under the taxonomy is short-answer question.
Short-answer questions can be satisfied with a few words or a phrase, and
deep-reasoning patterns are less likely to be uncovered in short-answer
questions (Graesser & Person, 1994). In addition, some researchers
(Hakkarainen & Sintonen, 2002) would equate verification question to
yes-no question, and the value verification question to an inquiry is
relatively low. If verification question has little contribution to the
80

progression of inquiry, then what is the implication of the high percentage of


verification questions found in the datasets. Whether the high proportion of
verification questions only inform us that the students are just capable to ask
low level question? The following examples may shed like on the value of
verification questions.
Will the HK Economic (economy) will be better ten years later? and
But there is a problem, can we choose which company to use after setting
up the 3rd company? are two examples of verification question found in
the datasets of this study. Both questions are classified as verification
question based on their grammatical form. But it is obvious that these two
questions cannot be satisfied with short-answer which only composite of a
few words. The first question, Will the HK Economic (economy) will be
better ten years later?, requests for a prediction of the Hong Kong
economy after ten years. The answerer needs to rely on the knowledge
he/she acquired previously to make such a prediction. This question not only
activates the prior knowledge of the answerer, but also makes the prior
knowledge of the answerer explicit to all group members. As opposed to the
first question, the second question, But there is a problem, can we choose
which company to use after setting up the 3rd company?, doesnt request
for knowledge from the answerer. Although the question itself is in the
format of a question, the purpose of questioner is to express his/her concern
over the low possibility that the introduction of the third power company
can bring the competition into the power supply market in Hong Kong. This
statement is more like an opinion than a question. Its purpose is to initiate a
discussion on the concern put forward by the questioner. As suggested by
van Boxtel (2000), verification question is a resource for constructing and
monitoring the common ground shared among different members in a group.
The examples of verification question illustrated above indicate that
verification question is not just a question request for yes/no responses to
factual questions (Hmelo-Silver & Barrows, 2008, p. 61). It can as well
elicit the prior knowledge possessed by either other group member or the
person who asks the question. However, the study on the role of verification
in inquiry is quite limited in the field of CSCL. We are hoping that this
81

discussion on verification question can facilitates other researchers to revisit


the roles of verification question in inquiry.

The second phenomenon is about the particularly low occurrence of


some question types. Both Chinese and English datasets possess seven
question types with less than 5% of occurrence. The frequency of questions
is often used as an indicator of the quality of inquiry (Lai & Law, 2012). If
the frequency is related to the quality of inquiry, would the low occurrence
rate inform us that these question types are irrelevant to the quality of
inquiry? Comparison question is selected for the propose of illustration of
the value of question with low occurrence. Comparison question is one of
the question types with less than 1% occurrence in both datasets.
(What are the differences and
similarities between the living condition of plants and animals?) is an
example of comparison question. In this question, the questioner tries to
make comparison between the living conditions for plants with animals.
Comparison plays an important role in knowledge advancement. Science is
not only about memorizing a large number of facts. An objective of science
is to recognize and re-structure of facts so that a more complex structure of
knowledge can be formulated and it is through comparison to facilitate
people to organize the information acquired in a more systematic way.
Question like comparison questions are regarded as high level questions,
however, its occurrence is relatively low in discourse. If we continue to use
the question counts as a metric for the quality of inquiry, we may possibly
miss out the influence of such questions in the discourse. It is recommended
to revisit the current method of quantitative analysis to cater the bias in
distribution of different types of questions in discourse.

82

Chapter 6

IMPLICATIONS,

RECOMMENDATIONS, LIMITATION
AND CONCLUSION
6.1 Main Findings
The findings of this study is presented according to the four research
questions listed in Chapter one.

1.

Is a hybrid method combining the rule-based question mark


method and machine-learning-based syntactic parsing method an
effective method for detecting questions in CSCL discourse of
Hong Kong students?

English question detection

The combination of the question mark and syntactic analysis for


English question detection can achieve 94.8% f1-score and 98.9%
accuracy, which is the highest among the QM, Syntax and QM +
Syntax methods. This indicates the combination of question mark and
shallow syntactic analysis is an effective method for detecting
questions in English CSCL discourse by Hong Kong students.

Chinese question detection

The best performance of Chinese question detection is achieved by the


QM method with 94.8% f1-score and 93.9% accuracy. By
implementing the Syntax method the f1-score and accuracy is dropped
by 0% and 89.5% respectively and such a drop is mainly caused by the
unsatisfactory result of syntactic analysis. There are two factors
determining the accuracy of syntactic analysis. One factor is that the
Chinese questions are syntactically ambiguous. It is found that the
syntactic structures of some Chinese questions are literally the same as
83

statements. Another factor is that the statistical parser is not able to


parse question correctly based on the limited amount training questions
and the missing out of question tags included in the Sinica Treebank.
Concerning the high f1-score and accuracy of the QM method, it is
recommended as the method question detecting question from Chinese
discourse.

2.

Is tree kernel method an effective method for classifying questions in


CSCL discourse of Hong Kong students?

English question classification

The question classification results vary across different types of


questions. The classification of verification and reason questions both
have their f1-score above 80%. However, for question types which
have a wide variation of syntactic structures, such as attribute question,
the result may not be as satisfactory as the verification and reason
questions. The f1-score of attribute question is only 60.9%. The
variation in the performance of question classification informs us that
the syntax-based tree kernel approach is an effectiveness approach for
classifying English questions with distinctive syntactic structures. For
those question types without distinctive syntactic structures, we may
need an extra mechanism to identifying the key clause or phrases from
questions. For example, to identifying the embedded clause where
have many wind from question Can u tell me where have many
wind?

Chinese question classification

The classification result of Chinese question is generally poorer than


the English question classification. Although the f1-score of
verification and reason questions can maintain at a level above 70%,
the f1-score of attribute question is much worse than the other two.
The f1-score of attribute question is only 36.4%. The investigation of
84

the classification result revealed that the main cause of this


unsatisfactory result is the problem of syntactic processing. Figure 16
and Figure 17 show two Chinese questions, which share most lexical
items in common, but have completely different syntactic structure.

3.

Is a hybrid method combining the lexeme-based case-based reasoning


method and syntax-based tree kernel method an effective method for
classifying Chinese questions in CSCL discourse of Hong Kong
students?

The experiment on using lexical items as attributes for CBR


demonstrates that the occurrence of lexical items is an important
feature for the determination of question types. The implementation of
CBR can improve the recall of both three types of question, however,
this also cause the great drop in precision. A hybrid method as
illustrated in Figure 1 can leverage the high precision of tree kernel
method and the improved recall by the CBR method. The result shows
that f1-score of the hybrid method for multi-class classification
surpasses the tree kernel and CBR methods. This indicates that the
implementation of hybrid method can generally improve the result of
Chinese question classification.

4.

Is a question taxonomy based on the functions of questions in inquiry


suitable for the automatic classification of questions in CSCL
discourse of Hong Kong students?

The original G&P (1994) question taxonomy is used by Hmelo-Silver


(2003) for analyzing the inquiry in CSCL discourse. This study has
made a few adaptations to simplify the taxonomy. The substantial
inter-coder reliability between two human coders show that the
question taxonomy discussed in section 2.4 can achieve a high
reliability for analyzing questions from online discussions of Hong
Kong school students. However, the results of automatic question
analysis have certain extent of disagreement, particularly in the
85

attribute question, with this result of inter-coder reliability achieved by


human coders. These results might not only inform us the fault of the
automated method for processing some type of questions, it may also
be an indicator that the linguistics features, both syntactic and lexical,
of the questions in the class of attribute questions are not distinctive
enough.

6.2 Implication and Recommendation


The present study has a few implications on the automated question
analysis and CSSL research. Firstly, it is commonly found in the literature
that the same question classification method is used for analyzing all
questions found in discourse. The result of this study has shown that the
classification results vary across different types of questions, and this may
imply that different methods for question classification should be selected
for processing different types of question. Secondly, this study has also
proven that the automatic processing of naturally occurred questions does
not need a lot amount of training data. The amount of training data is always
a concern of machine-learning based method. This study has shown that a
amount of training data with less than 500 count as for the verification
question can still yield a satisfactory question classification result. This
finding is especially important for the CSCL research. Since the amount of
processed CSCL data is limited, researchers are reluctant to explore the
automated method for discourse analysis. This study set forth an example
for the future research that the automated question classification might not
need a large amount of training data. Thirdly, the state-of-art method for the
automated analysis of CSCL discourse is mainly focused on the replication
of human judgement based on a well-established CSCL framework. There is
no exploration on how the results of computer algorithms can feedback to
the CSCL researcher on the validity of the CSCL framework. As illustrated
in the previous section, the attribute questions exhibit a wide range of
syntactic and lexical variation, therefore the researchers may consider
whether this question category is distinctive enough or the category should
be re-organized into different question categories. Lastly, most of the
86

automated discourse analysis research in CSCL is still focusing on the


assessment with the quantity of a particular discourse act. However, as
discussed in section 5.5.2, the quality is also an important consideration to
the assessment of the discussion. This insight may form a cornerstone for
the integration of computational methods into the assessment of the CSCL
discourse.
Furthermore, it is believed that the tool for automatic identification and
classification mentioned in this study would bring impact to the computer
supported collaborative learning, if it is available for the students and
teachers in their daily teaching and learning. A basic usage is an instant
filtering of questions. A discussion can span across a few months and it is
difficult for both teachers and students to trace the development of
discussion. This tool enables the teachers and students to quickly identify all
questions found in the discussion. Besides, this tool may also serves as a
dashboard to reflect to healthiness of a discussion through the report of
quantity and quality of questions in discussion. Question is the cornerstone
of inquiry-based learning. If the quantity of questions in the discussion is
small or the students are focusing on the fact-based questions, it may be a
sign of the lack of inquiry in the discussion or the students are only focusing
on the information exchange. The teacher of such a discussion, even the less
experienced teacher, would understand that there is a need to intervene the
discussion and encourage the students to raise some more other questions
which is fruitful for the discussion.

6.3 Limitation of this study


One big limitation of this study is the amount of data for the
experiment. The training data for both Chinese and English question
classification is below 1000. This amount of data is quite little comparing to
large amount included in the TREC QA collection. It is yet unknown
whether the question detection and classification result can be further
improved by increasing the training data. Besides, the contextual
information of the discussion is not considered this study. However, factors,
such as age, gender, subject, school, etc., may influence the question asking
87

behaviour of students. It is still uncertain whether the research results


generated from this study is applicable for analyzing question generated by
students with different background.

6.4 Future Work


Throughout the work attempted in the present study, we have identified
different possible future research possibilities which may bring significant
contributions to the CSCL and natural language processing communities.
The first three suggestions listed below are related to natural language
processing, while the last one relates to CSCL research.

The first and foremost, it is essential for us to revisit the feasibility to


use an unlexicalized PCFGs parser for processing Chinese sentences. This
study clearly demonstrates the lexicalized characteristics of Chinese
sentences. The future experiment should explore the use of lexicalized
parser for Chinese syntactic processing.

Secondly, this study has highlighted a few unique characteristics of


language used by Hong Kong students, but this result is only based on a
small amount of online discussion. It is believed that by increasing the scope
of this study to include more online discussion by students from different
schools and grades, we might probably have a more in-depth understanding
of the unique characteristics of language used by Hong Kong students. This
knowledge would help the computational linguistics community to develop
tools which are more suitable for the analysis of languages in Hong Kong.

Thirdly, the learning module of the CBR has not yet been implemented
in this study. The learning module is important to the mechanism of CBR.
The performance of CBR can be improved through the case revision and
case retention. It is believed that the incorporation of the learning module
could able to improve the result of question classification.

Moreover, the analysis in section 5.2.3 and 5.3 show that the current
88

syntax-based or lexeme-based Chinese question classification method took


into consideration some irrelevant

linguistic information

for the

determination of question category. It is believed that the Chinese question


classification result might be improved by implementing some mechanisms
to filter out the irrelevant information.

Finally, as described in beginning of this thesis, the development of


computational tools could help teachers to make pedagogical decision. An
experiment can be carried out in the future study to understand how this
kind of computation-aided analysis tool can help teachers in their teaching.

6.5 Conclusion
5.

Automated question detection and classification of CSCL


discourse are challenging topics. The question taxonomy and question
analysis technology are two big challenges in their own right. First of
all, there exists no standardized framework for analyzing questions in
CSCL discourse. Actually, standardization would never happen in the
context of education, since there is yet no agreement on the core
question in education what is learning? If one is trying to come up a
standard method for analyzing all questions in CSCL discourse, I
would say he/she is leading himself/herself to a dead end. So, what is
the point of this study? I must re-state in here, the purpose of this
study is not to establish any standard for the classification of inquiry
in CSCL discourse. However, the aim of this study is to explore the
possibility of the development of automated method for analyzing the
CSCL discourse by Hong Kong students. Through the investigation of
the automated question detection and classification results, we have
gained an in-depth understanding of unique characteristics of language
by Hong Kong students, such as code-mixing, and suggestions have
been provided for tackling the problems encountered. It is believed
that other researchers can build on this understanding to formulate
automated tools which is more suitable for analyzing the language
used uitable for the automatic classification of questions in CSCL
89

discourse of Hong Kong students?

The original G&P (1994) question taxonomy is used by Hmelo-Silver


(2003) for analyzing the inquiry in CSCL discourse. This study has
made a few adaptations to simplify the taxonomy. The substantial
inter-coder reliability between two human coders show that the
question taxonomy discussed in section 2.4 can achieve a high
reliability for analyzing questions from online discussions of Hong
Kong school students. However, the results of automatic question
analysis have certain extent of disagreement, particularly in the
attribute question, with this result of inter-coder reliability achieved by
human coders. These results might not only inform us the fault of the
automated method for processing some type of questions, it may also
be an indicator that the linguistics features, both syntactic and lexical,
of the questions in the class of attribute questions are not distinctive
enough.

6.6 Implication and Recommendation


The present study has a few implications on the automated question
analysis and CSSL research. Firstly, it is commonly found in the literature
that the same question classification method is used for analyzing all
questions found in discourse. The result of this study has shown that the
classification results vary across different types of questions, and this may
imply that different methods for question classification should be selected
for processing different types of question. Secondly, this study has also
proven that the automatic processing of naturally occurred questions does
not need a lot amount of training data. The amount of training data is always
a concern of machine-learning based method. This study has shown that a
amount of training data with less than 500 count as for the verification
question can still yield a satisfactory question classification result. This
finding is especially important for the CSCL research. Since the amount of
processed CSCL data is limited, researchers are reluctant to explore the
automated method for discourse analysis. This study set forth an example
90

for the future research that the automated question classification might not
need a large amount of training data. Thirdly, the state-of-art method for the
automated analysis of CSCL discourse is mainly focused on the replication
of human judgement based on a well-established CSCL framework. There is
no exploration on how the results of computer algorithms can feedback to
the CSCL researcher on the validity of the CSCL framework. As illustrated
in the previous section, the attribute questions exhibit a wide range of
syntactic and lexical variation, therefore the researchers may consider
whether this question category is distinctive enough or the category should
be re-organized into different question categories. Lastly, most of the
automated discourse analysis research in CSCL is still focusing on the
assessment with the quantity of a particular discourse act. However, as
discussed in section 5.5.2, the quality is also an important consideration to
the assessment of the discussion. This insight may form a cornerstone for
the integration of computational methods into the assessment of the CSCL
discourse.
Furthermore, it is believed that the tool for automatic identification and
classification mentioned in this study would bring impact to the computer
supported collaborative learning, if it is available for the students and
teachers in their daily teaching and learning. A basic usage is an instant
filtering of questions. A discussion can span across a few months and it is
difficult for both teachers and students to trace the development of
discussion. This tool enables the teachers and students to quickly identify all
questions found in the discussion. Besides, this tool may also serves as a
dashboard to reflect to healthiness of a discussion through the report of
quantity and quality of questions in discussion. Question is the cornerstone
of inquiry-based learning. If the quantity of questions in the discussion is
small or the students are focusing on the fact-based questions, it may be a
sign of the lack of inquiry in the discussion or the students are only focusing
on the information exchange. The teacher of such a discussion, even the less
experienced teacher, would understand that there is a need to intervene the
discussion and encourage the students to raise some more other questions
which is fruitful for the discussion.

91

6.7 Limitation of this study


One big limitation of this study is the amount of data for the
experiment. The training data for both Chinese and English question
classification is below 1000. This amount of data is quite little comparing to
large amount included in the TREC QA collection. It is yet unknown
whether the question detection and classification result can be further
improved by increasing the training data. Besides, the contextual
information of the discussion is not considered this study. However, factors,
such as age, gender, subject, school, etc., may influence the question asking
behaviour of students. It is still uncertain whether the research results
generated from this study is applicable for analyzing question generated by
students with different background.

6.8 Future Work


Throughout the work attempted in the present study, we have identified
different possible future research possibilities which may bring significant
contributions to the CSCL and natural language processing communities.
The first three suggestions listed below are related to natural language
processing, while the last one relates to CSCL research.

The first and foremost, it is essential for us to revisit the feasibility to


use an unlexicalized PCFGs parser for processing Chinese sentences. This
study clearly demonstrates the lexicalized characteristics of Chinese
sentences. The future experiment should explore the use of lexicalized
parser for Chinese syntactic processing.

Secondly, this study has highlighted a few unique characteristics of


language used by Hong Kong students, but this result is only based on a
small amount of online discussion. It is believed that by increasing the scope
of this study to include more online discussion by students from different
schools and grades, we might probably have a more in-depth understanding
of the unique characteristics of language used by Hong Kong students. This
92

knowledge would help the computational linguistics community to develop


tools which are more suitable for the analysis of languages in Hong Kong.

Thirdly, the learning module of the CBR has not yet been implemented
in this study. The learning module is important to the mechanism of CBR.
The performance of CBR can be improved through the case revision and
case retention. It is believed that the incorporation of the learning module
could able to improve the result of question classification.

Moreover, the analysis in section 5.2.3 and 5.3 show that the current
syntax-based or lexeme-based Chinese question classification method took
into consideration some irrelevant

linguistic information

for the

determination of question category. It is believed that the Chinese question


classification result might be improved by implementing some mechanisms
to filter out the irrelevant information.

Finally, as described in beginning of this thesis, the development of


computational tools could help teachers to make pedagogical decision. An
experiment can be carried out in the future study to understand how this
kind of computation-aided analysis tool can help teachers in their teaching.

6.9 Conclusion
Automated question detection and classification of CSCL discourse are
challenging topics. The question taxonomy and question analysis
technology are two big challenges in their own right. First of all, there exists
no standardized framework for analyzing questions in CSCL discourse.
Actually, standardization would never happen in the context of education,
since there is yet no agreement on the core question in education what is
learning? If one is trying to come up a standard method for analyzing all
questions in CSCL discourse, I would say he/she is leading himself/herself
to a dead end. So, what is the point of this study? I must re-state in here, the
purpose of this study is not to establish any standard for the classification of
inquiry in CSCL discourse. However, the aim of this study is to explore the
93

possibility of the development of automated method for analyzing the CSCL


discourse by Hong Kong students. Through the investigation of the
automated question detection and classification results, we have gained an
in-depth understanding of unique characteristics of language by Hong Kong
students, such as code-mixing, and suggestions have been provided for
tackling the problems encountered. It is believed that other researchers can
build on this understanding to formulate automated tools which is more
suitable for analyzing the language used by Hong Kong students.

In conclusion, the continuous improvement on natural language


processing technology has opened a gateway for CSCL researchers to make
use of those technologies to understand the regularity of human cognitive
process. It is believed that the incorporation those technology for the
facilitation of learning would have a significant impact on the future
education. Here and now we are witnessing a pivotal moment of future
education.

94

Appendix I
List of lexemes with Information Gain higher than or equal to 0.011

95

96

References
Aamodt, A., & Plaza, E. (1994). Case-Based Reasoning : Foundational
Issues , Methodological Variations , and System Approaches. AI
Communications, 7(1), 3959.
Aoun, J., & Li, Y. A. (1993). Wh-Elements in Situ : Syntax or LF ?
Linguistic Inquiry, 24(2), 199238.
Bereiter, C. (1994). Implications of Postmodernism for Science, or, Science
as Progressive Discourse. Educational Psychologist, 29(1), 312.
Bloehdorn, S., & Moschitti, A. (2007). Structure and semantics for
expressive text kernels. Proceedings of the sixteenth ACM conference
on Conference on information and knowledge management CIKM 07, 861. doi:10.1145/1321440.1321561
Bu, F., Zhu, X., Hao, Y., & Zhu, X. (2010). Function-based question
classification for general QA, (October), 11191128.
Burbules, N. C. (1993). Dialogue in teaching: Theory and practice. New
York: Teachers College Press.
Burtis, J. (1998). Analytic Toolkit for Knowledge Forum. Centre for
Applied Cognitive Science, The Ontario Institute for Studies in
Education/University of Toronto.
Carlson, A. J., Cumby, C. M., Rosen, J. L., & Roth, D. (1999). The SNoW
learning architecture. Technical Report UIUCDCS-R-99-2101 (pp.
114).
Census and Statistics Department. (2012). Population Aged 5 and Over by
Usual Language, 2001, 2006 and 2011 (A107).
http://www.census2011.gov.hk/en/main-table/A107.html.
Chan, C. K. K., Lee, E. Y. C., & Van Aalst, J. (2001). Assessing and
Fostering Knowledge Building Inquiry and Discourse. Paper presented
at the IKIT Summer Institute 2001. Toronto, ON.

97

Chen, F., Tsai, P.-F., Chen, K., & Huang, C. (1999). Sinica Treebank.
Computational Linguistics and Chinese Language Processing, 4(2),
97104.
Chen, K. (1996). A Model for Robust Chinese Parser. Computational
Linguistics and Chinese Language Processing, 1(1), 183204.
Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales.
Educational and Psychological Measurement, 20(1), 3746.
doi:10.1177/001316446002000104
Cong, G., Wang, L., Lin, C.-Y., Song, Y.-I., & Sun, Y. (2008). Finding
question-answer pairs from online forums. Proceedings of the 31st
annual international ACM SIGIR conference on Research and
development in information retrieval - SIGIR 08, 467.
doi:10.1145/1390334.1390415
Cui, H., Kan, M., Chua, T., & Xiao, J. (2004). A Comparative Study on
Sentence Retrieval for Definitional Question Answering. SIGIR
Workshop on Information Retrieval for Question Answering.
Day, M.-Y., Ong, C.-S., & Hsu, W.-L. (2007). Question Classification in
English-Chinese Cross-Language Question Answering: An Integrated
Genetic Algorithm and Machine Learning Approach. 2007 IEEE
International Conference on Information Reuse and Integration,
203208. doi:10.1109/IRI.2007.4296621
Emoticon [Def. 1]. (n.d.).Merriam Webster Online. Retrieved January 29,
2013, from http://www.easybib.com/reference/guide/apa/dictionary
Enfield, N. J., Stivers, T., & Levinson, S. C. (2010). Questionresponse
sequences in conversation across ten languages: An introduction.
Journal of Pragmatics, 42(10), 26152619.
doi:10.1016/j.pragma.2010.04.001
Forman, G. (2003). An Extensive Empirical Study of Feature Selection
Metrics for Text Classification, 3, 12891305.
Foster, J., & Vogel, C. (2004). Parsing Ill-formed Text using an Error
Grammar. Artif. Intell. Rev. Special AICS2003 (pp. 124).
98

Global English. (2013). 2013 Business English Index & Globalization of


English Research.
Graesser, A. C., Baggett, W., & Williams, K. (1996). Question-driven
Explanatory Reasoning. Applied Cognitive Psychology, 10, 1733.
Graesser, A. C., & Person, N. K. (1994). Question Asking During Tutoring.
American Educational Research Journal, 31(1), 104137.
doi:10.3102/00028312031001104
Hakan, S. (2007). A Re-examination of Question Classification. In J. Nivre,
H.-J. Kaalep, K. Muischnek, & M. Koit (Eds.), NODALIDA 2007
Conference Processdings (pp. 394397).
Hakkarainen, K. (1998). Epistemology of scientific inquiry and
computer-supported collaborative learning. Unpublished thesis for the
degree of Doctor of Philosophy. University of Toronto.
Hakkarainen, K. (2003). Progressive inquiry in a Computer-supported
biology class. Journal of Research in Research in Science Teaching,
40(10), 10721088.
Hakkarainen, K., & Sintonen, M. (2002). The Interrogative Model of
Inquiry and Computer-Supported Collaborative Learning. Science &
Education, 11, 2543.
Harabagiu, S. M., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M.,
Bunescu, R., Girju, R., et al. (2000). FALCON: Boosting Knowledge
for Answer Engines. Proceedings of the 9th Text Retrieval Conference
(TREC-9), 479488.
Hmelo-Silver, C. E. (2003). Analyzing collaborative knowledge
construction multiple methods for integrated understanding. Computers
& Education, 41(4), 397420. doi:10.1016/j.compedu.2003.07.001
Hmelo-Silver, C. E., & Barrows, H. S. (2008). Facilitating Collaborative
Knowledge Building. Cognition and Instruction, 26(1), 4894.
doi:10.1080/07370000701798495
Hong, L., & Davison, B. D. (2009). A classification-based approach to
question answering in discussion boards. Proceedings of the 32nd
99

international ACM SIGIR conference on Research and development in


information retrieval, 171178. doi:10.1145/1571941.1571973
Huang, S., & Chen, K. (2008). Knowledge Representation and Sense
Disambiguation for Interrogatives in E-HowNet. Computational
Linguistics and Chinese Language Processing, 13(3), 255278.
Hull, D. A. (1999). Xerox TREC-8 Question Answering Track Report.
Proceedings of the 8th Text Retrieval Conference (TREC-8) (pp.
743752).
Ittycheriah, A., Franz, M., & Roukos, S. (2001). IBMs Statistical Question
Answering System - TREC 10. Proc. of TREC 2001.
Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing.
Proceedings of the 41st Annual Meeting on Association for
Computational Linguistics - ACL 03, 423430.
doi:10.3115/1075096.1075150
Kolodner, J. L. (1983). Maintaining organization in a dynamic long-term
memory. Cognitive Science, 7, 243280.
Lai, M., & Law, N. (2012). Questioning and the quality of knowledge
constructed in a CSCL context: a study on two grade-levels of students.
Instructional Science. doi:10.1007/s11251-012-9246-1
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer
Agreement for Categorical Data. Biometrics, 33(1), 159174.
Law, M. C. (2005). The Acquisition of English Subject-Verb Agreement By
Cantonese Speakers. (Master of Art dissertation). Retrieved from the
HKU Scholar Hub.
Lee, K.-S., Oh, J.-H., Huang, J., Kim, J.-H., & Choi, K.-S. (2000). TREC-9
Experiments at KAIST: QA, CLIR and Batch Filtering. Proceedings of
the 9th Text Retrieval Conference (TREC-9) (Vol. 44, pp. 303316).
Levy, R., & Manning, C. D. (2003). Is it harder to parse Chinese, or the
Chinese Treebank? ACL 2003 (pp. pp. 439446.).

100

Li, B., Si, X., Lyu, M. R., King, I., & Chang, E. Y. (2011). Question
identification on twitter. Proceedings of the 20th ACM international
conference on Information and knowledge management - CIKM 11,
2477. doi:10.1145/2063576.2063996
Li, D. C. S. (2008). Understanding mixed code and classroom
code-switching : New Horizons in Education, 56(3), 7587.
Li, X., & Roth, D. (2002). Learning question classifiers: the role of semantic
information. Natural Language Engineering (Vol. 12, pp. 229249).
doi:10.1017/S1351324905003955
Man, S. S. (2006). First Language Influencing Hong Kong Students
English Learning. (Master of Art dissertation). Retrieved from the
HKU Scholar Hub.
Marcus, M. P. (1993). Building a Large Annotated Corpus of English : The
Penn Treebank. Computational Linguistics.
Miles, M., & Huberman, A. M. (1994). Qualitative Data Analysis.
Qualitative Data Analysis. Thousand Oaks, CA: Sage Publications.
Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill.
Nagao, M. (1984). A framework of a mechanical translation between
japanese and english by analogy principle. In A. Elithorn & R. Banerji
(Eds.), Artifical and Human Intelligence. Elsevier B.V.
Pasca, M., & Harabagiu, S. M. (2001). High Performance Question /
Answering. Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval
Pages (pp. 366 374). New Orleans, LA.
Prager, J., Chu-Carroll, J., & Czuba, K. (2002). Statistical answer-type
identification in open-domain question answering. Proceedings of the
second international conference on Human Language Technology
Research - (p. 150). Morristown, NJ, USA: Association for
Computational Linguistics. doi:10.3115/1289189.1289276

101

Prager, J., Radev, D., Brown, E., Coden, A., & Samn, V. (1999). The Use of
Predictive Annotation for Question Answering in TREC8. Proceedings
of the 8th Text Retrieval Conference (TREC-8) (pp. 399410).
Rebedea, T., Dascalu, M., & Trausan-matu, S. (2011). Automatic
Assessment of Collaborative Chat Conversations with PolyCAFe. In F.
Wild, M. Wolpers, C. D. Kloos, D. Gillet, & R. M. C. Garca (Eds.),
Towards Ubiquitous Learning, 6th European Conference on
Technology Enhanced Learning, EC-TEL 2011 (pp. 299312). Palermo,
Italy: Springer.
Ros, C., Wang, Y.-C., Cui, Y., Arguello, J., Stegmann, K., Weinberger, A.,
& Fischer, F. (2008). Analyzing collaborative learning processes
automatically: Exploiting the advances of computational linguistics in
computer-supported collaborative learning. International Journal of
Computer-Supported Collaborative Learning, 3(3), 237271.
doi:10.1007/s11412-007-9034-0
Santorini, B. (1990). Part-of-Speech Tagging Guidelines for the Penn
Treebank Project.
Scardamalia, M. (2002). CSILE / Knowledge Forum. Education technology:
An encyclopedia.
Scardamalia, M., & Bereiter, C. (1991). Higher Levels of Agency for
Children in Knowledge Building : A Challenge for the Design of New
Knowledge Media. The Journal of the Learning Sciences, 1(1), 3768.
doi:10.1207/s15327809jls0101_3
Schmitt, S., & Bergmann, R. (1999). Applying Case-Based Reasoning
Technology for Product Selection and Customization in Electronic
Commerce Environments, (27068), 115.
Simmons, R. F. (1965). Answering English Questions by Computer : A
Survey. Communications of the ACM, 8(1), 5370.
Suzuki, J., Taira, H., Sasaki, Y., & Maeda, E. (2003). Question
classification using HDAG kernel. Proceedings of the ACL 2003
workshop on Multilingual summarization and question answering,
6168. doi:10.3115/1119312.1119320
102

Taylor, A., Marcus, M. P., & Santorini, B. (2003). The Penn Treebank: an
overview. In A. Abeill (Ed.), Treebanks: building and using parsed
corpora (pp. 522). Dordrecht, Netherlands: Kluwer Academic
Publishers.
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., & Manning, C. (2005). A
Conditional Random Field Word Segmenter for Sighan Bakeoff 2005.
Fourth SIGHAN Workshop on Chinese Language Processing.
Van Boxtel, C. (2000). Collaborative concept learning: Collaborative
learning tasks, student interaction, and the learning of physics
concepts. Unpublished Doctoral thesis, Utrecht University, Utrecht,
The Netherlands.
Wang, K., & Chau, T.-S. (2010). Exploiting Salient Patterns for Question
Detection and Question Retrieval in Community-based Question
Answering. Proceedings of the 23rd International Conference on
Computational Linguistics (pp. 11551163).
Weber, R. P. (1990). Basic Content Analysis. Newbury Park, CA: Sage
Publications.
Yuan, J., & Jurafsky, D. (2005). Detection of Questions in Chinese
Conversational Speech. Automatic Speech Recognition and
Understanding, 2005 (pp. 4752).
Zhang, D., & Lee, W. S. (2003). Question classification using support
vector machines. Proceedings of the 26th annual international ACM
SIGIR conference on Research and development in information
retrieval - SIGIR 03, 26. doi:10.1145/860442.860443
Zhang, H., Hong-kui, Y., Xiang, D., & Liu, Q. (2003). HHMM-based
Chinese Lexical Analyzer ICTCLAS. Proc. of SIGHAN Workshop.
Zhang, K., & Shasha, D. (1989). Simple Fast Algorithms for the Editing
Distance between Trees and Related Problems. SIAM Journal on
Computing, 18(6), 12451262. doi:10.1137/0218082
Zhang, Y., & Wildemuth, B. M. (2009). Qualitative Analysis of Content.
Applications of Social Research Methods to Questions in Information
103

and Library (Wildemuth,., pp. 309319). Westport, CT: Libraries


Unlimited.

104

Você também pode gostar