Você está na página 1de 19

RESEARCH ON WORLD UNIVERSITY RANKINGS.

A PROJECT REPORT
Submitted in partial fulfillment for the award of the course
Data Mining Techniques (SWE2009)
by

ESWAR HARI KUMAR ​ (15MIS0390)


PREM VUDATHA​​ (15MIS0192)
VINEETH KUMAR ​ (15MIS0256)
HARRY LIDIL​​ (15MIS0391)

Under the guidance of


Sathyamoorthi.E
Assistant Professor (senior)

School of Information Technology & Engineering


Abstract:
Our project deals with the analysis of World University Rankings data given by
The Times Education which is is widely regarded as one of the most influential and
widely observed university measures. Founded in the United Kingdom in 2010, it
has been criticized for its commercialization and for undermining
non-English-instructing institutions.

The dataset contains various factors which are included in giving a rank to the
university such as its teaching rating, it's score for citations etc.. In this project we
try to analyse the top universities using the R.

DataSet Information:
There are totally 14 attributes in dataset. They are,
1. World_rank- world rank for the university. Contains rank ranges and equal
ranks (eg. =94 and 201-250)
2. University_name- name of university
3. Country- country of each university
4. Teaching- university score for teaching (the learning environment)
5. International- university score international outlook (staff, students,
research)
6. Research - university score for research (volume, income and reputation)
7. Citations- university score for citations (research influence)
8. Income- university score for industry income (knowledge transfer)
9. Total_score- total score for university, used to determine rank
10.Num_students- number of students at the university
11.Student_staff_ratio- Number of students divided by number of staff
12.International_students- Percentage of students who are international
13.Female_male_ratio- Female student to Male student ratio
14.Year- year of the ranking (2011 to 2016 included)

The dataset contain 2603 tuples with missing values.

Data Mining Techniques Used:


The techniques used in our project are
● Clustering the data by using Elbow method to find optimal number of
clusters.
● Clustering data using K means and Hierarchical algorithms.
● Visualisations of data using various plots.

Code:
First we need to clean the data. The dataset has many number of missing values
and data which aren’t numeric. So first we are going to convert data to numerical.

Cleaning.R:
times <- read.csv("C:\\Users\\Eswar\\Desktop\\timesData.csv", na.strings = c("", "-"), encoding
="UTF-8")
i <- duplicated(times);i
which(i)
times[i,]##If no duplicates, ignore
clean.data <- unique(times[complete.cases(times),])
summary(clean.data)

#Data cleaning
#rank---useless
times$world_rank
rnk <- as.character(times$world_rank)
unique(rnk)
cbind(rnk, as.numeric(rnk))
#get rid of = and ranges (look up regular expressions!)
rnk <- sub(pattern = "=", "", rnk)
rnk <- sub(pattern = "-.*", "", rnk)
rnk <- as.numeric(rnk)
rnk
summary(rnk)
times$world_rank <- rnk
#Intl
intl <- as.character(times$international)
unique(intl)
intl[intl == '-'] <- NA
unique(intl)
intl <- as.numeric(intl)
summary(intl)
times$international <- intl

#inc
inc <- as.character(times$income)
unique(inc)
inc[inc == '-'] <- NA
unique(inc)
inc <- as.numeric(inc)
summary(inc)
times$income <- inc

#totalscore
totalscore <- as.character(times$total_score)
unique(totalscore)
totalscore[totalscore == '-'] <- NA
unique(totalscore)
totalscore <- as.numeric(totalscore)
summary(totalscore)
times$total_score <- totalscore
#Students
ns <- as.character(times$num_students)
unique(ns)
cbind(ns, as.numeric(ns))
ns <- sub(pattern = ",", "", ns)
ns <- as.numeric(ns)
summary(ns)
times$num_students <- ns

# international_students
times$international_students
instu <- as.character(times$international_students)
unique(instu)
cbind(instu, as.numeric(fe==instu))
instu <- sub(pattern = "%", "", instu)
instu <- as.numeric(instu)
instu
summary(instu)
times$international_students <-instu

# female_male_ratio--female
fe <- as.character(times$female_male_ratio)
unique(fe)
cbind(fe, as.numeric(fe))
fe <- sub(pattern = "-", "", fe)
fe <- sub(pattern = ":.*", "", fe)
fe <- as.numeric(fe)
fe
summary(fe)

#female_male_ratio--male
ma <- as.character(times$female_male_ratio)
unique(ma)
cbind(ma, as.numeric(ma))
ma <- sub(pattern = "-", "", ma)
ma <- sub(pattern = ".*:", "", ma)
ma <- as.numeric(ma)
ma
summary(ma)

#ratio
ratio<-fe/ma
ratio
times$female_male_ratio <- ratio
summary(ratio)

summary(times)
#Data cleaning end
View(times)

The cleaned data still contains the missing values. Now we are removing the
missing values which are assigned as NA’s.

> nrow(times)
[1] 2603
> times1 <- na.omit(times)
> nrow(times1)
[1] 954

Finding optimal number of cluster:


We are considering only the data of universities ranked in year 2012.
> data12 <- subset(ctimes, year == 2, select =
c("country","teaching","international","income","student_staff_ratio","international_students","t
otal_score"))
set.seed(123)
# Compute and plot wss for k = 2 to k = 15.
k.max <- 15
data12 <- scaled_data
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
The other method used is NbClust Method.
library(NbClust)
nb <- NbClust(scaled_data, diss=NULL, distance = "euclidean",
min.nc=2, max.nc=5, method = "kmeans",
index = "all", alphaBeale = 0.1)

Plotting the data and clustering them into clusters, the obtained result is
data12 <- subset(ctimes, year == 2, select =
c("country","teaching","international","income","student_staff_ratio"
,"international_students","total_score"))
ggplot(data12, aes(teaching, total_score,color = country )) +
geom_point()
set.seed(20)
Cluster <- kmeans(data12, 3, nstart = 20)
Cluster
Cluster$cluster <- as.factor(Cluster$cluster)
ggplot(data12, aes(teaching, total_score, color = Cluster$cluster)) +
geom_point()
plot(data12, col = Cluster$cluster)
library(cluster)
clusplot(data12, Cluster$cluster, main="2D rep", shade=TRUE,
labels=2,lines=0)

#Plotting international and total score.


library(factoextra)
ggplot(data12, aes(international, total_score , color = country )) +
geom_point()
set.seed(20)
Cluster <- kmeans(data12[, 2,7], 3, nstart = 20)
Cluster
Cluster$cluster <- as.factor(Cluster$cluster)
ggplot(data12, aes(international, total_score, color =
Cluster$cluster)) + geom_point()
#Similarly we do it for other available factors.
#Plotting international_students and total score.
library(factoextra)
ggplot(data12, aes(international_students, total_score , color =
country )) + geom_point()
set.seed(20)
Cluster <- kmeans(data12[, 2,7], 3, nstart = 20)
Cluster
Cluster$cluster <- as.factor(Cluster$cluster)
ggplot(data12, aes(international_students, total_score, color =
Cluster$cluster)) + geom_point()
plot(data12, col = Cluster$cluster)

#Plotting research and total_score


library(factoextra)
ggplot(data12, aes(research, total_score , color = country )) +
geom_point()
set.seed(20)
Cluster <- kmeans(data12[, 4,7], 3, nstart = 20)
Cluster
Cluster$cluster <- as.factor(Cluster$cluster)
ggplot(data12, aes(research, total_score, color = Cluster$cluster)) +
geom_point()
plot(data12, col = Cluster$cluster)

#Plotting citations and total_score.


library(factoextra)
ggplot(data12, aes(citations, total_score , color = country )) +
geom_point()
set.seed(20)
Cluster <- kmeans(data12[, 5,7], 3, nstart = 20)
Cluster
Cluster$cluster <- as.factor(Cluster$cluster)
ggplot(data12, aes(citations, total_score, color = Cluster$cluster))
+ geom_point()
plot(data12, col = Cluster$cluster)
Hierarchical clustering:
dist.res <- dist(data12, method = "manhattan")
# Hierarchical clustering results
res.hk <-hkmeans(data12, 3)
res.hk
hkmeans_tree(res.hk, cex = 0.6)
#visualize pam clusters
library(factoextra)
fviz_dend(res.hk, cex = 0.6) +
labs(title= "Hierarchial Clustering")
View(times1)
Visualisations of top Countries year wise:
2012:
2013:

2014:
2015:

2016:
Literature Survey:
Paper-1:
a) Is measuring the knowledge creation of universities possible?: A review of
university rankings by Gokcen Arkali Olcay , Melih Bulu.
Abstract:
University ranking indexes are considered very useful benchmarking tools in comparing the performance
of universities around the world. Being placed in these prestigious indexes provides a strong
advertisement for a university and helps them to attract high-quality students and academicians all over
the world. This study aims to explore the leading global university rankings to determine the similarities
and differences in terms of their ranking criteria,main indicators,modeling choices, and the effects of
these on the rankings. Designating the Times Higher Education World Rankings as the base ranking, a
comprehensive comparison of the positions of the top universities of the base index with the matched
positions of the same universities under other leading indexes including ARWU,QS, Leiden, and URAP
is given. Correlations highlight the significant differences among some indexes even in measuring the
same criterion such as teaching or research.The synopsis of the leading university rankings reveals the
variability in the actual places of the best universities across different indexes. Some top universities in
one leading index do not even take a position on the list of another leading index. The variability in the
actual lists partially lead to the variety and weighting of the indicators used.
Link: https://www.sciencedirect.com/science/article/pii/S004016251630021X

Paper-2:
Integrated assessment and ranking of universities by fuzzy inference by
Misir Mardanov, Ramin Rzayev, Zenal Jamalov, Alla Khudatova
Abstract:
For the evaluation and the subsequent ranking of universities for the quality of educational services
provided by them it is proposed two approaches: statistical based on weighted estimates of key indicators
of universities, and verbal based on the application of fuzzy inference engine. By applying these methods
to the evaluation of five randomly selected universities (hypothetical alternatives), it is obtained their
aggregate estimations (ratings) and corresponding two ways of ranking. Statisticaland fuzzy inference
based verbal approaches are considered for assessing the ES quality of universities.Based on the results of
realization of these approaches appropriate ratings and corresponding rankings were obtained for five
universities characterized by arbitrarily chosen data.
Link: https://www.sciencedirect.com/science/article/pii/S1877050917324432

Paper-3:
Do the technical universities exhibit distinct behaviour in global university
rankings? A Times Higher Education (THE) case study by
Carmen Perez-Esparrells, Enrique Orduna-Malea
Abstract:
In this study a first-world technical university ranking composed of 137 universities belonging to
the THE ranking was presented. Among these institutions, we can discover the national
technological flagships in their home countries that bring new advances for scientific and
technology transfers and the majority of commercialisation of university research results. The
results demonstrate a distinct statistical behaviour of TUs, characterised by moderate-to-high
scores in Industry Income and a low performance in Research scores if compared with the
remaining non-Technical universities of the Top-800 sample in THE ranking.
Given the weights applied to each of the dimensions (30% Research; 30% Citations; 30%;
Teaching; 7.5% International Outlook; and 2.5% Industry Income), the original ranking
manifestly diminishes the potential of TUs’ performances. When simulating alternative scenarios
(especially the “soft” and “strong” scenarios), in which the Industry Income weight is slightly
increased (because of decreasing Research and Citations), the results show an overall increase in
scores of the majority of TUs. A total of 83.2% of TUs (114) increase their overall scores, some
of them significantly (17 universities would benefit from an increase of more than five points in
their final overall score within the strong scenario).
Link: ​https://www.sciencedirect.com/science/article/pii/S2212567115008382
Paper-4:
Higher education, high-impact research, and world university rankings: A case of
India and comparison with China published by K.S. Reddy, En Xie, Qingqing
Tang.
Abstract:
The aim of this paper was to examine the state of the higher education system, high-impact
research metrics, and world university rankings in India. Nested within the exploratory research
framework, we collected relevant data from archival sources and accomplished our goals based
on inductive and deductive logic. First, an overview of higher education and government
schemes for academic research was presented. Second, a theoretical note on the academic
scholarship and the determinants of high-impact research was described, as was the progress of
research metrics for three categories (all subjects, BMA and EEF), and the most recent world
university rankings were reported. In particular, the indicators of high-impact management
research, business school accreditation and rankings, and abstracting and indexing of publishing
journals were deeply discussed. Third, we outlined various potential challenges in Indian
university education and suggested fruitful policy guidelines for improving accessible practices
and university performance.
Link: ​https://www.sciencedirect.com/science/article/pii/S2405883116300478

Paper-5:
Rankings and university performance:A conditional multidimensional approach
Cinzia Daraioa ,Andrea Bonaccorsi ,Léopold Simar
Abstract:
In conclusion, we argue that a geographical analysis of world university rankings that considers
different rankings and scrutinizes the ranking data on a variety of scales, such as tiers of
institutions, cities and countries, adds three important dimensions to interdisciplinary debates
about university league tables. First, it illustrates the partiality of this discourse through its focus
on one segment of global higher education dominated by Anglo-American research practices in
the natural and technical sciences. Second, it outlines the even more specific perspectives of
different rankings on these partial representations. In our view, this further undermines the
authority that public discourse tends to grant world university rankings and confirms that any
representations of academic performance provide necessarily limited accounts of material and
reputational geographies. Finally, our comparative, geographical and disaggregating analysis has
revealed wider structures and dynamics within the dominant sphere of global higher education,
but it has also stressed that other measures and subject-specific perspectives would produce very
different geographies.
Link: ​https://www.sciencedirect.com/science/article/abs/pii/S0377221715000843

Conclusion:
In our project, we have done analysis on the Times World University Rankings and
have detailed clustering of the 2012 rankings. The optimal number of clusters that
we found was either 2 or 3. We have clustered the attributes using appropriate
clusters and sort them out. The quality of clustering can be given by the
between_SS / total_SS value. We have also identified the year wise progress of
dome universities using plots.

Você também pode gostar