Escolar Documentos
Profissional Documentos
Cultura Documentos
R Project
R is a free software environment for statistical computing, data manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4). Facilitates the inclusion of biological metadata from literature data such as PubMed. Provides access to powerful statistical and graphical methods.
References:
1- The R Project for Statistical Computing: http://www.r-project.org/ 2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29). 3-R & Bioconductor Manual. Author: Thomas Girke, UC. Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics 4- Bioconductor: http://www.bioconductor.org/
Install R
1- Install the latest release of R according to instructions provided in The R Project for Statistical Computing- http://www.r-project.org/ 2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you type the commands. 4- Any text or comment in R beginning with the hash # symbol is ignored.
References 1- The R Project for Statistical Computing: http://www.r-project.org/ 2- Bioconductor: http://www.bioconductor.org/ 3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html
Install packages in R
1- In the R Console type the following in the R command window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R") 2- request instalation of the package type: biocLite() 3- Install packages, "RISmed" , and "tm" by typing (see next slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package RISmed is to download content from NCBI databases. Package tm is for text mining functionalities Package ggplot2 is for data visualization
References 1- Bioconductor: http://www.bioconductor.org/ RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed tm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm ggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html
The R Console
Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]" fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713 onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit, file="title_oncolytic_virus.txt")
# export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F, sep="\t") # !!open file in excel
Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents" Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus) Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt Start the Text Mining Analysis
# Type the following in the R Console library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"), readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra
whitespace
my.corpus <- tm_map(my.corpus, gsub, pattern="[^[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash
"-"
my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from
words to get common origin Document matrix
my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Termmat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from
matrix displaying all the terms in any of the 2 documents.
to keep original
xx<- my.corpus.df[1:50,]
# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing = FALSE),]
Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw() p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1
Part 2
# Continue and type the following code in the R Console: now select the most freq title term. Therfore sort title in decreasing order my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term
#sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] # Plot the 50 most frequent title terms require(ggplot2)
p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2
my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title my.corpus.sub3.df[200:300,1:2] # view some of the terms
#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt, my.corpus.df$mesh_oncolytic_virus.txt) # correlation coefficient is [1] 0.4442518
For part 2
# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms
ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour = "Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))
plot 3: most frequent title terms with the corresponding mesh terms
most frequent mesh terms top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms Terms<- rownames(top50.mh.ti) # assign rownames to Terms msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl, label = Terms)) p4
# plot 5: most frequent title terms and most frequent mesh terms
# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")
# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.
top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5
Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
plot 5: most frequent title terms and most frequent mesh terms