Search Pubmed With R Part1Part2

Search Pubmed with R Part1 and Part2
R Project

R is a free software environment for statistical computing, data manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4). Facilitates the inclusion of biological metadata from literature data such as PubMed. Provides access to powerful statistical and graphical methods.
References:
1- The R Project for Statistical Computing: http://www.r-project.org/ 2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29). 3-R & Bioconductor Manual. Author: Thomas Girke, UC. Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics 4- Bioconductor: http://www.bioconductor.org/
Install R
1- Install the latest release of R according to instructions provided in The R Project for Statistical Computing- http://www.r-project.org/ 2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you type the commands. 4- Any text or comment in R beginning with the hash # symbol is ignored.
References 1- The R Project for Statistical Computing: http://www.r-project.org/ 2- Bioconductor: http://www.bioconductor.org/ 3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html
Install packages in R
1- In the R Console type the following in the R command window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R") 2- request instalation of the package type: biocLite() 3- Install packages, "RISmed" , and "tm" by typing (see next slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package RISmed is to download content from NCBI databases. Package tm is for text mining functionalities Package ggplot2 is for data visualization
References 1- Bioconductor: http://www.bioconductor.org/ RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed tm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm ggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html
The R Console
Query pubmed titles for oncolytic virus using RISmed

Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]" fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713 onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit, file="title_oncolytic_virus.txt")
Query pubmed MESH topic for oncolytic virus using RISmed

# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind, Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading, collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text file write(mh.list , file="mesh_oncolytic_virus.txt")
View results in excel

# export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F, sep="\t") # !!open file in excel
Column containing titles
Column containing corresponding Mesh terms
Preparing forText Mining Analysis
Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents" Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus) Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt Start the Text Mining Analysis
Text Mining Analysis

# Type the following in the R Console library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"), readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra
whitespace
my.corpus <- tm_map(my.corpus, gsub, pattern="[^[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash
"-"
# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes

numbers- optional

# Continue and type the following code in the R Console:
my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from
words to get common origin Document matrix
my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Termmat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from
matrix displaying all the terms in any of the 2 documents.
my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df

data frame for later
to keep original

#sort the most freq mesh term in the data frame my.corpus.df<- my.corpus.df[
order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]
# assign the 50 most freq mesh term to xx
xx<- my.corpus.df[1:50,]
# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2)
Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw() p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1

VIEW the 50 most frequent mesh term
Part 2

#
# Continue and type the following code in the R Console: now select the most freq title term. Therfore sort title in decreasing order my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term
#sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] # Plot the 50 most frequent title terms require(ggplot2)

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt

ggplot(xy) + geom_point(aes(Terms, Title.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw()
p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2

VIEW the 50 most frequent title term

# Continue and type the following code in the R Console: Create separate data frames for each frequency type

my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title my.corpus.sub3.df[200:300,1:2] # view some of the terms
#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt, my.corpus.df$mesh_oncolytic_virus.txt) # correlation coefficient is [1] 0.4442518

# bellow generates a term frequency vector from a text document termFrequency <-rowSums(as.matrix(my.corpus.matrix)) my.tdm <- TermDocumentMatrix(my.corpus, control = list(minWordLength = 1)) my.tdm #A term-document matrix (2632 terms, 2 documents) # bellow is to select those terms from term-document matrix which occur at least 100 times findFreqTerms(my.tdm[,1], lowfreq=100) findFreqTerms(my.tdm[,2], lowfreq=100)
For part 2

# Code for plot 3: most frequent title terms with the corresponding mesh terms my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]
# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms
Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt
ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour = "Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))
p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3
plot 3: most frequent title terms with the corresponding mesh terms

# Code for plot 4: most frequent title terms and

most frequent mesh terms top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms Terms<- rownames(top50.mh.ti) # assign rownames to Terms msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl, label = Terms)) p4

plot 4: most frequent title terms and most frequent mesh terms

my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms top50.mh.ti$Term<-rownames(top50.mh.ti) rownames(top50.mh.ti$Term) = NULL colnames(top50.mh.ti)[1] <- "msh" # change col name colnames(top50.mh.ti)[2] <- "title" # change col name
# plot 5: most frequent title terms and most frequent mesh terms
# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")
# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.

top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5
Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

# plot 5: most frequent title terms and most frequent mesh terms
p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5
plot 5: most frequent title terms and most frequent mesh terms

Search Pubmed With R Part1Part2

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Search Pubmed With R Part1Part2

Cargado por

Copyright:

Formatos disponibles

Search Pubmed with R Part1 and Part2

Query pubmed titles for oncolytic virus using RISmed

Query pubmed MESH topic for oncolytic virus using RISmed

View results in excel

Column containing titles

Column containing corresponding Mesh terms

Preparing forText Mining Analysis

Text Mining Analysis

# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes

Text Mining Analysis

my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df

Text Mining Analysis

order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]

# assign the 50 most freq mesh term to xx

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt

p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

Text Mining Analysis

p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5

Text Mining Analysis

También podría gustarte