Está en la página 1de 28

Search Pubmed with R Part1 and Part2

R Project

R is a free software environment for statistical computing, data manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4). Facilitates the inclusion of biological metadata from literature data such as PubMed. Provides access to powerful statistical and graphical methods.
References:
1- The R Project for Statistical Computing: http://www.r-project.org/ 2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29). 3-R & Bioconductor Manual. Author: Thomas Girke, UC. Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics 4- Bioconductor: http://www.bioconductor.org/

Install R
1- Install the latest release of R according to instructions provided in The R Project for Statistical Computing- http://www.r-project.org/ 2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you type the commands. 4- Any text or comment in R beginning with the hash # symbol is ignored.

References 1- The R Project for Statistical Computing: http://www.r-project.org/ 2- Bioconductor: http://www.bioconductor.org/ 3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html

Install packages in R
1- In the R Console type the following in the R command window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R") 2- request instalation of the package type: biocLite() 3- Install packages, "RISmed" , and "tm" by typing (see next slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package RISmed is to download content from NCBI databases. Package tm is for text mining functionalities Package ggplot2 is for data visualization
References 1- Bioconductor: http://www.bioconductor.org/ RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed tm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm ggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html

The R Console

Query pubmed titles for oncolytic virus using RISmed


Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]" fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713 onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit, file="title_oncolytic_virus.txt")

Query pubmed MESH topic for oncolytic virus using RISmed


# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind, Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading, collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text file write(mh.list , file="mesh_oncolytic_virus.txt")

View results in excel


# export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F, sep="\t") # !!open file in excel

Column containing titles

Column containing corresponding Mesh terms

Preparing forText Mining Analysis

Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents" Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus) Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt Start the Text Mining Analysis

Text Mining Analysis


# Type the following in the R Console library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"), readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra
whitespace

my.corpus <- tm_map(my.corpus, gsub, pattern="[^[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash
"-"

# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes


numbers- optional

Text Mining Analysis


# Continue and type the following code in the R Console:

my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from
words to get common origin Document matrix

my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Termmat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from
matrix displaying all the terms in any of the 2 documents.

my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df


data frame for later

to keep original

Text Mining Analysis


# Continue and type the following code in the R Console:
#sort the most freq mesh term in the data frame my.corpus.df<- my.corpus.df[

order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]

# assign the 50 most freq mesh term to xx

xx<- my.corpus.df[1:50,]

# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing = FALSE),]

Text Mining Analysis


# Continue and type the following code in the R Console:
# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2)

Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw() p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1

Text Mining Analysis


VIEW the 50 most frequent mesh term

Part 2

Text Mining Analysis


#

# Continue and type the following code in the R Console: now select the most freq title term. Therfore sort title in decreasing order my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term

#sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] # Plot the 50 most frequent title terms require(ggplot2)

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt


ggplot(xy) + geom_point(aes(Terms, Title.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw()

p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2

Text Mining Analysis


VIEW the 50 most frequent title term

Text Mining Analysis


# Continue and type the following code in the R Console: Create separate data frames for each frequency type

my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title my.corpus.sub3.df[200:300,1:2] # view some of the terms

#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt, my.corpus.df$mesh_oncolytic_virus.txt) # correlation coefficient is [1] 0.4442518

Text Mining Analysis


# bellow generates a term frequency vector from a text document termFrequency <-rowSums(as.matrix(my.corpus.matrix)) my.tdm <- TermDocumentMatrix(my.corpus, control = list(minWordLength = 1)) my.tdm #A term-document matrix (2632 terms, 2 documents) # bellow is to select those terms from term-document matrix which occur at least 100 times findFreqTerms(my.tdm[,1], lowfreq=100) findFreqTerms(my.tdm[,2], lowfreq=100)

For part 2

Text Mining Analysis


# Code for plot 3: most frequent title terms with the corresponding mesh terms my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt

ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour = "Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))

p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3

plot 3: most frequent title terms with the corresponding mesh terms

Text Mining Analysis

Text Mining Analysis


# Code for plot 4: most frequent title terms and

most frequent mesh terms top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms Terms<- rownames(top50.mh.ti) # assign rownames to Terms msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl, label = Terms)) p4

Text Mining Analysis


plot 4: most frequent title terms and most frequent mesh terms

Text Mining Analysis


my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms top50.mh.ti$Term<-rownames(top50.mh.ti) rownames(top50.mh.ti$Term) = NULL colnames(top50.mh.ti)[1] <- "msh" # change col name colnames(top50.mh.ti)[2] <- "title" # change col name

# plot 5: most frequent title terms and most frequent mesh terms

Text Mining Analysis

# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")

# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.

top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5

Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

Text Mining Analysis


# plot 5: most frequent title terms and most frequent mesh terms

p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5

plot 5: most frequent title terms and most frequent mesh terms

Text Mining Analysis

También podría gustarte