Está en la página 1de 36

tools for Statistical Analysis

of Microarray Data
Matt Ritchie
mritchie@wehi.edu.au
WEHI Bioinformatics

cDNA Microarray Story


Thousands of cDNA
sequences spotted on a
glass array
Hybridise and scan array
Image Analysis

indexs

grid.r
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

grid.c
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

spot.r
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

spot.c
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

area
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

121
112
115
144
136
122
128
131
101
148
159
140
150
135
101
138
128

Gmean
Gmedian
782.6529
786
795.1161
787
640.2261
612
636.4306
606
683.8015
665
704.6721
686
641.3516
617
669.0992
675
570.2475
582
596.3851
579
581.1635
578
574.3071
565
559.56
553
576.6963
577
615.6931
603
596.9565
592
1076.281
1073

GIQR
0.2713
0.2531
0.228259
0.219474
0.278104
0.252427
0.235919
0.228259
0.204065
0.221592
0.199796
0.167933
0.18629
0.174049
0.202645
0.21408
0.310304

Rmean
Rmedian
460.595
441
480.4107
430
415.7043
383
387.3472
370
447.875
421
437.0902
394
374.4453
344
390.2748
352
355.4455
315
370.6149
357
360.3962
329
369.6929
340
327.4133
306
363.963
352
365.703
352
356.5725
341
646.0469
611

RIQR
0.507772
0.69478
0.637267
0.615804
0.590317
0.574541
0.712276
0.65764
0.852071
0.667829
0.576348
0.565665
0.568426
0.664224
0.581229
0.636239
0.608136

bgGmean bgGmed
626.9174
604
661.4129
599
594.4429
582
586.6806
591
607.2785
595
583.7286
579
593.4861
570
575.6381
569
581.6649
573
577.9004
571
568.0617
567
566.9535
562
562.5439
550
577.2553
577
576.665
572
570.4036
557
621.8974
558

bgGSD
0.223515
0.260361
0.197048
0.211949
0.185998
0.225869
0.208001
0.204065
0.21606
0.19484
0.232283
0.175134
0.200304
0.185848
0.197331
0.186544
0.225912

bgRmean
366.1927
405.6269
354.1096
358.0463
366.8059
366.0238
367.9306
351.1619
341.2147
355.7835
339.0044
353.5767
354.1974
335.1957
330.7488
336.5067
361.0154

Outline

About R and packages available for


analysis of cDNA microarray data
Object oriented structure

An example analysis
Plotting capabilities
Options for normalization
Selecting differentially expressed genes

Lunching!

The R environment
Freeware version of S-plus (GNU S)
A language and environment for statistical
computing and graphics
Operates on Linux, Windows and MacOS
Add-in libraries available for specialised
statistical routines
Bioconductor bundle
Statistics for Microarray Analysis (SMA) for
cDNA microarrays

Installing R
http://cran.r-project.org/
R Binaries link

Downloading Libraries
http://www.bioconductor.org
Released Packages link

Downloading Libraries
http://cran.r-project.org/
Package sources link to SMA

The Bioconductor project produces an open


source software framework that will assist
biologists and statisticians working in
bioinformatics, with primary emphasis on
inference using DNA microarrays.
Specific packages for cDNA arrays (Thanks to
Jean and Sandrine!)
marrayInput
marrayPlots
marrayNorm

Class structure to keep track of standard


microarray information
Array layout [marrayLayout]
Gene names and experimental information
[marrayInfo]
Raw Cy3 and Cy5 data, gene names etc
[marrayRaw]
Normalized log-ratios (M values) and
Intensities (A values) [marrayNorm]

marrayRaw
maRf
matrix

maRb
matrix

maLayout
marrayLayout

maGf
matrix
maGnames
marrayInfo
maNotes
character

maGb
matrix

maW
matrix

maTargets
marrayInfo

An Example
4 microarrays (human)
AML1

1
3

Cell line 1 vs Cell line 2 (AML1 vs GFP)


Image analysis done using Spot

1723.spot
1737.spot
1738.spot
1739.spot

Array Layout: 20 x 20 x 4 x 12
Gene names (19200 spots on each array)
8kHela.gal

GFP

Basic Steps in Analysis


Read in data
Exploratory plots
Calculate spot intensities and log-ratios
A = (log2R + log2G)/2
M = log2R - log2G

Normalization (intensity-based)
Rank genes
Output results to file

Getting Started
Load libraries
marrayInput,
marrayPlots,
marrayNorm and sma

library(sma)
help.start()

Read in Data
widget.marrayRaw()

Raw Data (Post Image Analysis)


indexs

grid.r
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

grid.c
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

spot.r
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

spot.c
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2

area
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1

Gmean Gmedian GIQR Rmean Rmedian RIQR bgGmean bgGmed bgGSD bgRmean
121 782.6529
786 0.2713 460.595
441 0.507772 626.9174
604 0.223515 366.1927
112 795.1161
787 0.2531 480.4107
430 0.69478 661.4129
599 0.260361 405.6269
115 640.2261
612 0.228259 415.7043
383 0.637267 594.4429
582 0.197048 354.1096
144 636.4306
606 0.219474 387.3472
370 0.615804 586.6806
591 0.211949 358.0463
136 683.8015
665 0.278104 447.875
421 0.590317 607.2785
595 0.185998 366.8059
122 704.6721
686 0.252427 437.0902
394 0.574541 583.7286
579 0.225869 366.0238
128 641.3516
617 0.235919 374.4453
344 0.712276 593.4861
570 0.208001 367.9306
131 669.0992
675 0.228259 390.2748
352 0.65764 575.6381
569 0.204065 351.1619
101 570.2475
582 0.204065 355.4455
315 0.852071 581.6649
573 0.21606 341.2147
148 596.3851
579 0.221592 370.6149
357 0.667829 577.9004
571 0.19484 355.7835
159 581.1635
578 0.199796 360.3962
329 0.576348 568.0617
567 0.232283 339.0044
140 574.3071
565 0.167933 369.6929
340 0.565665 566.9535
562 0.175134 353.5767
150 559.56
553 0.18629 327.4133
306 0.568426 562.5439
550 0.200304 354.1974
135 576.6963
577 0.174049 363.963
352 0.664224 577.2553
577 0.185848 335.1957
101 615.6931
603 0.202645 365.703
352 0.581229 576.665
572 0.197331 330.7488
138 596.9565
592 0.21408 356.5725
341 0.636239 570.4036
557 0.186544 336.5067
128 1076.281
1073 0.310304 646.0469
611 0.608136 621.8974
558 0.225912 361.0154
127 1059.654
1043 0.238826 651.7402
625 0.590226
617
556 0.224742 342.3889
134 1284.164
1335 0.19153 774.4776
753 0.472759 605.0754
564 0.226288 351.4824
121 1238.289
1255 0.220095 787.5041
774 0.479653 632.79
565 0.207882 369.41
128 879.0469
888 0.340957 485.3828
455 0.637088 639.602
603 0.225747 403.9005

Array Layout, Gene names

Slots
aml.raw
maRf
19200x4

maRb
19200x4

maLayout
20x20x4x12

maGf
19200x4
maGnames
marrayInfo
maNotes
character

maGb
19200x4

maW
0x0

maTargets
marrayInfo

Access to slots - 3 options. For red foreground


Intensities, maRf in object created aml.raw
aml.raw@maRf
slot(aml.raw, maRf)
maRf(aml.raw)

Exploratory Plots - maImage


help(maImage)

Individual Channels Cy3


maImage(aml.raw[,1], x=maGf, main=1723: Gf)

Spot Intensity
maImage(aml.raw[,1], x=maA, col=heat.colors(20),
main=1723: A)

Exploratory Plots - maBoxplot


help(maBoxplot)

maBoxplot Log-ratios
maBoxplot(aml.raw[,1], x=maPrintTip, y=maM,
main=1723: Non-normalized M by print-tip)

maBoxplot Intensity
maBoxplot(aml.raw[,1], x=maPrintTip, y=maA,
main=1723: A by print-tip)

Normalization
help(maNorm)

Normalization
aml.norm.n <- maNorm(aml.raw, norm=none)
aml.norm.l <- maNorm(aml.raw, norm=loess)
aml.norm.p <- maNorm(aml.raw, norm=printTipLoess)

aml.norm.n
maA
19200x4

maM
19200x4

maLayout
20x20x4x12

maMloc
0x0
maGnames
marrayInfo

maNotes
character

maMscale
0x0

maW
0x0

maTargets
marrayInfo

maNormCall
call

Diagnostic Plots - maPlot


help(maPlot)

Normalization
maPlot(aml.norm.n[,1], z=NULL, pch=., main=1723 Non-normalized MA Plot)
maPlot(aml.norm.n[,1], z=maPrintTip, pch=.,
main=1723 Print-tip normalized MA Plot)

maPlots
maPlot(aml.norm.n[,3], z=maPrintTip, pch=.,
main=1738 Non-normalized MA Plot)
maPlot(aml.norm.p[,3], z=maPrintTip, pch=.,
main=1738 Print-tip normalized MA Plot)

Selecting Differentially Expressed


Genes
Ave M, variance M
t-like test
B statistic - Lnnstedt and Speed, (2002)
Posterior log odds

help(stat.bayesian)

Selecting Differentially Expressed


Genes
aml.M <- sweep(maM(aml.norm.p), c(1,1,-1,1), FUN=*)
aml.bayesian <- stat.bayesian(aml.M, nb=4, nw=2)
plot.bayesian(aml.M, nb=4, nw=2)
26 genes
where B > 0

Output Results
write.table(cbind(maGnames(aml.raw)@maLabels[index1][aml.bayesian$lods>0],
aml.bayesian[aml.bayesian$lods>0],
aml.bayesian$Xprep$Mbar[aml.bayesian$lods>0]), Results.txt,
row.names=F, col.names=c(Gene Names, B, Ave M), sep=\t,
quote=F)
Gene Name
B
Ave M
AA916325;aldo-keto reductase family 1, member C3 (3-alpha hydroxysteroid dehydrogenase,
0.85
type
-0.63
II)
AI949576;annexin A3
0.99
-0.59
AI927438;hemoglobin, beta
1.73
-1.39
AI969657;UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 2
0.07
-0.42
AI952285;EST, Highly similar to CA34_HUMAN COLLAGEN ALPHA 3(IV) CHAIN PRECURSOR
0.14
-0.61
[H.sapiens]
AA598601;insulin-like growth factor binding protein 3
4.72
-0.84
N80129;metallothionein 1L
1.12
1.04
AA169469;pyruvate dehydrogenase kinase, isoenzyme 4
2.31
-0.72
AA676466;argininosuccinate synthetase
0.14
-0.47
AA504348;ESTs, Highly similar to topoisomerase II alpha {C-terminal} [H.sapiens]
0.81
-0.53
AA127096;enigma (LIM domain protein)
3.35
0.83
AA446103;lectin, mannose-binding, 1
3.07
-0.72
AI381323;creatine kinase, muscle
1.55
-0.24
AA430504;ubiquitin carrier protein E2-C
0.73
-0.35
H70775;ESTs
0.90
-0.40
AA126265;calnexin
0.62
-0.66

Basic Steps in Analysis


Read in data widget.marrayInput()
Exploratory plots
maImage(), maBoxplot(), maPlot()

Calculate spot intensities and log-ratios


A = 0.5*(log2R + log2G)
M = log2R - log2G

maNorm()

Normalization (intensity-based)
Rank genes (using B statistic) stat.bayesian()
Output results to file write.table()

Summary
R and the libraries used here are free!
Class structure for common array objects
Standard functions for plotting and normalization
Heading in a user-friendly direction
GUIs
functions well documented - help() command

First released May 2002


More analysis methods coming OR
write you own!

Links
R - http://www.r-project.org/
Bioconductor
http://www.bioconductor.org/

SMA
http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html

marray Tutorials
http://oz.berkeley.edu/~terry/zarray/Course/ (Labs 1 & 2)

WEHI Bioinformatics
http://www.wehi.edu.au/bioweb/index.html

Gordon Smyths Microarray Site


http://www.statsci.org/ (Take the microarrays link)

References
Dudoit, S. and Yang, Y. H. (2002) Bioconductor
R packages for exploratory analysis and
normalization of cDNA microarray data.
IN
Parmigiani, G., Garrett, E. S., Irizarry, R. A., and
S. L. Zeger, S. L. (editors) (2002). The Analysis
of Gene Expression Data: Methods and
Software. Springer, New York. (To appear).

Acknowledgements
WEHI Genetics
and Bioinformatics
Gordon Smyth
Natalie Thorne
Terry Speed
Asa Wirapati
Hamish Scott
Joelle Michaud

UC Berkeley
Jean Yang
Sandrine Dudoit
Ben Bolstad

También podría gustarte