Image Analysis: Pre-Processing of Affymetrix Arrays

1
Lecture 3
Pre-Processing of Affymetrix Arrays
Stat 697K, CS 691K,
Microbio 690K
2
Image Analysis
Based on Information from Terry
Speeds Group, UC Berkeley
Bolstad, Bioinformatics 2003
Irizarry, Biostatistics 2003
4
Affymetrix Terminology
Probe: an oligonucleotide of 25 base-pairs (25-mer).
Each gene or portion of a gene is represented by 9 to 22
probes that uniquely identify a gene (current standard = 11).
Perfect match (PM): A 25-mer complementary to a reference
sequence of interest (e.g., part of a gene).
Mismatch (MM): same as PM but with a single base change
for the middle (13th) base. Purpose is to measure non-
specific binding and background noise.
Probe-pair: a (PM,MM) pair.
Probe-pair set: a collection of probe-pairs for a gene.
2
5
Image Analysis
Affymetrix arrays are processed using
MicroArray Suite MAS 5.1 software, from
Affymetrix
Images are scanned with lasers that generate
excitation light at 488 nanometers (nm)
The scanner produces one image which is
stored as a DAT file (~50 MB)
6
Affymetrix Chips
Each probe cell has millions of copies of the 25-
mer immobilized on the slide
On a typical chip, there are ~200,000 probe cells
11 pairs x 10,000 genes
7
Image Analysis
Gridding: MAS 5.1 software overlays a grid onto
the image, to locate probe cell centers
The user can adjust the grid
8
Image Analysis in MAS 5.1
Misaligned Grid Adjusting the Grid
3
9
Expression Calculation
The raw data = DAT image files are converted to CEL
files
Each probe cell: 10x10 pixels
Signal for each probe cell (PM and MM):
1) Remove outer 36 pixels (reduces noise)
2) 8x8 pixels remain
3) The probe cell signal is the 75
th
percentile of the
8x8 pixel values
Background: Average of the lowest 2% of probe cell
values in a region of the chip is taken as the
background value for that region and subtracted
from values in the region
chip is divided into 16 regions (called sectors)
10
Use only center 8x8 pixels for signal
11
Affymetrix File Summary
1) DAT file: Image file, ~10^7 pixels, ~50 MB.
2) CEL file: Cell intensity file, probe level PM
and MM values (view with MAS 5.1, or read into
BioConductor), ~7 MB
3) CDF file: Chip Description File. Contains probe
locations on the chip. Describes which probes
go in which probe sets and the location of probe-
pair sets, ~7 MB
built into R and BioConductor (homework)
just need to identify the type of chip for BioConductor
12
4) EXP file: Contains sample information, fluidics
settings and scanner settings (small, ~1 kB)
4
13
5) RPT file: Quality report file (small, ~2 kB)
Main results:
a) percent present genes: should be 40-50%, but
at least greater than 25%
b) average background signal: should be less
than 100
c) how well the labeling reaction went
For further detail, see the Affymetrix manual
and website
http://www.bcm.edu/mcfweb/affy_qc.htm
14
5) RPT file: Quality report file
As long as values are same across chips in an
experiment, then they are good chips to use.
http://www.bcm.edu/mcfweb/affy_qc.htm
15
Affymetrix Files
We will not be analyzing the DAT files
We will be using the probe level CEL files in R
The CDF files are part of BioConductor and do
not need to be read in
BioConductor has built-in files for Affymetrix
16
6 Affymetrix Chips
L
o
g

I
n
t
e
n
s
i
t
y
HuGeneFL Chips
Data from Wright et al 2002, human fibroblasts (involved in wound repair)
Cannot combine chips before normalization
5
17
Multiple Slides Normalization Methods
Extension of within-slide normalization.
Scale normalization step maybe skipped if chips
have approximately the same distributions.
There is a trade-off between the gains achieved
by scale normalization and the possible increase
in variability introduced.
18
Why Normalize Affymetrix Data?
Total brightness differs among slides
Background is different among slides
Some causes of systematic measurement
variation include:
Different amounts of RNA
The hybridization reaction may proceed more fully to
equilibrium in one array than another
Hybridization conditions may vary across arrays
Scanner settings are often different
19
Types of Variation
Interesting variation
Gene expression differences
Obscuring variation
Sample preparation (labeling difference)
Array manufacturing (hybridization difference)
Array processing (scanner difference)
20
Multiple Affymetrix Chip Normalization
Normalize across slides, to combine information
from multiple slides
Can treat a pair as Red and Green as in
cDNA arrays, and use the same approaches
from cDNA technology
Or, select one array as the baseline array, and
use it as a benchmark
baseline chosen as best quality, or the chip with the
median total intensity of all chips
some packages automatically choose baseline array
(RMA)
6
21
Scale Normalization
Many variations of this:
Scaling individual intensities so median or
mean intensities are the same across all arrays
Scaling individual intensities so the total
intensity on an array is the same across all
arrays
Built into Bioconductor package affy,
normalization method = constant (uses mean,
using probe level intensities before
summarization)
http://www.bea.ki.se/staff/reimers/Web.Pages/Normalization.Intro.htm 22
Location Transformations: subtract
or add a constant to all values
Mean & Median centering are examples of location transformations
(see cDNA lecture)
0 0
23
Scale Transformations
0 0
Scale transformations shift the median of the distribution and
change the shape of the distribution
Scale Transformation = Multiply all values by a constant
24
When comparing multiple arrays (with one sample or multiple
samples):
Assume the overall distribution of RNA intensity values does not
change much between samples
Most genes change very little in intensity across samples
Simplest approach assumes average gene expression is the same
for all arrays
This makes sense:
We are starting with equal quantities of RNA for the samples
we are going to compare
Therefore the average hybridization should be the same for all
samples
Normalization by Scaling
Source: Mark Reimers,
http://www.bea.ki.se/staff/reimers/Web.Pages/Normalization.Intro.htm
7
25
Scale Normalization using Affymetrix MAS
5.1 Software
1) Choose a baseline array
2) For each array i (besides baseline), multiply
each probe expression value by:
(probe value) x
[(mean expression on baseline array) /
(mean expression on array i)]
Results in each array having the same mean
intensity as baseline array
26
Scale Normalization
(background notes)
Affymetrix uses the 2% trimmed mean (trims the
probeset values. The probeset value is the gene
expression for a gene, i.e. after summarization
of probe-level data)
Definition: 2% trimmed mean: exclude the
highest and lowest 2% of probeset values
normalization method = constant (uses mean,
using probe level intensities before
summarization)
27
Affymetrix Scale Normalization using
BioConductor
Before normalization After scale normalization
28 test
Example
Step 1: calculate average intensity
for each slide
55 40 17.5 Mean
Probe 4
Probe 3
Probe 2
Probe 1
40 45 15
60 50 25
70 40 20
50 25 10
Slide 3 Slide 2 Slide 1
Baseline
8
29 test
Step 2: Multiply each column by Average 1,
then divide by Previous Column Average
17.5 Mean
Probe 4
Probe 3
Probe 2
Probe 1
40 * 17.5/55 45 * 17.5/40 15
60 * 17.5/55 50 * 17.5/40 25
70 * 17.5/55 40 * 17.5/40 20
50 * 17.5/55 25 * 17.5/40 10
Baseline
30 test
Column average intensities are now all
the same
17.5 17.5 17.5 Mean
Probe 4
Probe 3
Probe 2
Probe 1
12.73 19.69 15
19.09 21.88 25
22.27 17.5 20
15.9 10.94 10
Baseline
31
Disadvantage of Scale Normalization
If choose a poor baseline, get poorer results
All normalized chips do not have the same
distribution
32
Quantile Normalization
Introduced by Bolstad et al. 2003
Goal: To make the distribution of probe intensities the
same for every chip
The normalization distribution is calculated by averaging
each quantile across chips
Advantage: do not need a baseline array
normalizes a group of arrays at the same time
without specifying any one as the baseline array
It works with probe-level data
normalization method = quantiles
9
33
Quantile Normalization at PM Probe
Level
Each gene has 11 PM probes
Definition: the quantile is the sorted percentage
of data: i.e. the 20
th
quantile has 20% of data
below it.
The algorithm gives each array the same
distribution by calculating the mean of each
quantile and substituting it as the data value in
the original dataset
34
Columns are chips
S
o
u
r
c
e
:

B
e
n

B
o
l
s
t
a
d
35
Before normalization
After constant
normalization
After quantile
normalization,
distributions all same
Constant vs. Quantile Normalization
36
Remarks
For quantile normalization, the distribution
functions are effectively estimated by the sample
quantiles
Quantile normalization is fast
Variability of expression measures across chips
is reduced after normalization compared to
constant normalization and no normalization
Removes necessity of choosing baseline array
choosing poor baseline gives poorer results
10
37
Quantile Normalization Illustration
5 Affymetrix chips (version HG-U95A, HG=human
genome) of human liver cell lines (Bolstad et al. 2003)
Use M vs. A plots, where M and A are for each pair of
chips, using probe-level data
Plot pairwise PM probes for each pair of the 5 chips,
10 pairs in all:
10
! 3 ! 2
! 5
2
5
= =
38
M vs. A plots of chip pairs: before quantile normalization
Bolstad et al., 2003
39
M vs. A plots of chip pairs: after quantile normalization
40
Quantile Normalization Illustration
27 HG-U95 Affymetrix arrays for different dilutions of
human liver tissue and central nervous system cell line
Black line is
distribution of
all 27 after
quantile
normalization,
i.e. all have
same
distribution
11
41
Rank-Invariant Normalization
Idea: if a gene is differentially expressed
between 2 experiments, it should have a higher
rank in one array than another
42
Select a subset of probes (or genes) that are
non-differentially expressed, as the basis for
normalization
- similar to house-keeping gene idea
Fit a normalization curve through non-
differentially-expressed probes
normalization method = invariantset
Introduced by: Li and Wong 2001b, Tseng et al.
2001, Schadt et al. 2001, Stuart et al. 2001
43
Rank Invariant Normalization
Rank probes in each array separately
Probes with ranks in the two arrays within a
threshold, i.e. within 500 out of 150,000, are
labeled as rank invariant
44
Green = rank invariant set, not affected by differentially expressed
genes in lower right corner, therefore different normalization than
yellow line (here, normalized values are based on subtracting x values)
Yellow = smoothing spline (different method, similar idea to Lowess):
affected by lower right corner, which could be differentially expressed
genes
Li & Wong, GB, 2001
2 different samples in
an array set
y-axis is baseline
12
45
Comparison to Other Methods
Rank invariant method was not compared to
quantile normalization, scale normalization or no
normalization in Li & Wong 2001
Rank invariant method works well if there are a
small number of differentially expressed genes
in an experiment
BioConductor: Quantiles and Li & Wong method
are most widely used.
46
QQ plots (quantile-quantile plots)
A QQ plot is a graphical technique for
determining if 2 data sets come from populations
with a common distribution
It is a plot of the quantiles of the 1
st
data set
against the quantiles of the 2
nd
data set
If the 2 sets come from a population with the
same distribution, the points should fall on a 45-
degree line
Built in function of R (qqplot, qqnorm)
47
QQ plot of normalized array vs baseline
2 replicate arrays should have same distribution
This is close to 45-degree line
Li & Wong, GB, 2001
48
Use Stable Genes
Definition: genes that exhibit similar expression across a
large number of tissues and conditions, but are allowed to
deviate from this level in a small number of cases
451 standard genes found across tissue samples in HuGE
Index (Hsiao et al. 2001, Physiol Genomics; used HU6800
array). HuGE Index (Human Gene Expression Index,
www.HugeIndex.org) is a public repository for gene
expression data on normal human tissues using high-density
oligonucleotide arrays.
Other Normalization Methods
13
49
Use Stable Genes
HG_U95 and HG_U133 arrays contain 100 normalization
control probe sets validated to have constant expression
across tissue types (Affymetrix 2002)
More databases to obtain and validate such genes? E.g. Su et
al. (2002) Large-scale analysis of the human and mouse
transcriptomes. PNAS
50
Credit
- Steve Qin
- Cheng Li
- Jun S. Liu
- Wing Wong
- Robert Gentleman
- Yee Hwa Yang
- Sandrine Dudoit
- Percy Luu
- Terry Speed
- Debashis Ghosh
- Rafael Irizarry
- Rebecca Fry
- Leona Samson
- David Hoyle
- Mark Reimers
- Ben Bolstad
- Fred Wright
These slides are based in large part on lectures by Steve
Qin, University of Michigan, with generous permission.
51
(background notes)
Example from Li & Wong, Genome Biology
2001
58 human brain sample microarrays
Choose a baseline array (array 11)
Normalize each array to the baseline array,
pairwise (baseline + 1 more)
HU6800 array, with ~140,000 probes
Expect a probe for a non-differentially
expressed gene to have similar intensity
ranks in two arrays
52
(background notes)
Rank probes in each pair of arrays
separately
expect a probe for non-differentially-expressed
genes to have the same rank in both arrays
Iterative procedure to identify rank invariant
set of probes (non-differentially-expressed
genes)
14
53
Identifying Rank-Invariant Probes
in a Pair of Arrays
(background notes)
1) Rank probes in each array separately
2) Calculate proportional rank difference (PRD)
for each probe:
PRD
p
=(rank
p
in array 1 rank
p
in array 2)/
(total number of probes)
3) Threshold:
If PRD
p
< 0.003, then a probe is rank-
invariant
4) Repeat steps 1)-3) iteratively until the rank
invariant set does not change
54
Example of Iterative Procedure
(background notes)
For total number of probes = 140,000
Proportional rank difference of 0.003 =
absolute rank difference of 420.
Find 10,000 probes with rank difference
within 420
Repeat process for new set of 10,000
probes
Repeat until the number of points in new
set does not decrease
Li & Wong, GB, 2001
55
Variations on Method
(background notes)
Threshold:
PRD < 0.003 (diff of 420) for low ranking genes
(based on average rank)
- i.e. low ranks: rank 50 on array 1, rank 70 on
array 2
PRD < 0.007 (diff of 980) for high ranking genes
- i.e. high ranks: rank 136,000 array 1, rank
136,200 array 2 (fewer points at high intensity)
** Higher number, 980, for high ranking genes
(threshold is interpolated in between high and low ranking genes)
56
Rank Invariant Normalization Curve
(background notes)
A piecewise linear median line is fit through the
rank invariant probes
similar to Lowess curve fitting
The normalization curve is subtracted from each
probe in the non-baseline array
similar to Lowess normalization
Baseline array is not changed
all other arrays are changed in comparison to
baseline

Image Analysis: Pre-Processing of Affymetrix Arrays

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Image Analysis: Pre-Processing of Affymetrix Arrays

Cargado por

Copyright:

Formatos disponibles

1

También podría gustarte