Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Por
Br. Solymar Peraza Crespo
Por
Br. Solymar Peraza Crespo
Realizado con la Asesora de
Tutor Acadmico: Prof. Alfredo Ros
Tutor Industrial: Dr. Juan Ramn Gonzlez Ruiz
INFORME FINAL DE CURSOS EN COOPERACIN TCNICA Y DESARROLLO
SOCIAL
RESUMEN
El PRBB Parc de Recerca Biomdica de Barcelona (Parque de Investigacin
Biomdica de Barcelona) requera la clasificacin de una matriz de intensidades, que
contena la informacin de varias sondas genticas correspondientes a un grupo de
individuos, para posteriormente poder realizar un anlisis de conglomerados que
permitiera identificar la clasificacin por poblaciones segn el nmero de copias de
cada gen.
La realizacin del proyecto se llev a cabo en 4 etapas: La primera:
Familiarizacin con la terminologa gentica y los mtodos estadsticos
anteriormente empleados para las fases anteriores del proyecto. La segunda: El
desarrollo e implementacin de un mtodo que permitiera determinar el nmero de
copias de cada sonda gentica tomando en cuenta la distribucin de los datos. La
tercera: La clasificacin de los individuos en poblaciones mediante el algoritmo de kmeoides y la cuarta: Evaluacin de resultados obtenidos.
Los resultados obtenidos fueron satisfactorios e interesantes y una fuente para
futuras publicaciones en estadstica gentica.
PALABRAS CLAVES: Estimacin de mixturas, Clustering, clasificacin gentica,
gentica, SNPs, CNVs, Proyecto HapMap, Epidemiologa.
Aprobado con mencin: _______
Postulado para el premio: _______
Sartenejas, abril de 2009
INDICE
INDICE ....................................................................................................................... vi
INTRODUCCIN .......................................................................................................10
Antecedentes ........................................................................................................................................................ 10
Justificacin del Proyecto ................................................................................................................................... 10
Objetivos Generales: ........................................................................................................................................... 11
Objetivos Especficos: ......................................................................................................................................... 11
Estructura del Informe ....................................................................................................................................... 12
CAPITULO 1 ..............................................................................................................13
LA EMPRESA: El PRBB ...........................................................................................13
(PARC RECERCA BIOMDICA DE BARCELONA) .................................................13
1.1. Historia y fundacin del PRBB ................................................................................................................. 13
1.2. Centros que conforman el PRBB............................................................................................................... 15
1.2.1. Centro de Investigacin en Epidemiologa Ambiental (CREAL) ...................................................... 15
1.2.2. Hospital del Mar (IMAS)................................................................................................................... 15
1.2.3. Instituto Municipal de Investigacin Mdica (IMIM) ....................................................................... 16
1.2.4. Departamento de Ciencias Experimentales y de la Salud de la Universidad Pompeu Fabra (CEXSUPF):............................................................................................................................................................ 17
1.2.5. Centro de Regulacin Genmica (CRG)............................................................................................ 17
1.2.6. Centro de Medicina Regenerativa de Barcelona (CMRB) ................................................................. 17
1.2.7. Instituto de Alta Tecnologa (IAT) .................................................................................................... 18
CAPITULO 2 ..............................................................................................................19
FUNDAMENTOS TERICOS....................................................................................19
2.1. Nociones y conceptos genticos: ............................................................................................................... 19
2.2. ADN (cido desoxirribonucleico).............................................................................................................. 20
2.3. Gen: ........................................................................................................................................................... 21
vi
2.4. Alelos:........................................................................................................................................................ 21
2.5. Polimorfismo gentico:.............................................................................................................................. 22
Existen varios tipos de polimorfismos: .............................................................................................................. 22
2.6. Las Variaciones en nmero de copias o CNV (copy number variations) .................................................. 23
2.6.1. Variable Categrica: .......................................................................................................................... 25
2.6.2. Variable Cuantitativa ......................................................................................................................... 26
2.7. Tablas de Contingencia.............................................................................................................................. 26
2.8. Distribucin normal o Gaussiana:.............................................................................................................. 27
2.9. Coeficiente de concordancia Kappa: ......................................................................................................... 27
2.10. Estimacin de Mxima Verosimilitud ..................................................................................................... 29
2.11. Mixturas de distribuciones:...................................................................................................................... 30
2.12. EM (Expectation Maximization): ............................................................................................................ 31
2.13. Anlisis de Conglomerados (Clustering) ................................................................................................. 33
2.13.1. No supervisado: ............................................................................................................................... 33
a) Anlisis de Clusters No Jerrquicos.................................................................................................... 33
b) Anlisis de Clusters Jerrquicos ......................................................................................................... 36
2.13.2. Clustering supervisado:.................................................................................................................... 36
2.13.2. Distancias usadas en los distintos mtodos de Clustering................................................................ 37
a) Distancia Eucldea............................................................................................................................... 37
b) Distancia Manhattan ........................................................................................................................... 37
c) Distancia de Minkowski...................................................................................................................... 37
d) Distancia del Supremo ........................................................................................................................ 38
e) Distancia de Canberra ......................................................................................................................... 38
f) Distancia Binaria ................................................................................................................................. 38
g) Distancia de Ward............................................................................................................................... 38
2.14. Anlisis Discriminante: ........................................................................................................................... 39
2.14.1. Anlisis Discriminante Descriptivo: ................................................................................................ 39
2.14.2. Anlisis Discriminante Predictivo: .................................................................................................. 40
CAPITULO 3 ..............................................................................................................42
PROYECTO HAPMAP Y ...........................................................................................42
PRE-PROCESAMIENTO DE LOS DATOS ...............................................................42
3.1. Proyecto HAPMAP: .................................................................................................................................. 42
3.2. Pre-procesamiento de los datos: ................................................................................................................ 43
CAPITULO 4 ..............................................................................................................44
METODOLOGA ........................................................................................................44
vii
CAPITULO 5 ..............................................................................................................52
RESULTADOS...........................................................................................................52
5.1. Mixturas Gaussianas:................................................................................................................................. 52
5.2. Clasificacin de las poblaciones:( clustering no supervisado)................................................................... 53
5.3. Anlisis discriminante: (Clustering Supervisado)...................................................................................... 54
5.3.1. Clustering supervisado de los datos Clasificados: ............................................................................. 54
5.3.2. Clustering supervisado de los datos Originales: ................................................................................ 55
5.4. Comparacin de las clasificaciones: .......................................................................................................... 55
viii
ix
INTRODUCCIN
Planteamiento del Problema
El PRBB requera el desarrollo de un mtodo que permitiera determinar el nmero
de copias (factores) en cada marcador gentico, que tiene un grupo de individuos
tomando en cuenta la distribucin de la poblacin para cada sonda, para luego
realizar un anlisis multivariante y una a clasificacin de los individuos que
permitiera determinar si la clasificacin obtenida permite diferenciar poblaciones.
Antecedentes
Recientemente, el estudio del papel que tienen los genes en distintas reas de la
ciencia, como puede ser la medicina, ha tenido un gran auge. En particular, el
estudio de la relacin de ciertos genes con enfermedades complejas ha recibido
mucha atencin durante los ltimos aos. Uno de los ejemplos ms claros en
medicina ha sido la epidemiologa. Se ha dedicado muchos aos de investigacin a
estudiar minuciosamente los factores ambientales que se asocian a las
enfermedades ms comunes como el cncer, las enfermedades cardiovasculares, o
el SIDA. Actualmente los estudios epidemiolgicos incorporan el estudio de la
implicacin de ciertos genes, as como su interaccin con otros factores ambientales
conocidos.
Justificacin del Proyecto
Dado el elevado costo que implica la obtencin de los datos genticos, es
necesario desarrollar alguna forma de inferir la informacin que realmente nos
interesa, que en este caso es el nmero de copias, dado que la informacin que
actualmente se obtiene es una intensidad para cada gen y cada individuo. Por esto
se buscan mtodos que revelen la informacin que realmente nos interesa, para
luego evaluar la posibilidad de hacer estudios de clasificacin de individuos con
estos nuevos datos obtenidos, todo en miras de lograr detectar aquellos genes
relacionados
directamente
con
la
diferenciacin
de
poblaciones,
mas
Objetivos Generales:
La investigacin que el PRBB debe realizar, tiene como objetivo general obtener la
informacin referente a qu nmero de copias de cada sonda gentica tiene cada
individuo y posteriormente discriminar cuntas poblaciones hay en base a estos
datos obtenidos.
Objetivos Especficos:
datos:
o
11
12
CAPITULO 1
LA EMPRESA: El PRBB
(PARC RECERCA BIOMDICA DE BARCELONA)
1.1. Historia y fundacin del PRBB
En mayo del ao 2006 se inaugura del Parque de Investigacin Biomdica de
Barcelona (Parc de Recerca Biomdica de Barcelona, PRBB), tras cinco aos de
edificacin y un perodo de unos veinte aos trabajando para construir una
infraestructura cientfica capaz de competir con los mejores centros europeos. En
este sentido, el PRBB es un campus de produccin intensiva de conocimiento en el
mbito de la biomedicina y de las ciencias de la salud, que destaca por su masa
crtica, por su personal investigador de alto nivel y tambin por su carcter
internacional.
Es uno de los ncleos ms grandes de investigacin biomdica del sur de Europa.
El PRBB, una iniciativa de la Generalitat de Catalua, el Ayuntamiento de Barcelona
y la Universidad Pompeu Fabra (UPF), es una gran infraestructura cientfica, en
conexin fsica con el Hospital del Mar de Barcelona, que rene a seis centros
pblicos de investigacin estrechamente coordinados entre si.
Los centros que componen el PRBB se interesan en descifrar los enigmas de la
vida y los problemas de salud de la sociedad. El personal investigador de sus
centros destaca por sus descubrimientos en la bsqueda de respuestas a los
grandes problemas de salud actuales, y por su contribucin para que la humanidad
disfrute de una mejor calidad de vida y tenga ms conocimiento. El compromiso es
mltiple: desde generar nuevo conocimiento en el mbito de las ciencias de la salud
y de la vida hasta la transferencia de la tecnologa y conocimiento al mundo de la
14
15
Cncer
Epidemiologa y Salud Pblica
Procesos inflamatorios y cardiovasculares
Informtica Biomdica
Neuropsicofarmacologa
16
17
18
CAPITULO 2
FUNDAMENTOS TERICOS
23 cromosomas
no pareados,
que
al
combinarse (vulo
espermatozoide), forman una clula nueva con 46 cromosomas que dan como
resultado un ser humano, que es, genticamente nico y cuyo diseo est
determinado por el padre y la madre en partes iguales. En la fig. 2.3 podemos ver
una cadena de adn.
Fig.2.3
Todos los seres humanos tienen un aproximado de 30.000 genes, estos se
encuentran en lugares concretos denominados locus (o loci en plural) los cuales
determinan el crecimiento, el desarrollo y el funcionamiento de nuestros sistemas
bioqumicos y fsicos.
2.3. Gen:
El concepto de gen vara segn el tipo de fenmeno que queramos describir, si lo
importante es la transmisin de informacin o la mutacin, la unidad considerada
como gen, puede ser el par de bases nitrogenadas o el cromosoma mismo. Si
hablamos de evolucin, el gen ser la unidad mnima capaz de ser seleccionada.
Tambin se puede definir como segmentos de ADN que contienen informacin para
elaborar una protena especfica. Adems de ser conocido por todos un factor
hereditario que controla un carcter, como el color de ojos, la altura, color de cabello,
enfermedades hereditarias, y probablemente, muchas otras cosas que aun no han
sido descubiertas.
2.4. Alelos:
Es cada una de las formas alternativas que puede tener un gen, es decir las
posibles variaciones. Estos se diferencian en su secuencia y se pueden manifestar
en cambios en la funcin del gen. La mayora de los mamferos, poseen dos alelos
de cada gen (son diploides), cada uno de proveniente de cada padre y cada par de
alelos se ubica en igual locus o lugar del cromosoma.
Los alelos pueden diferir en secuencia o funcin. Los que varan en secuencia
tienen diferencias como inserciones, deleciones, o sustituciones de nucletidos en la
secuencia. Los alelos que difieren en funcin pueden tener o no diferencias
conocidas en las secuencias, pero se evalan por la forma en que afectan al
organismo.
Segn su expresin en el fenotipo se pueden clasificar en:
nucletido. Los SNP forman hasta el 90% de todas las variaciones genmicas
humanas, y aparecen cada 100 a 300 bases en promedio, a lo largo del
genoma humano. Dos tercios de los SNP corresponden a la sustitucin de
una citosina (C) por una timina (T). Estas variaciones en la secuencia del ADN
pueden afectar a la respuesta de los individuos a enfermedades, bacterias,
22
Como con cualquier tipo de variacin gentica, los CNVs pueden varan en
frecuencia y ocurrencia entre poblaciones dicindonos algo sobre nuestra historia
compartida. Como resultado de nuestro origen comn, la gran mayora de CNV (un
89 %) es compartido entre diversas poblaciones estudiadas. En la fig. 2.6 podemos
ver los tipos principales de variaciones:
25
26
f ( x) =
donde
(mu) es la media y
1
e
2
( x )2
2 2
(1)
es la varianza).
Fig. 2.7
2.9. Coeficiente de concordancia Kappa:
Se utiliza para medir el grado de acuerdo o concordancia entre dos vectores con
categoras mutuamente excluyentes. Este mtodo se prefiere sobre otros ndices de
concordancia ya que corrige el porcentaje de acuerdo que pueda deberse al azar, es
decir permite determinar hasta qu punto la concordancia observada es superior a la
27
k=
Po Pe
1 Pe
(2)
Po =
Num.acuerdos
Num.acuerdos + Num.desacuerdos
(3)
Pe = ( pi1 pi 2 )
(4)
i =1
Donde:
n = nmero de categoras
i = nmero de la categora (de 1 hasta n)
pi1 = proporcin de ocurrencia de la categora i para el observador 1.
Pi2 = proporcin de ocurrencia de la categora i para el observador 2.
28
grado de acuerdo
<0
sin acuerdo
0 - 0,2
insignificante
0,2 - 0,4
bajo
0,4 - 0,6
moderado
0,6 - 0,8
bueno
0,8 - 1
muy bueno
Tabla 1
La mayora de los procedimientos estadsticos suponen que los datos siguen algn
tipo de modelo matemtico que se que se puede definir por medio de una ecuacin
de la cual se desconoce alguno de sus parmetros, lo cual genera el problema de
calcular o estimar estos parmetros desconocidos a partir de la informacin obtenida
en un estudio diseado para tal fin.
El mtodo de mxima verosimilitud es uno de los procedimientos ms verstiles, a
la hora de estimar los parmetros de una distribucin de probabilidad, ya que se
puede aplicar en gran cantidad de situaciones.
Definicin del problema de estimacin:
Sea X={x1, x2, xn} una muestra que creemos tiene una distribucin de
probabilidad p(x|) de parmetros . Queremos estimar los parmetros * que mas
se ajusten a la muestra que tenemos.
29
L ( | X ) = p ( x i | )
(5)
i =1
Luego
* = arg max( L( | X ))
(6)
log( L ( | X ))
=0
(7)
p ( x | ) = q j p j ( x | j )
(12)
j =1
Con = {q j , } y
=1
M
N
M
(13)
30
log( L ( | X ))
=0
En las figuras 2.8 y 2.9 podemos ver un ejemplo de un grupo de datos que no
pueden ser estimados
Fig
p ( z | ) = p ( x, y | )
(14)
= p ( x | y , ) p ( y | )
En este caso no se puede estimar L( | Z ) =
L( | X , Y ) ya que no conocemos Y.
Q ( | g ) = log( L (, X , y ) p ( y | g )dy
(15)
= E[log p ( x, y | ) | X , g ]
Donde
Luego
el
EM
busca
los
parmetros
ptimos
de
cualesquiera:
Q( | (t ) ) = E[log p( x, y | ) | X , (t ) ]
(16)
33
basndose en los atributos de estos datos. Este comienza con una muestra de k
datos elegidos al azar de la matriz original de datos, los cuales son utilizados como
centroides iniciales de los k clusters que se van a formar. La matriz de distancias se
calcula desde dichos centroides hasta cada uno de los dems datos de la matriz y
cada uno de ellos ser asignado de esta forma al centroide ms cercano. Entonces
la matriz de distancias se recalcula reemplazando cada centroide por la media de los
datos asignados a el y el algoritmo repite el proceso anterior.
El objetivo es minimizar la disimilaridad de los elementos dentro de cada cluster y
maximizar la disimilaridad de los elementos que caen en diferentes clusters.
El algoritmo es el siguiente:
1. Se da como entrada un conjunto de datos S y el nmero de clusters
a formar k
2. Selecciona los centroides iniciales de los K grupos: c1, c2, ..., cK.
3. Asignar cada observacin xi de S al cluster C(i) cuyo centroide c(i)
est mas cerca de xi. Es decir, C(i)=argmin1kK||xi-ck||
4. Para cada uno de los clusters se recalcula su centroide basado en
los elementos que estn contenidos en el cluster y minimizando la suma
de cuadrados dentro del cluster. Es decir,
K
WSS =
|| x
c k || 2
(18)
k =1 C ( i ) = k
34
dicho centro. Tiene la ventaja de ser menos vulnerable a los datos extremos.
Un meoide puede definirse como el punto mejor centrado dentro del grupo de
datos.
El algoritmo es el siguiente:
1. Se comienza con un nmero arbitrario de Meoides (k<n) escogido por
el usuario y situado arbitrariamente dentro del grupo de datos.
2. Luego cada dato es asignado al meoide es ms similar. En este caso
la similitud es definida como la distancia (Eucldeas, Manhattan o
Minkowski )
3. Luego se selecciona de forma aleatoria un nuevo set de meoides.
4. Se calcula el costo C de cambiar el set anterior de meoides por el
nuevo.
5. Si C>0 se regresa al set anterior de meoides, si C<0 se toma el
nuevo set de centroides y se recalculan los grupos.
6. Finalmente se repiten los pasos de 2 a 5 hasta que no haya mas
cambios de centroides.
35
Fig.2.10 Dendrograma
x j = (x j1 , x j 2 ,..., x jm )
X nm . Podemos
dij =
(x
x jk ) =
2
ik
(x x ) (x x )
T
(19)
k =1
b) Distancia Manhattan
Viene dada por:
i
bij = xik x jk
k =1
c) Distancia de Minkowski
Viene dada por:
37
(20)
(21)
(22)
m
mij = xik x jk
k =1
e) Distancia de Canberra
Viene dada por:
m
xik x jk
k =1
xik + x jk
cij =
(23)
f) Distancia Binaria
Se utiliza cuando los datos son binarios, es decir ceros y unos. Se implementa
contando cuenta el nmero de bits diferentes en xi y x j , siempre que al menos uno
de los bits es distinto de cero.
g) Distancia de Ward
Tambin se conoce como la suma de los cuadrados incrementales, la medida de
proximidad entre los grupos i, j y c viene dada por:
k =1 h =1
38
Donde i=a,b,c
(24)
39
40
41
CAPITULO 3
PROYECTO HAPMAP Y
PRE-PROCESAMIENTO DE LOS DATOS
3.1. Proyecto HAPMAP:
El proyecto HapMap es una iniciativa que naci en octubre de 2002 y que pretende
realizar la creacin de un catlogo de las variantes genticas que ocurren ms
comnmente en los seres humanos y de las cuales hasta ahora se sabe bastante
poco. Fue creado para intentar describir qu son estas variantes y cmo estn
distribuidas en las diferentes poblaciones y lugares del mundo. El proyecto en s, no
est utilizando la informacin recolectada en estudios que relacionen los diferentes
genes con enfermedades, pero est diseado para proveer informacin que otros
investigadores pueden usar con este fin, con la intencin de desarrollar nuevos
mtodos de prevencin, diagnstico y tratamiento.
Es un esfuerzo realizado por varios pases para identificar y catalogar similitudes y
diferencias genticas en los seres humanos. Es una colaboracin entre cientficos
de Japn, Reino Unido, Canad, China, Nigeria y Estados Unidos. Y toda la
informacin generada por el proyecto es accesible al pblico.
El propsito de todo esto es comparar las secuencias genticas de diferentes
individuos para identificar las regiones de secuencias genticas variables que son
compartidas o frecuentes. Al hacer que estos datos sean accesibles y gratuitos, se
ayuda a los investigadores biomdicos a encontrar los genes relacionados con
algunas enfermedades, as como la respuesta a ciertas drogas de tratamiento.
proporcionan
una
valiosa
experiencia
en
la
realizacin
de
43
CAPITULO 4
METODOLOGA
4.1. Primera fase: Proyecto HapMap y pre-procesamiento de los datos
Comprende la obtencin de los datos y el trabajo previo realizado por los
investigadores del PRBB, para obtener un grupo de genes ms pequeo, que
concentrase una alta cantidad de SNPs y CNVs para facilitar las pruebas y el
desarrollo del mtodo posteriormente planteado.
4.2. Segunda fase: Clasificacin de los datos en nmero de copias (Modelo
de Mixturas gaussianas)
Para nuestro estudio se cuenta con una seleccin de 144 genes (o marcadores
genticos), correspondientes a 272 individuos. Almacenada en una matrz X272x144 de
variables xij, es decir el valor xij representa la intensidad del gen j en el individuo i.
Para cada sonda gentica queremos clasificar los individuos en un nmero de
clases C, usando la variable continua x. Tomando en cuenta la variabilidad de los
datos en cada caso. Algunas sondas muestran claramente los picos que diferencian
las clases presentes y otras son ms difciles de inferir a simple vista, como
podemos ver en la fig. 4.11:
Fig. 4.11
Queremos modelar la variable subyacente C, usando la variable observada x. Para
esto, proponemos utilizar un modelo de mixturas finitas de C componentes:
(25)
C1
C2
0.0007
Cc
x1
0.0001
xm 0.00098
0.95
0.98
0.0007
Tabla 2
Por ejemplo en la figura 4.12 el dato a correspondera a la clasificacin C2 ya
que la probabilidad de estar en C1, en este caso, sera muchsimo menor.
46
C1
C2
escogido por el
usuario, como se ve en la fig. 4.14 todos los datos por debajo de este parmetro se
consideran con cero copias:
Es decir se asigna la clase 1 correspondiente a cero copias a todos los valores por
debajo del valor
threshold.
48
(27)
Donde
con
El valor de
(28)
49
realiza una clasificacin tanto para los datos sin clasificar como para los nuevos
datos por clase obtenidos con el modelo de mixturas explicado previamente.
Esto nos permite comparar la clasificacin de los individuos antes y despus del
proceso. En nuestro caso sabemos que las muestras proceden de tres grupos o
poblaciones de distinta raza, queremos ver si las variaciones en cuanto a nmero de
copias nos permiten diferenciar las distintas razas de individuos.
Para comparar los resultados obtenidos realizamos una tabla de contingencia
utilizado las clases obtenidas mediante la clasificacin de meoides versus el vector
con la informacin de a qu poblacin corresponde cada sonda. Y un coeficiente
kappa que mide el porcentaje de acuerdos.
Posteriormente realizamos un anlisis discriminante en el cual se recalcula la
clasificacin (clustering) de los individuos de manera supervisada, esta se obtiene
mediante la funcin de R discrimin: Primero se clasifica segn la intensidad de
cada sonda gentica xij y luego segn el nmero de copias de cada gen cij y en cada
caso se compara con los grupos originales existentes (CEU, YRI, CHB). Esto nos
permite tener una idea de que tan acertados son los resultados luego de clasificar
cada sonda segn el nmero de copia en una intensidad.
Posteriormente el anlisis discriminante, permitir identificar aquellas variables que
discriminan entre dos o ms grupos definidos con anterioridad y, establecer
diferencias entre dichos grupos. La idea es poder identificar aquellos genes que son
relevantes en la diferenciacin de las poblaciones. Esta fase sigue en estudios
50
51
CAPITULO 5
RESULTADOS
5.1. Mixturas Gaussianas:
Se utiliz una matriz de datos experimental, con la informacin de 144 marcadores
genticos (columnas) y con 272 individuos (filas) pertenecientes a tres poblaciones,
a la cual aplicamos el modelo de mixturas normales descrito en el capitulo anterior:
En la fig. 5.16 Podemos ver la clasificacin de una de las 144 sondas en las
clases: 0,1,2 y 3 copias del gen. Para estos datos = 0.2
Esta clasificacin sucede para cada una de las sondas, y obtenemos como
resultado una matriz de 144 marcadores genticos (columnas) y con 272
con
53
Vemos que el anlisis de k-meoides realizado para la matriz original, sugiere que la
clasificacin mas ptima se obtiene con k = 2 3; es decir, con 2 3 poblaciones.
Pero no se puede decir cual clasificacin resulta mejor. Mientras que para los datos
clasificados es claro que la mejor clasificacin se obtiene con K = 3 poblaciones.
Sabemos (a priori) que los datos provienen de tres poblaciones, por lo tanto la
clasificacin obtenida es mejor para los datos discretos, es decir para la matriz
obtenida por el modelo de clases latentes, ya que la otra no nos da un claro
discernimiento entre si son dos o tres poblaciones.
5.3. Anlisis discriminante: (Clustering Supervisado)
5.3.1. Clustering supervisado de los datos Clasificados:
Aqu podemos ver los clusters obtenidos mediante la funcin discrimin, para los
datos cij obtenidos mediante el modelo de mixtura de normales, comparados con la
clasificacin real de las poblaciones. En la fig. 5.18 se puede apreciar que en ambos
casos los clusters estn perfectamente separados.
Grupos 1, 2 y 3 discrimin
54
Datos xij
CE
CH
YRI
CE
CH
YRI
(2*PA-1) = 0.793103
53
16
Coeficiente Kappa:
128
117
56
48
25
58
(2*PA-1) = 0.739464
Tabla 3
55
56
57
REFERENCIAS
1]
[2]
[3] [4]
L., Kanuk G. L. y Leslie. Comportamiento del Consumidor, Editorial Prentice Hall, 1997.
[6]
58
APENDICES
cop<-c(0, 1,2, 3, 4, 5, 6)
# Nmero de copias
cen<-c(0.1,0.5,1,1.5,2,2.5)
# Valor en el que estan centrados los "grup"
th=0.2
# Threshold, Todo por debajo sera tomado como 0 copias
asignaClase.R<-file.choose();source(asignaClase.R)
# Buscamos las funciones de un fichero
search.threshold.R<-file.choose();source(search.threshold.R)
# <Bucle principal: Normaliza, decide la particin tomar y clasifica los datos>
suppressWarnings(
for (i in 1:marc){
datosnorm[,i]<-datos[,i]/mean(datos[which(datos[,i]>0.5),i])
# Normalizacin de los datos con su media
x<-datosnorm[,i]
# Subgrupos por rango
xx<-cut(x,vec)
# Reiniciar tt en cero
tt<-table(xx)
poc<-which((1<=tt)&(tt<=10))
much<-which(tt>10)
xpm=c();b=0;
# Inicializacion
60
##
for (j in 1:length(much)){
a=c()
a<-which((x>vec[much[j]])&(x<=vec[much[j]+1]))
# Vector de posiciones de x con muchos datos
xpm<-c(xpm,a)
}
p1<-sort(xpm)
xad<-x[p1]
# Posiciones 1 ordenado
# Vector de alta densidad
yy<-mixgroup(xad)
# La mixtura se hace slo de los datos con alta densidad de puntos
grup<-cen[much]
# Valor en el que estan centrados c(0,0.5,1,1.5,2,2.5)
# (si hay muchos en cero, no hya problema porque esto se soluicion arriba en el
if)
num<-length(grup)
par.ini<-mixparam(grup, rep(0.1,num))
res<-mix(yy,par.ini)
count=i
xmc[p1,i] <-asignaClase(res,xad)
61
## A veces una mixtura con desviacin estndar mas grande que otra causa malas
clasificaciones
## para arreglar esto usamos la funcin search.threshold()
th2=search.threshold(res,xmc[p1,i])
if (length(th2)>=2){
for (j in 1:length(xad)){
for (k in 2:length(th2)){
if (xad[j]<=th2[1]) xmc[p1[j],i]=1;
if ((xad[j]>=th2[k-1])&(xad[j]<=th2[k])) xmc[p1[j],i]=k;
if (xad[j]>=th2[k]) xmc[p1[j],i]=k+1;
}
}
}
if (min(grup)>0.5){xmc[p1,i]=xmc[p1,i]+which(cen==min(grup))-2}
# Corrige el problema de por ejemplo darle clase 1
# a los que estan sobre 0.5
## 3) Datos con poca densidad: Clasificacin.
if (length(poc)!=0){
xpm=c()
nc=c(1,2,3,4)
for (j in 1:length(poc)){
a<-which((x>vec[poc[j]])&(x<=vec[poc[j]+1]))
# Vector de posiciones de x de pocos datos
x[a]<-poc[j]-1
# Se asigna (a mano) el numero de copias correspondiente
xpm<-c(xpm,a)
# c(0,0.5,1,1.5,2,2.5)
}
# c(0, 1, 2 ,3 ,4, 5)
p2<-sort(xpm)
# Posiciones 2
xmc[p2,i]<-x[p2]
}
62
}
)
2) k_meoides.R
# Programa que realiza la clasificacin de los individuos segun la
informacin # de los los marcadores genticos y compara el resultado
obtenido entre la
continuos
# utilizando las funciones <<ade4>>,<<cluster>>, <<concord>>,<<maptree>>
# <<graphics>> del programa R (freeware)
# Archivos necesarios: XMC.dum
source("C:/Documents and Settings /XMC.dum")
xn<-datosnorm
ls()
## xn=datosnorm es la matriz de datos continuos normalizados
## pop son las poblacciones reales
## xmc son los datos enteros obtenidos en
## datos[,i] son los datos sin normalizar
## hx son las poblaciones seleccionadas por pam k=3
library(ade4)
library(cluster)
library(maptree) # Graficos para clusterin jerarquico
require(graphics)
library (concord) # Test kappa
## Analisis PAM (K-meoides)
# Datos continuos XN
dxnc = daisy(xn, metric = c("euclidean")) ## datos normalizados
hxc2= pam(dxnc, k=2)
hxc3= pam(dxnc, k=3)
hxc4= pam(dxnc, k=4)
63
t2
64
t3
# Coeficiente de fiabilidad kappa continuos y discretos pam k=3
concord<-matrix(c(hxe3$cluster,hxc3$cluster),nrow=2);
scores.to.counts(t(concord));
kppa3=cohen.kappa(t( concord),"score")
kppa1 ##Table hxe vs. pop
kppa2 ##Table hxc vs. pop
kppa3 ##Table hxe vs. hxc
dump(ls(),"C:/Documents and Settings/K_meoides.dum")
3) Correspondencias Multiples.R
# Programa que realiza Anlisis de Correspondencias mltiples:
# de los marcadores genticos y compara el resultado obtenido entre la
# matriz de datos enteros XMC y la matriz de datos continuos
# utilizando las funciones <<ade4>>,<<cluster>>, <<concord>>,<<maptree>>
# del programa R (freeware)
# Archivos necesarios: K_meoides.dum
library(ade4)
library(cluster)
65
plot(rte.pop.3.c)
####################################################
hxc3= pam(dxnc, k=3)
###################################################
pobfc3<-factor(hxc3$clustering)
levels(pobfc3)
dd.3.c<-dudi.pca(xn,scannf=FALSE)
dn.3.c<-between(dd.3.c,pobfc3,scannf=FALSE)
# DATOS CONTINUOS K=3 PAM
plot(dn.3.c)
# Rand test
rte.3.c<-randtest(dn.3.c,nrepet = 999)
plot(rte.3.c)
####################################################
hxc4= pam(dxnc, k=4)
###################################################
pobfc4<-factor(hxc4$clustering)
levels(pobfc4)
dd.4.c<-dudi.pca(xn,scannf=FALSE)
dn.4.c<-discrimin(dd.4.c,pobfc4,scannf=FALSE)
# DATOS CONTINUOS K=4 PAM
plot(dn.4.c)
# Rand test
rte.4.c<-randtest(dn.4.c,nrepet = 999)
plot(rte.4.c)
####################################################
## B) Datos enteros XMC
dxne = daisy(xmc, metric = c("gower"))
hxe2= pam(dxne, k=2)
###################################################
pobfe2<-factor(hxe2$clustering)
levels(pobfe2)
67
xmc.f=c()
xmc.f=data.frame(apply(xmc,2,as.factor))
dd.2.e<-dudi.acm(xmc.f,scannf=FALSE)
# DATOS ENTEROS K=2 PAM
dn.2.e<-discrimin(dd.2.e, nf=9,pobfe2,scannf=FALSE)
plot(dn.2.e)
# Rand test
rte.2.e<-randtest(dn.2.e,nrepet = 999)
plot(rte.2.e)
####################################################
levels(pop)
xmc.f=c()
xmc.f=data.frame(apply(xmc,2,as.factor))
dd.pop.3.e<-dudi.acm(xmc.f,scannf=FALSE,nf=3)
# DATOS ENTEROS K=3 POP
dn.pop.3.e<-discrimin(dd.pop.3.e, nf=3,pop,scannf=FALSE)
s.class(dn.pop.3.e$li, pop, xax = 1, axesell=FALSE,yax = 2, sub = "Scores and c
lasses", csub = 2, clab = 1.5)
plot(dn.pop.3.e)
# Rand test
rte.pop.3.e<-randtest(dn.pop.3.e,nrepet = 999)
plot(rte.pop.3.e)
####################################################
hxe3= pam(dxne, k=3)
###################################################
pobfe3<-factor(hxe3$clustering)
levels(pobfe3)
xmc.f=data.frame(apply(xmc,2,as.factor))
dd.3.e<-dudi.acm(xmc.f,scannf=FALSE)
# DATOS ENTEROS K=3 PAM
dn.3.e<-discrimin(dd.3.e, pobfe3,scannf=FALSE)
plot(dn.3.e)
# Rand test
rte.3.e<-randtest(dn.3.e,nrepet = 999)
68
plot(rte.3.e)
####################################################
hxe4= pam(dxne, k=4)
###################################################
pobfe4<-factor(hxe4$clustering)
levels(pobfe4)
xn.f2=c()
xmc.f=data.frame(apply(xmc,2,as.factor))
dd.4.e<-dudi.acm(xmc.f,scannf=FALSE)
# DATOS ENTEROS K=4 PAM
dn.4.e<-discrimin(dd.4.e, pobfe4,scannf=FALSE)
plot(dn.4.e)
# Rand test
rte.4.e<-randtest(dn.4.e,nrepet = 999)
plot(rte.4.e)
####################################################
ANEXO B. Paper relacionado con el proyecto: Latent Class Model to Assess
Association between Copy Number and Disease in Targeted Studies.
ANEXO C. Paper donde se explica con detalle el preprocesamiento de los datos:
Identification of copy number variants define genomic differences among major
human ethnic groups.
69
Genes and Disease Program, Center for Genomic Regulation, Barcelona, Spain
e-mail addresses:
JRG: jrgonzalez@creal.cat
IS: isubirana@imim.es
GE: georgia.escaramis@crg.es
SP: speraza@creal.cat
AC: acaceres@creal.cat
XE: xavier.estivill@crg.es
LA: lluis.armengol@crg.es
1
Abstract
Background: Copy number variations (CNVs) might play an important role by altering dosage of genes and other regulatory elements, which may have functional and, ultimately, phenotypical consequences. Therefore, determining whether a CNV is associated
or not with a given disease might be relevant in understanding genesis and progression
of human diseases. In this paper, we present a framework to assess assocation between
CNVs and disease in case-control studies. We extend the model to analyze discrete traits
and adjust for confounding covariates.
Results: Through simulation studies, we have shown that our method outperforms
other simple methods based on using pre-defined thresholds to define copy number status.
We illustrate the method using a real data example in a controlled MLPA experiment
showing good results.
Conclusions: We illustrate that our method is robust and achives maximal theoretical
power since it accomodates the possible missclassification error when copy number status
are stablished. We have made the software freely available and will be included in the R
package MLPAstats.
Background
With the recent technological advances, different genome-wide studies have uncovered an
unprecedented number of structural variants in the human genome [1, 2, 3], mainly in
the form of copy number variations (CNVs). The important number of genes and other
regulatory elements encompassed by those variable regions, make CNVs very likely to
have functional and, ultimately, phenotypical consequences [4, 5]. In fact, recent studies
have correlated the number of copies of specific genes with different degrees of disease
predisposition [6, 7, 8], showing that the identification of DNA copy number is important
in understanding genesis and progression of human diseases.
Several techniques and platforms have been developed for genome-wide analysis of
DNA copy number, such as array-based comparative genomic hybridization (aCGH).
The goal of this approach is to identify contiguous DNA segments where copy number
changes are present. The ability of aCGH to discern between different number of copies is
limited, thus the use of different kinds of quantitative techniques are required for targeted
and more precise analysis of genomic regions. For known CNVs, real time PCR assays
can be applied to study the copy number status of given loci in cases and controls groups.
Individuals are typically binned into copy number categories using pre-defined thresholds.
Currently, Multiplex Ligation-dependent Probe Amplification (MLPA) [9] has also been
used to quantify copy number classes. This method allows the analysis of several loci at
a the same time in a unique assay. MLPA is normally used to test differences in gains
and losses among test and control samples [10] but it can also be used in the context of
association studies in a case-control or cohort settings [11, 12].
Statistical methods used in CNV-disease association studies are very simple. Quantitative methods give CNV measurements for each individual as a continuous variable.
After that, copy number status is usually inferred generally by using pre-defined thresholds, and subsequently assess differences in copy number distribution between cases and
3
controls by using 2 , Fisher or Mann-Whitney tests [6, 13, 14]. However, the distribution of CNV meassurements is continuous and multimodal, meaning that peak intensity
should be considered as a mixture of curves. In many occassions, these curves overlap with
different underlying distributions. Therefore, scoring copy number by binning and then
assessing the association may lead to misclassification and hence obtain false findings.
To overcome this difficulty, we propose a latent class (LC) model to assess association
between CNVs and disease wich incorporates possible misclassification in scoring copy
number status. After inferring copy numbers using gaussian finite mixture distributions,
the model assesses the relationship among the trait and a CNV with a mixture of generalized linear models. Association is then assessed using a likelihood ratio procedure.
We validate and compare our method with the existing methods through a simulation
study. We then illustrate how to test association between two CNVs in a case-control
study using a real data set.
Methods
Inference of copy number status
Let us assume that we observe I individuals from a given population, that consists of
C mutually exclusive latent classes c = 1, . . . , C (e.g. copy number status). Instead
of observing these classes, we observe a surrogate variable, X, that corresponds to a
continuous variable arising from any quantitative method. For instance, in targeted
studies using MLPA or real-time PCR, X corresponds to peak intensities for each CNV.
In the context of a whole genome scan, one may have quantitative data from Illumina or
Affymetrix array, where for each probe, the variable X corresponds to a ratio of intensities.
Figure 1 shows possible patterns that peak intensities may have. Some variants cleary
show different underlying copy number status with multimodal peak intensities (CNV2,
CNV4 and CNV6). In other cases, where the existence of different copy numbers is not
clear, inferring copy number by binning the data may be difficult or unfeasible.
For each CNV variant, we are interested in classifying the individuals into the C classes
using the surrogate variable X. We propose to model the unobserved latent classes using
a finite mixture model with C components of the form
f (x|) =
C
X
c N(x|),
(1)
c=1
where N(|c , c2 ) is the Gaussian distribution with denoting all model parameters
(e.g., = (c , c2 ), c = 1, . . . , C), and x is the surrogate variable that corresponds to the
quantitative measure of the copy number status. For the component weigths c it holds
C
X
c=1
c = 1 and c 0, c = 1, . . . , C.
The value of C to be used is chosen by applying the Bayesian Information Criteria (BIC)
[15]. It should be pointed out that in some occasions, specially when there are individuals
with 0 copies, the intensity distributions (see CNV2 and CNV4 in Figure 1) are very close
to 0. In this situation, the estimation procedure of parameters involved in (1) used to fail
since the underlying distribution of individuals with 0 copies is not normally distributed.
In these situations we propose to fit the following mixture model to determine the latent
classes
f (x|) = 1 I{x } +
where is given by the user, 1 =
1 +
C
X
c=2
C
X
c=2
I{x }
,
I
c N(x|c , c2 ) I{x> } ,
c = 1 and c 0 c = 2, . . . , C.
(2)
The posterior probabilities are used to segment data by assigning each individual to
a given copy number status that will correspond to the class with maximum posterior
probability (MAP). After fitting this finite mixture model, we can perform a goodnessof-fit test using a 2 test statistic. Finite mixture parameters can be estimated using the
EM algorithm [16, 17] or Newton-type procedures [17]. Then, the posterior probability
that the individual i with an observed value x belongs to copy number class j is given by
j N(x|j , j2 )
.
wij = P(j|x, ) = P
2
c c N(x|c , c )
(3)
(4)
P(yi |Ci = c, ) =
eyi c
1 + ec
Now, we consider that copy number status is measured with error (i.e., the latent class
is not known). Therefore, we are modelling the probability of being case as a mixture of
C binomial variables in the following way
) =
P(yi |
C
X
c=1
where wic is the posterior probability that the individual i belongs to copy number class
c given in (3). Therefore, assuming conditional independence of case-control status given
latent class, the likelihood function for model parameters can be written as
I X
C
Y
i=1 c=1
wic P(yi|Ci = c, ) =
I X
C
Y
i=1 c=1
wic
eyi c
.
1 + ec
(5)
It is straightforward to see that we can compute the odds ratio (OR) of belonging to class
c with respect to a given reference r as
ORc/r = ec r .
(6)
Quantitative traits
We now consider the case where our phenotype, Y , is continuous. We assume that
Y |c N (c , 2 ). In this case, conditionally to cluster c we have that
P(yi|Ci = c, ) =
(yi ic )2
1
e 22 ,
2
(7)
where
ic c
And, similarly to the case of discrete traits, the likelihood function for model parameters
is given by
I X
C
Y
i=1 c=1
I X
C
Y
i=1 c=1
wic
(yi c )2
1
e 22 .
2
(8)
In this case we are interested in evaluating the difference between mean effect of individuals with c copies and r copies. This can simply be computed as
yc/r = c r
Model with covariates
In some ocassions researchers are interested in assessing the effect of CNVs adjusted for
other covariates, Z1 , . . . , ZK (normally called confounding variables). In this case, the
likelihood function can be written as
I X
C
Y
i=1 c=1
wic P (yi|Ci = c, Z, c , ),
where
P (yi|Ci = c, Z, c, ) =
8
eic
1 + eic
(9)
P (yi|Ci = c, Z, c , , ) =
(yi ic )2
1
e 22
2
(10)
ic = c + 1 Zi1 + . . . + K ZiK
(11)
Parameter estimation
In this section we address parameter estimation for the general situation of having covariates and either discrete or quantitative traits. For brevity let (, , ) (notice
that for discrete traits = 1). We consider that wic are known and that they are given
by the surrogate variable X from equation (3). Therefore, they can be pluged in the
log-likelihood resulting in
Y |Ci = c, Z, ) =
log P (Y
I
X
log
C
X
c=1
i=1
w
ic P (yi|Ci = c, Z, ).
(12)
Here P (yi |Ci = c, Z, ) is given by equations (9) and (10) for discrete and quantitative
traits, respectively. The maximum likelihood estimators (MLE) of the model parameters
maximize this log-likelihood function. We propose to use a Newton-Raphson procedure
to find parameter estimates. The k-th component of the score, S, is given by
PC h
I
Y | ) X c=1 ick
log P (Y
Sk (y|C, )
=
.
PC
k
c=1 hic
i=1
Y |) X
2 log P (Y
Hkk ( )
=
k k
i=1
PC
hic
s=1 k k
PC
s=1 hic
P
PC
J
s=1 hic
hic
s=1 k
2
PC
hic
s=1 k
where
hic wic P (yi|Ci = c, Z, ).
Formulas for the derivatives of hic for covariates and for discrete and qualitative traits
are given in the Appendix.
MLE can be used to estimate the OR, under the multiplicative model, between individuals with c copy number status with respect to a reference category (e.g., individuals
with r copy number status) as
c/r = ec r .
OR
(13)
Similarly, when analyzing continuous traits, the estimated mean effect between individuals with c copies and r copies is
yc/r = c r .
(14)
)
Var(
(15)
Therefore, we can compute a 95% confidence interval (CI95%) for ORc/r using the expression
q
CI1 (ORc/r ) exp (c r ) z/2 V ar( )[c,c] + V ar( )[r,r] 2V ar( )[c,r]
(16)
q
V ar()[c,c] + V ar()[r,r] 2V ar()[c,r],
10
(17)
where z/2 denotes the (1 /2)-th quantile of standard normal distribution, is the
desired type-I error, and subindex [, ] denotes the position of the inverse of Fishers
information matrix.
Hypothesis testing
We propose to use a likelihood ratio test to assess disease association, by taking as reference the model without copy number variable. Twice the increase in the log-likelihood
provides the asymptotic 2 statistic that tests H0 : 1 = 2 = . . . = C . In many
ocasions, we are interested in studying the trend on copy number status (e.g., additive
model). This can be done by generalyzing equation (11) in the form
ic =
M
X
Dicm cm .
(18)
m=1
1 1 ...
0 1 ...
1 ...
1
...
1
1 ... C 1 ... C 1
and the trend hypothesis on copy number status is tested using a likelihood ratio test
comparing this model with the null model. Notice that this formulation allow us to
accomodate different or common effects for each latent class. In this case, parameter
estimates are obtained as previously illustrated. Formulas for the derivatives involved
in the score and hessian where coefficients are not shared by each latent class, can be
found in the Appendix. An R language functions for the methods discussed in this paper
are freely available at http://www.creal.cat/jrgonzalez/software.htm and they will be
11
Results
Simulation study
We performed computer simulation studies to examine empirically the properties of the
parameter estimators developed in the previous sections. The specific goals of these
estudies were: (i) to evaluate the performance of the proposed likelihood ratio trend
test based on the latent class model for different CNV measurement distributions; (ii) to
examine the effect of sample size (I) on the distributional properties of the estimators;
(iii) to examine the bias and mean square error (MSE) of the estimators; (iv) to validate
whether variance or parameter estimates obtained using the observed information matrix
are correct. Simulations were performed as follows. To study (i) we simulated a binary
trait using 300 cases and 300 controls. The unobserved copy number status (e.g. latent
classes) were simulated depending on 3 different copy number status (C = 3) with a
proportion of individuals in each category set equal to = (0.5, 0.4, 0.1). The trend OR
was set equal to 1.5. The observed ratio intensities (X variable) were simulated as a finite
mixture of C normal distributions using different means, , and variances, 2 , to assess
whether the separation of clusters and their variance affects to the power.
To study (ii)-(iv) we simulated binary and quantitative traits. For the binary case,
simulation was perfomed as above but simulating different scenarios varying the sample
size (I), OR and proportion of individuals for each copy number status, . Again we
also simulated different CNV distributions varying and 2 . For qantitative traits, we
used the same simulation procedure but copy number status were simulated depending
on a fixed mean level of the trait for the copy number status considered as reference and
a desired mean difference with respect to the others categories of copy number status.
12
Next, we describe the settings for the different simulation parameters. Sample size: We
chose the values of I: I {50, 300}. Although current studies are analyzing thousands of
individuals, these values were chosen to evaluate the performance of our proposed method
in moderatly large samples. Copy number status: in this case, as we were interested
in evaluating the performace of parameter estimates, we only simulated two different
copy number status C = {1, 2}. Odds ratio: To examine the impact of association
among the disease and CNV, we chose two values for OR: OR {1.3, 2} in order to
consider a moderate association and a strong one. Proportion of cases with normal copy
number status: To evaluate the impact of classes with different number of individuals
we set {(0.8, 0.2), (0.5, 0.5)}. Finite mixture: To asses the impact of distribution of
intensity ratio, X, we simulated two normal distributions with the following parameters:
{1, 1.5} which correspond to have 2 (considered as normal copy number status) and
3 copies, respectively, and {(0.15, 0.15), (0.15, 0.2), (0.2, 0.2)}. In this case, these
scenarios also helped us to model different situations regarding missclasification or how
latent classes were separated. Supplementary Figure XX shows the different distributions
of quantitative CNV that were simulated.
We compared three different approaches. The first one (NAIVE) was based on assesing
association between disease and copy number status obtained using MAP from the finite
mixture model (2). That is, association was assessed using a 2 test from Table 1. The
second one is what biologist normally do when analyze this kind of data and is based on
assigning CNV status using pre-defined thresholds (THRES). Association is then assessed
using a 2 test. As mentioned previously, we simulated data from two mixtures of normals
with mean 1 and 1.5. This is equivalent to simulating individuals with 2 and 3 copies,
respectively. In this situation, it is considered that individuals with intensity (or ratiointensity) larger than 1.33 correspond to individuals with 3 copies [10]. The third method
is the one proposed in this paper, based on latent class (LC).
13
Simulation results for evaluating the performance of likelihood ratio trend test of our
proposed model are showed in Figure 2. Top figures represent the power for all methods
analyzed for two scenarios (other scenarios are given in Supplementary Figures 1 and
2). The left pannel shows the power for each method, varying the CNV measurement
distribution with regard the mean of each latent class, , while right panel gives the same
information but having fixed means and varying variances, 2 . The figure also depicts the
distribution of CNV for some scenarios. We observe that our proposed latent class model
performs better in all cases, even when mixture of CNVs are not very well separated.
Simulation results to evaluate parameter estimates for discrete traits are presented in
Tables 2, 3 and Figure 3 (Supplementary Figure 3 shows the results for I = 50). Similar
results and conlcusions are obtained when a quantitative trait is analyzed. Table 2 and
Figure 3 and Supplementary Figure 3 summarize the OR comparing individuals with 3
copies with respect to those individuals with 2 copies (reference category) and give the
MSE for two different sample sizes, I, two different proportions of individuals having 2
copies, , and two different variances for each component of the mixture, . Table 3
compares different methods to compute standard error of ORs for the different scenarios
previously described. The results compare asymptotic variance based on observed information matrix (ASYM) with respect to empirical variance (EMP). Table 3 also shows
coverage and power of confidence intervals based on the three methods analyzed.
As expected, when the sample size increased, the performance of the estimators of
the finite-dimensional parameters improved (Table 2). In all cases, LC method perfoms
better than the other ones. LC is less bias than NAIVE and THRES in all cases, also
showing better MSE. Figure 3 also confirms the better performance of LC method to
estimate empirical OR distribution. In particular, the distribution of LC method is the
closest to the simulated data (Figure 3 and Supplementary Figure 3).
Regarding variance estimates, estimation based on ASYM showed good performance
14
in all scenarios (Table 3). Despite of little overestimating the empirical variance (EMP)
the bias was less pronounced for I = 300 as expected. Confidence intervals based on LC
method outperforms those obtained by other methods with regard to power.
15
Table 5 shows the ORs and their 95%CI for the two genes analyzed. The first three
columns show the results considered as the goal standard, since CN status was determined at the laboratory using PCR technique, while the other columns show the results
obtainded after estimating the CN status using our proposed finite mixture model and
computing the ORs using a naive approach (e.g. considering that there is no missclasification) and the LC model that account for missclasification. As we can see, the results
are the same for the gene 1, since no misclassification is observed (see Figure 4 and Table
4). However, for the gene 2, SN status is not determine so easy as in the case of analyzing gene 1. This is why we observe different results regarding OR estimation and more
interestingly in the P value of association. For instance, the order of magnitud of the
association between the disease and gene 2 is better captured with the LC model than
the NAIVE approach. As for OR estimates, the analysis using the true CN status shows
that individuals with one copy of gene 2 have a 46% decrease of risk of having the disease
with respect to those indiviuals with 0 copies. As the 95%CI shows, this difference is
statistically significant. We lead to the same conclusion when we compare individuals
with two copies wiht respect with subjects with zero copies. Notice that in both cases
we observe that the naive approach is understimating the OR as the simulation study
showed.
Conclusion
In this paper we have shown that the assessment of association between CNVs and disease
using analysis methods that do no take into account the misclassification of copy number
status (threshold and naive methods) underestimate both p-values and parameter estimates. This is contrary to the need of increasing statistical power, which is reduced by the
multiple comparison correction for the simultaneous testing of several loci. False positives
are typically controlled by a dramatic reduction on the nominal p-value, and therefore
16
very low values are required to reach statistical significance. A precise computation of
these values is essential in genetic association studies.
Here we have proposed a latent class model (LC) that accommodates both the uncertainty of assessing CNV status and possible confounding factors. The parameter estimation procedure allows the estimation of their confidence intervals. The LC model was
remarkably consistent with simulated data. In particular, we found that the p-values obtained with the LC model were more precise to the expected values than those obtained
with the threshold and naive methods.
CNVs are assayed quantitatively by a broad range of methods [22] such as array CGH
or Illumina and Affymetrix platforms. These technologies have the ability to identify
thousands of CNVs simultaneously, which makes the analysis of such data computationally demanding. We have found that the LC method is very efficient and, therefore, can
serve this purpose. Specifically, we have observed the the Newton-Rapson optimization
converge in only few steps (usually 4-6). It is important to note that the finite Gaussian
mixture model, incorporated in the LC model, assumes that cases and controls are comparable. While this is true for MLPA data, for other technologies such as CGH, Illumina
and Affymetrix differential errors can be present between group of subjects due to DNA
quality or handling. Therefore it would be necessary to assess CNV status for cases and
controls separately before the association analysis based on those platforms.
We maximize the likelihood function assuming fixed weights for each copy number
status that accounts for possible missclasification. The main advantage of considering
weigths as known constants is that the Newton-Raphson procedure is much simpler, faster
and feasible for obtaining the Hessian matrix analytically. We therefore assume that copy
number status is independent of being case or control. This assumption was validated in
our simulation studies, where we confirmed that the proposed model captures very well
the nature of the synthetic data and variance estimates. Interestingly, we observed that
17
the variance estimates using MLE was also reproduced when a bootstrap procedure was
used (see Supplementary Table 1). In the interest of generalization, one can consider
maximizing the likelihood function for both model parameter and weights. If that is
the case, an EM algorithm should be used instead. However, one should bare in mind
that EM does not allow the estimation of the variance of the model parameters and is
computationally expensive; which is a challenge if this method is used in whole genome
scan settings.
In conclusion, we have shown how the LC model can accommodate bi-allelic and
mult-iallellic CNVs as well as quantitative traits. We have also presented how it can
incorporate confounding variables. This is of particular importance in complex diseases
studies where environmental factors need to be taken into account. The formulation can
also be generalized to assess survival times or counts, in longitudinal studies.
Authors contributions
JRG and IS developed the new statistical methods. JRG wrote the R functions and
the main text of the manuscript and performed the simulation studies. GE and AC
proposed abundant suggestions for developing the models. SP worked on the gaussian
mixture approach to model quantitative CNVs measurements. XE reviewed the paper and
revised its framework. LA and JRG proposed the need of a statistical tool to measure the
biological differences in allele distribution in cohorts of cases and controls, and conceived
the study. All authors have read, and approved the final manuscript.
Acknowledgments
First author wants to thank Xavier Bassaga
na for his comments and helpful conversations
about the model proposed. This work was been supported by the Spanish Ministry of
18
Science and Innovation [MTM2008-02457 to JRG and SAF2008-00357 to XE]; and the
European Commission [AnEUploidy project; FP6-2005-LifeSciHealth contract #037627].
Appendix
To obtain parameter estimates, we have to maximize the log-likelihood function
Y |Ci = c, Z, ) =
log P (Y
I
X
log
i=1
C
X
c=1
w
ic P (yi|Ci = c, Z, ),
where P (yi|Ci = c, Z, ) is given by equations (9) and (10) for discrete and quantitative
traits, respectively. As we have previously mentioned, the k-th component of the score,
S, is given by
PC h
I
Y | ) X c=1 ick
log P (Y
=
.
Sk (y|C, )
PC
k
c=1 hic
i=1
The k-th element of the hessian, H, is
PC
PC
hic PC
I
Y |) X s=1 k k s=1 hic s=1
2 log P (Y
=
Hkk ( )
P
2
k k
J
i=1
h
s=1 ic
hic
k
PC
hic
s=1 k
where
Binary Traits
Binary Traits without covariates
In this case, hic function takes the form
wic
eyi c
.
1 + ec
Therefore,
wic I{k=c} yi eyk (1 + ek ) aic I{k=c}eyi k ek
hic
=
= I{k=c} hic (yi pic ),
k
(1 + ek )2
19
where
pic =
1
.
1 + ec
And,
2 hic
hic
2 hic
2
=
I
(y
p
)
h
(p
p
)
,
and
= 0 for k 6= k .
{k=c}
i
ic
ic
ic
ic
k2
k
k k
Binary Traits with covariates
In this case, hic function takes the form
K
X
eyi ic
hic = wic
, where ic = c +
k zik
1 + eic
k=1
Therefore,
wic I{k=c} yi eyic (1 + eic ) wic I{k=c} eyi ic eic
hic
=
= I{k=c}hic (yi pic ),
k
(1 + eic )2
where
pic =
1
.
1 + eic
And
2 hic
hic
2 hic
2
=
I
(y
p
)
h
(p
p
)
,
and
= 0 for k 6= k .
{k=c}
i
ic
ik
ic
ic
2
k
k
j j
For covariates, we have that
hic
= zp hic (yi pic )
p
2 hic
hic
= zp
(yi pic ) zp2 hic (pic p2ic )
2
p
p
2 hic
hic
= zp
(yi pic ) zp zp hic (pic p2ic )
p p
p
Quantitative traits
Quantitative traits without covariates and shared variance
In this case, hic function takes the form
1 (yi c )2
hic = wic e 22 ,
20
Therefore,
hic
1 (yik )2 yi k
yi k
= I{k=c} wic e 22
= I{k=c} hic
2
k
2
2 hic
1 hic
2 hic
=
I
(y
h
,
and
= 0 for k 6= k
i
k
ic
{k=c}
k2
2 k
k k
hic
1 (yi2c )2
hic hic
1 (yi 2c )2 (yi c )2
=
= wic 2 e 2 + e 2
+ 3 (yi c )2
3
2 hic
=
2
hic
hic
2 hic
=
k
hic
+ (yi c )2
2hic
3
hic 3
3hic 2
6
(yi s )s
X
1 (yi is )2
hic = wic e 22 , where is = s +
p zip
p=1
Therefore,
hic
yi ic
= I{k=c} hic
k
2
2 hic
2 hic
1 hic
=
I
(y
h
,
and
= 0, for k 6= k
i
ic
ic
{k=c}
2
2
k
k
k k
hic
hic hic
=
+ 3 (yi ic )2
!
h
hic 3
2
ic
hic
2 hic
2 3hic
=
+ (yi ic )
2
2
6
!
hic
2
hic
2hic
= I{k=c} 2 3 (yi ic )
k
!
hic
2 hic
2hic
(yi ic )zip
p
2
3
2 hic
zip hic
= I{k=c} 2
(yi ic ) hic
p k
j
21
hic
hic
=
(yi ic )zip
p
2
2 hic
=
p2
hic
p
2
2
zip
1
2 hic
hic hic 1
zip zip
hic 2 , and
=
hic 2 for p 6= p
hic
p p
p p hic
Trend test
In this situation we can write the linear predictor of equation (18) as
ic = 1 + 1 (c 1).
In other words, 1 plays the role of an intercept and 2 is the slope. In this case we
consider that both 1 and beta2 are the shared for each latent class. In this situation,
eyi ic
taking in mind that hic = wic 1+e
ic we have that for the discrete traits
and
hic
= hic xikc (yi pic ),
k
(19)
2 hic
hic
= xikc
(yi pic ) xikc xik c hic (pic p2ic ).
k k
k
(20)
and
(yi ic )2
2 2
, we have that
hic
yi ic
= hic xikc
,
k
2
(21)
2 hic
hic yi ic
hic
= xikc
xikc xik c 2 .
2
k k
k
(22)
2 hic
=
2
hic
hic hic
=
+ 3 (yi ic )2 ,
!
hic
hic 3
2
hic
2 3hic
+ (yi ic )
,
2
6
and
2 hic
= xikc
k
hic
2hic
3
22
(yi ic )
(23)
(24)
(25)
References
[1] Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL, Cheng Z, Schwartz
S, Albertson DG, Pinkel D, Altshuler DM, Eichler EE: Linkage disequilibrium
and heritability of copy-number polymorphisms within duplicated regions
of the human genome. Am J Hum Genet 2006, 79(2):27590.
[2] Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero
MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei
R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J,
Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW,
Hurles ME: Global variation in copy number in the human genome. Nature
2006, 444(7118):44454.
[3] Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay
C, Ng RT, Brown CJ, Eichler EE, Lam WL: A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet 2007,
80:91104.
[4] Feuk L, Carson AR, Scherer SW: Structural variation in the human genome.
Nat Rev Genet 2006, 7(2):8597.
[5] Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird
CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P,
Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number
variation on gene expression phenotypes. Science 2007, 315(5813):84853.
23
24
[20] Greenland S: Basic methods for sensitivity analysis of biases. Int J Epi 1996,
25:11071115.
[21] Spiegelman D, Rosner B, Logan R: Estimation and inference for logistic regression with covariate missclassification and measurement error, in main
study/validation study designs. J Am Stat Assoc 2000, 95:5161.
[22] Armour J, Barton D, Cockburn D, Taylor G: The detection of large deletions
or duplications in genomic DNA. Hum Mut 2002, 20:325337.
26
Figure Legend
CNV2
CNV3
100
Frequency
50
60
Frequency
40
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.3
0.4
0.5
0.10
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
Frequency
20
0
10
20
50
100
Frequency
60
40
Frequency
80
50
150
100
60
CNV4
0.2
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
0.8
27
=(1.0,1.4,1.8)
0.5
0.8
0.6
0.4
LC
NAIVE
THRESHOLD
TRUE
0.0
0.4
= 0.5
0.2
power (p<10e3)
0.8
0.6
0.4
0.2
LC
NAIVE
THRESHOLD
TRUE
0.0
power (p<10e3)
1.0
1.0
=(0.2,0.2,0.2)
0.3
0.2
0.10
0.1
0.12
0.14
0.16
0.18
0.20
= 0.3
= 0.2
=(0.1,0.1,0.1)
=(0.15,0.15,0.15)
=(0.2,0.2,0.2)
Figure 2: Empirical power for simulation studies. Empirical power for the three different
approaches analyzed varying the quality of clustering for underlying copy number status. Left
pannel is for a fixed set of variance and varying means, while the rigth pannel is for a fixed
mean and varying variances.
28
(.15,.15)
2.0
true
estimated
naive
corrected
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
0.0
1.0
0.0
0.0
(0.5,0.5)
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.8,0.2)
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.5,0.5)
1.3
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.8,0.2)
1.3
(.15,.2)
2.0
OR
2.0
(.2,.2)
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
Figure 3: Empirical distribution of effect estimates (log OR) for each copy number
status. Results for 1000 simulated case-control data sets (300/300), for different degrees of
association (e.g. different OR) and different distributions of quantitative CNV measurements
(e.g. varying clustering quality)
29
0.6
0.4
0.2
0.0
1
100
Casecontrol status
case
200
300
400
500
individuals
control
600
density
30
1.0
0.8
0.6
0.4
0.2
0.0
0.2
1
100
Casecontrol status
case
200
300
400
500
individuals
control
600
density
31
Disease
Cases
Controls
Total
R
S
32
Table 2: Odds ratio (e ) and mean square error obtained in 1,000 simulations using the
three different approaches: NAIVE, THRES and LC (read text to have a description of
each one). The results are given for different scenarios varying number of individuals (I),
proportion of individuals in each copy number status (), odds ratio (e ) and variance
for CNV quantitative measurements.
e
I
50
50
50
50
50
50
50
50
50
50
50
50
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
e
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
SIM NAIVE
(0.15,0.15) 1.23
1.17
(0.2,0.2)
1.24
1.14
(0.15,0.2) 1.28
1.18
(0.15,0.15) 1.60
1.40
(0.2,0.2)
1.82
1.36
(0.15,0.2) 1.89
1.42
(0.15,0.15) 1.26
1.24
(0.2,0.2)
1.32
1.28
(0.15,0.2) 1.26
1.23
(0.15,0.15) 2.04
1.94
(0.2,0.2)
2.04
1.76
(0.15,0.2) 2.06
1.78
(0.15,0.15) 1.30
1.25
(0.2,0.2)
1.32
1.25
(0.15,0.2) 1.30
1.22
(0.15,0.15) 2.01
1.87
(0.2,0.2)
2.03
1.70
(0.15,0.2) 2.03
1.62
(0.15,0.15) 1.31
1.27
(0.2,0.2)
1.30
1.23
(0.15,0.2) 1.30
1.24
(0.15,0.15) 2.00
1.87
(0.2,0.2)
2.00
1.72
(0.15,0.2) 2.00
1.76
33
THRES
1.15
1.09
1.15
1.28
1.29
1.33
1.21
1.25
1.20
1.83
1.68
1.72
1.18
1.15
1.16
1.49
1.36
1.38
1.26
1.22
1.23
1.77
1.66
1.71
LC
1.20
1.21
1.24
1.48
1.52
1.57
1.26
1.35
1.26
2.05
2.05
1.99
1.30
1.34
1.29
2.01
1.99
1.86
1.30
1.30
1.29
2.00
2.02
1.97
Table 3: Empirical coverage and power obtained in 1,000 simulations using the three different approaches: NAIVE, THRES
and LC (read text to have a description of each one). The results are given for different scenarios varying number of
individuals (I), proportion of individuals in each copy number status (), odds ratio (e ) and variance for CNV quantitative
measurements. The table also shows the variance of parameter estimates using the asymptotic (ASYM) variance compared
with the empirical (EMP) variance.
34
I
50
50
50
50
50
50
50
50
50
50
50
50
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
e
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
EMP
0.5821
0.5679
0.5326
0.6382
0.6103
0.6174
0.4168
0.4298
0.3984
0.4231
0.4022
0.4345
0.2291
0.2208
0.2192
0.2452
0.2334
0.2455
0.1711
0.1709
0.1582
0.1621
0.1692
0.1793
ASYM
0.5898
0.6605
0.5846
0.6512
0.7057
0.6407
0.4367
0.4838
0.4578
0.4495
0.5020
0.4696
0.2341
0.2667
0.2373
0.2610
0.2996
0.2591
0.1775
0.1970
0.1866
0.1823
0.2030
0.1904
SIM
94.2
93.0
96.6
94.2
92.8
95.6
94.0
94.6
95.2
95.6
97.0
94.4
94.0
94.6
94.2
94.2
95.8
93.8
93.6
94.4
96.8
95.8
96.2
95.4
Coverage (%)
NAIVE THRES
96.2
95.8
94.0
93.0
96.2
95.4
92.6
89.0
92.2
82.8
87.0
79.4
94.2
95.2
93.8
94.0
95.2
95.2
94.6
93.8
95.0
94.6
93.4
94.4
94.0
89.2
94.4
88.6
93.6
89.2
93.6
66.0
89.8
43.2
83.0
43.8
93.8
94.0
93.8
92.8
95.2
94.4
95.2
90.4
84.0
82.4
88.2
83.0
LC
96.8
96.2
97.4
94.0
95.2
93.0
93.8
95.4
95.6
94.6
98.2
95.6
93.2
96.4
96.0
94.6
96.0
94.6
93.8
93.6
95.2
95.8
96.0
95.2
SIM
6.6
5.2
6.8
22.0
16.8
19.4
11.6
12.6
11.4
39.4
42.2
47.4
20.4
23.0
23.4
85.4
84.2
85.8
37.0
36.6
34.6
98.4
99.2
98.2
Power (%)
NAIVE THRES
5.4
6.4
4.8
4.2
4.8
4.8
16.8
11.2
9.4
7.4
10.6
9.8
10.0
9.2
7.0
7.0
8.6
9.4
32.4
32.4
23.8
23.2
30.8
29.8
15.2
17.0
17.0
11.0
15.8
13.2
78.4
58.6
60.8
42.6
62.8
44.8
30.8
31.2
24.4
25.0
22.8
24.6
96.8
93.0
89.4
90.0
92.4
88.2
LC
4.6
3.6
3.0
15.4
7.0
9.6
10.0
7.2
8.2
32.6
25.2
33.4
17.8
16.2
18.0
79.2
66.6
67.4
32.4
28.2
25.2
97.2
94.2
94.4
True copy
number status
0
1
2
Gene 1
0 426 0
0
1 0 201 0
2 0
0
24
Gene 2
0 85
0
0
1 5 287 0
2 0
73 204
Table 4: Contingency table of estimated and true copy number status for two genes
involved in complex disease example
35
Co
True CN
Ca
OR (CI95%)
Co
Gene 1
0 215 211
1
1 80 121 1.54 (1.09,2.17)
2 6
18 3.06 (1.19,7.85)
P association
0.0027
P trend
5.0 104
Gene 2
0 24 66
1
1 159 201 0.46 (0.27,0.77)
2 108 93 0.31 (0.18,0.54)
P association
7.2 105
P trend
2.1 105
Ca
Estimated CN
ORnaive (CI95%) ORLC (CI95%)
211 215
121 80
18
6
1
1.54 (1.09,2.17)
3.06 (1.19,7.85)
0.0027
5.0 104
1
1.54 (1.10,2.16)
3.06 (1.19,7.87)
0.0023
5.0 104
22 63
129 178
140 119
1
0.44 (0.26,0.75)
0.33 (0.19,0.57)
2.3 104
1.0 104
1
0.47 (0.28,0.88)
0.31 (0.18,0.54)
8.4 105
2.1 105
Table 5: Association analysis of disesase status and copy number category using the true
copy number status and the estimated using the finite mixture proposed.
36
Additional files
Additional file 1: latent model MLPA sup.pdf, 465.1K
37
Figure 1
CNV2
CNV3
100
50
Frequency
60
40
Frequency
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.3
0.4
0.5
0.10
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
0
10
20
Frequency
50
100
Frequency
60
40
20
0
Frequency
80
50
150
100
60
CNV4
0.2
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
Figure 2
0.8
0.6
0.4
0.2
LC
NAIVE
THRESHOLD
TRUE
0.0
power (p<10e3)
1.0
=(0.2,0.2,0.2)
0.5
0.4
= 0.5
0.3
0.2
= 0.3
0.1
= 0
2.0
0.0
0.5
Density
1.0
1.5
true
estimated
naive
corrected
1.0
Figure 3
0.5
0.0
0.5
log(OR)
1.0
1.5
0.6
0.4
0.2
0.0
Figure 4
1
100
Casecontrol status
case
200
300
400
individuals
500
control
600
density
0.8
0.6
0.4
0.2
0.0
0.2
1.0
Figure 5
1
100
Casecontrol status
case
200
300
individuals
400
500
control
600
density
BMC Bioinformatics
BioMed Central
Open Access
Methodology article
doi: 10.1186/1471-2105-10-172
Abstract
Background: Copy number variations (CNVs) may play an important role in disease risk by
altering dosage of genes and other regulatory elements, which may have functional and, ultimately,
phenotypic consequences. Therefore, determining whether a CNV is associated or not with a given
disease might be relevant in understanding the genesis and progression of human diseases. Current
stage technology give CNV probe signal from which copy number status is inferred. Incorporating
uncertainty of CNV calling in the statistical analysis is therefore a highly important aspect. In this
paper, we present a framework for assessing association between CNVs and disease in casecontrol studies where uncertainty is taken into account. We also indicate how to use the model to
analyze continuous traits and adjust for confounding covariates.
Results: Through simulation studies, we show that our method outperforms other simple methods
based on inferring the underlying CNV and assessing association using regular tests that do not
propagate call uncertainty. We apply the method to a real data set in a controlled MLPA experiment
showing good results. The methodology is also extended to illustrate how to analyze aCGH data.
Conclusion: We demonstrate that our method is robust and achieves maximal theoretical
power since it accommodates uncertainty when copy number status are inferred. We have made
R functions freely available.
Background
With the recent technological advances, various genomewide studies have uncovered an unprecedented number
of structural variants throughout the human genome
[1-3], mainly in the form of copy number variations
(CNVs). The considerable number of genes and other
Page 1 of 13
(page number not for citation purposes)
f ( x | Q) =
p N(x | Q),
(1)
c =1
CNV2
CNV3
100
Frequency
50
60
Frequency
40
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.2
0.3
0.4
0.10
0.5
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
Frequency
10
20
50
100
60
Frequency
80
50
150
100
60
CNV4
Frequency
40
Methods
20
real examples. One of them corresponds to a casecontrol study using data from a MLPA experiment where
the true copy number status is known. The second
example belongs to a study where breast cancer cell lines
are analyzed using aCGH.
http://www.biomedcentral.com/1471-2105/10/172
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 1
CNV quantitative measurements. Examples of CNV
data showing different clustering quality and copy number
status.
Page 2 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
= 1 and p c 0, c = 1,..., C .
c =1
f ( x | Q) = p 1I{x t } +
p N( x | h , s
c
c =2
2
c ) I{x >t } ,
P(y i | Ci = c , b ) = m icyi (1 m ic )1 y i ,
logit( m ic ) b c .
(2)
p1 =
I{x t }
I
P(y i | Ci = c , b ) =
p1 +
= 1 and p c 0 c = 2,..., C .
c =2
w ij = P( j | x , ) =
p j N( x|h j ,s 2j )
2
c p c N( x|h c ,s c )
(4)
(3)
e y ib c
.
1+ e b c
P(y i | b ) =
w P(y | C = c, b ),
ic
c =1
...
Total
Cases
Controls
r1
s1
r2
s2
...
...
rC
sC
R
S
Page 3 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
w P(y | C = c, b ) =
ic
i =1 c =1
i =1 c =1
e y ib c
w ic
.
1+ e b c
(5)
OR c / r = e
b c b r
Quantitative traits
We now consider the case where our phenotype, Y, is
continuous. We assume that Y |c N(c, s2). In this case,
conditioning on cluster c
P(y i | Ci = c , b ) =
1
e
2p s
(9)
P( y i | C i = c , Z , b c , g , s ) =
1
e
2p s
(y i y ic ) 2
2s 2
(y i m ic ) 2
2s 2
(7)
y ic = b c + g 1Z i1 + + g K Z iK .
(11)
Parameter estimation
In this section we address parameter estimation for the
general situation of having covariates and either discrete
or quantitative traits. For brevity, let (b, g, s) (notice
that for discrete traits s = 1). We consider that the
weights, w ic , are known and that they are given by the
surrogate variable X from equation (3). Therefore, they
can be used in the log-likelihood calculation, resulting in
log P(Y | Ci = c , Z , q ) =
where
i =1
c =1
log w
ic P( y i
| Ci = c , Z , q ).
(12)
m ic b c .
Similar to the case of discrete traits, the likelihood
function for model parameters b is given by
I
w P(y | C = c, b ) = w
ic
i =1 c =1
i =1 c =1
ic
1
e
2p s
(y i b c ) 2
2s 2
(8)
log P(Y |q )
S k (y | C , q )
=
q k
yc / r = b c b r .
Covariate Adjustment
In some instances researchers are interested in assessing the
effect of CNVs after adjusting for other covariates, Z1, ..., ZK
(usually called confounding variables). In this case, the
likelihood function can be written as
C
w
i =1 c =1
where
ic P( y i
| Ci = c , Z , b c , g ),
cC=1
hic
q k
Cc =1 hic .
i =1
(10)
(6)
ey ic
1+ ey ic
P( y i | C i = c , Z , b c , g ) =
2 log P(Y |q )
=
q k q k
i =1
Cs =1
h
h
hic
C hic Cs =1 ic Cs =1 ic
q k
q k q k s =1
q k
2
sJ =1 hic
where
h ic w ic P(y i | Ci = c , Z , q ).
Formulae for the derivatives of hic for covariates and for
discrete and qualitative traits are given in the Appendix.
MLE can be used to estimate, under the multiplicative
model, the OR between individuals with copy number
Page 4 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
OR c / r = e b c b r .
(13)
Similarly, when analyzing continuous traits, the estimated mean effect among individuals with c copies with
respect to those with r copies is
y c / r = b c b r .
(14)
(15)
CI1a (OR c / r ) exp (b c b r ) z a / 2 Var(q )[c ,c] + Var(q )[r ,r ] 2Var(q )[c ,r ] ,
(16)
and for y c / r
CI1a ( y c / r ) (b c b r ) z a / 2 Var(q )[c ,c] + Var(q )[r ,r ] 2Var(q )[c ,r ] ,
(17)
where za/2 denotes the (1 - a/2)-th quantile of a standard
normal distribution, a is the desired type-I error, and
subindex [, ] denotes the position in the inverse of
Fisher's information matrix.
Hypothesis testing
We propose to use a likelihood ratio test to assess disease
association, taking the model without the copy number
variable as reference. Twice the increase in the loglikelihood provides the asymptotic c2 statistic that tests
H0: b1 = b2 = ... = b C . In many instances, we are
interested in studying the trend in effect with respect to
copy number status (e.g., additive model). This can be
done by generalizing equation (11) in the form
M
y ic =
icmz cm ,
(18)
m =1
1
1 1 1 1
D =
0 0 1 1 C 1 C 1
and the trend hypothesis on copy number status is tested
using a likelihood ratio test, comparing this model with
the null model. Notice that this formulation allows us to
accommodate different or common effects for each
latent class. In this case, parameter estimates are
obtained as shown above. Formulae for the derivatives
obtained in the score and Hessian, where coefficients are
not shared by each latent class, are shown in the
Appendix. R language functions for the methods
discussed in this paper are freely available at http://
www.creal.cat/jrgonzalez/software.htm [25]
Results
Simulation study
We performed computer simulation studies to empirically examine the properties of the parameter estimators
developed in the previous sections. The specific goals of
these studies were: (i) to evaluate the performance of the
proposed likelihood ratio trend test based on the latent
class model for a number of CNV measurement
distributions; (ii) to examine the effect of sample size
(I) on the distributional properties of the estimators;
(iii) to examine the bias and mean square error (MSE) of
the estimators; (iv) to assess the accuracy whether of the
variance and parameter estimates obtained using the
observed information matrix. Simulations were performed as follows: To study (i), we simulated a binary
trait using 300 cases and 300 controls. The unobserved
copy number statuses (e.g. latent classes) were simulated
depending on 3 different copy number status ( C = 3),
with the proportion of individuals in each category set as
= (0.5, 0.4, 0.1). The trend OR was set equal to 1.5. The
observed signal intensity ratio (X variable) were simulated as a finite mixture of C normal distributions using
different means, h, and variances, s2, to assess whether
the separation of clusters and their variance affects
power.
Page 5 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
Figure 2
Empirical power for simulation studies. Empirical power for the three different approaches analyzed, varying the quality
of clustering for underlying copy number status. Left panel is for fixed variance and varying means, while the right panel is for
fixed mean and varying variances.
Page 6 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
eb
eb
SIM
NAIVE
THRES
LC
NAIVE
THRES
LC
50
50
50
50
50
50
50
50
50
50
50
50
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
1.23
1.24
1.28
1.60
1.82
1.89
1.26
1.32
1.26
2.04
2.04
2.06
1.17
1.14
1.18
1.40
1.36
1.42
1.24
1.28
1.23
1.94
1.76
1.78
1.15
1.09
1.15
1.28
1.29
1.33
1.21
1.25
1.20
1.83
1.68
1.72
1.20
1.21
1.24
1.48
1.52
1.57
1.26
1.35
1.26
2.05
2.05
1.99
57
107
134
54
152
180
39
82
66
40
107
87
87
131
148
85
158
253
51
79
72
67
128
107
42
114
112
44
126
162
32
97
60
34
92
71
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
1.30
1.32
1.30
2.01
2.03
2.03
1.31
1.30
1.30
2.00
2.00
2.00
1.25
1.25
1.22
1.87
1.70
1.62
1.27
1.23
1.24
1.87
1.72
1.76
1.18
1.15
1.16
1.49
1.36
1.38
1.26
1.22
1.23
1.77
1.66
1.71
1.30
1.34
1.29
2.01
1.99
1.86
1.30
1.30
1.29
2.00
2.02
1.97
13
27
24
21
69
78
7
15
12
11
36
26
32
50
42
120
203
189
9
17
14
23
51
37
10
29
21
13
43
38
5
12
9
5
15
10
Odds ratio (eb) and mean square error obtained in 1,000 simulations using the three different approaches, NAIVE, THRES and LC (see text for a
description of each). Results are given for different scenarios, varying the number of individuals (I), the proportion of individuals with each copy
number status (), the odds ratio (eb), and the variance for CNV quantitative measurements.
Page 7 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
0.6
Casecontrol status
100
case
200
300
400
500
control
600
individuals
Gene 1
0
1
2
426
0
0
0
201
0
0
0
24
Gene 2
0
1
2
85
5
0
0
287
73
0
0
204
density
1.0
Figure 3
Association between Gene 1 and disease. Graphical
representation of peak intensities (CNV quantitative
measurement) of individuals for Gene 1 analyzed in the
example. The various colors indicate copy number status
inferred using our proposed finite mixture model.
0.0
0.2
0.4
0.6
0.8
0.2
0.4
0.2
0.0
1
100
Casecontrol status
case
200
300
individuals
400
500
control
600
density
Figure 4
Association between Gene 2 and disease. Graphical
representation of peak intensities (CNV quantitative
measurement) of individuals for Gene 2 analyzed in the
example. The various colors indicate copy number status
inferred using our proposed finite mixture model.
From the table, we can see that the finite mixture model
gives a perfect classification for gene 1 and some
misclassification for gene 2. Goodness-of-fit test revealed
that the proposed mixture model to determine CNV
status was appropriate (p = 0.6615 and p = 0.1586).
Table 4 shows the ORs and their 95%CI for the two
genes analyzed. The first three columns show the results
obtained in the laboratory using PCR, while the other
columns show the results obtained after estimating the
copy number status using our proposed finite mixture
aCGH example
The analysis of aCGH data requires additional steps to
take into account the dependency across probes. Table 5
shows four steps we recommend for the analysis of this
kind of data. First, MAP should be obtained with an
algorithm that considers probe correlation. We use, in
particular, the CGHcall R program which includes a
mixture model to infer CNV status [18]. Second, we
build blocks/regions of consecutive clones with similar
signatures. To perform this step the CGHregions R
library was used [26]. Third, the association between
the CNV status of blocks and the trait is assessed by
incorporating the uncertainty probabilities in the LC
model. And fourth, corrections for multiple comparisons
must be performed. We use the Benjamini-Hochberg
(BH) correction [27]. This is a heuristic method that is
robust against positive dependence and increasingly
conservative as correlation increases [28].
Page 8 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
Table 4: Association analysis of disease status and copy number category using the true copy number status and the estimated status
obtained using the finite mixture proposed
True CN
Gene 1
0
1
2
P association
P trend
Gene 2
0
1
2
P association
P trend
Estimated CN
Co
Ca
OR (CI95%)
Co
Ca
ORnave (CI95%)
ORLC (CI95%)
210
75
6
216
126
18
1
1.63 (1.16,2.30)
2.92 (1.14,7.49)
0.0027
5.0 10-4
210
75
6
216
126
18
1
1.63 (1.16,2.30)
2.92 (1.14,7.49)
0.0027
5.0 10-4
1
1.63 (1.16,2.30)
2.92 (1.14,7.50)
0.0023
5.0 10-4
24
159
108
66
201
93
1
0.46 (0.27,0.77)
0.31 (0.18,0.54)
7.2 10-5
2.1 10-5
22
129
140
63
178
119
1
0.44 (0.26,0.75)
0.33 (0.19,0.57)
2.3 10-4
1.0 10-4
1
0.47 (0.27,0.82)
0.31 (0.18,0.54)
8.4 10-5
2.1 10-5
Significance level
10
Latent class model
Chi-square test
-6
1
0
10
-5
4
2
10-4
10-3
10-2
27
10
64
41
117
93
Results are given for different levels of association and comparing our
proposed model with the nave approach that does not consider
uncertainty.
significance levels. We observe that incorporating classification uncertainty with the LC model substantially
increased the level of association, as compared to the
NAIVE approach. The number of positive association at
5% of significance after applying BH correction was 49
and 24 for LC and NAIVE approach, respectively.
Discussion
In this paper we have shown that the assessment of
association between CNVs and disease using analysis
methods that do no take into account uncertainty when
inferring copy number status lead to larger p-values and
underestimate the model parameters. This confounds the
need to increase statistical power, which is reduced by
the multiple comparison correction for the simultaneous
testing of several loci. False positives are typically
controlled by a dramatic reduction in the nominal
p-value, such that very low values are required to reach
statistical significance. Thus, a precise computation of
these values is essential in genetic association studies.
Here we have proposed a latent class model (LC) that
accounts for the uncertainty of assessing CNV status and
also accommodates potential confounding factors. In the
case of analyzing quantitative traits, we also provide
formulae to further propagate call uncertainty, as other
authors have proposed in another context [32]. By
analyzing quantitative traits, we have assumed that the
response variable follows a normal distribution, although
this assumption does not hold in some instances. In this
situation, one possibility is to analyze the log-transformed variable, although log transformation may not be
not sufficient. The model could easily be extended to fit a
response variable that has any exponential family
distribution (e.g. normal, gamma, Poisson). However,
we have not yet implemented this option in the functions
reported here. The extension of our proposed latent-class
Page 9 of 13
(page number not for citation purposes)
model to assess survival time, possibly with rightcensored data, is not trivial but could be a very interesting
avenue for future investigation. The parameter estimation
procedure proposed here, allows the estimation of
confidence intervals. The LC model was remarkably
consistent with simulated data. In particular, we found
that the p-values obtained with the LC model were more
similar to the expected values than those obtained by the
threshold and nave methods.
We maximize the likelihood function, assuming fixed
weights for each copy number status, which accounts for
possible misclassification. The main advantage of considering weights as known constants is that the NewtonRaphson procedure is much simpler, faster and feasible
for obtaining the Hessian matrix analytically. We
confirmed that the proposed model captures very well
the nature of the synthetic data and variance estimates.
Interestingly, we observed that the variance estimates
using MLE were also reproduced when a bootstrap
procedure was used (see Additional file 1, Table S2). In
the interest of generalization, one can consider maximizing the likelihood function for both model parameters and weights. In that case, an EM algorithm
should be used instead. However, one should bear in
mind that EM does not allow for estimation of the
variance of the model parameters and is computationally
expensive, which may be particularly costly if this
method is used in whole genome scan settings.
Conclusion
We have shown that the LC model can incorporate
uncertainty of CNV calling in the analysis. We have also
illustrated how to analyze quantitative traits as well as how
to accomodate confounding variables. This is of particular
importance in complex diseases studies where other clinical
or biochemical factors need to be taken into account. The
formulation can also be generalized to assess survival times
or counts in longitudinal studies. The model has showed
good performance when analyzing both targeted (MLPA
data) and whole genome (aCGH data) studies.
Authors' contributions
JRG and IS developed the new statistical methods. JRG
wrote the R functions and the main text of the manuscript and performed the simulation studies. GE and AC
made abundant suggestions for developing the models.
SP worked on the gaussian mixture approach to model
quantitative CNVs measurements. XE reviewed the paper
and revised its framework. LA and JRG proposed the
need of a statistical tool to measure the biological
differences in allele distribution in cohorts of cases and
controls, and conceived the study. All authors have read,
and approved the final manuscript.
http://www.biomedcentral.com/1471-2105/10/172
Appendix
To obtain parameter estimates, we maximize the loglikelihood function
C
log P(Y | Ci = c , Z , q ) =
w
log
i =1
ic P( y i
| Ci = c , Z , q ),
c =1
S k (y | C , q )
log P(Y |q )
=
q k
cC=1
hic
q k
Cc =1 hic .
i =1
Cs =1
h
h
hic
C hic Cs =1 ic Cs =1 ic
q k q k s =1
q k
q k
2
sJ =1 hic
i =1
where
h ic w ic P(y i | Ci = c , Z , q ).
Herein we provide formulae for the derivatives of hic for all
cases discussed in this paper. Although the following
expressions may appear complicated, they are straightforward to program and are included in the >R functions
available at http://www.creal.cat/jrgonzalez/software.htm.
Binary Traits
Binary Traits without covariates
In this case, the hic function takes the form
w ic
e y ib c
.
1+ e b c
Therefore,
yb
yb b
b
hic w ic I{k = c}y ie k (1+ e k ) a ic I{k = c}e i k e k
= I{k =c}h ic (y i p ic ),
=
b k
(1+ e b k ) 2
where
p ic =
1
,
1+ e b c
and
h
2hic
= I{k =c} ic (y i p ic ) h ic (p ic p ic2 ) , and
b k
b 2
2hic
= 0 for k k.
b k b k
Page 10 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
e y iy ic
h ic = w ic
, where y ic = b c +
1+ ey ic
g k z ik .
k =1
h ic = w ic
Therefore,
yy
yy y
y
hic w ic I{k = c}y ie ic (1+ e ic ) w ic I{k = c}e i ic e ic
= I{k =c}h ic (y i p ic ),
=
y
2
b k
i
c
(1+ e
)
1
e
s
, where y is = b s +
g z
p ip .
p =1
hic
y y
= I{k =c}h ic i ic
b k
s2
2hic
2hic
1 hic
= 0, for k k
= I{k =c}
(y i y ic ) h ic , and
2
2
b
b k b k
b
s k
k
hic
h
h
= ic + ic (y i y ic ) 2
s
s
s3
1
p ic =
,
1+ e y ic
2hic
= I{k =c} ic (y i p ic ) h ik (p ic p ic2 ) , and
b k
b 2
(y i y is ) 2
2s 2
Therefore,
where
and
hic
s hic
2hic
= s
2
s
s2
hic 3
s 3hics 2
2 s
+ (y i y ic )
s6
h ic
2h ic
2h
= I{k =c} s2 3ic (y i y ic )
b k s
s
s
h
ic
2hic s
2h
=
ic (y i y ic )z ip
g ps s 2
s 3
z ip hic
2hic
= I{k =c}
(y i y ic ) h ic
g pb k
s 2 b j
hic
hic
=
(y i y ic )z ip
g p
s 2
2hic
= 0 for k k .
b jb j
For covariates:
hic
= z ph ic (y i p ic )
g p
2hic
h
= z p ic (y i p ic ) z p2h ic (p ic p ic2 )
2
g p
g p
2hic
h
= z p ic (y i p ic ) z p z ph ic (p ic p ic2 )
g pg p
g p
2
z i2p
z ip z ip
2hic hic 1
2hic
h hic 1
=
h ic
, and
= ic
h ic
for p p
2
2
g
h
g pg p g p g p hic
p ic
s
s2
g p
Quantitative traits
Quantitative traits without covariates and shared
variance
In this case, the hic function takes the form
1
h ic = w ic e
s
(y i b c ) 2
2s 2
(y i b k ) 2
yi b k
y b
2s 2
= I{k =c}h ic i k
s2
s2
2
2
hic
hic
1 hic
= I{k =c}
(y b k ) h ic , and
= 0 for k k
2 b k i
b k b k
b 2
s
(y b c ) 2
(y b c ) 2
i
i
2
2 (y i b c ) 2
1
hic
1
h
h
2
s
2
s
= ic + ic (y i b c ) 2
= w ic
+ e
e
2
s
s
s
s3
s3
hic 3
hic
h
3hics 2
s
s
2
ic
hic
2 s
= s
+ (y i b c )
2
2
s
s
s6
hic
2hic
2h
= s ic (y i b s )s
b k s s 2
s 3
y ic = b 1 + b 1(c 1).
Therefore,
1
hic
= I{k =c}w ic e
b k
s
Trend test
hic
= h ic x ikc (y i p ic ),
b k
(19)
and
2hic
h
= x ikc ic (y i p ic ) x ikc x ikc h ic (p ic p ic2 ).
b k b k
b k
( y y ic ) 2
i
2s 2
(20)
, we
Page 11 of 13
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2105/10/172
hic
y y
= h ic x ikc i ic ,
b k
s2
(21)
3.
and
2hic
h y y
h
= x ikc ic i ic x ikc x ikc ic .
2
b k b k
b k s
s2
4.
(22)
5.
hic
h
h
= ic + ic (y i y ic ) 2 ,
s
s
s3
hic
s hic
2hic
= s
s 2
s2
6.
(23)
hic 3
s 3hics 2
2 s
,
+ (y i y ic )
s2
(24)
7.
8.
and
9.
hic
2hic
2h
= x ikc s ic
2
b k s
s3
s
(y i y ic ).
(25)
10.
11.
Additional material
Additional file 1
Tables and figures for more scenarios of simulation studies.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712105-10-172-S1.pdf]
Acknowledgements
12.
13.
14.
The first author would like to thank Xavier Bassagaa for his comments
and helpful conversations about the model proposed. Gavin Lucas is also
acknowledged for his comments on a last version of the manuscript. The
authors also want to thank helpful comments on how to analyze aCGH
data given by one of the reviewers. This work was supported by the
Spanish Ministry for Science and Innovation [MTM2008-02457 to JRG and
SAF2008-00357 to XE]; and the European Commission [AnEUploidy
project; FP6-2005-LifeSciHealth contract #037627].
16.
References
17.
1.
2.
Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL,
Cheng Z, Schwartz S, Albertson DG, Pinkel D, Altshuler DM and
Eichler EE: Linkage disequilibrium and heritability of copynumber polymorphisms within duplicated regions of the
human genome. Am J Hum Genet 2006, 79(2):275290.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD,
Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S,
Freeman JL, Gonzalez JR, Grata-cos M, Huang J, Kalaitzopoulos D,
Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L,
Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J,
Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L,
Conrad DF, Es-tivill X, Tyler-Smith C, Carter NP, Aburatani H,
Lee C, Jones KW, Scherer SW and Hurles ME: Global variation in
15.
18.
19.
20.
21.
Page 12 of 13
(page number not for citation purposes)
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
http://www.biomedcentral.com/1471-2105/10/172
BioMedcentral
Page 13 of 13
(page number not for citation purposes)