Universidad Simón Bolívar Decanato de Estudios Profesionales Coordinación de Matemáticas

UNIVERSIDAD SIMN BOLVAR
DECANATO DE ESTUDIOS PROFESIONALES

COORDINACIN DE MATEMTICAS
ANLISIS MULTIVARIANTE PARA DETERMINAR GENES VARIABLES EN

NMERO DE COPIAS ASOCIADOS A DISTINTAS POBLACIONES
Por
Br. Solymar Peraza Crespo
Sartenejas, abril de 2009


Por
Br. Solymar Peraza Crespo
Realizado con la Asesora de
Tutor Acadmico: Prof. Alfredo Ros
Tutor Industrial: Dr. Juan Ramn Gonzlez Ruiz
INFORME FINAL DE CURSOS EN COOPERACIN TCNICA Y DESARROLLO
SOCIAL
Presentado ante la Ilustre Universidad Simn Bolvar

como requisito parcial para optar al ttulo de Licenciada en Matemticas Aplicadas


INFORME FINAL DE CURSOS EN COOPERACIN TCNICA Y DESARROLLO SOCIAL
Presentado por
Solymar Peraza Crespo
Carnet: 01-34259
REALIZADO CON LA ASESORIA DE: Prof. Alfredo Ros (Tutor Acadmico)
Dr. Juan Ramn Gonzlez Ruiz (Tutor Industrial)
RESUMEN
El PRBB Parc de Recerca Biomdica de Barcelona (Parque de Investigacin
Biomdica de Barcelona) requera la clasificacin de una matriz de intensidades, que
contena la informacin de varias sondas genticas correspondientes a un grupo de
individuos, para posteriormente poder realizar un anlisis de conglomerados que
permitiera identificar la clasificacin por poblaciones segn el nmero de copias de
cada gen.
La realizacin del proyecto se llev a cabo en 4 etapas: La primera:
Familiarizacin con la terminologa gentica y los mtodos estadsticos
anteriormente empleados para las fases anteriores del proyecto. La segunda: El
desarrollo e implementacin de un mtodo que permitiera determinar el nmero de
copias de cada sonda gentica tomando en cuenta la distribucin de los datos. La
tercera: La clasificacin de los individuos en poblaciones mediante el algoritmo de kmeoides y la cuarta: Evaluacin de resultados obtenidos.
Los resultados obtenidos fueron satisfactorios e interesantes y una fuente para
futuras publicaciones en estadstica gentica.
PALABRAS CLAVES: Estimacin de mixturas, Clustering, clasificacin gentica,
gentica, SNPs, CNVs, Proyecto HapMap, Epidemiologa.
Aprobado con mencin: _______
Postulado para el premio: _______
INDICE
INDICE ....................................................................................................................... vi
INTRODUCCIN .......................................................................................................10
Antecedentes ........................................................................................................................................................ 10
Justificacin del Proyecto ................................................................................................................................... 10
Objetivos Generales: ........................................................................................................................................... 11
Objetivos Especficos: ......................................................................................................................................... 11
Estructura del Informe ....................................................................................................................................... 12
CAPITULO 1 ..............................................................................................................13
LA EMPRESA: El PRBB ...........................................................................................13
(PARC RECERCA BIOMDICA DE BARCELONA) .................................................13
1.1. Historia y fundacin del PRBB ................................................................................................................. 13
1.2. Centros que conforman el PRBB............................................................................................................... 15
1.2.1. Centro de Investigacin en Epidemiologa Ambiental (CREAL) ...................................................... 15
1.2.2. Hospital del Mar (IMAS)................................................................................................................... 15
1.2.3. Instituto Municipal de Investigacin Mdica (IMIM) ....................................................................... 16
1.2.4. Departamento de Ciencias Experimentales y de la Salud de la Universidad Pompeu Fabra (CEXSUPF):............................................................................................................................................................ 17
1.2.5. Centro de Regulacin Genmica (CRG)............................................................................................ 17
1.2.6. Centro de Medicina Regenerativa de Barcelona (CMRB) ................................................................. 17
1.2.7. Instituto de Alta Tecnologa (IAT) .................................................................................................... 18
CAPITULO 2 ..............................................................................................................19
FUNDAMENTOS TERICOS....................................................................................19
2.1. Nociones y conceptos genticos: ............................................................................................................... 19
2.2. ADN (cido desoxirribonucleico).............................................................................................................. 20
2.3. Gen: ........................................................................................................................................................... 21
vi
2.4. Alelos:........................................................................................................................................................ 21
2.5. Polimorfismo gentico:.............................................................................................................................. 22
Existen varios tipos de polimorfismos: .............................................................................................................. 22
2.6. Las Variaciones en nmero de copias o CNV (copy number variations) .................................................. 23
2.6.1. Variable Categrica: .......................................................................................................................... 25
2.6.2. Variable Cuantitativa ......................................................................................................................... 26
2.7. Tablas de Contingencia.............................................................................................................................. 26
2.8. Distribucin normal o Gaussiana:.............................................................................................................. 27
2.9. Coeficiente de concordancia Kappa: ......................................................................................................... 27
2.10. Estimacin de Mxima Verosimilitud ..................................................................................................... 29
2.11. Mixturas de distribuciones:...................................................................................................................... 30
2.12. EM (Expectation Maximization): ............................................................................................................ 31
2.13. Anlisis de Conglomerados (Clustering) ................................................................................................. 33
2.13.1. No supervisado: ............................................................................................................................... 33
a) Anlisis de Clusters No Jerrquicos.................................................................................................... 33
b) Anlisis de Clusters Jerrquicos ......................................................................................................... 36
2.13.2. Clustering supervisado:.................................................................................................................... 36
2.13.2. Distancias usadas en los distintos mtodos de Clustering................................................................ 37
a) Distancia Eucldea............................................................................................................................... 37
b) Distancia Manhattan ........................................................................................................................... 37
c) Distancia de Minkowski...................................................................................................................... 37
d) Distancia del Supremo ........................................................................................................................ 38
e) Distancia de Canberra ......................................................................................................................... 38
f) Distancia Binaria ................................................................................................................................. 38
g) Distancia de Ward............................................................................................................................... 38
2.14. Anlisis Discriminante: ........................................................................................................................... 39
2.14.1. Anlisis Discriminante Descriptivo: ................................................................................................ 39
2.14.2. Anlisis Discriminante Predictivo: .................................................................................................. 40
CAPITULO 3 ..............................................................................................................42
PROYECTO HAPMAP Y ...........................................................................................42
PRE-PROCESAMIENTO DE LOS DATOS ...............................................................42
3.1. Proyecto HAPMAP: .................................................................................................................................. 42
3.2. Pre-procesamiento de los datos: ................................................................................................................ 43
CAPITULO 4 ..............................................................................................................44
METODOLOGA ........................................................................................................44
vii
4.1. Primera fase: Proyecto HapMap y pre-procesamiento de los datos........................................................... 44

4.2. Segunda fase: Clasificacin de los datos en nmero de copias (Modelo de Mixturas gaussianas) ........... 44
4.2.1. Estimacin de los parmetros: .......................................................................................................... 45
4.2.2. Seleccin de la clase .......................................................................................................................... 46
4.2.3. Inconveniente cerca de cero ............................................................................................................... 48
4.2.4. Modelo resultante............................................................................................................................... 49
4.3. Tercera fase: Clustering ............................................................................................................................. 49
4.3.1. Clasificacin de los individuos segn el nmero de copias ............................................................... 49
4.3.2. Clasificacin de los individuos segn el nmero de copias (Clustering no supervisado) .................. 50
CAPITULO 5 ..............................................................................................................52
RESULTADOS...........................................................................................................52
5.1. Mixturas Gaussianas:................................................................................................................................. 52
5.2. Clasificacin de las poblaciones:( clustering no supervisado)................................................................... 53
5.3. Anlisis discriminante: (Clustering Supervisado)...................................................................................... 54
5.3.1. Clustering supervisado de los datos Clasificados: ............................................................................. 54
5.3.2. Clustering supervisado de los datos Originales: ................................................................................ 55
5.4. Comparacin de las clasificaciones: .......................................................................................................... 55
CONCLUSIONES Y FUTUROS ESTUDIOS..............................................................57

REFERENCIAS..........................................................................................................58
APENDICES ..............................................................................................................59
Anexo A. ............................................................................................................................................ 59
viii
ix
INTRODUCCIN
Planteamiento del Problema
El PRBB requera el desarrollo de un mtodo que permitiera determinar el nmero
de copias (factores) en cada marcador gentico, que tiene un grupo de individuos
tomando en cuenta la distribucin de la poblacin para cada sonda, para luego
realizar un anlisis multivariante y una a clasificacin de los individuos que
permitiera determinar si la clasificacin obtenida permite diferenciar poblaciones.
Antecedentes
Recientemente, el estudio del papel que tienen los genes en distintas reas de la
ciencia, como puede ser la medicina, ha tenido un gran auge. En particular, el
estudio de la relacin de ciertos genes con enfermedades complejas ha recibido
mucha atencin durante los ltimos aos. Uno de los ejemplos ms claros en
medicina ha sido la epidemiologa. Se ha dedicado muchos aos de investigacin a
estudiar minuciosamente los factores ambientales que se asocian a las
enfermedades ms comunes como el cncer, las enfermedades cardiovasculares, o
el SIDA. Actualmente los estudios epidemiolgicos incorporan el estudio de la
implicacin de ciertos genes, as como su interaccin con otros factores ambientales
conocidos.
Justificacin del Proyecto
Dado el elevado costo que implica la obtencin de los datos genticos, es
necesario desarrollar alguna forma de inferir la informacin que realmente nos
interesa, que en este caso es el nmero de copias, dado que la informacin que
actualmente se obtiene es una intensidad para cada gen y cada individuo. Por esto
se buscan mtodos que revelen la informacin que realmente nos interesa, para
luego evaluar la posibilidad de hacer estudios de clasificacin de individuos con
estos nuevos datos obtenidos, todo en miras de lograr detectar aquellos genes
relacionados
directamente
con
la
diferenciacin
de
poblaciones,
especficamente en un futuro con las enfermedades antes mencionadas.

10
mas
Objetivos Generales:
La investigacin que el PRBB debe realizar, tiene como objetivo general obtener la
informacin referente a qu nmero de copias de cada sonda gentica tiene cada
individuo y posteriormente discriminar cuntas poblaciones hay en base a estos
datos obtenidos.
Objetivos Especficos:
Se proponen distintos mtodos para obtener la clasificacin de los
datos:
o
Clustering (Particin y jerrquico):

mclust, hclust, etc.
Modelos de clases latentes (ajuste de mixturas).
Se Investiga qu mtodos utilizan estas funciones de R y cmo son
implementadas (esto es importante dada la magnitud de la base de datos).
Luego se implementan los mtodos con los datos propuestos.
Una vez inferido el nmero de copias de cada gen, o sonda gentica.
Se realiza un anlisis de k-meoides para ver si estos CNVs permiten o no

discriminar exitosamente los tres tipos de poblaciones que tenemos.
Luego se realiza un anlisis multivariante con los datos clasificados y
sin clasificar, utilizando la informacin a priori de la pertenencia de

poblaciones.
Se calcula el ndice kappa de concordancia y una tabla de
contingencia, para ambas clasificaciones.
Se evalan los resultados de cada mtodo.
11
Se establece la relacin entre la variacin en nmero de copias
(CNVs) y la discriminacin de poblaciones.
En un futuro esto podra extrapolarse para obtener los genes
responsables de ciertas enfermedades (casos controles). Y as detectar los

genes ms relevantes en dicha discriminacin.
Estructura del Informe
El presente informe es el resultado del proyecto de pasanta Anlisis multivariante
para determinar genes variables en nmero de copias asociados a distintas
poblaciones con duracin de veinte (20) semanas y est conformado por los
siguientes captulos:
El primero plantea los objetivos y fases que fueron planteados para desarrollar
exitosamente un modelo que resolviera el problema de clases latentes; el segundo
contiene una resea sobre el PRBB (Parc de Recerca Biomdica de Barcelona) uno
de los centros de investigacin biomdica mas importante en Espaa y la
Comunidad Europea, en el cual se desarroll el proyecto de pasanta, el tercero
explica la proveniencia de los datos (proyecto HapMap) y el proceso previo realizado
por los investigadores del PRBB, mediante el cual se procesan los datos antes de
ser discretizados y clasificados, el cuarto captulo desarrolla una introduccin sobre
los fundamentos tericos utilizados en el desarrollo del modelo multivariante, as
como los conceptos tericos utilizados en gentica, el quinto describe la metodologa
desarrollada y empleada, el sexto muestra los resultados obtenidos, el sptimo
contiene las conclusiones y recomendaciones, el octavo contiene las referencias
bibliogrficas y el noveno el apndice.
12
CAPITULO 1
LA EMPRESA: El PRBB
(PARC RECERCA BIOMDICA DE BARCELONA)
1.1. Historia y fundacin del PRBB
En mayo del ao 2006 se inaugura del Parque de Investigacin Biomdica de
Barcelona (Parc de Recerca Biomdica de Barcelona, PRBB), tras cinco aos de
edificacin y un perodo de unos veinte aos trabajando para construir una
infraestructura cientfica capaz de competir con los mejores centros europeos. En
este sentido, el PRBB es un campus de produccin intensiva de conocimiento en el
mbito de la biomedicina y de las ciencias de la salud, que destaca por su masa
crtica, por su personal investigador de alto nivel y tambin por su carcter
internacional.
Es uno de los ncleos ms grandes de investigacin biomdica del sur de Europa.
El PRBB, una iniciativa de la Generalitat de Catalua, el Ayuntamiento de Barcelona
y la Universidad Pompeu Fabra (UPF), es una gran infraestructura cientfica, en
conexin fsica con el Hospital del Mar de Barcelona, que rene a seis centros
pblicos de investigacin estrechamente coordinados entre si.
Los centros que componen el PRBB se interesan en descifrar los enigmas de la
vida y los problemas de salud de la sociedad. El personal investigador de sus
centros destaca por sus descubrimientos en la bsqueda de respuestas a los
grandes problemas de salud actuales, y por su contribucin para que la humanidad
disfrute de una mejor calidad de vida y tenga ms conocimiento. El compromiso es
mltiple: desde generar nuevo conocimiento en el mbito de las ciencias de la salud
y de la vida hasta la transferencia de la tecnologa y conocimiento al mundo de la
empresa. As tambin las actividades, laboratorios e ingenios estn a la disposicin

de cualquier persona, entidad o sociedad cvica que desee conocerlos por dentro.
Otro compromiso muy importante es la formacin de personal cientfico. La
mayora de personas que trabajan aqu son muy jvenes y una buena parte de las
casi mil personas que hay en el edificio del PRBB no son Espaolas. De hecho hay
ms de treinta nacionalidades diferentes, de manera que el ingls es el idioma habitual en
nuestros seminarios y reuniones cientficas.
Echando un simple vistazo por la pgina web se pueden ver las grandes lneas de
investigacin y plataformas tecnolgicas. Adems, de contar con un buscador
interno para explorar a fondo las publicaciones cientficas ya que los resultados
cientficos son pblicos. El PRBB Tambin ha asumido compromisos colectivos
respecto a la calidad de sus actividades y para prevenir problemas de integridad en
sus investigaciones: fue creado el Cdigo de Buenas Prcticas Cientficas de los
centros del PRBB, el cual tambin est disponible para acceso pblico. En la Fig.1.1
podemos ver una imagen del edificio del PRBB.
Fig.1.1 Edificio sede del PRBB
14
1.2. Centros que conforman el PRBB

El proyecto cientfico del PRBB rene a varias instituciones y centros de
investigacin independientes, todos ellos enfocados a distintos aspectos de la
biomedicina.
1.2.1. Centro de Investigacin en Epidemiologa Ambiental (CREAL)

La investigacin del CREAL se centra, sobre todo, en el estudio de los
determinantes ambientales de las enfermedades respiratorias, del cncer y de los
efectos precoces de los contaminantes ambientales en los primeros aos de vida de
los nios. Se trata de una investigacin con una finalidad muy prctica, orientada al
desarrollo de polticas de proteccin de la salud que permitan la disminucin de las
enfermedades y las discapacidades sociales debidas a exposiciones ambientales.
Identifica los determinantes ambientales de la salud y promueve su prevencin y
control.
1.2.2. Hospital del Mar (IMAS)
El Hospital del Mar, perteneciente al Instituto Municipal de Asistencia Sanitaria
(IMAS), es un hospital moderno, universitario, activo e investigador, en el cual se
atienden patologas de complejidad media y alta y que posee una larga historia y
amplia tradicin de servicio en la ciudad. En la fig. 1.2 podemos ver el hospital del
mar.
15
Fig.1.2 Hospital del mar

1.2.3. Instituto Municipal de Investigacin Mdica (IMIM)
Interconecta de manera prctica la investigacin bsica con la realidad clnica
presente en el hospital universitario.
La investigacin de este centro se organiza en cinco programas de carcter
multidisciplinar, alrededor de los siguientes ejes temticos:
Cncer
Epidemiologa y Salud Pblica
Procesos inflamatorios y cardiovasculares
Informtica Biomdica
Neuropsicofarmacologa
La produccin cientfica generada como fruto de esta investigacin, incluye cerca

de 400 publicaciones anuales en revistas internacionales indexadas en el Science
Citation Index (SCI) y el Social Sciences Citation Index (SSCI), y unas 200 en
revistas de mbito nacional Espaol. Esta produccin cientfica sita al IMIMHospital del Mar en la octava posicin en el ranking de centros de mayor produccin
cientfica en biomedicina de todo el estado espaol, y la cuarta posicin en Catalua.
Asimismo, el IMIM-Hospital del Mar es el centro de investigacin sanitaria espaol
que publica mayor proporcin de trabajos en colaboracin internacional.
16
1.2.4. Departamento de Ciencias Experimentales y de la Salud de la

Universidad Pompeu Fabra (CEXS-UPF):
Invierte en la formacin de futuros cientficos de alto nivel y ofrecemos un
programa de doctorado interdisciplinario impartido en ingls.
1.2.5. Centro de Regulacin Genmica (CRG)
El Centro de Regulacin Genmica (CRG) es un centro de investigacin biomdica
bsica, cuyo objetivo es promover una investigacin bsica de excelencia en
biomedicina y, especialmente, en los mbitos de la genmica y la protemica.
Tiene como reto entender la base genmica de las enfermedades para mejorar la
calidad de vida.
1.2.6. Centro de Medicina Regenerativa de Barcelona (CMRB)
Despus de la aprobacin de la ley de reproduccin asistida en noviembre de 2003,
se hizo posible investigar en Espaa con embriones humanos congelados y con las
clulas madre derivadas de los mismos.
El CMRB tiene la misin bsica de investigar con clulas madre embrionarias
humanas, as como en diferentes modelos animales y la finalidad de conocer:
Los mecanismos bsicos del desarrollo inicial y de la organognesis.
Aplicacin de las lneas celulares que se derivan de las clulas madre a
enfermedades (medicina regenerativa) en las que hay prdida de clulas

(enfermedades degenerativas).
Con esto se pretende entender los mecanismos bsicos del desarrollo inicial y de
la organognesis, as como encontrar aplicaciones para el tratamiento de las
enfermedades degenerativas.
17
1.2.7. Instituto de Alta Tecnologa (IAT)

Tiene como misin ofrecer a la Comunidad Cientfica y a la Industria Farmacutica
servicios de Imagen Molecular basados en la Tomografa por Emisin de Positrones
(PET) y en la Resonancia Magntica.
Estas tecnologas (PET y de imagen celular), permiten visualizan los procesos
bioqumicos in vivo para la investigacin bsica y clnica.
18
CAPITULO 2
FUNDAMENTOS TERICOS
2.1. Nociones y conceptos genticos:

La gentica es la ciencia que estudia la herencia biolgica, es decir, la transmisin
de los caracteres morfolgicos y fisiolgicos que pasan de padres a hijos. El ncleo
que contiene la informacin gentica se encuentra en la molcula de ADN, que a su
vez se encuentra en los cromosomas.
El cuerpo de cada ser humano esta conformado por clulas, cada una con 46
cromosomas, estas estn distribuidas en 23 pares o cromosomas homlogos. Los
pares del 1 al 22 son iguales en ambos sexos y se conocen como autosomas, el par
nmero 23 est compuesto por los cromosomas que determinan el sexo. Las
mujeres tienen dos cromosomas X y los hombres un cromosoma X y un cromosoma
Y. Mientras que todas las clulas tienen 46 cromosomas, las clulas reproductivas
slo tienen
23 cromosomas
no pareados,
que
al
combinarse (vulo
espermatozoide), forman una clula nueva con 46 cromosomas que dan como
resultado un ser humano, que es, genticamente nico y cuyo diseo est
determinado por el padre y la madre en partes iguales. En la fig. 2.3 podemos ver
una cadena de adn.
Fig.2.3
Todos los seres humanos tienen un aproximado de 30.000 genes, estos se
encuentran en lugares concretos denominados locus (o loci en plural) los cuales
determinan el crecimiento, el desarrollo y el funcionamiento de nuestros sistemas
bioqumicos y fsicos.
2.2. ADN (cido desoxirribonucleico)

Se localiza en el ncleo de las clulas y es el material gentico que contiene toda
la informacin referente al desarrollo fenotpico de un individuo. Est compuesto de
dos bandas llamadas nucletidos. Las dos bandas se disponen en espiral formando
una doble hlice y estn unidas entre s por enlaces de hidrgeno entre las bases de
nucletidos. La informacin gentica est contenida en secuencia a lo largo de la
molcula; la cual puede hacer copias exactas de s misma mediante un proceso
llamado replicacin, pasando de este modo la informacin a las clulas hijas.
Fig.2.4 Estructura del ADN

20
2.3. Gen:
El concepto de gen vara segn el tipo de fenmeno que queramos describir, si lo
importante es la transmisin de informacin o la mutacin, la unidad considerada
como gen, puede ser el par de bases nitrogenadas o el cromosoma mismo. Si
hablamos de evolucin, el gen ser la unidad mnima capaz de ser seleccionada.
Tambin se puede definir como segmentos de ADN que contienen informacin para
elaborar una protena especfica. Adems de ser conocido por todos un factor
hereditario que controla un carcter, como el color de ojos, la altura, color de cabello,
enfermedades hereditarias, y probablemente, muchas otras cosas que aun no han
sido descubiertas.
2.4. Alelos:
Es cada una de las formas alternativas que puede tener un gen, es decir las
posibles variaciones. Estos se diferencian en su secuencia y se pueden manifestar
en cambios en la funcin del gen. La mayora de los mamferos, poseen dos alelos
de cada gen (son diploides), cada uno de proveniente de cada padre y cada par de
alelos se ubica en igual locus o lugar del cromosoma.
Los alelos pueden diferir en secuencia o funcin. Los que varan en secuencia
tienen diferencias como inserciones, deleciones, o sustituciones de nucletidos en la
secuencia. Los alelos que difieren en funcin pueden tener o no diferencias
conocidas en las secuencias, pero se evalan por la forma en que afectan al
organismo.
Segn su expresin en el fenotipo se pueden clasificar en:
Alelos dominantes: Son aquellos que aparecen en el individuo ya sea
heterocigotos (posee cromosomas cuyos alelos tienen diferente informacin,

uno es dominante y otro recesiva.) u homocigoto (Es un individuo que
solamente contiene un alelo del par).
Alelos recesivos: los que quedan enmascarados del fenotipo de un

21
individuo heterocigoto y slo aparecen en el homocigoto, siendo homocigtico

para los genes recesivos.
2.5. Polimorfismo gentico:

Un polimorfismo gentico es una variacin en la secuencia de un lugar
determinado de ADN entre los individuos de una poblacin, una variante allica en la
que se produce cuando se sustituye un par de bases nitrogenadas por otro par
distinto, es decir es la existencia de mltiples alelos de un gen presentes en una
poblacin. Esta debe existir
de forma estable en una poblacin y para ser
considerado un polimorfismo gentico y no una mutacin, para esto debe presentar

una frecuencia de al menos el 1%. Las mutaciones, son mucho menos frecuentes y
van asociadas, habitualmente, a enfermedades hereditarias. Estos polimorfismos,
normalmente se expresan como diferentes fenotipos. Por ejemplo el color de la piel
es un polimorfismo.
Un polimorfismo puede tratarse de la sustitucin de una simple base nitrogenada,
por ejemplo, sustituir una A (adenina), por una C (citosina), o puede ser ms
complicado, como por ejemplo la repeticin de una secuencia determinada de ADN,
donde un porcentaje de individuos tenga un determinado nmero de copias de una
determinada secuencia.
Los cambios poco frecuentes en la secuencia de bases en el ADN, no se llaman
polimorfismos, pues podran tratarse de mutaciones.
Existen varios tipos de polimorfismos:
RFLP: (restriction-fragment-length polymorphisms) Polimorfismos de
longitud de fragmentos de restriccin.
SNPs: (Single Nucleotide Polimorphism)Polimorfismo de un solo
nucletido. Los SNP forman hasta el 90% de todas las variaciones genmicas
humanas, y aparecen cada 100 a 300 bases en promedio, a lo largo del
genoma humano. Dos tercios de los SNP corresponden a la sustitucin de
una citosina (C) por una timina (T). Estas variaciones en la secuencia del ADN
pueden afectar a la respuesta de los individuos a enfermedades, bacterias,
22
virus, productos qumicos, frmacos, etc.

Los SNP que se localizan dentro de una secuencia codificante de ADN pueden
modificar o no la cadena de aminocidos que producen, si la modifican se llama
SNP sinnimo (o mutacin silenciosa) y SNP no-sinnimo si no. Los SNP que se
encuentren en regiones no codificantes pueden tener consecuencias en el proceso
de unin de factores de transcripcin o modificando la secuencia de RNA no
codificante. En la fig.2.5 podemos ver grficamente una representacin de lo que
sucede en la cadena de ADN al producirse un SNP:
Fig. 2.5 Polimorfismo de un solo nucletido

(Single nucleotid polimorphisim)
2.6. Las Variaciones en nmero de copias o CNV (copy number variations)
Anteriormente era pensado que los genes estaban casi siempre presentes en dos
copias en el genoma humano. Sin embargo descubrimientos recientes han revelado
que largos segmentos de AND pueden variar en el nmero de copias y estas
variaciones pueden derivar en desbalances. Por ejemplo, se han encontrado genes
que normalmente tienen dos copias, con una, tres o ms de tres copias, o incluso en
algunos casos con ninguna.
Las diferencias en nuestro ADN contribuyen a nuestra unicidad. Estos cambios
influencian la mayora de lo rasgos incluyendo al susceptibilidad a ciertas
23
enfermedades. Antes se pensaba que los SNP en el ADN eran la variacin ms

importante y prevalente en el ADN, pero estudios actuales estn revelando que los
CNVs comprenden al menos tres veces el contenido de SNPs. Los CNVs muchas
veces abarcan genes enteros, entonces se pueden pensar que juegan un papel
importante en enfermedades y respuesta a tratamientos con drogas, adems de
poder darnos una pista en la evolucin del genoma humano.
Actualmente se esta realizando un mapeo de CNVs que se piensa transformar la

investigacin medica en cuatro reas, la primera y ms importante es la bsqueda
de genes que se relacionen con las enfermedades comunes, la segunda, el estudio
de condiciones gticas familiares, la tercera el estudio de miles de defectos del
desarrollo causados por reagrupamientos cromosmicos, el mapeo con CNVs esta
siendo usado para excluir las variaciones detectadas en individuos no afectados,
esto permitir ayudar a los investigadores a detectar la regin exacta de
modificacin que puede estar afectando. Los datos generados contribuirn adems
a tener una referencia ms acertada y completa sobre la secuencia de referencia del
genoma humano usado por todos los cientficos biomdicos.
Un descubrimiento sorprendente fue que aproximadamente un 12% del genoma

humano presenta variacin en nmero de copias, esto sugiere que los CNV son ms
comunes de lo que pensamos. Alrededor de 2900 genes o 10% de aquellos
conocidos son abarcados por CNVs. Algunos CNVs encontrados en la poblacin
general pueden tener un tamao de millones de bases, afectando numerosos genes
que aun no tienen una consecuencia observable. Hasta ahora, se han descrito 2000
CNVs aproximadamente, se sospecha que pueda haber miles mas. Un gen tiene
aproximadamente
60.000 bases, alrededor de 100 CNVs fueron detectados en
cada genoma con un tamao promedio de 250.000 bases.

La mayora de los CNVs son variantes benignas que no causan enfermedades
directamente. Sin embargo hay muchas instancias en las que los CNVs que afectan
el desarrollo crtico de los genes causan enfermedades, por ejemplo estudios
recientes han listado 17 condiciones del solo sistema nervioso (incluido el mal de
parkinson) que pueden resultar como causa del numero de copias.
24
Como con cualquier tipo de variacin gentica, los CNVs pueden varan en
frecuencia y ocurrencia entre poblaciones dicindonos algo sobre nuestra historia
compartida. Como resultado de nuestro origen comn, la gran mayora de CNV (un
89 %) es compartido entre diversas poblaciones estudiadas. En la fig. 2.6 podemos
ver los tipos principales de variaciones:
Fig. 2.6 Tipos de variaciones en la estructura

Conceptos matemticos y estadsticos:
Tipos de Variables
2.6.1. Variable Categrica:
Se refiere a cualquier variable que implica la incorporacin de elementos en
categoras, son etiquetas alfanumricas o nombres. Estas pueden ser:
25
a) Variable Nominales Asignan nombres a las diferentes formas que pueda

tomar la variable, sus posibles valores son mutuamente excluyentes entre s y no
tienen alguna forma natural de ordenacin.
b) Variable Ordinales Son las variables categricas que tienen algn orden.
Aquellas en las cuales podemos comparar que una tiene ms en con relacin a
una caracterstica de lo que tiene otro elemento de la muestra, esto nos permite
ordenar los elementos. Por ejemplo: bueno, regular y malo.
2.6.2. Variable Cuantitativa:
Son las variables en donde las diferencias entre los elementos de la muestra
pueden ser expresados en cantidades. Estas pueden ser:
a) Variable Discretas: Aquellas en la que sus valores estn claramente
separados unos de otros. Un ejemplo clsico es el tamao de una familia: una
familia puede tener un hijo o 2, 3, 4, 5, etc. Pero no puede tener 2.5 o 4.75 hijos.
b) Variable Continuas: Son tambin comnmente llamadas variables de
medicin, son aquellas que toman cualquier valor numrico (entero, fraccionario,
real o irracional). Este tipo de variables se obtienen principalmente, a travs de
mediciones y estn sujetas a la precisin de los instrumentos de medicin.
2.7. Tablas de Contingencia

La distribucin conjunta o tabla de contingencia, permiten ver la relacin entre dos
o ms variables. En esta tabla, cada entrada tendr el nmero de casos o individuos
que poseen el nivel de uno de los factores y otro nivel de otro factor
simultneamente. Se utiliza generalmente con dos o ms variables categricas.
26
2.8. Distribucin normal o Gaussiana:

Su funcin de densidad est dada por:
f ( x) =
donde
(mu) es la media y
1
e
2
( x )2
2 2
(sigma) es la desviacin estndar (
(1)
es la varianza).
Tiene forma de campana (vase fig.2.7) y se utiliza comnmente por la frecuencia

con la que ciertos fenmenos tienden a parecerse en su comportamiento a esta
distribucin. Es muy til ya que su funcin de densidad tiene forma de campana y es
simtrica esto favorece su aplicacin como modelo a gran nmero de variables
estadsticas.
Fig. 2.7
2.9. Coeficiente de concordancia Kappa:
Se utiliza para medir el grado de acuerdo o concordancia entre dos vectores con
categoras mutuamente excluyentes. Este mtodo se prefiere sobre otros ndices de
concordancia ya que corrige el porcentaje de acuerdo que pueda deberse al azar, es
decir permite determinar hasta qu punto la concordancia observada es superior a la
27
que se poda obtener por puro azar.

El ndice de concordancia kappa se define de la siguiente manera:
k=
Po Pe
1 Pe
(2)
Para calcular Po es decir, la proporcin de concordancia observada, tenemos:
Po =
Num.acuerdos
Num.acuerdos + Num.desacuerdos
(3)
Para calcular Pe es decir la proporcin de concordancia por azar, tenemos:

n
Pe = ( pi1 pi 2 )
(4)
i =1
Donde:

n = nmero de categoras
i = nmero de la categora (de 1 hasta n)
pi1 = proporcin de ocurrencia de la categora i para el observador 1.
Pi2 = proporcin de ocurrencia de la categora i para el observador 2.
Si hay acuerdo perfecto K ser 1, por lo que 1-Pe representa el porcentaje de

acuerdo posible no atribuible al azar.
El coeficiente kappa fue propuesto originalmente por Cohen en 1960 por lo que a
menudo se le conoce como kappa de Cohen, inicialmente para el caso de dos
mtodos. Posteriormente fue generalizado para el caso de ms de dos evaluadores.
Podemos usar la siguiente tabla para interpretar el ndice k:
28
grado de acuerdo
<0
sin acuerdo
0 - 0,2
insignificante
0,2 - 0,4
bajo
0,4 - 0,6
moderado
0,6 - 0,8
bueno
0,8 - 1
muy bueno
Tabla 1
2.10. Estimacin de Mxima Verosimilitud
La mayora de los procedimientos estadsticos suponen que los datos siguen algn
tipo de modelo matemtico que se que se puede definir por medio de una ecuacin
de la cual se desconoce alguno de sus parmetros, lo cual genera el problema de
calcular o estimar estos parmetros desconocidos a partir de la informacin obtenida
en un estudio diseado para tal fin.
El mtodo de mxima verosimilitud es uno de los procedimientos ms verstiles, a
la hora de estimar los parmetros de una distribucin de probabilidad, ya que se
puede aplicar en gran cantidad de situaciones.
Definicin del problema de estimacin:
Sea X={x1, x2, xn} una muestra que creemos tiene una distribucin de
probabilidad p(x|) de parmetros . Queremos estimar los parmetros * que mas
se ajusten a la muestra que tenemos.
29
La funcin de estimacin de verosimilitud de los parmetros dad la muestra es

N
L ( | X ) = p ( x i | )
(5)
i =1
Luego
* = arg max( L( | X ))
(6)
Y se obtiene igualando a cero:
log( L ( | X ))
=0
(7)
2.11. Mixturas de distribuciones:

Es posible que la variable x provenga no solo de una distribucin sino de una
combinacin de varias. En este caso la combinacin ponderada de las distribuciones
sera:
M
p ( x | ) = q j p j ( x | j )
(12)
j =1
Con = {q j , } y
=1
El problema es que al aplicando el mtodo de mxima verosimilitud, no se puede

resolver analticamente:
M
N
M
Log ( | x) = log q j p j ( xi , j ) = log q j p j ( xi , j )

i =1 j =1
i =1
j =1
(13)
30
log( L ( | X ))
=0
En las figuras 2.8 y 2.9 podemos ver un ejemplo de un grupo de datos que no
pueden ser estimados
correctamente con una normal, pero se estiman
perfectamente bien con una mixtura de dos normales:
Fig.2.8 Datos que no pueden ser estimados con una normal
Fig
Fig. 2.9 mismo grupo de datos estimado con

una mixtura de varias normales
2.12. EM (Expectation Maximization):

Es una tcnica iterativa general que permite hacer estimacin de mxima
verosimilitud de parmetros en datos en los que existe informacin oculta. Permite
31
estimar los parmetros que describen una distribucin de probabilidad subyacente.

Adems sirve como anlisis complementario de las tcnicas de clustering jerrquico
estndar.
Definicin:
Sea Z=(x,Y) un conjunto de datos donde los datos X son visibles pero los datos Y
estan ocultos.
Entonces:
p ( z | ) = p ( x, y | )
(14)
= p ( x | y , ) p ( y | )
En este caso no se puede estimar L( | Z ) =
L( | X , Y ) ya que no conocemos Y.
Entonces se supone que es una variable aleatoria y se calcula la media:
Q ( | g ) = log( L (, X , y ) p ( y | g )dy
(15)
= E[log p ( x, y | ) | X , g ]
Donde
Luego
el
son parmetros propuestos.

algoritmo
EM
busca
los
parmetros
ptimos
de
L( | Z ) = L( | X , Y ) por medio de dos pasos:

Paso E (esperanza): En el cual se calcula la esperanza de la verosimilitud
respecto a la informacin que se conoce y los parmetros propuestos
cualesquiera:
Q( | (t ) ) = E[log p( x, y | ) | X , (t ) ]
(16)
Paso M (maximizacin): En el cual se maximiza Q respecto a los parmetros

escogidos en E.
(17)
32
( t +1) = arg max (Q ( | t ))
Estos dos pasos se repiten hasta alcanzar la convergencia.
2.13. Anlisis de Conglomerados (Clustering)

Es una tcnica multivariante en la cual se busca agrupar las variables o datos
tratando de lograr clasificarlos en grupos. Esta clasificacin se puede hacer de
manera jerrquica o no, en cualquier caso todos los algoritmos toman como
elemento de medicin la distancia entre los datos, y los grupos resultantes de la
clasificacin tendrn la propiedad de estar conformados por los elementos ms
cercanos entre s y los grupos tendrn la distancia mxima.
El anlisis de conglomerados puede ser:
2.13.1. No supervisado:
Con estas tcnicas se agrupan los datos en funcin de una distancia sin
utilizar ningn tipo de informacin externa para organizar los grupos. Dependiendo
de la forma en la que los datos son agrupados, podemos distinguir dos tipos de
clustering:
a) Anlisis de Clusters No Jerrquicos

Con esta clasificacin se crean grupos independientes entre s, con distancia
mxima entre ellos. Cada observacin o valor se agrega a un solo grupo y todas las
observaciones dentro de cada grupo estn lo ms cerca posible.
Normalmente el algoritmo comienza a calcular la matriz de distancias a partir de un
nmero de clusters seleccionado anteriormente por el usuario y luego se va
recolocando de forma iterativa los datos en los diferentes grupos hasta minimizar la
dispersin interna de cada cluster.
33
Los dos algoritmos ms conocidos de anlisis de clustering no jerrquicos en

estudios genticos son:
K-Medias: Es un algoritmo que permite clasificar n datos en k particiones,
basndose en los atributos de estos datos. Este comienza con una muestra de k
datos elegidos al azar de la matriz original de datos, los cuales son utilizados como
centroides iniciales de los k clusters que se van a formar. La matriz de distancias se
calcula desde dichos centroides hasta cada uno de los dems datos de la matriz y
cada uno de ellos ser asignado de esta forma al centroide ms cercano. Entonces
la matriz de distancias se recalcula reemplazando cada centroide por la media de los
datos asignados a el y el algoritmo repite el proceso anterior.
El objetivo es minimizar la disimilaridad de los elementos dentro de cada cluster y
maximizar la disimilaridad de los elementos que caen en diferentes clusters.
El algoritmo es el siguiente:
1. Se da como entrada un conjunto de datos S y el nmero de clusters
a formar k
2. Selecciona los centroides iniciales de los K grupos: c1, c2, ..., cK.
3. Asignar cada observacin xi de S al cluster C(i) cuyo centroide c(i)
est mas cerca de xi. Es decir, C(i)=argmin1kK||xi-ck||
4. Para cada uno de los clusters se recalcula su centroide basado en
los elementos que estn contenidos en el cluster y minimizando la suma
de cuadrados dentro del cluster. Es decir,
K
WSS =
|| x
c k || 2
(18)
k =1 C ( i ) = k
5. Volver al paso 2 hasta que se consiga convergencia.

6. Al final obtenemos una lista que dice en que cluster est cada dato.
K-Meoides:
Este algoritmo es una versin ms robusta del algoritmo de k-medias, este escoge
los centroides dentro de los mismos datos. Al igual que k-medias parte los datos en
grupos para luego minimizar la distancia entre el centro y los datos pertenecientes a
34
dicho centro. Tiene la ventaja de ser menos vulnerable a los datos extremos.
Un meoide puede definirse como el punto mejor centrado dentro del grupo de
datos.
El algoritmo es el siguiente:
1. Se comienza con un nmero arbitrario de Meoides (k<n) escogido por
el usuario y situado arbitrariamente dentro del grupo de datos.
2. Luego cada dato es asignado al meoide es ms similar. En este caso
la similitud es definida como la distancia (Eucldeas, Manhattan o
Minkowski )
3. Luego se selecciona de forma aleatoria un nuevo set de meoides.
4. Se calcula el costo C de cambiar el set anterior de meoides por el
nuevo.
5. Si C>0 se regresa al set anterior de meoides, si C<0 se toma el
nuevo set de centroides y se recalculan los grupos.
6. Finalmente se repiten los pasos de 2 a 5 hasta que no haya mas
cambios de centroides.
SOM (Self-Organising Maps) o mapas auto-organizados:
Es una implementacin de redes neurales. El algoritmo va uniendo de forma

iterativa, los patrones mas parecidos entre si y alejando de aquellos otros que son
mas diferentes. Este algoritmo es ms fiables y robustos para trabajar con grandes
cantidades de datos con ruido pero.
Un problema que presentan los mtodos no jerrquicos es que al no generar un
dendrograma no ofrecen una idea de la representacin espacial de los datos, lo cual
ofrece un manejo ms intuitivo a la hora de analizar los datos.
35
b) Anlisis de Clusters Jerrquicos

Se basan en una matriz de distancias. Comienza con pequeos grupos que tienen
un patrn de expresin comn y luego se construye un dendrograma que es una
representacin grfica
con forma de rbol, con las relaciones basadas en la
cercana o similitud entre los datos, el cual se crea de forma secuencial.

Este rbol establece una relacin ordenada de los grupos anteriormente definidos
y la longitud de sus ramas permite tener una idea de la distancia entre los distintos
nodos del mismo. Todos siguen la misma estrategia, en general separan cada dato
en un nodo luego calculan la distancia entre los dos genes ms prximos y los
juntan en un conglomerado o cluster, luego se vuelve a calcular la matriz de
distancias sustituyendo los dos patrones que se han unido por el promedio de
ambos. Siempre queda a eleccin del usuario el mtodo y el tipo de distancia que
quiera utilizar. En la siguiente figura 2.10 podemos ver un ejemplo de dendrograma:
Fig.2.10 Dendrograma
2.13.2. Clustering supervisado:

Para este tipo de clustering debemos contar con una informacin previa sobrer los
datos, por ejemplo en la mayora de muestras biolgicas se puede contar con una
36
informacin preliminar que puede utilizarse para agrupacin de nuevos datos en

clusters. El mtodo supervisado aprende de la informacin previa, que generalmente
vienen dada por un conjunto de datos de entrenamiento, que del cual se extrae la
forma en que deben clasificar los nuevos datos. Entre estos mtodos podemos
destacar:
SVM (Supported Vector Machines): Es una tcnica que utiliza hper
planos que permiten separar los datos en el espacio multidimencional

como puntos negativos o positivos.
Pereceptrones o redes neuronales: Pueden discriminar entre
varias clases diferentes y clasificar muchas muestras al mismo tiempo.
2.13.2. Distancias usadas en los distintos mtodos de Clustering

Sean
xi = ( xi1 , xi 2 ,..., xim )
x j = (x j1 , x j 2 ,..., x jm )
observaciones que se encuentran en las filas i y j de una matriz
X nm . Podemos
definir las siguientes distancias:

a) Distancia Eucldea
Viene dada por:
m
dij =
(x
x jk ) =
2
ik
(x x ) (x x )
T
(19)
k =1
b) Distancia Manhattan
Viene dada por:
i
bij = xik x jk
k =1
c) Distancia de Minkowski
Viene dada por:
37
(20)
(21)
sij = sup xik x jk
(22)
m
mij = xik x jk
k =1
d) Distancia del Supremo

Viene dada por:
e) Distancia de Canberra
Viene dada por:
m
xik x jk
k =1
xik + x jk
cij =
(23)
f) Distancia Binaria
Se utiliza cuando los datos son binarios, es decir ceros y unos. Se implementa
contando cuenta el nmero de bits diferentes en xi y x j , siempre que al menos uno
de los bits es distinto de cero.
g) Distancia de Ward
Tambin se conoce como la suma de los cuadrados incrementales, la medida de
proximidad entre los grupos i, j y c viene dada por:
SSWc = (SSWa + SSWb )

Donde
ni
SSWi = ( xihk xih. )
k =1 h =1
38
Donde i=a,b,c
(24)
2.14. Anlisis Discriminante:

Es una tcnica de anlisis estadstico multivariante y de clasificacin. Permite
establecer las diferencias existentes entre grupos e identificar aquellas variables que
discriminan mejor entre dos o ms grupos definidos con anterioridad. Se parte de
una muestra de N sujetos en los que se ha medido P variables independientes,
estas se utilizarn para tomar la decisin en cuanto al grupo en el que se clasifica
cada sujeto.
Para realizar el anlisis discriminante, cada sujeto debe tener puntuaciones en
una o ms de las variables cuantitativas independientes y un valor que le identifique
como miembro de alguno de los grupos. Se puede decir que el anlisis discriminante
se utiliza para determinar si los grupos difieren de los promedios de las variables y
se utilizan esas variables para predecir cuales son los miembros que pertenecen a
cada grupo.
Su gran utilidad est en describir las diferencias existentes entre diferentes grupos
a base de los valores que toman ciertas variables sobre los individuos de cada uno
de los grupos. Tambin ayuda a clasificar nuevos individuos en alguno de los grupos
ya existentes, en funcin de los valores que toman ciertas variables para esos
individuos. Es decir, mediante la identificacin y utilizacin de aquellas variables que
parecen ser las mejores predictoras para cada uno de los grupos. Es usado tambin
para identificar las correlaciones entre las variables, as como las relaciones de
causa y efecto.
El anlisis discriminante puede clasificarse en:
2.14.1. Anlisis Discriminante Descriptivo:
El anlisis discriminante descriptivo nos permite detectar las variables que mejor
diferencian a los grupos ya definidos.
39
2.14.2. Anlisis Discriminante Predictivo:

El anlisis discriminante predictivo tiene la funcin de clasificar el grupo al cual
pertenece cada sujeto. Con la clasificacin, se desea predecir con mayor precisin
los miembros de cada grupo.
La funcin discriminante utiliza la combinacin de los valores de las variables
predictoras para clasificar un objeto o sujeto en uno de los grupos de la variable de
criterio. La funcin discriminante es una variable derivada, que podemos definir
como la suma del peso de los valores en las variables predictoras. Se utiliza
tambin, un concepto de puntuacin lmite, los sujetos que obtengan una puntuacin
discriminante mayor a la puntuacin lmite se clasifican en un grupo y los que
obtengan una menor se clasifican en el otro.
El concepto de la funcin discriminante se utiliza tambin en aquellas situaciones
donde se tienen ms de dos grupos. En ese caso se calcula ms de una funcin
discriminante para decidir la clasificacin de los sujetos. La funcin discriminante
busca disminuir el nmero de errores de clasificacin.
Supuestos:
1. Normalidad: Se asume que la informacin representa una muestra de una
distribucin multivariada normal. Si este supuesto no cumple, los valores arrojados
en los resultados, pueden ser invlidos.
2. Homogeniedad de varianzas/covarianzas: Se asume que las matrices de la
varianza/covarianza son homogneas entre grupos. Si no se cumple con este
supuesto, se producirn resultados invlidos.
3. Seleccin aleatoria de la muestra: Se asume que la muestra debe ser escogida
al azar y que las puntuaciones en una variable, debe ser independiente entre
sujetos. Si no se cumple este criterio, los resultados de significancia no son
confiables.
40
4. Clasificacin correcta: Cada una de las observaciones en la clasificacin inicial

debe estar correctamente clasificada.
Las ventajas del Anlisis discriminante residen en que se puede identificar
variables que se relacionan con una variable criterio. Adems cuando la variable
predictora tiene unos valores definidos, se puede predecir las valores de la variable
criterio. Permite identificar las variables que mejor discriminan entre grupos. Por
ltimo permite identificar relaciones entre las variables, as como relaciones de
causa y efecto.
Entre las desventajas del anlisis esta el que el anlisis es muy complejo en
trminos a pasos a seguir. Adems la muestra debe tener gran tamao. Finalmente,
la posibilidad de cometer errores de clasificacin.
41
CAPITULO 3
PROYECTO HAPMAP Y
PRE-PROCESAMIENTO DE LOS DATOS
3.1. Proyecto HAPMAP:
El proyecto HapMap es una iniciativa que naci en octubre de 2002 y que pretende
realizar la creacin de un catlogo de las variantes genticas que ocurren ms
comnmente en los seres humanos y de las cuales hasta ahora se sabe bastante
poco. Fue creado para intentar describir qu son estas variantes y cmo estn
distribuidas en las diferentes poblaciones y lugares del mundo. El proyecto en s, no
est utilizando la informacin recolectada en estudios que relacionen los diferentes
genes con enfermedades, pero est diseado para proveer informacin que otros
investigadores pueden usar con este fin, con la intencin de desarrollar nuevos
mtodos de prevencin, diagnstico y tratamiento.
Es un esfuerzo realizado por varios pases para identificar y catalogar similitudes y
diferencias genticas en los seres humanos. Es una colaboracin entre cientficos
de Japn, Reino Unido, Canad, China, Nigeria y Estados Unidos. Y toda la
informacin generada por el proyecto es accesible al pblico.
El propsito de todo esto es comparar las secuencias genticas de diferentes
individuos para identificar las regiones de secuencias genticas variables que son
compartidas o frecuentes. Al hacer que estos datos sean accesibles y gratuitos, se
ayuda a los investigadores biomdicos a encontrar los genes relacionados con
algunas enfermedades, as como la respuesta a ciertas drogas de tratamiento.
En la parte inicial del proyecto, se reuni informacin de cuatro poblaciones con

ancestros africanos, asiticos y europeos. La interaccin con los miembros de estas
poblaciones
proporcionan
una
valiosa
experiencia
en
la
realizacin
de
investigaciones con poblaciones identificadas
Organizaciones pblicas y privadas en seis pases estn participando en el

proyecto. Estos datos pueden ser descargados con un mnimo de restricciones, es
decir estn disponibles para casi cualquier investigador que los necesite.
Debido a la historia del la especie humana, la mayora de las variaciones genticas
o haplotipos que se encuentran en los cromosomas ocurren en todas las
poblaciones humanas. Sin embargo, algunas variaciones pueden ser mas comunes
en unas poblaciones que en otras, y las variaciones mas recientes pueden
encontrarse por ejemplo slo en una poblacin y en otras no. Para escoger
eficientemente los SNPs es necesario mirar las frecuencias de haplotipos en
mltiples poblaciones. Esto tambin mejora los datos genticos para ms de una
poblacin, as como la habilidad de los investigadores de detectar la contribucin
gentica de las enfermedades que son ms o menos prevalentes en diferentes
grupos.
3.2. Pre-procesamiento de los datos:
La base de datos que se utiliz para el presente estudio proviene de un estudio
previo realizado en el PRBB, para extraer seleccionar de entre la gran cantidad de
datos del proyecto HAPMAP una cantidad menor de datos que concentrara un alto
nmero de variaciones en nmero de copia.
Las muestras de ADN resultantes tienen en total 270 personas repartidas en tres
poblaciones: Los Yoruba gente de (Ibadan, Nigueria) provee un sets de 90 muestras,
los japoneses y chinos, que comprenden 45 muestras provenientes de Tokio y 45
provenientes de Beijing y finalmente 90 muestras provenientes de Estados Unidos y
Europa. Las muestras no estn relacionadas de ninguna forma con los participantes,
es decir, no se conoce nombres, direcciones o datos personales, pero s se sabe el
sexo y la proveniencia del individuo. Para ms informacin ver el ANEXO C.
43
CAPITULO 4
METODOLOGA
4.1. Primera fase: Proyecto HapMap y pre-procesamiento de los datos
Comprende la obtencin de los datos y el trabajo previo realizado por los
investigadores del PRBB, para obtener un grupo de genes ms pequeo, que
concentrase una alta cantidad de SNPs y CNVs para facilitar las pruebas y el
desarrollo del mtodo posteriormente planteado.
4.2. Segunda fase: Clasificacin de los datos en nmero de copias (Modelo
de Mixturas gaussianas)
Para nuestro estudio se cuenta con una seleccin de 144 genes (o marcadores
genticos), correspondientes a 272 individuos. Almacenada en una matrz X272x144 de
variables xij, es decir el valor xij representa la intensidad del gen j en el individuo i.
Para cada sonda gentica queremos clasificar los individuos en un nmero de
clases C, usando la variable continua x. Tomando en cuenta la variabilidad de los
datos en cada caso. Algunas sondas muestran claramente los picos que diferencian
las clases presentes y otras son ms difciles de inferir a simple vista, como
podemos ver en la fig. 4.11:
Fig. 4.11
Queremos modelar la variable subyacente C, usando la variable observada x. Para
esto, proponemos utilizar un modelo de mixturas finitas de C componentes:
(25)
En la cual N ( x | ) es la distribucin y C = 1....c es el nmero de clases, y

denota los parmetros de la distribucin de las mixturas de funciones normales
= ( c , c ) media y varianza.
4.2.1. Estimacin de los parmetros:

Para estimar los parmetros de las mixturas podemos utilizar el algoritmo EM
(Expectation Maximization, implementado en la funcin mix de R) ya que este
algoritmo esta diseado para hacer la estimacin de mxima verosimilitud
parmetros que describen una distribucin de probabilidad subyacente, como
sucede en este caso.
Este algoritmo nos permite entonces obtener los valores de media y varianza para
45
las distribuciones normales que ajustan nuestros datos.

Luego calculamos la probabilidad de que el dato correspondiente al individuo i de
la sonda pertenezca a la clase j :
(26)
4.2.2. Seleccin de la clase

Luego usamos estas probabilidades para segmentar los datos asignando a cada
individuo el nmero de copia correspondiente a la probabilidad ms alta:
C1
C2
0.0007
Cc
x1
0.0001
xm 0.00098
0.95
0.98
0.0007
Tabla 2
Por ejemplo en la figura 4.12 el dato a correspondera a la clasificacin C2 ya
que la probabilidad de estar en C1, en este caso, sera muchsimo menor.
46
C1
C2
Fig.4.12 Clasificacin del punto a en

un modelo de dos mixturas
Esto se realiza para cada punto, asignndole a cada dato su clase
correspondiente, como se ve en la fig. 4.13. En la cual los datos azul marino tienen
cero copias del gen, los datos en azul claro tienen una copia y los datos en rojo
tienen 2 copias del gen en cuestin:
Fig. 4.13 Grfica de una sonda gentica clasificada

Segn la distribucin de sus datos.
47
4.2.3. Inconveniente cerca de cero

En algunos casos cuando hay individuos con cero copias de un gen, las
intensidades obtenidas dan muy cercanas a cero y en este caso, la estimacin de
los parmetros falla. Los individuos con cero copias no estn normalmente
distribuidos, ya que nadie tiene un nmero negativo de copias de un gen.
En este caso podemos agregar un parmetro threshold
escogido por el
usuario, como se ve en la fig. 4.14 todos los datos por debajo de este parmetro se
consideran con cero copias:
Fig 4.14. Clasificacin de una sonda con la

El nuevo parmetro
Es decir se asigna la clase 1 correspondiente a cero copias a todos los valores por
debajo del valor
threshold.
48
4.2.4. Modelo resultante

Entonces nuestro modelo inicial quedara:
(27)
Donde
es dado por el usuario
denota la funcin indicadora y
con
El valor de
(28)
ya que la clase c1 la estamos forzando al
convertir todos los datos bajo el threshold en cero.

Al hacer esto asignamos a cada dato la clase que le corresponde.
4.3. Tercera fase: Clustering
4.3.1. Clasificacin de los individuos segn el nmero de copias
Luego de obtener la matriz de datos que contiene la informacin de cuntas copias
de cada gen tiene cada individuo, como vemos en la siguiente fig.4.15 :
Fig. 4.15 Matriz de sondas genticas clasificadas

segn el nmero de copias
49
Realizamos un anlisis de clustering para ver como se agrupan los individuos

segn el nmero de copias. Es decir, queremos ver si hay alguna relacin que
permita clasificar los individuos con diferentes copias de cada gen.
4.3.2. Clasificacin de los individuos segn el nmero de copias
(Clustering no supervisado)
Para esto se utiliza la funcin pam de la librera cluster de R que utiliza el
algoritmo de k-meoides
(una versin ms robusta del algoritmo k-medias). Se
realiza una clasificacin tanto para los datos sin clasificar como para los nuevos
datos por clase obtenidos con el modelo de mixturas explicado previamente.
Esto nos permite comparar la clasificacin de los individuos antes y despus del
proceso. En nuestro caso sabemos que las muestras proceden de tres grupos o
poblaciones de distinta raza, queremos ver si las variaciones en cuanto a nmero de
copias nos permiten diferenciar las distintas razas de individuos.
Para comparar los resultados obtenidos realizamos una tabla de contingencia
utilizado las clases obtenidas mediante la clasificacin de meoides versus el vector
con la informacin de a qu poblacin corresponde cada sonda. Y un coeficiente
kappa que mide el porcentaje de acuerdos.
Posteriormente realizamos un anlisis discriminante en el cual se recalcula la
clasificacin (clustering) de los individuos de manera supervisada, esta se obtiene
mediante la funcin de R discrimin: Primero se clasifica segn la intensidad de
cada sonda gentica xij y luego segn el nmero de copias de cada gen cij y en cada
caso se compara con los grupos originales existentes (CEU, YRI, CHB). Esto nos
permite tener una idea de que tan acertados son los resultados luego de clasificar
cada sonda segn el nmero de copia en una intensidad.
Posteriormente el anlisis discriminante, permitir identificar aquellas variables que
discriminan entre dos o ms grupos definidos con anterioridad y, establecer
diferencias entre dichos grupos. La idea es poder identificar aquellos genes que son
relevantes en la diferenciacin de las poblaciones. Esta fase sigue en estudios
50
actualmente ya que la idea es aplicar este mtodo para diferenciar poblaciones

sanas de enfermas y descubrir por ejemplo, los genes relacionados a alguna
enfermedad o patologa.
51
CAPITULO 5
RESULTADOS
5.1. Mixturas Gaussianas:
Se utiliz una matriz de datos experimental, con la informacin de 144 marcadores
genticos (columnas) y con 272 individuos (filas) pertenecientes a tres poblaciones,
a la cual aplicamos el modelo de mixturas normales descrito en el capitulo anterior:
En la fig. 5.16 Podemos ver la clasificacin de una de las 144 sondas en las
clases: 0,1,2 y 3 copias del gen. Para estos datos = 0.2
Fig. 5.16 Clasificacin de una sonda.
Esta clasificacin sucede para cada una de las sondas, y obtenemos como
resultado una matriz de 144 marcadores genticos (columnas) y con 272
con
valores de clases cij = 0,1,2 3.

5.2. Clasificacin de las poblaciones:( clustering no supervisado)
Luego se utiliza esta matriz resultante de datos enteros para realizar un clustering
de k-meoides, que nos permite clasificar a cada individuo en un grupo segn su
patrn en nmero de copias para cada gen. Y posteriormente se compara este
resultado con el obtenido para la matriz original de datos no clasificados por nmero
de copias. Se puede ver el resultado de dicho anlisis en la siguiente fig.5.17 en la
mitad superior se encuentra el resultado para la matriz inicial y en la parte de abajo
para los datos clasificados obtenidos con la clasificacin por mixturas.
Fig. 5.17 Resultado del anlisis de k-meoides
53
Vemos que el anlisis de k-meoides realizado para la matriz original, sugiere que la
clasificacin mas ptima se obtiene con k = 2 3; es decir, con 2 3 poblaciones.
Pero no se puede decir cual clasificacin resulta mejor. Mientras que para los datos
clasificados es claro que la mejor clasificacin se obtiene con K = 3 poblaciones.
Sabemos (a priori) que los datos provienen de tres poblaciones, por lo tanto la
clasificacin obtenida es mejor para los datos discretos, es decir para la matriz
obtenida por el modelo de clases latentes, ya que la otra no nos da un claro
discernimiento entre si son dos o tres poblaciones.
5.3. Anlisis discriminante: (Clustering Supervisado)
5.3.1. Clustering supervisado de los datos Clasificados:
Aqu podemos ver los clusters obtenidos mediante la funcin discrimin, para los
datos cij obtenidos mediante el modelo de mixtura de normales, comparados con la
clasificacin real de las poblaciones. En la fig. 5.18 se puede apreciar que en ambos
casos los clusters estn perfectamente separados.
Grupos 1, 2 y 3 discrimin
Poblaciones Originales CEU,CHB y YRB
Fig.5.18 Grupos obtenidos al clasificar

los datos obtenidos mediante el modelo de mixtura de normales
utilizando el anlisis discriminante
54
5.3.2. Clustering supervisado de los datos Originales:

Aqu podemos ver los clusters obtenidos mediante la funcin discrimin, para los
datos xij que se tena originalmente, comparados con la clasificacin real de las
poblaciones. Se puede apreciar que en este caso los grupos no estn tan bien
separados. Esto sugiere que hay una mejor clasificacin con los datos cij.
Grupos 1, 2 y 3 discrimin
Poblaciones Originales CEU,CHB y YRB
Fig. 5.19 Grupos obtenidos al clasificar

los datos originales utilizando el anlisis discriminante
5.4. Comparacin de las clasificaciones:
Para comparar la clasificacin obtenida con la real, realizamos una tabla de
contingencia:
Datos Cij
Coeficiente Kappa:
Datos xij
CE
CH
YRI
CE
CH
YRI
(2*PA-1) = 0.793103
53
16
Coeficiente Kappa:
128
117
56
48
25
58
(2*PA-1) = 0.739464
Tabla 3
55
En la tabla vemos que la mayor concentracin de datos se encuentra en la

diagonal, esto quiere decir que no slo est bien que tengamos 3 grupos, sino que
adems estn bastante bien clasificados.
Cosa que se corrobora con el coeficiente de concordancia kappa, que en ambos
casos es altsimo pero en el caso de la matriz datos obtenidos mediante el modelo
de mixtura es mayor.
56
CONCLUSIONES Y FUTUROS ESTUDIOS
Descubrimos que el modelo propuesto de mixturas gaussianas permite determinar

el nmero de copias de manera adecuada y casi sin perdida de informacin, de
hecho, la discriminacin de poblaciones en ambos casos (utilizando o no la
informacin a priori) da mejores resultados con los datos discretos o nmero de
copias obtenida mediante el modelo propuesto de mixturas gaussianas que con los
continuos de la matriz original de intensidades.
Luego de realizar las comparaciones vemos que al realizar la clasificacin de
datos mediante el modelo de mixturas, y realizar el clustering sobre los datos con
SNPs y CNVs
vemos que estas variaciones son tiles para discriminar
poblaciones, en este caso en el cual comparamos personas de razas diferentes,

esto nos hace pensar en el potencial de informacin que podemos obtener al
estudiar estas variaciones genticas recientemente descubiertas, de las cuales
apenas se esta comenzando a saber algo, de hecho se conoce aproximadamente un
10% de la funcin de las variaciones genticas mas simples, como por ejemplo,
color de ojos, piel, cabello, etc.
Un siguiente paso seria ver si estos SNPs y CNVs permiten diferenciar
poblaciones sanas de poblaciones enfermas, con alergias, con diferentes tipos de
resistencia al HIV, cncer, la resistencia a ciertos medicamentos, y un sin fin de
respuestas biolgicas diversas que hasta ahora solo se tratan con ensayo y error.
De ser as esta metodologa podra ser til, nos solo para diferenciar estas
poblaciones sino para para determinar aquellos CNVs y SNPs asociados a las
enfermedades ms complejas que han afectado al ser humano.
57
REFERENCIAS
1]
Kotler P., Jain D. C. y S. Maesincee. El marketing se mueve, Paids, 2002.
[2]
Da Costa J. Diccionario de mercadeo directo ingls-espaol, Panapo, 1996.
[3] [4]
Picn E. Segmentacin de mercados, Aspectos estratgicos y metodolgicos,

Prentice Hall, 2004.
[5] Schiffman
L., Kanuk G. L. y Leslie. Comportamiento del Consumidor, Editorial Prentice Hall, 1997.
[6]
Prez C. Tcnicas de Anlisis Multivariante de Datos, Aplicaciones con SPSS,

Prentice Hall, 2004.
[7]
Johnson D. E. Mtodos Multivariados aplicados al anlisis de datos, International

Thomson Editores, 1998.
58
APENDICES
Anexo A. Codigos del programa

1) Mixdist datos enteros XMC.R
Solymar Peraza Crespo
Dic: 2008
R v7.1
# Programa que realiza la discretizacin de la matriz de datos
continuos
# que contiene los marcadores genticos de una o varias poblaciones mediante
# clustering, utilizando la funcin <<mix>> de la libreria <<mixdist>>
# del programa R (freeware)
# Archivos necesarios: data.dum,asignaClase.R y search.threshold.R
# Archivos de salida: XMC.dum
library(mixdist)
# Preparacin de los datos:
# Obtencin de los datos
rm(list=ls())
# Borra todas las variables residuales
data.dum<-file.choose();source(data.dum)
# Obtiene matriz de datos enteros y las poblaciones
# Eliminacin de individuos con marcadores faltantes: Datos (NA)
datos<-na.exclude(datos)
# Datos[167,108]=NA dato atpico
# par(mfrow = c(1, 1),bg="lavenderblush")
# boxplot(datos,main='Boxplot de datos sin NA')
datos[161,108]=NA
# Eliminacion del dato atpico
datos<-na.exclude(datos)
59
# boxplot(datos,main='Boxplot de datos sin NAs y datos atipicos')

# Inicializaciones:
indi<-length(datos[,1])
# Filas individuos
marc<-length(datos[1,])
# Columnas marcadores
xmc<-datos
# Matrz donde se guardara la matrz de clasificacin
datosnorm<-datos
# Matrz donde se guardara los datos normalizados
vec<-c(0,0.2,0.75,1.25,1.75,2.25,2.75,Inf)
# Particin de los datos
cop<-c(0, 1,2, 3, 4, 5, 6)
# Nmero de copias
cen<-c(0.1,0.5,1,1.5,2,2.5)
# Valor en el que estan centrados los "grup"
th=0.2
# Threshold, Todo por debajo sera tomado como 0 copias
asignaClase.R<-file.choose();source(asignaClase.R)
# Buscamos las funciones de un fichero
search.threshold.R<-file.choose();source(search.threshold.R)
# <Bucle principal: Normaliza, decide la particin tomar y clasifica los datos>
suppressWarnings(
for (i in 1:marc){
datosnorm[,i]<-datos[,i]/mean(datos[which(datos[,i]>0.5),i])
# Normalizacin de los datos con su media
x<-datosnorm[,i]
# Subgrupos por rango
xx<-cut(x,vec)
# Corto el vector en intervalos de vec
# 7 Levels: (0,0.2] (0.2,0.75] (0.75,1.25] (1.25,1.75] (1.75,2.25] (2.25,2.75] (2.75,Inf]

tt<-c(0,0,0,0,0,0)
# Reiniciar tt en cero
tt<-table(xx)
# Cantidad de datos en cada uno de los 7 niveles
poc<-which((1<=tt)&(tt<=10))
# Ubicacion de los intervalos con pocos valores
much<-which(tt>10)
# Ubicacion de los intervalos con muchos valores
xpm=c();b=0;
# Inicializacion
60
## 1) Datos cercanos a cero (en el intervalo [0,th)):

p0<-which(x<th)
# Posicin de los valores en el intervalo [0,th)
if (length(p0)!=0) {xmc[p0,i]<-0}
if (vec[much[1]]==0){much<-much[2:(length(much))]}
# Corto much si hay un intervalo de
# muchos datos en cero, para no usarlo en el paso 2)
## 2) Datos con alta densisdad: Datos a utilizar en la mixtura
##
En este paso se crea un conjunto slo con los intervalos de mayor
##
densidad de datos para luego hacer la mixtura con eso:
for (j in 1:length(much)){
a=c()
a<-which((x>vec[much[j]])&(x<=vec[much[j]+1]))
# Vector de posiciones de x con muchos datos
xpm<-c(xpm,a)
}
p1<-sort(xpm)
xad<-x[p1]
# Posiciones 1 ordenado
# Vector de alta densidad
yy<-mixgroup(xad)
# La mixtura se hace slo de los datos con alta densidad de puntos
grup<-cen[much]
# Valor en el que estan centrados c(0,0.5,1,1.5,2,2.5)
# (si hay muchos en cero, no hya problema porque esto se soluicion arriba en el
if)
num<-length(grup)
# Cantidad de medias para la mixtura
par.ini<-mixparam(grup, rep(0.1,num))
res<-mix(yy,par.ini)
count=i
xmc[p1,i] <-asignaClase(res,xad)
# Guarda la matriz de clasificacin

# slo en las posiciones de alta densidad...
61
## A veces una mixtura con desviacin estndar mas grande que otra causa malas
clasificaciones
## para arreglar esto usamos la funcin search.threshold()
th2=search.threshold(res,xmc[p1,i])
if (length(th2)>=2){
for (j in 1:length(xad)){
for (k in 2:length(th2)){
if (xad[j]<=th2[1]) xmc[p1[j],i]=1;
if ((xad[j]>=th2[k-1])&(xad[j]<=th2[k])) xmc[p1[j],i]=k;
if (xad[j]>=th2[k]) xmc[p1[j],i]=k+1;
}
}
}
if (min(grup)>0.5){xmc[p1,i]=xmc[p1,i]+which(cen==min(grup))-2}
# Corrige el problema de por ejemplo darle clase 1
# a los que estan sobre 0.5
## 3) Datos con poca densidad: Clasificacin.
if (length(poc)!=0){
xpm=c()
nc=c(1,2,3,4)
for (j in 1:length(poc)){
a<-which((x>vec[poc[j]])&(x<=vec[poc[j]+1]))
# Vector de posiciones de x de pocos datos
x[a]<-poc[j]-1
# Se asigna (a mano) el numero de copias correspondiente
xpm<-c(xpm,a)
# c(0,0.5,1,1.5,2,2.5)
}
# c(0, 1, 2 ,3 ,4, 5)
p2<-sort(xpm)
# Posiciones 2
xmc[p2,i]<-x[p2]
}
62
}
)
2) k_meoides.R
# Programa que realiza la clasificacin de los individuos segun la
informacin # de los los marcadores genticos y compara el resultado
obtenido entre la
# matrz de datos enteros XMC y la matriz de datos
continuos
# utilizando las funciones <<ade4>>,<<cluster>>, <<concord>>,<<maptree>>
# <<graphics>> del programa R (freeware)
# Archivos necesarios: XMC.dum
source("C:/Documents and Settings /XMC.dum")
xn<-datosnorm
ls()
## xn=datosnorm es la matriz de datos continuos normalizados
## pop son las poblacciones reales
## xmc son los datos enteros obtenidos en
## datos[,i] son los datos sin normalizar
## hx son las poblaciones seleccionadas por pam k=3
library(ade4)
library(cluster)
library(maptree) # Graficos para clusterin jerarquico
require(graphics)
library (concord) # Test kappa
## Analisis PAM (K-meoides)
# Datos continuos XN
dxnc = daisy(xn, metric = c("euclidean")) ## datos normalizados
hxc2= pam(dxnc, k=2)
63
# Datos enteros XMC

dxne = daisy(xmc, metric = c("gower"))
##Aqui uso datos discretizados
hxe2= pam(dxne, k=2)
## Plot de k-meoides (continuos y discretos)
op<-par(mfrow = c(2, 3))
plot(hxc2,main= 'continuos k=2')
plot(hxe2,main= 'discretos k=2')
par(op)
## Comparacion de la clasificacion K-meoides y pop original

## Tabla hxe vs. pop
t1=table(hxe3$cluster,pop);
##Table hxe vs. pop
# Coeficiente de fiabilidad kappa enteros pam y pop k=3

concord<-matrix(c(hxe3$cluster,pop),nrow=2);
scores.to.counts(t(concord));
kppa1=cohen.kappa(t( concord),"score")
## Tabla hxc vs. pop
t2=table(hxc3$cluster,pop);
##Table hxc vs. pop
t2
64
# Coeficiente de fiabilidad kappa continuos pam y pop k=3

concord<-matrix(c(hxc3$cluster,pop),nrow=2);
## Tabla hxe vs. hxc
t3=table(hxe3$cluster,hxc3$cluster);
##Table hxe vs. hxc
t3
# Coeficiente de fiabilidad kappa continuos y discretos pam k=3
concord<-matrix(c(hxe3$cluster,hxc3$cluster),nrow=2);
kppa1 ##Table hxe vs. pop
kppa2 ##Table hxc vs. pop
kppa3 ##Table hxe vs. hxc
dump(ls(),"C:/Documents and Settings/K_meoides.dum")
3) Correspondencias Multiples.R
# Programa que realiza Anlisis de Correspondencias mltiples:
# de los marcadores genticos y compara el resultado obtenido entre la
# matriz de datos enteros XMC y la matriz de datos continuos
# utilizando las funciones <<ade4>>,<<cluster>>, <<concord>>,<<maptree>>
# del programa R (freeware)
# Archivos necesarios: K_meoides.dum
library(ade4)
library(cluster)
65
library(maptree) # Graficos para clusterin jerarquico

require(graphics)
rm(list=ls())
source("C:/Documents and Settings/K_meoides.dum")
## A) Datos continuos normalizados XN

###################################################
pobfc2<-factor(hxc2$clustering)
levels(pobfc2)
dd.2.c<-dudi.pca(xn,scannf=FALSE)
## DATOS CONTINUOS K=2 PAM
dn.2.c<-between(dd.2.c, pobfc2,scannf=FALSE)
plot(dn.2.c)
# Rand test
rte.2.c<-randtest (dn.2.c,nrepet = 9999)
rte.2.c
plot(rte.2.c)
###################################################
pop
###################################################
levels(pop)
dd.pop.3.c<-dudi.pca(xn,scannf=FALSE)
dn.pop.3.c<-between(dd.pop.3.c, pop,scannf=FALSE)
# DATOS CONTINUOS K=3 POP
plot(dn.pop.3.c)
grid(10, 10, lwd = 3)
s.class(dn.pop.3.c$ls, pop, xax = 1, axesell=FALSE,yax = 2, sub = "Scores and
classes", csub = 2, clab = 1.5)
# Rand test
rte.pop.3.c<-randtest(dn.pop.3.c,nrepet = 999)
66
plot(rte.pop.3.c)
####################################################
###################################################
levels(pobfc3)
dn.3.c<-between(dd.3.c,pobfc3,scannf=FALSE)
# DATOS CONTINUOS K=3 PAM
plot(dn.3.c)
# Rand test
rte.3.c<-randtest(dn.3.c,nrepet = 999)
plot(rte.3.c)
####################################################
###################################################
levels(pobfc4)
dn.4.c<-discrimin(dd.4.c,pobfc4,scannf=FALSE)
# DATOS CONTINUOS K=4 PAM
plot(dn.4.c)
# Rand test
rte.4.c<-randtest(dn.4.c,nrepet = 999)
plot(rte.4.c)
####################################################
## B) Datos enteros XMC
dxne = daisy(xmc, metric = c("gower"))
###################################################
pobfe2<-factor(hxe2$clustering)
levels(pobfe2)
67
xmc.f=c()
xmc.f=data.frame(apply(xmc,2,as.factor))
dd.2.e<-dudi.acm(xmc.f,scannf=FALSE)
# DATOS ENTEROS K=2 PAM
dn.2.e<-discrimin(dd.2.e, nf=9,pobfe2,scannf=FALSE)
plot(dn.2.e)
# Rand test
rte.2.e<-randtest(dn.2.e,nrepet = 999)
plot(rte.2.e)
####################################################
levels(pop)
xmc.f=c()
dd.pop.3.e<-dudi.acm(xmc.f,scannf=FALSE,nf=3)
# DATOS ENTEROS K=3 POP
dn.pop.3.e<-discrimin(dd.pop.3.e, nf=3,pop,scannf=FALSE)
s.class(dn.pop.3.e$li, pop, xax = 1, axesell=FALSE,yax = 2, sub = "Scores and c
lasses", csub = 2, clab = 1.5)
plot(dn.pop.3.e)
# Rand test
rte.pop.3.e<-randtest(dn.pop.3.e,nrepet = 999)
plot(rte.pop.3.e)
####################################################
###################################################
levels(pobfe3)
dn.3.e<-discrimin(dd.3.e, pobfe3,scannf=FALSE)
plot(dn.3.e)
# Rand test
68
plot(rte.3.e)
####################################################
###################################################
levels(pobfe4)
xn.f2=c()
dn.4.e<-discrimin(dd.4.e, pobfe4,scannf=FALSE)
plot(dn.4.e)
# Rand test
plot(rte.4.e)
####################################################
ANEXO B. Paper relacionado con el proyecto: Latent Class Model to Assess
Association between Copy Number and Disease in Targeted Studies.
ANEXO C. Paper donde se explica con detalle el preprocesamiento de los datos:
Identification of copy number variants define genomic differences among major
human ethnic groups.
69
Latent Class Model to Assess Association between

Copy Number and Disease in Targeted Studies
Juan R. Gonzalez1,2,3 , Isaac Subirana2,3 , Ge`orgia Escarams2,4 , Solymar Peraza2,1 ,
Alejandro Caceres1 , Xavier Estivill4,2 , Llus Armengol4
Center for research in environmental epidemiology (CREAL)
CIBER en Epidemiologa y Salud P

ublica (CIBERESP)
Institut Municipal dInvestigacio M`edica (IMIM)
Genes and Disease Program, Center for Genomic Regulation, Barcelona, Spain
Correspondence to: Dr. Juan R. Gonzalez

Center for research in environmental epidemiology (CREAL) (room 188)
Barcelona Biomedical Research Park (PRBB)
Plaza Charles Darwin s/n, Barcelona 08003, Spain.
e-mail: jrgonzalez@creal.cat
e-mail addresses:
JRG: jrgonzalez@creal.cat
IS: isubirana@imim.es
GE: georgia.escaramis@crg.es
SP: speraza@creal.cat
AC: acaceres@creal.cat
XE: xavier.estivill@crg.es
LA: lluis.armengol@crg.es
1
Abstract
Background: Copy number variations (CNVs) might play an important role by altering dosage of genes and other regulatory elements, which may have functional and, ultimately, phenotypical consequences. Therefore, determining whether a CNV is associated
or not with a given disease might be relevant in understanding genesis and progression
of human diseases. In this paper, we present a framework to assess assocation between
CNVs and disease in case-control studies. We extend the model to analyze discrete traits
and adjust for confounding covariates.
Results: Through simulation studies, we have shown that our method outperforms
other simple methods based on using pre-defined thresholds to define copy number status.
We illustrate the method using a real data example in a controlled MLPA experiment
showing good results.
Conclusions: We illustrate that our method is robust and achives maximal theoretical
power since it accomodates the possible missclassification error when copy number status
are stablished. We have made the software freely available and will be included in the R
package MLPAstats.
Background
With the recent technological advances, different genome-wide studies have uncovered an
unprecedented number of structural variants in the human genome [1, 2, 3], mainly in
the form of copy number variations (CNVs). The important number of genes and other
regulatory elements encompassed by those variable regions, make CNVs very likely to
have functional and, ultimately, phenotypical consequences [4, 5]. In fact, recent studies
have correlated the number of copies of specific genes with different degrees of disease
predisposition [6, 7, 8], showing that the identification of DNA copy number is important
in understanding genesis and progression of human diseases.
Several techniques and platforms have been developed for genome-wide analysis of
DNA copy number, such as array-based comparative genomic hybridization (aCGH).
The goal of this approach is to identify contiguous DNA segments where copy number
changes are present. The ability of aCGH to discern between different number of copies is
limited, thus the use of different kinds of quantitative techniques are required for targeted
and more precise analysis of genomic regions. For known CNVs, real time PCR assays
can be applied to study the copy number status of given loci in cases and controls groups.
Individuals are typically binned into copy number categories using pre-defined thresholds.
Currently, Multiplex Ligation-dependent Probe Amplification (MLPA) [9] has also been
used to quantify copy number classes. This method allows the analysis of several loci at
a the same time in a unique assay. MLPA is normally used to test differences in gains
and losses among test and control samples [10] but it can also be used in the context of
association studies in a case-control or cohort settings [11, 12].
Statistical methods used in CNV-disease association studies are very simple. Quantitative methods give CNV measurements for each individual as a continuous variable.
After that, copy number status is usually inferred generally by using pre-defined thresholds, and subsequently assess differences in copy number distribution between cases and
3
controls by using 2 , Fisher or Mann-Whitney tests [6, 13, 14]. However, the distribution of CNV meassurements is continuous and multimodal, meaning that peak intensity
should be considered as a mixture of curves. In many occassions, these curves overlap with
different underlying distributions. Therefore, scoring copy number by binning and then
assessing the association may lead to misclassification and hence obtain false findings.
To overcome this difficulty, we propose a latent class (LC) model to assess association
between CNVs and disease wich incorporates possible misclassification in scoring copy
number status. After inferring copy numbers using gaussian finite mixture distributions,
the model assesses the relationship among the trait and a CNV with a mixture of generalized linear models. Association is then assessed using a likelihood ratio procedure.
We validate and compare our method with the existing methods through a simulation
study. We then illustrate how to test association between two CNVs in a case-control
study using a real data set.
Methods
Inference of copy number status
Let us assume that we observe I individuals from a given population, that consists of
C mutually exclusive latent classes c = 1, . . . , C (e.g. copy number status). Instead
of observing these classes, we observe a surrogate variable, X, that corresponds to a
continuous variable arising from any quantitative method. For instance, in targeted
studies using MLPA or real-time PCR, X corresponds to peak intensities for each CNV.
In the context of a whole genome scan, one may have quantitative data from Illumina or
Affymetrix array, where for each probe, the variable X corresponds to a ratio of intensities.
Figure 1 shows possible patterns that peak intensities may have. Some variants cleary
show different underlying copy number status with multimodal peak intensities (CNV2,
CNV4 and CNV6). In other cases, where the existence of different copy numbers is not
clear, inferring copy number by binning the data may be difficult or unfeasible.
For each CNV variant, we are interested in classifying the individuals into the C classes
using the surrogate variable X. We propose to model the unobserved latent classes using
a finite mixture model with C components of the form
f (x|) =
C
X
c N(x|),
(1)
c=1
where N(|c , c2 ) is the Gaussian distribution with denoting all model parameters
(e.g., = (c , c2 ), c = 1, . . . , C), and x is the surrogate variable that corresponds to the
quantitative measure of the copy number status. For the component weigths c it holds
C
X
c=1
c = 1 and c 0, c = 1, . . . , C.
The value of C to be used is chosen by applying the Bayesian Information Criteria (BIC)
[15]. It should be pointed out that in some occasions, specially when there are individuals
with 0 copies, the intensity distributions (see CNV2 and CNV4 in Figure 1) are very close
to 0. In this situation, the estimation procedure of parameters involved in (1) used to fail
since the underlying distribution of individuals with 0 copies is not normally distributed.
In these situations we propose to fit the following mixture model to determine the latent
classes
f (x|) = 1 I{x } +
where is given by the user, 1 =
1 +
C
X
c=2
C
X
c=2
I{x }
,
I
c N(x|c , c2 ) I{x> } ,
I denotes an indicator function, and
c = 1 and c 0 c = 2, . . . , C.
(2)
The posterior probabilities are used to segment data by assigning each individual to
a given copy number status that will correspond to the class with maximum posterior
probability (MAP). After fitting this finite mixture model, we can perform a goodnessof-fit test using a 2 test statistic. Finite mixture parameters can be estimated using the
EM algorithm [16, 17] or Newton-type procedures [17]. Then, the posterior probability
that the individual i with an observed value x belongs to copy number class j is given by
j N(x|j , j2 )
.
wij = P(j|x, ) = P
2
c c N(x|c , c )
(3)
Latent class model

Discrete traits
Let us suppose that copy number status is associated with a binary phenotype (casecontrol). The association is typically assessed with a 2 test for the contingency Table 1.
Missclassification in the table is incorporated when we assign each individual to a given
class c using maximum a posteriori probability (MAP). Thus, this problem can be seen as
an association study with missclassification (measurement error) [18]. It is well known
that misclassification of covariates has important implications on parameter estimates
and statistical inference [19]. Some approaches account for such error [20, 21]. These
are, however, based on performing validation studies in a subsample. In our context, this
is unfeasible because hundreds of genes are normally analyzed at time, and technology
may have different sensitivity and specificity for each of the inspected loci. We therefore
propose the posterior probability of belonging to each latent class to model the degree of
missclasification regarding the copy number status. We then account for this information
in the association model.
Conditionally on cluster c, we have that

P(yi |Ci = c, ) = yici (1 ic )1yi ,
(4)
where = (1 , . . . , c ), c = 1, . . . , C is our vector of parameters, and

logit(ic ) c
Then, equation (4) can be rewritten as
P(yi |Ci = c, ) =
eyi c
1 + ec
Now, we consider that copy number status is measured with error (i.e., the latent class
is not known). Therefore, we are modelling the probability of being case as a mixture of
C binomial variables in the following way
) =
P(yi |
C
X
c=1
wic P(yi |Ci = c, ),
where wic is the posterior probability that the individual i belongs to copy number class
c given in (3). Therefore, assuming conditional independence of case-control status given
latent class, the likelihood function for model parameters can be written as
I X
C
Y
i=1 c=1
wic P(yi|Ci = c, ) =
I X
C
Y
i=1 c=1
wic
eyi c
.
1 + ec
(5)
It is straightforward to see that we can compute the odds ratio (OR) of belonging to class
c with respect to a given reference r as
ORc/r = ec r .
(6)
Quantitative traits
We now consider the case where our phenotype, Y , is continuous. We assume that
Y |c N (c , 2 ). In this case, conditionally to cluster c we have that
P(yi|Ci = c, ) =
(yi ic )2
1
e 22 ,
2
(7)
where
ic c
And, similarly to the case of discrete traits, the likelihood function for model parameters
is given by
I X
C
Y
i=1 c=1
wic P(yi |Ci = c, ) =
I X
C
Y
i=1 c=1
wic
(yi c )2
1
e 22 .
2
(8)
In this case we are interested in evaluating the difference between mean effect of individuals with c copies and r copies. This can simply be computed as
yc/r = c r
Model with covariates
In some ocassions researchers are interested in assessing the effect of CNVs adjusted for
other covariates, Z1 , . . . , ZK (normally called confounding variables). In this case, the
likelihood function can be written as
I X
C
Y
i=1 c=1
wic P (yi|Ci = c, Z, c , ),
where
P (yi|Ci = c, Z, c, ) =
8
eic
1 + eic
(9)
for discrete traits, and
P (yi|Ci = c, Z, c , , ) =
(yi ic )2
1
e 22
2
(10)
for qualitative traits. In both cases
ic = c + 1 Zi1 + . . . + K ZiK
(11)
Parameter estimation
In this section we address parameter estimation for the general situation of having covariates and either discrete or quantitative traits. For brevity let (, , ) (notice
that for discrete traits = 1). We consider that wic are known and that they are given
by the surrogate variable X from equation (3). Therefore, they can be pluged in the
log-likelihood resulting in
Y |Ci = c, Z, ) =
log P (Y
I
X
log
C
X
c=1
i=1
w
ic P (yi|Ci = c, Z, ).
(12)
Here P (yi |Ci = c, Z, ) is given by equations (9) and (10) for discrete and quantitative
traits, respectively. The maximum likelihood estimators (MLE) of the model parameters
maximize this log-likelihood function. We propose to use a Newton-Raphson procedure
to find parameter estimates. The k-th component of the score, S, is given by
PC h
I
Y | ) X c=1 ick
log P (Y
Sk (y|C, )
=
.
PC
k
c=1 hic
i=1
The k-th element of the hessian, H, is

I
Y |) X
2 log P (Y
Hkk ( )
=
k k
i=1
PC
hic
s=1 k k
PC
s=1 hic
P
PC
J
s=1 hic
hic
s=1 k
2
PC
hic
s=1 k
where
hic wic P (yi|Ci = c, Z, ).
Formulas for the derivatives of hic for covariates and for discrete and qualitative traits
are given in the Appendix.
MLE can be used to estimate the OR, under the multiplicative model, between individuals with c copy number status with respect to a reference category (e.g., individuals
with r copy number status) as
c/r = ec r .
OR
(13)
Similarly, when analyzing continuous traits, the estimated mean effect between individuals with c copies and r copies is
yc/r = c r .
(14)
The asymptotic variance-covariance matrix of maximum likelihood estimates of can

be estimated using the observed information matrix, F , as
= F 1 ()
= H 1 ().
)
Var(
(15)
Therefore, we can compute a 95% confidence interval (CI95%) for ORc/r using the expression

q
CI1 (ORc/r ) exp (c r ) z/2 V ar( )[c,c] + V ar( )[r,r] 2V ar( )[c,r]
(16)
and for yc/r

CI1 (yc/r ) (c r ) z/2
q
V ar()[c,c] + V ar()[r,r] 2V ar()[c,r],
10
(17)
where z/2 denotes the (1 /2)-th quantile of standard normal distribution, is the
desired type-I error, and subindex [, ] denotes the position of the inverse of Fishers
information matrix.
Hypothesis testing
We propose to use a likelihood ratio test to assess disease association, by taking as reference the model without copy number variable. Twice the increase in the log-likelihood
provides the asymptotic 2 statistic that tests H0 : 1 = 2 = . . . = C . In many
ocasions, we are interested in studying the trend on copy number status (e.g., additive
model). This can be done by generalyzing equation (11) in the form
ic =
M
X
Dicm cm .
(18)
m=1
where D is a I M design matrix, and is a vector of dimension M having the model

parameters. M is the total number of variables included in the model, including copy
number status and confounding variales (e.g., M = C + K). For example, a trend test on
copy number status without covariates D will have the form

1 ...
D =
0 ...
1 1 ...
0 1 ...

1 ...
1
...
1
1 ... C 1 ... C 1
and the trend hypothesis on copy number status is tested using a likelihood ratio test
comparing this model with the null model. Notice that this formulation allow us to
accomodate different or common effects for each latent class. In this case, parameter
estimates are obtained as previously illustrated. Formulas for the derivatives involved
in the score and hessian where coefficients are not shared by each latent class, can be
found in the Appendix. An R language functions for the methods discussed in this paper
are freely available at http://www.creal.cat/jrgonzalez/software.htm and they will be
11
included in the R package MLPAstats.
Results
Simulation study
We performed computer simulation studies to examine empirically the properties of the
parameter estimators developed in the previous sections. The specific goals of these
estudies were: (i) to evaluate the performance of the proposed likelihood ratio trend
test based on the latent class model for different CNV measurement distributions; (ii) to
examine the effect of sample size (I) on the distributional properties of the estimators;
(iii) to examine the bias and mean square error (MSE) of the estimators; (iv) to validate
whether variance or parameter estimates obtained using the observed information matrix
are correct. Simulations were performed as follows. To study (i) we simulated a binary
trait using 300 cases and 300 controls. The unobserved copy number status (e.g. latent
classes) were simulated depending on 3 different copy number status (C = 3) with a
proportion of individuals in each category set equal to = (0.5, 0.4, 0.1). The trend OR
was set equal to 1.5. The observed ratio intensities (X variable) were simulated as a finite
mixture of C normal distributions using different means, , and variances, 2 , to assess
whether the separation of clusters and their variance affects to the power.
To study (ii)-(iv) we simulated binary and quantitative traits. For the binary case,
simulation was perfomed as above but simulating different scenarios varying the sample
size (I), OR and proportion of individuals for each copy number status, . Again we
also simulated different CNV distributions varying and 2 . For qantitative traits, we
used the same simulation procedure but copy number status were simulated depending
on a fixed mean level of the trait for the copy number status considered as reference and
a desired mean difference with respect to the others categories of copy number status.
12
Next, we describe the settings for the different simulation parameters. Sample size: We
chose the values of I: I {50, 300}. Although current studies are analyzing thousands of
individuals, these values were chosen to evaluate the performance of our proposed method
in moderatly large samples. Copy number status: in this case, as we were interested
in evaluating the performace of parameter estimates, we only simulated two different
copy number status C = {1, 2}. Odds ratio: To examine the impact of association
among the disease and CNV, we chose two values for OR: OR {1.3, 2} in order to
consider a moderate association and a strong one. Proportion of cases with normal copy
number status: To evaluate the impact of classes with different number of individuals
we set {(0.8, 0.2), (0.5, 0.5)}. Finite mixture: To asses the impact of distribution of
intensity ratio, X, we simulated two normal distributions with the following parameters:
{1, 1.5} which correspond to have 2 (considered as normal copy number status) and
3 copies, respectively, and {(0.15, 0.15), (0.15, 0.2), (0.2, 0.2)}. In this case, these
scenarios also helped us to model different situations regarding missclasification or how
latent classes were separated. Supplementary Figure XX shows the different distributions
of quantitative CNV that were simulated.
We compared three different approaches. The first one (NAIVE) was based on assesing
association between disease and copy number status obtained using MAP from the finite
mixture model (2). That is, association was assessed using a 2 test from Table 1. The
second one is what biologist normally do when analyze this kind of data and is based on
assigning CNV status using pre-defined thresholds (THRES). Association is then assessed
using a 2 test. As mentioned previously, we simulated data from two mixtures of normals
with mean 1 and 1.5. This is equivalent to simulating individuals with 2 and 3 copies,
respectively. In this situation, it is considered that individuals with intensity (or ratiointensity) larger than 1.33 correspond to individuals with 3 copies [10]. The third method
is the one proposed in this paper, based on latent class (LC).
13
Simulation results for evaluating the performance of likelihood ratio trend test of our
proposed model are showed in Figure 2. Top figures represent the power for all methods
analyzed for two scenarios (other scenarios are given in Supplementary Figures 1 and
2). The left pannel shows the power for each method, varying the CNV measurement
distribution with regard the mean of each latent class, , while right panel gives the same
information but having fixed means and varying variances, 2 . The figure also depicts the
distribution of CNV for some scenarios. We observe that our proposed latent class model
performs better in all cases, even when mixture of CNVs are not very well separated.
Simulation results to evaluate parameter estimates for discrete traits are presented in
Tables 2, 3 and Figure 3 (Supplementary Figure 3 shows the results for I = 50). Similar
results and conlcusions are obtained when a quantitative trait is analyzed. Table 2 and
Figure 3 and Supplementary Figure 3 summarize the OR comparing individuals with 3
copies with respect to those individuals with 2 copies (reference category) and give the
MSE for two different sample sizes, I, two different proportions of individuals having 2
copies, , and two different variances for each component of the mixture, . Table 3
compares different methods to compute standard error of ORs for the different scenarios
previously described. The results compare asymptotic variance based on observed information matrix (ASYM) with respect to empirical variance (EMP). Table 3 also shows
coverage and power of confidence intervals based on the three methods analyzed.
As expected, when the sample size increased, the performance of the estimators of
the finite-dimensional parameters improved (Table 2). In all cases, LC method perfoms
better than the other ones. LC is less bias than NAIVE and THRES in all cases, also
showing better MSE. Figure 3 also confirms the better performance of LC method to
estimate empirical OR distribution. In particular, the distribution of LC method is the
closest to the simulated data (Figure 3 and Supplementary Figure 3).
Regarding variance estimates, estimation based on ASYM showed good performance
14
in all scenarios (Table 3). Despite of little overestimating the empirical variance (EMP)
the bias was less pronounced for I = 300 as expected. Confidence intervals based on LC
method outperforms those obtained by other methods with regard to power.
Application to real data

The data used for this analysis were generated and kindly provided by one of the coauthors of the current work. Although data is still unpublished, it is made available through
the website (http://davinci.crg.es/estivill lab) in a blinded manner for reproducing our
findings with the herein presented approach and for third party validation. Some candidate genes were identified after performing a whole genome scan analysis using aCGH,
where a pool of controls and cases were compared. In order to further investigate the
relationship among the disease and altered genes, a targeted study including several variants was designed using MLPA technique. We collected peak intensities of MLPA assays
for 360 cases and 291 controls. Figures 4 and 5 show the intensities for cases and controls
for two selected genes. In both cases, we observe 3 latent clases, corresponding two 0,
1, and 2 copies of the gene. We found that the finite mixture model fits very well (2
goodness-of-fit test, P = 0.6615 and P = 0.4888). The main difference between these two
cases is that copy number status for gene 1 can be established using a threshold method,
while for the second gene this classification seems more arbitrary. As a consequence,
missclassification should be considered when analyzing gene 2. Table 4 shows the classfication of individuals as 0, 1, 2 copies estimated using equation (2) and the true copy
number of genes obtained after breakpoint cloning and assessing allele presence by PCR,
which unequivocally reports the exact number of copies. From the table, we can see that
finite mixture model gives a perfect classification for gene 1 and some missclassification
for gene 2. We checked the suitability of the mixture model obtaining p=0.6615 and
P=0.1586.
15
Table 5 shows the ORs and their 95%CI for the two genes analyzed. The first three
columns show the results considered as the goal standard, since CN status was determined at the laboratory using PCR technique, while the other columns show the results
obtainded after estimating the CN status using our proposed finite mixture model and
computing the ORs using a naive approach (e.g. considering that there is no missclasification) and the LC model that account for missclasification. As we can see, the results
are the same for the gene 1, since no misclassification is observed (see Figure 4 and Table
4). However, for the gene 2, SN status is not determine so easy as in the case of analyzing gene 1. This is why we observe different results regarding OR estimation and more
interestingly in the P value of association. For instance, the order of magnitud of the
association between the disease and gene 2 is better captured with the LC model than
the NAIVE approach. As for OR estimates, the analysis using the true CN status shows
that individuals with one copy of gene 2 have a 46% decrease of risk of having the disease
with respect to those indiviuals with 0 copies. As the 95%CI shows, this difference is
statistically significant. We lead to the same conclusion when we compare individuals
with two copies wiht respect with subjects with zero copies. Notice that in both cases
we observe that the naive approach is understimating the OR as the simulation study
showed.
Conclusion
In this paper we have shown that the assessment of association between CNVs and disease
using analysis methods that do no take into account the misclassification of copy number
status (threshold and naive methods) underestimate both p-values and parameter estimates. This is contrary to the need of increasing statistical power, which is reduced by the
multiple comparison correction for the simultaneous testing of several loci. False positives
are typically controlled by a dramatic reduction on the nominal p-value, and therefore
16
very low values are required to reach statistical significance. A precise computation of
these values is essential in genetic association studies.
Here we have proposed a latent class model (LC) that accommodates both the uncertainty of assessing CNV status and possible confounding factors. The parameter estimation procedure allows the estimation of their confidence intervals. The LC model was
remarkably consistent with simulated data. In particular, we found that the p-values obtained with the LC model were more precise to the expected values than those obtained
with the threshold and naive methods.
CNVs are assayed quantitatively by a broad range of methods [22] such as array CGH
or Illumina and Affymetrix platforms. These technologies have the ability to identify
thousands of CNVs simultaneously, which makes the analysis of such data computationally demanding. We have found that the LC method is very efficient and, therefore, can
serve this purpose. Specifically, we have observed the the Newton-Rapson optimization
converge in only few steps (usually 4-6). It is important to note that the finite Gaussian
mixture model, incorporated in the LC model, assumes that cases and controls are comparable. While this is true for MLPA data, for other technologies such as CGH, Illumina
and Affymetrix differential errors can be present between group of subjects due to DNA
quality or handling. Therefore it would be necessary to assess CNV status for cases and
controls separately before the association analysis based on those platforms.
We maximize the likelihood function assuming fixed weights for each copy number
status that accounts for possible missclasification. The main advantage of considering
weigths as known constants is that the Newton-Raphson procedure is much simpler, faster
and feasible for obtaining the Hessian matrix analytically. We therefore assume that copy
number status is independent of being case or control. This assumption was validated in
our simulation studies, where we confirmed that the proposed model captures very well
the nature of the synthetic data and variance estimates. Interestingly, we observed that
17
the variance estimates using MLE was also reproduced when a bootstrap procedure was
used (see Supplementary Table 1). In the interest of generalization, one can consider
maximizing the likelihood function for both model parameter and weights. If that is
the case, an EM algorithm should be used instead. However, one should bare in mind
that EM does not allow the estimation of the variance of the model parameters and is
computationally expensive; which is a challenge if this method is used in whole genome
scan settings.
In conclusion, we have shown how the LC model can accommodate bi-allelic and
mult-iallellic CNVs as well as quantitative traits. We have also presented how it can
incorporate confounding variables. This is of particular importance in complex diseases
studies where environmental factors need to be taken into account. The formulation can
also be generalized to assess survival times or counts, in longitudinal studies.
Authors contributions
JRG and IS developed the new statistical methods. JRG wrote the R functions and
the main text of the manuscript and performed the simulation studies. GE and AC
proposed abundant suggestions for developing the models. SP worked on the gaussian
mixture approach to model quantitative CNVs measurements. XE reviewed the paper and
revised its framework. LA and JRG proposed the need of a statistical tool to measure the
biological differences in allele distribution in cohorts of cases and controls, and conceived
the study. All authors have read, and approved the final manuscript.
Acknowledgments
First author wants to thank Xavier Bassaga
na for his comments and helpful conversations
about the model proposed. This work was been supported by the Spanish Ministry of
18
Science and Innovation [MTM2008-02457 to JRG and SAF2008-00357 to XE]; and the
European Commission [AnEUploidy project; FP6-2005-LifeSciHealth contract #037627].
Appendix
To obtain parameter estimates, we have to maximize the log-likelihood function
Y |Ci = c, Z, ) =
log P (Y
I
X
log
i=1
C
X
c=1
w
ic P (yi|Ci = c, Z, ),
where P (yi|Ci = c, Z, ) is given by equations (9) and (10) for discrete and quantitative
traits, respectively. As we have previously mentioned, the k-th component of the score,
S, is given by
PC h
I
Y | ) X c=1 ick
log P (Y
=
.
Sk (y|C, )
PC
k
c=1 hic
i=1
The k-th element of the hessian, H, is
PC
PC
hic PC
I
Y |) X s=1 k k s=1 hic s=1
2 log P (Y
=
Hkk ( )
P
2
k k
J
i=1
h
s=1 ic
hic
k
PC
hic
s=1 k
where
hic wic P (yi|Ci = c, Z, ).

Herein we are giving formulas for the derivatives of hic for all cases discussed in this paper.
Altough next expressions may appear complicated, thery are easy to program and they are
included in the R functions that are available at http://www.creal.cat/jrgonzalez/software.htm.
Binary Traits
Binary Traits without covariates
In this case, hic function takes the form
wic
eyi c
.
1 + ec
Therefore,
wic I{k=c} yi eyk (1 + ek ) aic I{k=c}eyi k ek
hic
=
= I{k=c} hic (yi pic ),
k
(1 + ek )2
19
where
pic =
1
.
1 + ec
And,

2 hic
hic
2 hic
2
=
I
(y
p
)
h
(p
p
)
,
and
= 0 for k 6= k .
{k=c}
i
ic
ic
ic
ic
k2
k
k k
Binary Traits with covariates
K
X
eyi ic
hic = wic
, where ic = c +
k zik
1 + eic
k=1
Therefore,
wic I{k=c} yi eyic (1 + eic ) wic I{k=c} eyi ic eic
hic
=
= I{k=c}hic (yi pic ),
k
(1 + eic )2
where
pic =
1
.
1 + eic
And

2 hic
hic
2 hic
2
=
I
(y
p
)
h
(p
p
)
,
and
= 0 for k 6= k .
{k=c}
i
ic
ik
ic
ic
2
k
k
j j
For covariates, we have that
hic
= zp hic (yi pic )
p
2 hic
hic
= zp
(yi pic ) zp2 hic (pic p2ic )
2
p
p
2 hic
hic
= zp
(yi pic ) zp zp hic (pic p2ic )
p p
p
Quantitative traits
Quantitative traits without covariates and shared variance
1 (yi c )2
hic = wic e 22 ,
20
Therefore,
hic
1 (yik )2 yi k
yi k
= I{k=c} wic e 22
= I{k=c} hic
2
k
2

2 hic
1 hic
2 hic
=
I
(y
h
,
and
= 0 for k 6= k
i
k
ic
{k=c}
k2
2 k
k k

hic
1 (yi2c )2
hic hic
1 (yi 2c )2 (yi c )2
=
= wic 2 e 2 + e 2
+ 3 (yi c )2
3
2 hic
=
2
hic
hic
2 hic
=
k
hic
+ (yi c )2
2hic
3
hic 3
3hic 2
6
(yi s )s
Quantitative traits with covariates and shared variance

P
X
1 (yi is )2
hic = wic e 22 , where is = s +
p zip
p=1
Therefore,
hic
yi ic
= I{k=c} hic
k
2

2 hic
2 hic
1 hic
=
I
(y
h
,
and
= 0, for k 6= k
i
ic
ic
{k=c}
2
2
k
k
k k
hic
hic hic
=
+ 3 (yi ic )2
!
h
hic 3
2
ic
hic
2 hic
2 3hic
=
+ (yi ic )
2
2
6
!
hic
2
hic
2hic
= I{k=c} 2 3 (yi ic )
k
!
hic
2 hic
2hic
(yi ic )zip
p
2
3

2 hic
zip hic
= I{k=c} 2
(yi ic ) hic
p k
j
21
hic
hic
=
(yi ic )zip
p
2
2 hic
=
p2
hic
p
2
2
zip
1
2 hic
hic hic 1
zip zip
hic 2 , and
=
hic 2 for p 6= p
hic
p p
p p hic
Trend test
In this situation we can write the linear predictor of equation (18) as
ic = 1 + 1 (c 1).
In other words, 1 plays the role of an intercept and 2 is the slope. In this case we
consider that both 1 and beta2 are the shared for each latent class. In this situation,
eyi ic
taking in mind that hic = wic 1+e
ic we have that for the discrete traits
and
hic
= hic xikc (yi pic ),
k
(19)
2 hic
hic
= xikc
(yi pic ) xikc xik c hic (pic p2ic ).
k k
k
(20)
For quantitative traits, where hic = wic 1 e
and
(yi ic )2
2 2
, we have that
hic
yi ic
= hic xikc
,
k
2
(21)
2 hic
hic yi ic
hic
= xikc
xikc xik c 2 .
2
k k
k
(22)
And for the variance, we have that
2 hic
=
2
hic
hic hic
=
+ 3 (yi ic )2 ,
!
hic
hic 3
2
hic
2 3hic
+ (yi ic )
,
2
6
and
2 hic
= xikc
k
hic
2hic
3
22
(yi ic )
(23)
(24)
(25)
References
[1] Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL, Cheng Z, Schwartz
S, Albertson DG, Pinkel D, Altshuler DM, Eichler EE: Linkage disequilibrium
and heritability of copy-number polymorphisms within duplicated regions
of the human genome. Am J Hum Genet 2006, 79(2):27590.
[2] Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero
MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei
R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J,
Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW,
Hurles ME: Global variation in copy number in the human genome. Nature
2006, 444(7118):44454.
[3] Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay
C, Ng RT, Brown CJ, Eichler EE, Lam WL: A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet 2007,
80:91104.
[4] Feuk L, Carson AR, Scherer SW: Structural variation in the human genome.
Nat Rev Genet 2006, 7(2):8597.
[5] Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird
CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P,
Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number
variation on gene expression phenotypes. Science 2007, 315(5813):84853.
23
[6] Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ,

Freedman BI, Quinones MP, Bamshad MJ, Murthy KK, Rovin BH, Bradley W,
Clark RA, Anderson SA, OConnell R J, Agan BK, Ahuja SS, Bologna R, Sen L,
Dolan MJ, Ahuja SK: The influence of CCL3L1 gene-containing segmental
duplications on HIV-1/AIDS susceptibility. Science 2005, 307(5714):143440.
[7] Rovelet-Lecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D:
APP locus duplication causes autosomal dominant early-onset Alzheimer
disease with cerebral amyloid angiopathy. Nat Genet 2006, 38:246.
[8] Le Marechal C, Masson E, Chen JM, Morel F, Ruszniewski P, Levy P, Ferec C:
Hereditary pancreatitis caused by triplication of the trypsinogen locus.
Nat Genet 2006, 38(12):13724.
[9] Schouten JP, McElgunn CJ, Waaijer R, Zwijnenburg D, Diepvens F, G P: Relative quantification of 40 nucleic acid sequences by multiplex ligationdependent probe amplification. Nucleic Acids Res 2002, 30(12):e57.
[10] Gonzalez J, Carrasco J, Armengol L, Villatoro S, Jover L, Yasui Y, Estivill X:
Probe-specific mixed-model approach to detect copy number differences
using multiplex ligation-dependent probe amplification (MLPA). BMC
Bioinformatics 2008, 9:261.
[11] Engert S, Wappenschmidt B, Betz B, Kast K, Kutsche M, Hellebrand H, Goecke
T, Kiechle M, Niederacher D, Schmutzler R, Meindl A: MLPA screening in the
BRCA1 gene from 1,506 German hereditary breast cancer cases: novel
deletions, frequent involvement of exon 17, and occurrence in single earlyonset cases. Hum Genet 2008, 29(7):94858.
24
[12] Hansen T, Jonson L, Albrechtsen A, Andersen M, Ejlertsen B, Nielsen F: Large

BRCA1 and BRCA2 genomic rearrangements in Danish high risk breastovarian cancer families.
[13] Aitman T, Dong R, Vyse T, Norsworthy P, Johnson M, Smith J, Mangion J,
Roberton-Lowe C, Marshall A, Petretto M E Hodges, Bhangal G, Patel S, SheehanRooney K, Duda M, Cook P, Evans D, Domin J, Flint J, Boyle J, Pusey C, Cook H:
Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis
in rats and humans. Nature 2006, 439(7078):8515.
[14] Fellermann K, DE S, E S, H S, J W, CL B, W R, A T, M S, P L, B R, EF S: A
chromosome 8 gene-cluster polymorphism with low human beta-defensin
2 gene copy number predisposes to Crohn disease of the colon. Am J Hum
Genet 2006, 79(3):43948.
[15] Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 1998, 41:578
588.
[16] Leisch F: FlexMix: A general framework for finite mixture models and
latent class regression in R. Journal of Statistical Software 2004, 11(8):118,
[[http://www.jstatsoft.org/v11/i08/]].
[17] Du J: Combined Algorithms for Fitting Finite Mixture Distributions. PhD
thesis, McMaster University, Ontario, Canada 2002.
[18] Bashir S, Duffy S: The correction of risk estimates for measuremente error.
Ann Epidem 1993, 7:156164.
[19] Davidov O, Faraggi D, Reiser B: Misclassification in logistic regression with
discrete covariates. Biometrical Journal 2003, 5:541553.
25
[20] Greenland S: Basic methods for sensitivity analysis of biases. Int J Epi 1996,
25:11071115.
[21] Spiegelman D, Rosner B, Logan R: Estimation and inference for logistic regression with covariate missclassification and measurement error, in main
study/validation study designs. J Am Stat Assoc 2000, 95:5161.
[22] Armour J, Barton D, Cockburn D, Taylor G: The detection of large deletions
or duplications in genomic DNA. Hum Mut 2002, 20:325337.
26
Figure Legend
CNV2
CNV3
100
Frequency
50
60
Frequency
40
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.3
0.4
0.5
0.10
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
Frequency
20
0
10
20
50
100
Frequency
60
40
Frequency
80
50
150
100
60
CNV4
0.2
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 1: CNV quantitative meassurements.Examples of CNV data showing different

clustering quality and copy number status
27
=(1.0,1.4,1.8)
0.5
0.8
0.6
0.4
LC
NAIVE
THRESHOLD
TRUE
0.0
0.4
= 0.5
0.2
power (p<10e3)
0.8
0.6
0.4
0.2
LC
NAIVE
THRESHOLD
TRUE
0.0
power (p<10e3)
1.0
1.0
=(0.2,0.2,0.2)
0.3
0.2
0.10
0.1
0.12
0.14
0.16
0.18
0.20
= 0.3
= 0.2
=(0.1,0.1,0.1)
=(0.15,0.15,0.15)
=(0.2,0.2,0.2)
Figure 2: Empirical power for simulation studies. Empirical power for the three different
approaches analyzed varying the quality of clustering for underlying copy number status. Left
pannel is for a fixed set of variance and varying means, while the rigth pannel is for a fixed
mean and varying variances.
28
(.15,.15)
2.0
true
estimated
naive
corrected
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
1.0
1.5
0.0
0.0
0.5
log(OR)
1.0
0.5
0.0
0.5
log(OR)
2.0
1.5
true
estimated
naive
corrected
0.5
0.0
0.5
log(OR)
1.0
1.5
Density
1.0
1.5
true
estimated
naive
corrected
0.5
0.5
0.5
Density
1.0
Density
1.0
1.5
1.5
true
estimated
naive
corrected
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
0.0
1.0
0.0
0.0
(0.5,0.5)
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.8,0.2)
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.5,0.5)
1.3
1.0
2.0
0.5
2.0
1.0
0.0
0.0
(0.8,0.2)
1.3
(.15,.2)
2.0
OR
2.0
(.2,.2)
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
1.0
0.5
0.0
0.5
log(OR)
1.0
1.5
Figure 3: Empirical distribution of effect estimates (log OR) for each copy number
status. Results for 1000 simulated case-control data sets (300/300), for different degrees of
association (e.g. different OR) and different distributions of quantitative CNV measurements
(e.g. varying clustering quality)
29
0.6
0.4
0.2
0.0
Peak Intensity (CNV quantitative measurement)
Goodnessoffit (p value): 0.66153
copy number estimation

0
0
1
100
Casecontrol status
case
200
300
400
500
individuals
control
600
density
Figure 4: Association between Gene 1 and disease. Graphical representation of peak

intensities (CNV quantitative measurement) of individuals for Gene 1 anlayzed in the example.
Different colors indicate copy number status inferred using our proposed finite mixture model
30
1.0
0.8
0.6
0.4
0.2
0.0

0
0.2
1
100
Casecontrol status
case
200
300
400
500
individuals
control
600
density
Figure 5: Association between Gene 1 and disease. Graphical representation of peak

intensities (CNV quantitative measurement) of individuals for Gene 2 anlayzed in the example.
Different colors indicate copy number status inferred using our proposed finite mixture model
31
Disease
Cases
Controls
Copy number status

1 2
C
r1 r2
rC
sC
s1 s2
Total
R
S
Table 1: Contingency table of disesase status and copy number category
32
Table 2: Odds ratio (e ) and mean square error obtained in 1,000 simulations using the
three different approaches: NAIVE, THRES and LC (read text to have a description of
each one). The results are given for different scenarios varying number of individuals (I),
proportion of individuals in each copy number status (), odds ratio (e ) and variance
for CNV quantitative measurements.
e
I
50
50
50
50
50
50
50
50
50
50
50
50
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
e
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
SIM NAIVE
(0.15,0.15) 1.23
1.17
(0.2,0.2)
1.24
1.14
(0.15,0.2) 1.28
1.18
(0.15,0.15) 1.60
1.40
(0.2,0.2)
1.82
1.36
(0.15,0.2) 1.89
1.42
(0.15,0.15) 1.26
1.24
(0.2,0.2)
1.32
1.28
(0.15,0.2) 1.26
1.23
(0.15,0.15) 2.04
1.94
(0.2,0.2)
2.04
1.76
(0.15,0.2) 2.06
1.78
(0.15,0.15) 1.30
1.25
(0.2,0.2)
1.32
1.25
(0.15,0.2) 1.30
1.22
(0.15,0.15) 2.01
1.87
(0.2,0.2)
2.03
1.70
(0.15,0.2) 2.03
1.62
(0.15,0.15) 1.31
1.27
(0.2,0.2)
1.30
1.23
(0.15,0.2) 1.30
1.24
(0.15,0.15) 2.00
1.87
(0.2,0.2)
2.00
1.72
(0.15,0.2) 2.00
1.76
33
THRES
1.15
1.09
1.15
1.28
1.29
1.33
1.21
1.25
1.20
1.83
1.68
1.72
1.18
1.15
1.16
1.49
1.36
1.38
1.26
1.22
1.23
1.77
1.66
1.71
LC
1.20
1.21
1.24
1.48
1.52
1.57
1.26
1.35
1.26
2.05
2.05
1.99
1.30
1.34
1.29
2.01
1.99
1.86
1.30
1.30
1.29
2.00
2.02
1.97
Mean Square Error (103 )

NAIVE THRES
LC
57
87
42
107
131
114
134
148
112
54
85
44
152
158
126
180
253
162
39
51
32
82
79
97
66
72
60
40
67
34
107
128
92
87
107
71
13
32
10
27
50
29
24
42
21
21
120
13
69
203
43
78
189
38
7
9
5
15
17
12
12
14
9
11
23
5
36
51
15
26
37
10
Table 3: Empirical coverage and power obtained in 1,000 simulations using the three different approaches: NAIVE, THRES
and LC (read text to have a description of each one). The results are given for different scenarios varying number of
individuals (I), proportion of individuals in each copy number status (), odds ratio (e ) and variance for CNV quantitative
measurements. The table also shows the variance of parameter estimates using the asymptotic (ASYM) variance compared
with the empirical (EMP) variance.
34
I
50
50
50
50
50
50
50
50
50
50
50
50
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
e
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
EMP
0.5821
0.5679
0.5326
0.6382
0.6103
0.6174
0.4168
0.4298
0.3984
0.4231
0.4022
0.4345
0.2291
0.2208
0.2192
0.2452
0.2334
0.2455
0.1711
0.1709
0.1582
0.1621
0.1692
0.1793
ASYM
0.5898
0.6605
0.5846
0.6512
0.7057
0.6407
0.4367
0.4838
0.4578
0.4495
0.5020
0.4696
0.2341
0.2667
0.2373
0.2610
0.2996
0.2591
0.1775
0.1970
0.1866
0.1823
0.2030
0.1904
SIM
94.2
93.0
96.6
94.2
92.8
95.6
94.0
94.6
95.2
95.6
97.0
94.4
94.0
94.6
94.2
94.2
95.8
93.8
93.6
94.4
96.8
95.8
96.2
95.4
Coverage (%)
NAIVE THRES
96.2
95.8
94.0
93.0
96.2
95.4
92.6
89.0
92.2
82.8
87.0
79.4
94.2
95.2
93.8
94.0
95.2
95.2
94.6
93.8
95.0
94.6
93.4
94.4
94.0
89.2
94.4
88.6
93.6
89.2
93.6
66.0
89.8
43.2
83.0
43.8
93.8
94.0
93.8
92.8
95.2
94.4
95.2
90.4
84.0
82.4
88.2
83.0
LC
96.8
96.2
97.4
94.0
95.2
93.0
93.8
95.4
95.6
94.6
98.2
95.6
93.2
96.4
96.0
94.6
96.0
94.6
93.8
93.6
95.2
95.8
96.0
95.2
SIM
6.6
5.2
6.8
22.0
16.8
19.4
11.6
12.6
11.4
39.4
42.2
47.4
20.4
23.0
23.4
85.4
84.2
85.8
37.0
36.6
34.6
98.4
99.2
98.2
Power (%)
NAIVE THRES
5.4
6.4
4.8
4.2
4.8
4.8
16.8
11.2
9.4
7.4
10.6
9.8
10.0
9.2
7.0
7.0
8.6
9.4
32.4
32.4
23.8
23.2
30.8
29.8
15.2
17.0
17.0
11.0
15.8
13.2
78.4
58.6
60.8
42.6
62.8
44.8
30.8
31.2
24.4
25.0
22.8
24.6
96.8
93.0
89.4
90.0
92.4
88.2
LC
4.6
3.6
3.0
15.4
7.0
9.6
10.0
7.2
8.2
32.6
25.2
33.4
17.8
16.2
18.0
79.2
66.6
67.4
32.4
28.2
25.2
97.2
94.2
94.4
True copy
number status
0
1
2
Gene 1
0 426 0
0
1 0 201 0
2 0
0
24
Gene 2
0 85
0
0
1 5 287 0
2 0
73 204
Table 4: Contingency table of estimated and true copy number status for two genes
involved in complex disease example
35
Co
True CN
Ca
OR (CI95%)
Co
Gene 1
0 215 211
1
1 80 121 1.54 (1.09,2.17)
2 6
18 3.06 (1.19,7.85)
P association
0.0027
P trend
5.0 104
Gene 2
0 24 66
1
1 159 201 0.46 (0.27,0.77)
2 108 93 0.31 (0.18,0.54)
P association
7.2 105
P trend
2.1 105
Ca
Estimated CN
ORnaive (CI95%) ORLC (CI95%)
211 215
121 80
18
6
1
1.54 (1.09,2.17)
3.06 (1.19,7.85)
0.0027
5.0 104
1
1.54 (1.10,2.16)
3.06 (1.19,7.87)
0.0023
5.0 104
22 63
129 178
140 119
1
0.44 (0.26,0.75)
0.33 (0.19,0.57)
2.3 104
1.0 104
1
0.47 (0.28,0.88)
0.31 (0.18,0.54)
8.4 105
2.1 105
Table 5: Association analysis of disesase status and copy number category using the true
copy number status and the estimated using the finite mixture proposed.
36
Additional files
Additional file 1: latent model MLPA sup.pdf, 465.1K
37
Figure 1
CNV2
CNV3
100
50
Frequency
60
40
Frequency
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.3
0.4
0.5
0.10
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
0
10
20
Frequency
50
100
Frequency
60
40
20
0
Frequency
80
50
150
100
60
CNV4
0.2
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
Figure 2
0.8
0.6
0.4
0.2
LC
NAIVE
THRESHOLD
TRUE
0.0
power (p<10e3)
1.0
=(0.2,0.2,0.2)
0.5
0.4
= 0.5
0.3
0.2
= 0.3
0.1
= 0
2.0
0.0
0.5
Density
1.0
1.5
true
estimated
naive
corrected
1.0
Figure 3
0.5
0.0
0.5
log(OR)
1.0
1.5
0.6
0.4
0.2
0.0
Figure 4

0
0
1
100
Casecontrol status
case
200
300
400
individuals
500
control
600
density
0.8
0.6
0.4
0.2
0.0

0
0.2
1.0
Figure 5
1
100
Casecontrol status
case
200
300
individuals
400
500
control
600
density
Additional files provided with this submission:

Additional file 1: latent_model_mlpa_sup.pdf, 278K
http://www.biomedcentral.com/imedia/3102130692352773/supp1.pdf
BMC Bioinformatics
BioMed Central
Open Access
Methodology article
Accounting for uncertainty when assessing association between

copy number and disease: a latent class model
Juan R Gonzlez*1,2,3, Isaac Subirana2,3, Gergia Escarams2,4,
Solymar Peraza2,1, Alejandro Cceres1,3, Xavier Estivill4,2
and Llus Armengol4
Address: 1Center for research in environmental epidemiology (CREAL), Barcelona, Spain, 2CIBER en Epidemiologa y Salud Pblica
(CIBERESP), Barcelona, Spain, 3Institut Municipal d'Investigaci Mdica (IMIM), Barcelona, Spain and 4Genes and Disease Program,
Center for Genomic Regulation, Barcelona, Spain
E-mail: Juan R Gonzlez* - jrgonzalez@creal.cat; Isaac Subirana - isubirana@imim.es; Gergia Escarams - georgia.escaramis@crg.es;
Solymar Peraza - speraza@creal.cat; Alejandro Cceres - acaceres@creal.cat; Xavier Estivill - xavier.estivill@crg.es;
Llus Armengol - lluis.armengol@crg.es
*Corresponding author
Published: 06 June 2009

BMC Bioinformatics 2009, 10:172
Received: 12 November 2008

Accepted: 6 June 2009
doi: 10.1186/1471-2105-10-172
This article is available from: http://www.biomedcentral.com/1471-2105/10/172

2009 Gonzlez et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Copy number variations (CNVs) may play an important role in disease risk by
altering dosage of genes and other regulatory elements, which may have functional and, ultimately,
phenotypic consequences. Therefore, determining whether a CNV is associated or not with a given
disease might be relevant in understanding the genesis and progression of human diseases. Current
stage technology give CNV probe signal from which copy number status is inferred. Incorporating
uncertainty of CNV calling in the statistical analysis is therefore a highly important aspect. In this
paper, we present a framework for assessing association between CNVs and disease in casecontrol studies where uncertainty is taken into account. We also indicate how to use the model to
analyze continuous traits and adjust for confounding covariates.
Results: Through simulation studies, we show that our method outperforms other simple methods
based on inferring the underlying CNV and assessing association using regular tests that do not
propagate call uncertainty. We apply the method to a real data set in a controlled MLPA experiment
showing good results. The methodology is also extended to illustrate how to analyze aCGH data.
Conclusion: We demonstrate that our method is robust and achieves maximal theoretical
power since it accommodates uncertainty when copy number status are inferred. We have made
R functions freely available.
Background
With the recent technological advances, various genomewide studies have uncovered an unprecedented number
of structural variants throughout the human genome
[1-3], mainly in the form of copy number variations
(CNVs). The considerable number of genes and other
regulatory elements that fall within these variable

regions make CNVs very likely to have functional and,
ultimately, phenotypic consequences [4,5]. In fact, recent
studies have reported a correlation between copy
number of specific genes and degree of disease predisposition [6-8], indicating that identification of DNA
Page 1 of 13
(page number not for citation purposes)
For each CNV variant, we are interested in classifying the

subjects into the C classes using the surrogate variable X.
We propose to model the unobserved latent classes using
a finite mixture model with C components of the form
C
f ( x | Q) =
p N(x | Q),
(1)
c =1
CNV2
CNV3
100
Frequency
50
60
Frequency
40
100
0.2
0.3
0.4
0.5
0.6
20
50
Frequency
80
150
100
150
CNV1
0.0
0.1
0.2
0.3
0.4
0.10
0.5
0.15
CNV5
0.20
0.25
CNV6
0.0
0.2
0.4
0.6
0.8
40
30
Frequency
10
20
50
100
60
Frequency
80
50
150
100
60
CNV4
Frequency
Ionita-Laza et al. (2009) pointed out that it is not

inmediately clear how this uncertainty of CNV calling
should be incorporated in the statistical analysis [15]. To
overcome this difficulty in assessing association between
CNVs and disease, we propose a latent class (LC) model
that incorporates possible uncertainty that appear when
CNV calling is performed. After inferring copy number
using Gaussian finite mixture distributions, or any other
calling algorithm, the model assesses the relationship
between the trait and a CNV using a mixture of
generalized linear models. Association is then tested
using a likelihood ratio procedure. We validate and
compare our method with existing methods through a
simulation study. We then illustrate how to test
association between CNVs and the trait by using two
Inference of copy number status

Let us assume that we observe I individuals from a given
population, consisting of C mutually exclusive latent
classes c = 1, ..., C (e.g. copy number status). Instead of
observing these classes, we observe a surrogate variable, X,
corresponding to a continuous variable arising from any
quantitative method. For instance, in targeted studies using
MLPA or real-time PCR, X corresponds to peak intensities
for each CNV probe. In the context of a whole genome scan,
one may have quantitative data from aCGH or any other
platform such as Illumina or Affymetrix, where, for each
probe, the variable X corresponds to a ratio of intensities.
Figure 1 shows a number of possible distributions that
signal intensities may have. Some variants clearly show
different underlying copy number status with multimodal
signal intensities distributions (CNV2, CNV4 and CNV6). In
other cases, where the existence of different copy numbers
is not clear, inferring copy number by binning the data may
be difficult or unfeasible.
40
The statistical methods used in CNV-disease association

studies are currently very simple. Quantitative methods
give CNV probe signal intensity measurements for each
individual as a continuous variable, from which copy
number status is inferred, generally using pre-defined
thresholds. Differences in copy number distribution
between cases and controls are then assessed using c2,
Fisher or Mann-Whitney tests [6,13,14]. However, the
distribution of CNV probe measurements is continuous
and multimodal, meaning that signal intensity should be
considered as a mixture of curves. In many instances,
these curves overlap with various underlying distributions leading to uncertainty. Therefore, scoring copy
number by binning and then assessing the association
may lead to misclassification and unreliable results.
Methods
20
Several techniques and platforms have been developed

for genome-wide analysis of DNA copy number, such as
array-based comparative genomic hybridization (aCGH).
The goal of this approach is to identify contiguous DNA
segments where copy number changes are present. The
ability of aCGH to distinguish between different numbers
of copies is limited, so various quantitative techniques are
required for more precise, targeted analysis of genomic
regions. For known CNVs, real time PCR assays can be
used to compare the copy number status of particular loci
in cases and controls. Individuals are typically binned
into copy number categories using pre-defined thresholds
of probe signal intensity. Recently, Multiplex Ligationdependent Probe Amplification (MLPA) [9] has also been
used to quantify copy number classes. This method
allows the analysis of several loci at the same time in a
single assay. MLPA is usually used to identify gains or
losses in test samples with respect to controls [10], but it
can also be used in the context of association studies in a
case-control or cohort settings [11,12].
real examples. One of them corresponds to a casecontrol study using data from a MLPA experiment where
the true copy number status is known. The second
example belongs to a study where breast cancer cell lines
are analyzed using aCGH.
copy number is important in understanding genesis and

progression of human diseases.
http://www.biomedcentral.com/1471-2105/10/172
0.0
0.2
0.4
0.6
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 1
CNV quantitative measurements. Examples of CNV
data showing different clustering quality and copy number
status.
Page 2 of 13
where N(|hc, s c2 ) is the Gaussian distribution with

denoting all model parameters (e.g., = (hc, s c2 ), c = 1, ...,
C ), and x is the surrogate variable that corresponds to the
quantitative measure of copy number status. For the
component weights c it holds that
C
= 1 and p c 0, c = 1,..., C .
c =1
The value of C to be used is chosen by applying Bayesian

Information Criteria (BIC) [16]. This mixture model
approach for calling is similar to some used for the analysis
of aCGH data [17,18] where correlation among probes
should be considered. When analyzing MLPA data, it should
be pointed out that in some instances, especially when there
are individuals with 0 copies, the intensity distributions (see
CNV2 and CNV4 in Figure 1) for a null allele is meant to be
equal to 0. However, due to experimental noise it is fact that
in some cases this ratio shows values that slightly deviate
from this theoretical value. After our experience with
hundreds of home-made MLPA probes, the value for null
alleles is typically below 0.1; nevertheless, we recommend
this parameter to be determined experimentally for each of
the probes used in the MLPA experiments using the
appropriate control samples. For these cases, the procedure
used to estimate the parameters in (1) fails because the
underlying distribution of individuals with 0 copies is not
normal. In these situations we propose to fit the following
mixture model to determine the latent classes
f ( x | Q) = p 1I{x t } +
p N( x | h , s
c
c =2
2
c ) I{x >t } ,
Latent class model

Discrete traits
Let us suppose that copy number status is associated with
a binary phenotype (case-control). The association is
typically assessed using a c2 test for the contingency table
(Table 1). Misclassification in the table (due to uncertainty when inferring CNVs) is incorporated when we
assign each individual to a given class c using maximum
a-posteriori probability (MAP). Thus, this problem can be
seen as an association study with misclas-sification
("measurement error") [21]. It is well known that
misclassification of covariates has important implications
for parameter estimates and statistical inference [22].
Some approaches account for such error [23,24]. These
are, however, based on performing validation studies in a
subsample. In the present context, this is unfeasible
because hundreds of genes are normally analyzed at a
time, and the technology may have a different sensitivity
and specificity for each of the inspected loci. Therefore, we
propose to use the posterior probability of belonging to
each latent class to model the degree of misclassification
of copy number status. We then take this information into
account in the association model.
Conditioning on cluster c, we have that
P(y i | Ci = c , b ) = m icyi (1 m ic )1 y i ,
where b = (b1, ..., bc), c = 1, ..., C is our vector of

parameters, and
logit( m ic ) b c .
(2)
Then, equation (4) can be rewritten as
where is given by the user, as previously indicated,
p1 =
I{x t }
I
P(y i | Ci = c , b ) =
, denotes an indicator function, and

C
p1 +
= 1 and p c 0 c = 2,..., C .
c =2
The posterior probabilities are used to segment data by

assigning each individual to a given copy number status
corresponding to the class with maximum posterior
probability (MAP). After fitting this finite mixture model,
we can perform a goodness-of-fit test using c2 test statistic.
Finite mixture parameters can be estimated using the EM
algorithm [19,20] or Newton-type procedures [20]. Then,
the posterior probability that individual i with an observed
value x belongs to copy number class j is given by
w ij = P( j | x , ) =
p j N( x|h j ,s 2j )
2
c p c N( x|h c ,s c )
(4)
(3)
e y ib c
.
1+ e b c
Now, we consider that copy number status is measured with

error (i.e., the latent class is not known). Therefore, we are
modeling the probability of being an affected individual as a
mixture of C binomial variables, as follows:
C
P(y i | b ) =
w P(y | C = c, b ),
ic
c =1
Table 1: Contingency table of disease status and copy number

category
Copy number status

Disease
...
Total
Cases
Controls
r1
s1
r2
s2
...
...
rC
sC
R
S
Page 3 of 13
where wic is the posterior probability that individual i

belongs to copy number class c, given in (3). Therefore,
assuming conditional independence of case-control
status, given latent class, the likelihood function for
model parameters b can be written as
C
w P(y | C = c, b ) =
ic
i =1 c =1
i =1 c =1
e y ib c
w ic
.
1+ e b c
(5)
We can then simply compute the odds ratio (OR) of

belonging to class c with respect to a given reference r as
OR c / r = e
b c b r
Quantitative traits
We now consider the case where our phenotype, Y, is
continuous. We assume that Y |c N(c, s2). In this case,
conditioning on cluster c
P(y i | Ci = c , b ) =
1
e
2p s
(9)
for discrete traits, and
P( y i | C i = c , Z , b c , g , s ) =
1
e
2p s
(y i y ic ) 2
2s 2
(y i m ic ) 2
2s 2
(7)
for quantitative traits. In both cases
y ic = b c + g 1Z i1 + + g K Z iK .
(11)
Parameter estimation
In this section we address parameter estimation for the
general situation of having covariates and either discrete
or quantitative traits. For brevity, let (b, g, s) (notice
that for discrete traits s = 1). We consider that the
weights, w ic , are known and that they are given by the
surrogate variable X from equation (3). Therefore, they
can be used in the log-likelihood calculation, resulting in
log P(Y | Ci = c , Z , q ) =
where
i =1
c =1
log w
ic P( y i
| Ci = c , Z , q ).
(12)
m ic b c .
Similar to the case of discrete traits, the likelihood
function for model parameters b is given by
I
w P(y | C = c, b ) = w
ic
i =1 c =1
i =1 c =1
ic
1
e
2p s
(y i b c ) 2
2s 2
Here P(yi| Ci = c, Z, ) is given by equations (9) and (10)

for discrete and quantitative traits, respectively. The
maximum likelihood estimators (MLE) of the model
parameters maximize this log-likelihood function. We
propose to use a Newton-Raphson procedure to find
parameter estimates. The k-th component of the score, S,
is given by
(8)
log P(Y |q )
S k (y | C , q )
=
q k
In this case we are interested in evaluating the difference

between the mean effect of individuals with c copies and
r copies. This can simply be computed as
yc / r = b c b r .
Covariate Adjustment
In some instances researchers are interested in assessing the
effect of CNVs after adjusting for other covariates, Z1, ..., ZK
(usually called confounding variables). In this case, the
likelihood function can be written as
C
w
i =1 c =1
where
ic P( y i
| Ci = c , Z , b c , g ),
cC=1
hic
q k
Cc =1 hic .
i =1
The k-th element of the Hessian, H, is

H kk(q )
(10)
(6)
ey ic
1+ ey ic
P( y i | C i = c , Z , b c , g ) =
2 log P(Y |q )
=
q k q k
i =1
Cs =1
h
h
hic
C hic Cs =1 ic Cs =1 ic
q k
q k q k s =1
q k
2
sJ =1 hic
where
h ic w ic P(y i | Ci = c , Z , q ).
Formulae for the derivatives of hic for covariates and for
discrete and qualitative traits are given in the Appendix.
MLE can be used to estimate, under the multiplicative
model, the OR between individuals with copy number
Page 4 of 13
status c with respect to a reference category (e.g.,

individuals with copy number status r) as
OR c / r = e b c b r .
(13)
Similarly, when analyzing continuous traits, the estimated mean effect among individuals with c copies with
respect to those with r copies is
y c / r = b c b r .
(14)
The asymptotic variance-covariance matrix of maximum

likelihood estimates of can be estimated using the
observed information matrix, F, as
Var(q ) = F 1(q ) = H 1(q ).
(15)
Therefore, we can compute a 95% confidence interval

(CI95%) for ORc/r using the expression
CI1a (OR c / r ) exp (b c b r ) z a / 2 Var(q )[c ,c] + Var(q )[r ,r ] 2Var(q )[c ,r ] ,
(16)
and for y c / r
CI1a ( y c / r ) (b c b r ) z a / 2 Var(q )[c ,c] + Var(q )[r ,r ] 2Var(q )[c ,r ] ,
(17)
where za/2 denotes the (1 - a/2)-th quantile of a standard
normal distribution, a is the desired type-I error, and
subindex [, ] denotes the position in the inverse of
Fisher's information matrix.
Hypothesis testing
We propose to use a likelihood ratio test to assess disease
association, taking the model without the copy number
variable as reference. Twice the increase in the loglikelihood provides the asymptotic c2 statistic that tests
H0: b1 = b2 = ... = b C . In many instances, we are
interested in studying the trend in effect with respect to
copy number status (e.g., additive model). This can be
done by generalizing equation (11) in the form
M
y ic =
icmz cm ,
(18)
m =1
where D is a I M design matrix, and is a vector of

dimension M having the model parameters. M is the
total number of variables included in the model,
including copy number status and confounding variables (e.g., M = C + K). For example, a trend test on copy
number status without covariates D would have the form
1
1 1 1 1
D =
0 0 1 1 C 1 C 1
and the trend hypothesis on copy number status is tested
using a likelihood ratio test, comparing this model with
the null model. Notice that this formulation allows us to
accommodate different or common effects for each
latent class. In this case, parameter estimates are
obtained as shown above. Formulae for the derivatives
obtained in the score and Hessian, where coefficients are
not shared by each latent class, are shown in the
Appendix. R language functions for the methods
discussed in this paper are freely available at http://
www.creal.cat/jrgonzalez/software.htm [25]
Results
Simulation study
We performed computer simulation studies to empirically examine the properties of the parameter estimators
developed in the previous sections. The specific goals of
these studies were: (i) to evaluate the performance of the
proposed likelihood ratio trend test based on the latent
class model for a number of CNV measurement
distributions; (ii) to examine the effect of sample size
(I) on the distributional properties of the estimators;
(iii) to examine the bias and mean square error (MSE) of
the estimators; (iv) to assess the accuracy whether of the
variance and parameter estimates obtained using the
observed information matrix. Simulations were performed as follows: To study (i), we simulated a binary
trait using 300 cases and 300 controls. The unobserved
copy number statuses (e.g. latent classes) were simulated
depending on 3 different copy number status ( C = 3),
with the proportion of individuals in each category set as
= (0.5, 0.4, 0.1). The trend OR was set equal to 1.5. The
observed signal intensity ratio (X variable) were simulated as a finite mixture of C normal distributions using
different means, h, and variances, s2, to assess whether
the separation of clusters and their variance affects
power.
To study (ii)(iv) we simulated binary and quantitative

traits. For the binary trait, simulation was performed as
above but simulating various scenarios of sample size
(I), OR and proportion of individuals with each copy
number status, . Again, we simulated different CNV
distributions by varying h and s2. For quantitative traits,
we used the same simulation procedure but copy
number status was simulated depending on a fixed
mean trait level for the reference copy number status and
a desired mean difference with respect to other copy
number statuses. Next, we describe the settings for the
different simulation parameters. Sample size: We chose
the values of I: I {50, 300}. Although current studies
Page 5 of 13
are analyzing thousands of individuals, these values were

chosen to evaluate the performance of our proposed
method in moderately large samples. Copy number status:
Since we were interested in evaluating the performance of
the parameter estimates, we only simulated two different
copy number statuses C = {1, 2}. Odds ratio: To assess the
impact of the strength of association between the disease
and CNV, we chose two values for OR: OR {1.3, 2} in
order to consider a moderate association and a strong one.
Proportion of cases with normal copy number status: To evaluate
the impact of classes with different number of individuals
we set {(0.8, 0.2), (0.5, 0.5)}. Finite mixture: To asses the
impact of distribution of intensity ratio, X, we simulated two
normal distributions with the following parameters: h {1,
1.5}, which correspond to having 2 (considered as normal
copy number status) and 3 copies, respectively, and s
{(0.15, 0.15), (0.15, 0.2), (0.2, 0.2)}. In this case, these
scenarios also helped us to model different situations
regarding misclassification or how latent classes were
separated.
We compared three different approaches. The first (NAIVE)
was based on assessing association between disease and
copy number status obtained using MAP from the finite
mixture model (2). That is, association was assessed using a
c2 test from Table 1. The second is the approach that has
been used predominantly to date when analyzing this kind
of data and is based on assigning CNV status using predefined thresholds (THRES). Association is then assessed
using a c2 test. As mentioned previously, we simulated data
from two mixtures of normal distributions with means of 1
and 1.5. This is equivalent to simulating individuals with 2
and 3 copies, respectively. In this situation, it is considered
that individuals with intensity (or intensity-ratio) greater
than 1.33 correspond to individuals with 3 copies [10]. The
third method is the one proposed in this paper, based on

latent class (LC) using a c2 test. In order to make the results
comparable, the performance of LC based on likelihood
ratio trend test was compared with that of the two other
methods using a c2 trend test (e.g. 1 degree of freedom). To
evaluate bias and MSE of parameter estimates, c2 of
association was used for all three methods.
Simulation results for evaluating the performance of the
likelihood ratio trend test in our proposed model are
shown in Figure 2. The top figures show the power for all
methods analyzed under two scenarios (other scenarios
are given in Additional file 1).
The left panel shows the power for each method, varying
the CNV measurement distribution with regard to the
mean of each latent class, h, while the right panel gives
the same information but with fixed means and varying
variances, s2. Figure 2 also depicts the distribution of
CNV signal intensities for various scenarios. We observe
that our proposed latent class model performs better in
all cases, even when distribution of copy number status
are not very well separated (e.g. more uncertainty).
Simulation results to evaluate parameter estimates for
discrete traits are presented in Table 2 and in Table S1 and
Figures S3 and S4 (see Additional file 1). Similar results and
conclusions are obtained for a quantitative trait. Table 2 and
Figures S3 and S4 (see Additional file 1) summarize the OR
obtained by comparing individuals with 3 copies to those
with 2 copies (reference category) and give the MSE for two
different sample sizes, I, two different proportions of
individuals with 2 copies, , and two different variances
for each component of the mixture, s. Table S1 (see
Additional file 1) compares different methods to compute
Figure 2
Empirical power for simulation studies. Empirical power for the three different approaches analyzed, varying the quality
of clustering for underlying copy number status. Left panel is for fixed variance and varying means, while the right panel is for
fixed mean and varying variances.
Page 6 of 13
Table 2: Simulation study
Mean Square Error (103)
eb
eb
SIM
NAIVE
THRES
LC
NAIVE
THRES
LC
50
50
50
50
50
50
50
50
50
50
50
50
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
1.23
1.24
1.28
1.60
1.82
1.89
1.26
1.32
1.26
2.04
2.04
2.06
1.17
1.14
1.18
1.40
1.36
1.42
1.24
1.28
1.23
1.94
1.76
1.78
1.15
1.09
1.15
1.28
1.29
1.33
1.21
1.25
1.20
1.83
1.68
1.72
1.20
1.21
1.24
1.48
1.52
1.57
1.26
1.35
1.26
2.05
2.05
1.99
57
107
134
54
152
180
39
82
66
40
107
87
87
131
148
85
158
253
51
79
72
67
128
107
42
114
112
44
126
162
32
97
60
34
92
71
300
300
300
300
300
300
300
300
300
300
300
300
0.8
0.8
0.8
0.8
0.8
0.8
0.5
0.5
0.5
0.5
0.5
0.5
1.3
1.3
1.3
2
2
2
1.3
1.3
1.3
2
2
2
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
(0.15,0.15)
(0.2,0.2)
(0.15,0.2)
1.30
1.32
1.30
2.01
2.03
2.03
1.31
1.30
1.30
2.00
2.00
2.00
1.25
1.25
1.22
1.87
1.70
1.62
1.27
1.23
1.24
1.87
1.72
1.76
1.18
1.15
1.16
1.49
1.36
1.38
1.26
1.22
1.23
1.77
1.66
1.71
1.30
1.34
1.29
2.01
1.99
1.86
1.30
1.30
1.29
2.00
2.02
1.97
13
27
24
21
69
78
7
15
12
11
36
26
32
50
42
120
203
189
9
17
14
23
51
37
10
29
21
13
43
38
5
12
9
5
15
10
Odds ratio (eb) and mean square error obtained in 1,000 simulations using the three different approaches, NAIVE, THRES and LC (see text for a
description of each). Results are given for different scenarios, varying the number of individuals (I), the proportion of individuals with each copy
number status (), the odds ratio (eb), and the variance for CNV quantitative measurements.
the standard error of the ORs for the various scenarios

described above. The results compare asymptotic variance
based on an observed information matrix (ASYM) with
respect to empirical variance (EMP). Supplementary Table
S1 also shows coverage and power of confidence intervals
based on the three methods analyzed. As expected, when the
sample size increased, the performance of the estimators of
the finite-dimensional parameters improved (Table 2). In
all cases, the LC method performs better than the others. LC
has less bias than NAIVE and THRES in all cases, and also
shows better MSE.
Regarding variance estimates, the estimation based on ASYM
showed good performance in all scenarios (see Additional
file 1, Table S1). Despite slightly overestimating of EMP, the
bias was less pronounced for I = 300, as expected.
Confidence intervals based on the LC method outperform
those obtained by other methods with regard to power.
Application to real data

MLPA example
The first data set used to analyze CNV and disease was
generated and kindly provided by one of the coauthors
of the current work. Although data is still unpublished, it
has been made available in a blinded format for

reproducing our findings using the approach presented
herein, and for other validation studies. Some candidate
genes were identified after performing a whole genome
scan analysis using aCGH, where a pool of controls and
cases were compared. In order to further investigate the
relationship between the disease and altered the genes, a
targeted study including several variants was designed
using the MLPA technique. We obtained signal intensities of MLPA assays for 360 cases and 291 controls.
Figures 3 and 4 show the intensities for cases and
controls for two selected genes. In both cases, we observe
3 latent classes, corresponding to 0, 1, and 2 copies of
the gene. We found that the finite mixture model fits
very well (c2 goodness-of-fit test, P = 0.6615 and P =
0.4888). The main difference between these two cases is
that copy number status for gene 1 can be established
using a threshold method, while for the second gene this
classification seems more arbitrary. As a consequence,
misclassification should be taken into account when
analyzing gene 2. Table 3 shows the classification of
individuals as having 0, 1, 2 copies, estimated using
equation (2) and the true copy number obtained by
breakpoint cloning and assessing allele presence by PCR,
which unequivocally reports the exact number of copies.
Page 7 of 13
Table 3: Contingency table of estimated and true copy number

status for the two genes examined in the real data example
0.6
True copy number status

0
0
Casecontrol status
100
case
200
300
400
500
control
600
individuals
Gene 1
0
1
2
426
0
0
0
201
0
0
0
24
Gene 2
0
1
2
85
5
0
0
287
73
0
0
204
density
1.0
Figure 3
Association between Gene 1 and disease. Graphical
representation of peak intensities (CNV quantitative
measurement) of individuals for Gene 1 analyzed in the
example. The various colors indicate copy number status
inferred using our proposed finite mixture model.
0.0
0.2
0.4
0.6
0.8

0
0.2
0.4
0.2
0.0
1
100
Casecontrol status
case
200
300
individuals
400
500
control
600
model and computing the ORs using a nave approach

(e.g. assuming that there is no misclassification) and the
LC model that accounts for misclassification. As we can
see, the results are the same for gene 1, since no
misclassification is observed (see Figure 3 and Table 3).
However, for gene 2, copy number status could not be
determined as easily as for gene 1. Thus, we observe a
different OR estimation and, more importantly, a
different P-value for association. For instance, the order
of magnitude of the association between the disease and
gene 2 is better captured by the LC model than by the
NAIVE approach. Regarding the OR estimates, the
analysis using the true copy number status shows that
individuals with one copy of gene 2 have a 63% decrease
in disease risk with respect to individuals with 0 copies.
As the 95%CI shows, this difference is statistically
significant. We arrive at the same conclusion when we
compare individuals with 2 copies with respect to those
with 0 copies. Note that in both cases we observe that the
nave approach underestimates the OR, as shown by the
simulation study.
density
Figure 4
Association between Gene 2 and disease. Graphical
representation of peak intensities (CNV quantitative
measurement) of individuals for Gene 2 analyzed in the
example. The various colors indicate copy number status
inferred using our proposed finite mixture model.
From the table, we can see that the finite mixture model
gives a perfect classification for gene 1 and some
misclassification for gene 2. Goodness-of-fit test revealed
that the proposed mixture model to determine CNV
status was appropriate (p = 0.6615 and p = 0.1586).
Table 4 shows the ORs and their 95%CI for the two
genes analyzed. The first three columns show the results
obtained in the laboratory using PCR, while the other
columns show the results obtained after estimating the
copy number status using our proposed finite mixture
aCGH example
The analysis of aCGH data requires additional steps to
take into account the dependency across probes. Table 5
shows four steps we recommend for the analysis of this
kind of data. First, MAP should be obtained with an
algorithm that considers probe correlation. We use, in
particular, the CGHcall R program which includes a
mixture model to infer CNV status [18]. Second, we
build blocks/regions of consecutive clones with similar
signatures. To perform this step the CGHregions R
library was used [26]. Third, the association between
the CNV status of blocks and the trait is assessed by
incorporating the uncertainty probabilities in the LC
model. And fourth, corrections for multiple comparisons
must be performed. We use the Benjamini-Hochberg
(BH) correction [27]. This is a heuristic method that is
robust against positive dependence and increasingly
conservative as correlation increases [28].
Page 8 of 13
Table 4: Association analysis of disease status and copy number category using the true copy number status and the estimated status
obtained using the finite mixture proposed
True CN
Gene 1
0
1
2
P association
P trend
Gene 2
0
1
2
P association
P trend
Estimated CN
Co
Ca
OR (CI95%)
Co
Ca
ORnave (CI95%)
ORLC (CI95%)
210
75
6
216
126
18
1
1.63 (1.16,2.30)
2.92 (1.14,7.49)
0.0027
5.0 10-4
210
75
6
216
126
18
1
1.63 (1.16,2.30)
2.92 (1.14,7.49)
0.0027
5.0 10-4
1
1.63 (1.16,2.30)
2.92 (1.14,7.50)
0.0023
5.0 10-4
24
159
108
66
201
93
1
0.46 (0.27,0.77)
0.31 (0.18,0.54)
7.2 10-5
2.1 10-5
22
129
140
63
178
119
1
0.44 (0.26,0.75)
0.33 (0.19,0.57)
2.3 10-4
1.0 10-4
1
0.47 (0.27,0.82)
0.31 (0.18,0.54)
8.4 10-5
2.1 10-5
Table 5: Steps used to assess association between CNVs and

traits when aCGH is used
Step 1. Use any aCGH calling procedure that provides MAP

(uncertainty)
Step 2. Build blocks/regions of consecutive probes with similar
signatures
Step 3. Use the signature that occurs most in a block to perform
association unsing LC model
Step 4. Correct for multiple testing considering dependency among
signatures
We applied the methodology to the breasts cancer data

studied by Neve et al. [29], which is freely available from the
bioconductor website http://www.bioconductor.org/ [30].
The data consists on CGH arrays of 1 MB resolution [31].
The authors chose the 50 samples that could be matched to
the name tokens of caArrayDB data (June 9th 2007).
In this example the association between strogen receptor
positivity (dichotomous variable; 0: negative, 1: positive)
and CNVs was tested. We contrasted the association as
given by the LC and the NAIVE models. The original data
set contained 2621 probes which were reduced to 459
blocks after the application of CGHcall and CGHregions
functions. Table 6 shows the number of CNV blocks
associated with strogen receptor positivity for different
Table 6: Number of CNV blocks (out of 459) associated with
estrogen receptor positivity from 50 aCGH breast cancer cell lines
Significance level
10
Latent class model
Chi-square test
-6
1
0
10
-5
4
2
10-4
10-3
10-2
27
10
64
41
117
93
Results are given for different levels of association and comparing our
proposed model with the nave approach that does not consider
uncertainty.
significance levels. We observe that incorporating classification uncertainty with the LC model substantially
increased the level of association, as compared to the
NAIVE approach. The number of positive association at
5% of significance after applying BH correction was 49
and 24 for LC and NAIVE approach, respectively.
Discussion
In this paper we have shown that the assessment of
association between CNVs and disease using analysis
methods that do no take into account uncertainty when
inferring copy number status lead to larger p-values and
underestimate the model parameters. This confounds the
need to increase statistical power, which is reduced by
the multiple comparison correction for the simultaneous
testing of several loci. False positives are typically
controlled by a dramatic reduction in the nominal
p-value, such that very low values are required to reach
statistical significance. Thus, a precise computation of
these values is essential in genetic association studies.
Here we have proposed a latent class model (LC) that
accounts for the uncertainty of assessing CNV status and
also accommodates potential confounding factors. In the
case of analyzing quantitative traits, we also provide
formulae to further propagate call uncertainty, as other
authors have proposed in another context [32]. By
analyzing quantitative traits, we have assumed that the
response variable follows a normal distribution, although
this assumption does not hold in some instances. In this
situation, one possibility is to analyze the log-transformed variable, although log transformation may not be
not sufficient. The model could easily be extended to fit a
response variable that has any exponential family
distribution (e.g. normal, gamma, Poisson). However,
we have not yet implemented this option in the functions
reported here. The extension of our proposed latent-class
Page 9 of 13
model to assess survival time, possibly with rightcensored data, is not trivial but could be a very interesting
avenue for future investigation. The parameter estimation
procedure proposed here, allows the estimation of
confidence intervals. The LC model was remarkably
consistent with simulated data. In particular, we found
that the p-values obtained with the LC model were more
similar to the expected values than those obtained by the
threshold and nave methods.
We maximize the likelihood function, assuming fixed
weights for each copy number status, which accounts for
possible misclassification. The main advantage of considering weights as known constants is that the NewtonRaphson procedure is much simpler, faster and feasible
for obtaining the Hessian matrix analytically. We
confirmed that the proposed model captures very well
the nature of the synthetic data and variance estimates.
Interestingly, we observed that the variance estimates
using MLE were also reproduced when a bootstrap
procedure was used (see Additional file 1, Table S2). In
the interest of generalization, one can consider maximizing the likelihood function for both model parameters and weights. In that case, an EM algorithm
should be used instead. However, one should bear in
mind that EM does not allow for estimation of the
variance of the model parameters and is computationally
expensive, which may be particularly costly if this
method is used in whole genome scan settings.
Conclusion
We have shown that the LC model can incorporate
uncertainty of CNV calling in the analysis. We have also
illustrated how to analyze quantitative traits as well as how
to accomodate confounding variables. This is of particular
importance in complex diseases studies where other clinical
or biochemical factors need to be taken into account. The
formulation can also be generalized to assess survival times
or counts in longitudinal studies. The model has showed
good performance when analyzing both targeted (MLPA
data) and whole genome (aCGH data) studies.
Authors' contributions
JRG and IS developed the new statistical methods. JRG
wrote the R functions and the main text of the manuscript and performed the simulation studies. GE and AC
made abundant suggestions for developing the models.
SP worked on the gaussian mixture approach to model
quantitative CNVs measurements. XE reviewed the paper
and revised its framework. LA and JRG proposed the
need of a statistical tool to measure the biological
differences in allele distribution in cohorts of cases and
controls, and conceived the study. All authors have read,
and approved the final manuscript.
Appendix
To obtain parameter estimates, we maximize the loglikelihood function
C
log P(Y | Ci = c , Z , q ) =
w
log
i =1
ic P( y i
| Ci = c , Z , q ),
c =1
where P(yi| Ci = c, Z, ) is given by equations (9) and

(10) for discrete and quantitative traits, respectively. As
previously mentioned, the k-th component of the score,
S, is given by
S k (y | C , q )
log P(Y |q )
=
q k
cC=1
hic
q k
Cc =1 hic .
i =1
The k-th element of the Hessian, H, is

2 log P(Y |q )
H kk(q )
=
q k q k
Cs =1
h
h
hic
C hic Cs =1 ic Cs =1 ic
q k q k s =1
q k
q k
2
sJ =1 hic
i =1
where
h ic w ic P(y i | Ci = c , Z , q ).
Herein we provide formulae for the derivatives of hic for all
cases discussed in this paper. Although the following
expressions may appear complicated, they are straightforward to program and are included in the >R functions
available at http://www.creal.cat/jrgonzalez/software.htm.
Binary Traits
Binary Traits without covariates
In this case, the hic function takes the form
w ic
e y ib c
.
1+ e b c
Therefore,
yb
yb b
b
hic w ic I{k = c}y ie k (1+ e k ) a ic I{k = c}e i k e k
= I{k =c}h ic (y i p ic ),
=
b k
(1+ e b k ) 2
where
p ic =
1
,
1+ e b c
and
h
2hic
= I{k =c} ic (y i p ic ) h ic (p ic p ic2 ) , and
b k
b 2
2hic
= 0 for k k.
b k b k
Page 10 of 13
Binary Traits with covariates

e y iy ic
h ic = w ic
, where y ic = b c +
1+ ey ic
Quantitative traits with covariates and shared variance

K
g k z ik .
k =1
h ic = w ic
Therefore,
yy
yy y
y
hic w ic I{k = c}y ie ic (1+ e ic ) w ic I{k = c}e i ic e ic
= I{k =c}h ic (y i p ic ),
=
y
2
b k
i
c
(1+ e
)
1
e
s
, where y is = b s +
g z
p ip .
p =1
hic
y y
= I{k =c}h ic i ic
b k
s2
2hic
2hic
1 hic
= 0, for k k
= I{k =c}
(y i y ic ) h ic , and
2
2
b
b k b k
b
s k
k
hic
h
h
= ic + ic (y i y ic ) 2
s
s
s3
1
p ic =
,
1+ e y ic
2hic
= I{k =c} ic (y i p ic ) h ik (p ic p ic2 ) , and
b k
b 2
(y i y is ) 2
2s 2
Therefore,
where
and
hic
s hic
2hic
= s
2
s
s2
hic 3
s 3hics 2
2 s
+ (y i y ic )
s6
h ic
2h ic
2h
= I{k =c} s2 3ic (y i y ic )
b k s
s
s
h
ic
2hic s
2h
=
ic (y i y ic )z ip
g ps s 2
s 3
z ip hic
2hic
= I{k =c}
(y i y ic ) h ic
g pb k
s 2 b j
hic
hic
=
(y i y ic )z ip
g p
s 2
2hic
= 0 for k k .
b jb j
For covariates:
hic
= z ph ic (y i p ic )
g p
2hic
h
= z p ic (y i p ic ) z p2h ic (p ic p ic2 )
2
g p
g p
2hic
h
= z p ic (y i p ic ) z p z ph ic (p ic p ic2 )
g pg p
g p
2
z i2p
z ip z ip
2hic hic 1
2hic
h hic 1
=
h ic
, and
= ic
h ic
for p p
2
2
g
h
g pg p g p g p hic
p ic
s
s2
g p
Quantitative traits
Quantitative traits without covariates and shared
variance
1
h ic = w ic e
s
(y i b c ) 2
2s 2
(y i b k ) 2
yi b k
y b
2s 2
= I{k =c}h ic i k
s2
s2
2
2
hic
hic
1 hic
= I{k =c}
(y b k ) h ic , and
= 0 for k k
2 b k i
b k b k
b 2
s
(y b c ) 2
(y b c ) 2
i
i
2
2 (y i b c ) 2
1
hic
1
h
h
2
s
2
s
= ic + ic (y i b c ) 2
= w ic
+ e
e
2
s
s
s
s3
s3
hic 3
hic
h
3hics 2
s
s
2
ic
hic
2 s
= s
+ (y i b c )
2
2
s
s
s6
hic
2hic
2h
= s ic (y i b s )s
b k s s 2
s 3
In this situation we can write the linear predictor of

equation (18) as
y ic = b 1 + b 1(c 1).
Therefore,
1
hic
= I{k =c}w ic e
b k
s
Trend test
In other words, b1 plays the role of an intercept and b2 is

the slope. In this case, we consider that both b1 and beta2
are shared for each latenty yclass. In this situation, bearing
i ic
in mind that h ic = w ic e y , for the discrete traits, we
ic
1
+
e
have that
hic
= h ic x ikc (y i p ic ),
b k
(19)
and
2hic
h
= x ikc ic (y i p ic ) x ikc x ikc h ic (p ic p ic2 ).
b k b k
b k
For quantitative traits, where

h ic = w ic s1 e
have that
( y y ic ) 2
i
2s 2
(20)
, we
Page 11 of 13
hic
y y
= h ic x ikc i ic ,
b k
s2
(21)
3.
and
2hic
h y y
h
= x ikc ic i ic x ikc x ikc ic .
2
b k b k
b k s
s2
4.
(22)
5.
For the variance, we have that
hic
h
h
= ic + ic (y i y ic ) 2 ,
s
s
s3
hic
s hic
2hic
= s
s 2
s2
6.
(23)
hic 3
s 3hics 2
2 s
,
+ (y i y ic )
s2
(24)
7.
8.
and
9.
hic
2hic
2h
= x ikc s ic
2
b k s
s3
s
(y i y ic ).
(25)
10.
11.
Additional material
Additional file 1
Tables and figures for more scenarios of simulation studies.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712105-10-172-S1.pdf]
Acknowledgements
12.
13.
14.
The first author would like to thank Xavier Bassagaa for his comments
and helpful conversations about the model proposed. Gavin Lucas is also
acknowledged for his comments on a last version of the manuscript. The
authors also want to thank helpful comments on how to analyze aCGH
data given by one of the reviewers. This work was supported by the
Spanish Ministry for Science and Innovation [MTM2008-02457 to JRG and
SAF2008-00357 to XE]; and the European Commission [AnEUploidy
project; FP6-2005-LifeSciHealth contract #037627].
16.
References
17.
1.
2.
Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL,
Cheng Z, Schwartz S, Albertson DG, Pinkel D, Altshuler DM and
Eichler EE: Linkage disequilibrium and heritability of copynumber polymorphisms within duplicated regions of the
human genome. Am J Hum Genet 2006, 79(2):275290.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD,
Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S,
Freeman JL, Gonzalez JR, Grata-cos M, Huang J, Kalaitzopoulos D,
Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L,
Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J,
Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L,
Conrad DF, Es-tivill X, Tyler-Smith C, Carter NP, Aburatani H,
Lee C, Jones KW, Scherer SW and Hurles ME: Global variation in
15.
18.
19.
20.
21.
copy number in the human genome. Nature 2006, 444(7118):

444454.
Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z,
Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE and
Lam WL: A comprehensive analysis of common copy-number
variations in the human genome. Am J Hum Genet 2007,
80:91104.
Feuk L, Carson AR and Scherer SW: Structural variation in the
human genome. Nat Rev Genet 2006, 7(2):8597.
Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N,
Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N,
Scherer SW, Tavare S, Deloukas P, Hurles ME and Dermitzakis ET:
Relative impact of nucleotide and copy number variation on
gene expression phenotypes. Science 2007, 315(5813):848853.
Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G,
Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, Murthy KK,
Rovin BH, Bradley W, Clark RA, Anderson SA, O'Connell RJ,
Agan BK, Ahuja SS, Bologna R, Sen L, Dolan MJ and Ahuja SK: The
influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 2005, 307(5714):
1434440.
Rovelet-Lecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A,
Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F,
Frebourg T and Campion D: APP locus duplication causes
autosomal dominant early-onset Alzheimer disease with
cerebral amyloid angiopathy. Nat Genet 2006, 38:2426.
Le Marechal C, Masson E, Chen JM, Morel F, Ruszniewski P, Levy P
and Ferec C: Hereditary pancreatitis caused by triplication of
the trypsinogen locus. Nat Genet 2006, 38(12):13721374.
Schouten JP, McElgunn CJ, Waaijer R, Zwijnenburg D, Diepvens F
and G P: Relative quantification of 40 nucleic acid sequences
by multiplex ligation-dependent probe amplification. Nucleic
Acids Res 2002, 30(12):e57.
Gonzlez J, Carrasco J, Armengol L, Villatoro S, Jover L, Yasui Y and
Estivill X: Probe-specific mixed-model approach to detect
copy number differences using multiplex ligation-dependent
probe amplification (MLPA). BMC Bioinformatics 2008, 9:261.
Engert S, Wappenschmidt B, Betz B, Kast K, Kutsche M,
Hellebrand H, Goecke T, Kiechle M, Niederacher D, Schmutzler R
and Meindl A: MLPA screening in the BRCA1 gene from 1,506
German hereditary breast cancer cases: novel deletions,
frequent involvement of exon 17, and occurrence in single
early-onset cases. Hum Genet 2008, 29(7):948958.
Hansen T, Jonson L, Albrechtsen A, Andersen M, Ejlertsen B and
Nielsen F: Large BRCA1 and BRCA2 genomic rearrangements in Danish high risk breast-ovarian cancer families.
Breast Cancer Res Treat 2008 in press.
Aitman T, Dong R, Vyse T, Norsworthy P, Johnson M, Smith J,
Mangion J, Roberton-Lowe C, Marshall A, Petretto M, Hodges E,
Bhangal G, Patel S, Sheehan-Rooney K, Duda M, Cook P, Evans D,
Domin J, Flint J, Boyle J, Pusey C and Cook H: Copy number
polymorphism in Fcgr3 predisposes to glomerulonephritis
in rats and humans. Nature 2006, 439(7078):851855.
Fellermann K, Stange D, Schaeffeler E, Schmalzl H, Wehkamp J,
Bevins C, Reinisch W, Teml A, Schwab M, Lichter P, Radlwimmer B
and Stange E: A chromosome 8 gene-cluster polymorphism
with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet 2006, 79(3):
43948.
Ionita-Laza I, Rogers AJ, Lange C, Raby BA and Lee C: Genetic
association analysis of copy-number variation (CNV) in
human disease pathogenesis. Genomics 2009, 93:2226.
Fraley C and Raftery AE: How many clusters? Which clustering
method? Answers via model-based cluster analysis. The
Computer Journal 1998, 41:578588.
Picard F, Robin S, Lebarbier E and Daudin JJ: A segmentation/
clustering model for the analysis of array CGH data.
Biometrics 2007, 63(3):758766.
Wiel van de MA, Kim KI, Vosse SJ, van Wieringen WN, Wilting SM
and Ylstra B: CGHcall: calling aberrations for array CGH
tumor profiles. Bioinformatics 2007, 23(7):892894.
Leisch F: A general framework for finite mixture models and
latent class regression in R. Journal of Statistical Software 2004, 11(8):
118.
Du J: Combined Algorithms for Fitting Finite Mixture
Distributions. PhD thesis McMaster University, Ontario, Canada;
2002.
Bashir S and Duffy S: The correction of risk estimates for
measuremente error. Ann Epidem 1993, 7:156164.
Page 12 of 13
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
Davidov O, Faraggi D and Reiser B: Misclassification in logistic

regression with discrete covariates. Biometrical Journal 2003,
5:541553.
Greenland S: Basic methods for sensitivity analysis of biases.
Int J Epi 1996, 25:11071115.
Spiegelman D, Rosner B and Logan R: Estimation and inference
for logistic regression with covariate missclassification and
measurement error, in main study/validation study designs.
J Am Stat Assoc 2000, 95:5161.
CREAL's web-page. http://www.creal.cat/jrgonzalez/software.htm.
Wiel van de M and van Wieringen W: CGHregions: dimension
reduction for array CGH data with minimal information
loss. Cancer Informatics 2007, 2:5563.
Benjamini Y and Hochberg Y: Controlling the false discovery
rate: A practical and powerful approach to multiple testing.
J Roy Statist Soc Ser B 1995, 57:289300.
Sarkar S: False discovery and false nondiscovery rates in
single-step multiple testing procedures. The Annals of Statistics
2006, 34:394415.
Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L,
Bayani N, Coppe JP, Tong F, Speed T, Spellman PT, DeVries S,
Lapuk A, Wang NJ, Kuo WL, Stilwell JL, Pinkel D, Albertson DG,
Waldman FM, McCormick F, Dickson RB, Johnson MD, Lippman M,
Ethier S, Gazdar A and Gray JW: A collection of breast cancer
cell lines for the study of functionally distinct cancer
subtypes. Cancer Cell 2006, 10(6):515527.
Bioconductor's web-page. http://www.bioconductor.org/.
M Neve et al in Gray Lab at LBL: Neve2006: expression and CGH data
on breast cancer cell lines. [R package version 0.1.6].
van Wieringen WN and Wiel van de MA: Nonparametric testing
for DNA copy number induced differential mRNA gene
expression. Biometrics 2009, 65:1929.
Publish with Bio Med Central and every

scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:

available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright
BioMedcentral
Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp
Page 13 of 13

Universidad Simón Bolívar Decanato de Estudios Profesionales Coordinación de Matemáticas

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Universidad Simón Bolívar Decanato de Estudios Profesionales Coordinación de Matemáticas

Cargado por

Copyright:

Formatos disponibles

UNIVERSIDAD SIMN BOLVAR

DECANATO DE ESTUDIOS PROFESIONALES

ANLISIS MULTIVARIANTE PARA DETERMINAR GENES VARIABLES EN

Sartenejas, abril de 2009

UNIVERSIDAD SIMN BOLVAR

ANLISIS MULTIVARIANTE PARA DETERMINAR GENES VARIABLES EN

Presentado ante la Ilustre Universidad Simn Bolvar

UNIVERSIDAD SIMN BOLVAR

ANLISIS MULTIVARIANTE PARA DETERMINAR GENES VARIABLES EN

4.1. Primera fase: Proyecto HapMap y pre-procesamiento de los datos........................................................... 44

CONCLUSIONES Y FUTUROS ESTUDIOS..............................................................57

especficamente en un futuro con las enfermedades antes mencionadas.

Se proponen distintos mtodos para obtener la clasificacin de los

Clustering (Particin y jerrquico):

mclust, hclust, etc.

Modelos de clases latentes (ajuste de mixturas).

Se Investiga qu mtodos utilizan estas funciones de R y cmo son

implementadas (esto es importante dada la magnitud de la base de datos).

Luego se implementan los mtodos con los datos propuestos.

Una vez inferido el nmero de copias de cada gen, o sonda gentica.

Se realiza un anlisis de k-meoides para ver si estos CNVs permiten o no

Luego se realiza un anlisis multivariante con los datos clasificados y

sin clasificar, utilizando la informacin a priori de la pertenencia de

Se calcula el ndice kappa de concordancia y una tabla de

contingencia, para ambas clasificaciones.

Se evalan los resultados de cada mtodo.

Se establece la relacin entre la variacin en nmero de copias

(CNVs) y la discriminacin de poblaciones.

En un futuro esto podra extrapolarse para obtener los genes

responsables de ciertas enfermedades (casos controles). Y as detectar los

empresa. As tambin las actividades, laboratorios e ingenios estn a la disposicin

Fig.1.1 Edificio sede del PRBB

1.2. Centros que conforman el PRBB

1.2.1. Centro de Investigacin en Epidemiologa Ambiental (CREAL)

Fig.1.2 Hospital del mar

La produccin cientfica generada como fruto de esta investigacin, incluye cerca

1.2.4. Departamento de Ciencias Experimentales y de la Salud de la

Los mecanismos bsicos del desarrollo inicial y de la organognesis.

Aplicacin de las lneas celulares que se derivan de las clulas madre a

enfermedades (medicina regenerativa) en las que hay prdida de clulas

1.2.7. Instituto de Alta Tecnologa (IAT)

2.1. Nociones y conceptos genticos:

2.2. ADN (cido desoxirribonucleico)

Fig.2.4 Estructura del ADN

Alelos dominantes: Son aquellos que aparecen en el individuo ya sea

heterocigotos (posee cromosomas cuyos alelos tienen diferente informacin,

Alelos recesivos: los que quedan enmascarados del fenotipo de un

individuo heterocigoto y slo aparecen en el homocigoto, siendo homocigtico

2.5. Polimorfismo gentico:

de forma estable en una poblacin y para ser

considerado un polimorfismo gentico y no una mutacin, para esto debe presentar

RFLP: (restriction-fragment-length polymorphisms) Polimorfismos de

longitud de fragmentos de restriccin.

SNPs: (Single Nucleotide Polimorphism)Polimorfismo de un solo

virus, productos qumicos, frmacos, etc.

Fig. 2.5 Polimorfismo de un solo nucletido

enfermedades. Antes se pensaba que los SNP en el ADN eran la variacin ms

Actualmente se esta realizando un mapeo de CNVs que se piensa transformar la

Un descubrimiento sorprendente fue que aproximadamente un 12% del genoma

60.000 bases, alrededor de 100 CNVs fueron detectados en

cada genoma con un tamao promedio de 250.000 bases.

Fig. 2.6 Tipos de variaciones en la estructura

a) Variable Nominales Asignan nombres a las diferentes formas que pueda