Multiple Comparisons and Evaluating Significance
- In 1978 Restricted Fragment Linked Polymorphisms (RFLPSs) were used for linkage analysis.
- In 1987 the first human genetic map was created.
- In 1989 microstellite markers made genome-wide linkage studies possible.
- 1990-2003 the human genome project was sequenced.
- 2002-2006 HapMap project collected sequences in populations to discover variation across the genome.
- 2006 onward, Genome-Wide Association Studies (GWAS)
- 2010 onward, large scale custom arrays
- 2010 onward, sequencing technology becomes affordable
- Even more WGS projects...
- ADSP 2012
- TOPMed 2014
- CCDG 2014
Prior to the GWAS era, genetic association studies were hypothesis driven; Testing markers within/near the gene or region for association. "H0: The trait X is caused/influenced by Gene A." The hypothesis (gene or genes) came from:
- Experiments in other species
- Known associations with a related trait in humans
- Linkage analysis localizing trait to a specific chromosomal region
Chip-based Genome-wide Association Scans
- Hypothesis generating
- Assumes only that there are genetic effects large enough to find
- Asks what genes/variants are associated with my trait
- 500k -> 5 million genes/variants across genome
- Multiple genome-wide chips availble
- Varying strategies for SNP selection
- Imputation allows testing of ungenotyped SNPs
- Typically GWAS chips have focused on common SNPs with frequency > 1%
Candidate
- Limits testing to locations of perceived high-prior-probability
- "If you look under the lampost you can only see what it illuminates"
Genome-Wide
- Extreme multiple testing - requires large sample size, meta-analysis of multiple studies to overcome
- Gives an "unbaised" view of the genome
- Allows unexpected discoveries
Whole Genome or Exome Sequencing
- Identifies known SNPs (that would be on a chip) but also previously undiscovered variants.
- Attempts to assay all, or nearly all, variation in genome or exome
- Whole exome:
- ~1% of the genome
- ~30 million bp
- Number of variants observed depends on sample size and population
- Whole genome: 3 billion bp, > 30 million known variants in 1000G project
- Whole exome:
Statistical Significance
There many things to test in genetic association studies:
- Multiple phenotypes
- Multiple SNPs
- Candidate gene or region assocation
- Genome-wide association
- Haplotype Analyses
- Gene-Gene or Gene-environmental Interactions
The multiple tests are often correlated.
Type I error: Null hypothesis of "no association" is rejected, when in fact the marker is NOT assocaiated with that trait.
This implies research will spend a considerable amount of resources focusing on a gene or chromosomal region that is not truly important for your trait.
Type II error: Null hypothesis of "no association" is NOT rejected, when in fact the trait and marker are associated.
The significance level alpha for a single statistical test is the type-I error rate for that test. If we perform multiple tests within the same study at level alpha, the type-I error rate specified will apply to each specific test but not to the entire experiment (unless some adjusted is made).
Family-wise error rate (FWER) is the probability of at least one type I error.
False discovery rate (FDR) is the expected proportion of type I errors among the rejected hypotheses.