Skip to main content

Multiple Comparisons and Evaluating Significance

  • In 1978 Restricted Fragment Linked Polymorphisms (RFLPSs) were used for linkage analysis.
  • In 1987 the first human genetic map was created.
  • In 1989 microstellite markers made genome-wide linkage studies possible.
  • 1990-2003 the human genome project was sequenced.
  • 2002-2006 HapMap project collected sequences in populations to discover variation across the genome.
  • 2006 onward, Genome-Wide Association Studies (GWAS)
  • 2010 onward, large scale custom arrays
  • 2010 onward, sequencing technology becomes affordable
  • Even more WGS projects...
    • ADSP 2012
    • TOPMed 2014
    • CCDG 2014

Prior to the GWAS era, genetic association studies were hypothesis driven; Testing markers within/near the gene or region for association. "H0: The trait X is caused/influenced by Gene A." The hypothesis (gene or genes) came from:

  • Experiments in other species
  • Known associations with a related trait in humans
  • Linkage analysis localizing trait to a specific chromosomal region

Chip-based Genome-wide Association Scans

  • Hypothesis generating
    • Assumes only that there are genetic effects large enough to find
    • Asks what genes/variants are associated with my trait
  • 500k -> 5 million genes/variants across genome
    • Multiple genome-wide chips availbleavailable
    • Varying strategies for SNP selection
    • Imputation allows testing of ungenotyped SNPs
    • Typically GWAS chips have focused on common SNPs with frequency > 1%
Candidate
  • Limits testing to locations of perceived high-prior-probability
  • "If you look under the lampost you can only see what it illuminates"
Genome-Wide
  • Extreme multiple testing - requires large sample size, meta-analysis of multiple studies to overcome
  • Gives an "unbaised"unbiased" view of the genome
  • Allows unexpected discoveries

Whole Genome or Exome Sequencing

  • Identifies known SNPs (that would be on a chip) but also previously undiscovered variants.
  • Attempts to assay all, or nearly all, variation in genome or exome
    • Whole exome:
      • ~1% of the genome
      • ~30 million bp
      • Number of variants observed depends on sample size and population
    • Whole genome: 3 billion bp,  > 30 million known variants in 1000G project

Statistical Significance

There many things to test in genetic association studies:

  • Multiple phenotypes
  • Multiple SNPs
    • Candidate gene or region assocationassociation
    • Genome-wide association
    • Haplotype Analyses
  • Gene-Gene or Gene-environmental Interactions

The multiple tests are often correlated.

Type I error: Null hypothesis of "no association" is rejected, when in fact the marker is NOT assocaiatedassociated with that trait.
This implies research will spend a considerable amount of resources focusing on a gene or chromosomal region that is not truly important for your trait.

Type II error: Null hypothesis of "no association" is NOT rejected, when in fact the trait and marker are associated.
This implies the chromosomal region/gene is discarded; a piece of teh genetic puzzle remains missing for now.

The significance level alpha for a single statistical test is the type-I error rate for that test. If we perform multiple tests within the same study at level alpha, the type-I error rate specified will apply to each specific test but not to the entire experiment (unless some adjusted is made).

Family-wise error rate (FWER) is the probability of at least one type I error.

False discovery rate (FDR) is the expected proportion of type I errors among the rejected hypotheses.