Statistical Genetics

BS858: Genetics, inheritance, complex traits, and testing for gene components

Genetics and Inheritance

Gregor Mendel was an Austrian Monk in the late 1800s who experimented with pea plants to find trait inheritance patterns in successive generations.

image-1663021566338.png

Mendel's Laws

  1. Principle of Segregation - Two alleles of a homologous pair of chromosomes separate (segregate) during gamete formation such that each gamete receives only one allele
  2. Principle of Independent Assortment - Alleles of a gene pair assort independently of other gene pairs. The segregation of one pair of alleles in no way alters the segregation of another pair of alleles*

*This law is not always true, when genes are linked on a chromosome, explained later

Terminology

Homozygous - An individual who has two copies of the same allele at a locus

Heterozyous - An individual who has two different alleles at the same locus

Dominant alleles only needs one copy to show the phenotype

Recessive alleles need two copies of the allele to show the phenotype

Co-dominance means the phenotype for the heterozygote is different from either homozygote

Example - Blood types A and B are co-dominant to each other, but dominant to type O:

image-1663022402037.png

Gene inheritance only happens during Meiosis, not mitosis (although mutations can happen in both). If you have genes close to each other, they are more likely to be inherited together, which is why Mandel's second law does not always hold.

Pedigree Drawing

image-1663024032456.png

Diagonal line mean the person is deceased. Filled in represents individuals effected by a certain condition of interest, a half filled in shape represents a "carrier" of the condition who does not express it.

Mendelian Diseases

Mendelian diseases are disorders caused by a mutation in a single gene. Follow a specific form of inheritance, the most common:

image-1663168885447.png

image-1663169001798.png

image-1663169031404.png

image-1663169228960.png

image-1663169262282.png

Any trait that does not follow a Mendelian pattern is a complex genetic trait

Storing Family Data

There are really no standards for storing genetic data, everyone has their own way of representing it.

image-1663169926109.png

Here is a typical representation of a pedigree data set. 2 indicates something is present and 1 indicates it is not. Be prepared to draw a pedigree chart from something like the above.

a1 and a2 are the alleles being studied and aff is if the indicivdual is affected by the trait. We might represent the allele pair as 11, 12, 22 or 1/1, 1/2, 2/2, etc.

SNP (SNV) "single nucleotide polymorphism/variant" genotypes may be:

Microsatellites may be recorded as:

Sometimes phenotype may be complex, where it is than affected/unaffected, in which case it is usually store separately from the genetic data.

Useful Summary Information of Genetic Data

Linkage Analysis

The primary goal of linkage analysis is to determine the location (chromosome and region) of genes influencing a specific trait. We accomplish this by looking for evidence of co-inheritance of the trait with other genes or markers whose locations are known, and locating genes close to one another.

image-1663625197774.png

Linked Genes

Recall Mendel's 2nd Law: The Principle of Independent Assortment - Alleles of a gene pair assort independently of other gene pairs. The segregation of one pair of alleles in no way alters the segregation of another pair of alleles EXCEPT when the genes are linked on a chromosome.

A haploid genotype (haplotype) is a combination of alleles at multiple loci that are transmitted together on the same chromosome. It contains:

  1.  The alleles present at each locus (multi-locus genotype) such as Aa Bb
  2. Which alleles are on the same chromosome, such as possible haplotypes Ab and aB or ab and AB

image-1663858116819.png

The above represents 2 different phenotytpes on a pea plant, Y is color and R is whether its round or wrinkled.

If the color and shape are unlinked then we would consider each box of a punnet square to have equal likelihood:

image-1663858297910.png

image-1663858319328.png

If the color and shape are completely linked then all Haploid gametes are RY or ry, that is to say R is always inherited with Y and r is always inherited with y:

image-1663858511292.png

image-1663858531365.png

Genes can also be partially linked so the likelihood of them being paired together more likely but not guaranteed.

image-1663858673449.png

image-1663858693620.png

Recombination

Recombinant don't inherit the same chromosomes as their parents, while non-recombinant have chromosomes inherited by their parents. Proportion of recombinants gives an indication of the distance along the chromosome between the two loci; the closer they are together the less likely they are to combine.

Consider Morgan's fruit fly experiment as an example:

image-1663858833734.png

image-1663859074730.png

Wing length and eye color are linked. So RN and rn should always be inherited together, but we can see the recombination Rn or rN in B1 that doesn't occur in parents.

When loci are on different chromosomes, we observe recombination 50% of the time.

image-1663627283758.png

The line in-between the allele means they are on the same chromosome, so the alleles on the same side are inherited together.

Overview of Recombination

The recombination fraction (theta) is the probability of recombination (i.e. the probability of an odd number of crossovers) between two loci. The further apart they are, the more likely they are to recombine and vice versa.

Linkage Analysis

Linkage analysis estimates the genetic distance between genetic markers or between genetic markers and a trait locus with recombination. This allows us to know where a trait locus is in the genome if we know:

  1. The genetic distance between a disease locus and a marker 
  2. The location of the marker on the genome.

To perform a linkage analysis, estimate the recombination fraction between the disease locus and a marker locus and test:

H0: 𝜃 = 1/2 -> not linked i.e segregating independently

H1: 𝜃 < 1/2 -> linked, i.e. close together on the same chromosome

The most common test of linkage is Logarithm of the Odds (LOD):

LOD Score = log base 10 of the likelihood ratio = log10(L(𝜃=𝜃1)/L(𝜃=.5))

This is a transformation of the usual likelihood ratio test. The probability of observing R recombinants and N non-recombinants, where the recombination fraction is 𝜃, the binomial is as follows:

image-1663627779340.png

So the likelihood is:

image-1663627804127.png

We can ignore the constant because we will be working with a ratio of likelihoods and the constant cancels.

When we know the exact number of recombinants and non-recombinants:

image-1663627923838.png

The MLE of 𝜃 is 𝜃_hat = R/(R+N) which is the value of 𝜃 that maximizes the LOD score and likelihood function.

Factors Influencing Linkage Analysis
Overview of Linkage Analysis

Parametric linkage analysis assesses linkage between a marker and a locus

Advantages

Disadvantages

Multi-point Linkage Analysis

Multi-point analysis incorporates multiple markers into the likelihood computation. It computes the likelihood that a disease is located at a certain position on a chromosome. The null hypothesis is that the disease locus is not on the chromosome. It has more information and thus is more powerful.

Affected Relative Pairs

Based on sets of affected pairs, no need to specify a model. This is a non-parametric tests of linkage analysis. Easier to collect but need a larger sample.

Alleles that are copies of the same allele from a common ancestor are called "identical-by-descent" or IBD for short. IBS is "Identity-by-state". All IBD are IBS but the opposite is not true.

image-1663631210060.png

For the bottom rightmost "3" in the example definitely comes from the father, so it is shared IBD.

The logic behind this test is that if an allele is "causing" a disease, and the disease is not common, two affected relatives will most probably share the allele IBD. This means the surrounding region will most probably be IBD, and we can look for regions which are more likely to hold the disease locus.

Consider the below chart with the expected frequency of relative pairs showing sharing 0,1, or 2 alleles IBD under no linkage:

image-1663862221470.png

Twins always share their alleles. Children always inherit 1 allele IBD with their parents, making the relationship useless for linkage analysis. Siblings can give the most information.

The process of linkage analysis through affected relative pairs is as follows:

  1. Collect a sample of pairs of affected relatives
  2. Genotype some markers and estimate IBD
  3. At multiple locations across the genome test whether the pairs share more alleles IBD than would be expected by chance.

Tests for affected sibling pairs:

Goodness of Fit

image-1663863418555.png

Linkage Measured By LOD Score

LOD Score = log_10(Likelihood Ratio)

Likelihood Ratio Test (LRT) = 2 * ln(Likelihood Ratio) ~ chi2

Therefor: chi2 = 2 * ln(10) * log_10(likelihood ratio) = 4.6 * LOD Score

LOD Score  = chi2 test value / 4.6, with 1 df

LOD score  ~ z2 / 4.6 or t2 / 4.6

Quantitative Trait Locus Mapping

Relatives who have similar trait values should have higher than expected levels of sharing genetic material near the genes that influence those traits (a greater IBD).

The Haseman-Elston (HE) was the first approach to QTL in humans.

image-1663864197348.png

It can be proven that:

image-1663864215315.png

Where sigma_squared_a is the genetic variance explained by the locus and sigma_squared_e is the envoirnmental variance of the trait.

image-1663864292968.png

A negative slope implies linkage, as relatives with similar trait values have small squared differences and high IBD sharing

Assessing the Genetic Component of a Phenotype

A phenotype is the appearance of an individual, which results from the interaction of the person's genetic makeup and their environment. Phenotypes can be categorical or numerical. If we are interested in the genetic component of a trait  there are different methods we can use for analysis.

We define a trait for analysis, determine what we are interested in studying, and then the methods that need to be used. For example, if we want to know how BMI is linked to gene effects we would need to consider other factors such as age, sex, smoking, etc. Thus, a Multiple Linear Regression might be appropriate, were we can measure the variability of each factor.

Variance of Phenotypic Traits

𝜎2T = The observed (phenotypic) variability of a trait
𝜎2T = 𝜎2G + 𝜎2E = The phenotypic variability can be partitioned in to variability due to genetic and environmental effects
𝜎2G = 𝜎2A + 𝜎2D = The genetic component can be further partitioned into additive and dominance genetic variance

We can write a model for the trait as:  
T = (A + D) + E = G + E

Assumptions of this model:

Additive and Dominance Components

Consider a frequency distribution of trait values for two alleles B and b, where B creates a high trait value and b creates a low trait value on a continuous scale which is shifted so that the midpoint between the mean of BB (+a) and bb (-a) is 0:

image-1664374671516.png

d is the mean of the Bb group

In an additive model d = 0 (no dominance; dominance variance = 0)
In a recessive model: d = -a    (Bb would overlap the bb distribution)
In a dominant model d = +a   (Bb would overlap the BB distribution)

The degree of dominance can be defined as d/a

Heritability

The heritability of a trait is the proportion of total phenotypic variance that is due to genetic effects.

Heritability can be defined as:

h2    =    𝜎2G /  𝜎2T     =   ( 𝜎2A + 𝜎2D ) / ( 𝜎2A + 𝜎2D + 𝜎2E )

The above formula is also called Broad sense hertiability. Narrow sense heritability (just the additive component): 

hn2    =  𝜎2A  / 𝜎2T

We use the expected resemblance among relatives to estimate h2. It is a function of covariance between relatives and coefficient of relationship (AKA "Additive coefficient")

The additive coefficient of a relationship C is the expected proportion of alleles shared IBD by a relative pair, defined as:

C = 2-R, where R is the degree of relationship

image-1664378503599.png

Recall sharing a allele Identical-By-Descent means relatives who share the exact same copy of an allele by inheritance.

image-1664378809549.png

The additive coefficient is expected proportion of alleles shared IBD by the pair, so we can also define it as

image-1664379231292.png

p(x) = x/2 = the proportion of shared alleles

For example, the parent child relationship would have a additive coefficient of 0*(0/2) + 1*(1/2) + 0*(2/2) = 1/2

The kinship coefficient is the probability that a randomly selected pair of alleles from a individual is IBD. It is always half of the coefficient of relationship.

Estimating Heritability

Using Co-variance

Recall the properties of covariance:

image-1664379863104.png

A standardized measure of covariance is correlation which is a scale of -1 to 1

image-1664380038270.png

We can use the properties of covariance to determine the covariance between quantitative measure on a parent and an offspring. This folows the same T = (A + D) + E = G + E concept as above:

Cov(Parents, Offsrping) = Cov (Ap + Dp + Ep, Ao + Do + Eo)

In this example, we can break this down into:

Note: This only works if the assumptions of independence is true and standard deviation is the same for all individuals.

Since we are assuming standard deviation is the same across generations we can write hieritability between parent and offspring as a product of correlation:

hn2 =  𝜎2A  / 𝜎2T = 2 * Cov(parent, offspring) / SD(parent)*SD(child) = 2 * Cor(parent, offspring)

Thus, we can estimate narrow-sense heritability using 2 times the observed correlation between parents and offspring.

Using Linear Regression

Alternatively, we can use the linear regression coefficient to estimate heritability. Where Y is offspring and X is parent:

𝑌 = 𝛼 + 𝛽𝑋 + 𝐸

 𝛽_hat ~ cor(X, Y) x SD(Y)/SD(X)

image-1664384066843.png

hn2 = 2*rPO = 2*𝛽_hatPO

Note that the estimate obtained from beta may differ from the estimate obtained from r if the standard deviations of parents and offspring are not equal.

If both parents are available we can regress the offspring value of the mean parental phenotype value:

image-1664384675608.png

h2 = 𝛽_hatMO  

Beta estimates heritability directly in the average parent version. I'm not going show the math here, but know this is in part due to SD(average parent) = SD(parent) / sqrt(2)

Example with Sibling Pairs

Looking at the relationship table above in the siblings row: additive coefficient is 1/2 and dominance coefficient 1/4. If data is available on N sibling pairs only:

Cov(Sibling 1. Sibling 2) = (1 / 2) 𝜎2A + (1 / 4) 𝜎2D 

If 𝜎2D   = 0, 2 times the intraclass correlation (ICC) is an estimate of heritability:image-1664386082651.png
Where x and SD(x) are computed combining data on all siblings.

If 𝜎2D  does not equal 0, this estimate is between the narrow and broad heritability because:

image-1664386233273.png

Intraclass Correlation Vs. Pearson (Product-Moment) Correlation

When pairs consist of individuals of two different classes (grandparent-grandchild, parent-offspring) we call this pairwise correlation and we can use a simple Pearson correlation coefficient:

image-1664386482464.png

But when the pairs have no obvious order (siblings or cousins), the intraclass correlation is used:

image-1664386537312.png
where n is the number of pairs and x_bar is the mean value across all individuals

The main difference between the product-moment correlation and ICC:

ICC can be extended to sets of 3 or more. The formula for ICC is:

image-1664386880153.png

k = number of siblings per sibship
MSB = Mean Square Between (model)
MSW = Mean Square Within (error)
If the sibships are of unequal size using the average sibship size will provide an approximate h2 estimate

h2 = 2 * ICC

Twin Studies

Monozygotic twins are genetically identical and dizygotic twins are genetic the same as a full sibling, sharing half their genes on average. A common estimate of heritability in studies with both mono and dizygotic twins can estimate it as:

h2 = 2 * (ICCMZ - ICCDZ)

If 𝜎2D  does not equal 0, this estimate is greater than both the narrow and broad heritability because:

image-1664387386880.png

This is in part because we cannot assume there is no shared environmental variance in siblings and especially twins.

Notes on Estimating Heritability

We can estimate the linear mixed effects (variance component model):

image-1664388948635.png
In which case we would estimate heritability as 𝜎2G and test H0: 𝜎2G  = 0

Heritability Summary

Recurrence Risk Ratio

Recurrence risk ratio assess familial aggregation of dichotomous traits (diseases).  𝜆𝑅 denotes relative risk with a subscript indicating the specific type of relative examined.

A proband is a subject selected for a study due to disease status. In familial disease studies, it is the affected subject that brought the family into the study

𝜆𝑅 = 𝐾𝑅/𝐾
𝐾𝑅 = P(disease | relative of type R is affected) =  the proportion (prevalence) of relatives of affected probands that are affect with disease
K = the proportion of affected individuals in the general population

General rule:

Population Genetics

Genotype and Allele Frequency Estimation is the first step in studying a polymorphism. Used for family data and independent individuals in a population. We can use a subset of individuals who are independent and count alleles, or use the maximum likelihood methods to take all genotypes into account for pedigree data.

Consider the following example of allele frequency estimation:image-1664839258179.png
We can take the frequency of each allele by the observed proportions:
    pM = (2*298 + 489)/(2*1000)
    pN = 1 - pM

Hardy-Weinberg Law

Describes how we expect allele frequencies and phenotype frequencies to be related in a population.

  1. For a large, random-mating population, in the absence of forces that change allele frequencies, the allele and genotype frequencies remain constant from one generation to the next
  2. After one generation of random mating, for an autosomal locus with alleles 1 and 2 (frequencies p and q = 1 - p), the relative frequencies of the genotypes 11, 12, and 22 are:
         p2, 2pqm q2

image-1664839771200.png

Assumptions

Given two alleles with 1 and 2, there are 6 possible parent mating types:image-1664840002929.png

So the frequency of allele 1 in the offspring would be:
    P(11) + ½P(12) = (u+v/2)2 + (u+v/2)(v/2+w) = (u+v/2)(u+v+w) = u+v/2

Similarly, the frequency of allele 2 is: w + v/2

Forces that change allele or genotype frequency (invalidate HW law)
Testing for HWE

Though several assumptions of the HW law are not met in any population, genotypes in a population usually conform reasonable well to expectations, due to the various forces cancelling each other out.

H0: The genotype frequencies math the HW expectations (p2, 2pqm q2)

  1. Estimate allele frequency (p_hat)
  2. Determine the expected genotype frequencies from the estimated allele frequency, assuming null is true

    image-1664840626013.png

  3. Compute the expeceted genotype counts

    image-1664840650695.png

  4. Compare observed genotype counted to expected
    image-1664840698879.png

Compare X2 to a chi-squared distribution with 1 degree of freedom. This is usually the number of categories minus 1, but we lose an additional degree of freedom since we estimate allelic frequencies from the data (3-1-1).

Conclusion: The observed genotype frequencies are [not] significantly different from the expectations of the HW equilibrium.

When we reject the HWE, we usually don't know why other than the assumptions being violated.

Exact HWE Test

There are (2N)!/nA!nB! possible arrangements for the alleles in the sample. Under HWE the probability of observing exactly nAB heterozygotes in N individuals with nA alleles is:image-1664841170550.png

Under many conditions, samples of affected individuals will not be in HWE for alleles associated with disease BUT controls should be close to HWE, as should population-based (unascertained) samples. Note also that genotypes among related individuals may not be in HWE since the individuals are not independent.

If the observed genotypes in unaffected controls (whole random sample) depart from HWE, this may indicate:

Often if the HWE test p-value is << .01 in controls, the marker is not used in association analyses.

Population Structure

"Random Mating" occurs within a population, but not within overall population.image-1664842754495.png

We can observed the combined population to test for structure in sub-populations:image-1664842847540.png

The Heterozygote Deficit = 1 - (observed het freq)/(HWE expected het freq)image-1664842908493.png
F is the proportional decrease in heterozygotes observed under what would be expected under the HWE.

Linkage Disequilibrium (LD)

Recall a haplotype is a set of markers on the same chromosome that are always inherited together. The haplotype consists of two pieces of information: Genotypes and which alleles are inherited together..

Suppose we have two markers 1, with alleles A and a and freq pA, pa, and 2, with alleles B and b and freq pB, pb
We have 4 possible haplotypes: AB, Ab, aB, ab
If pAB is the probability A and B are on the same chromosome, then we can say pAB = pB * pb if markers are independent
Let Pt(AB) be the frequency of the haplotype AB after t generations of random mating, and theta is the recombination fraction
image-1664843774174.png

The idea being that we can measure the decay of Linkage Disequilibrium as a function of generations passed:image-1664843995619.png

You can also measure this in a pairwise fashion:image-1664844181567.png

There are problems with the pairwise measure however:

Min and Max of DAB:
    If D>0, D ≤ min(pApb,papB)
    If D<0, |D| ≤ min(pApB,papb)

D is usaully scaled so that its range is (0, 1) or (-1, 1):image-1664844532725.png

D' = 1 -> No evidence for recombination beteween markers
D' = 1 -> Fewer than 4 haplotypes are observed between two biallelic variants
If allele frequencies are similar D' near 1 -> the markers are good surrogates

D' estimates can be inflated with small sample sizes
D' estimates can be inflated when one allele is rare

Association Testing in Unrelated Individuals

In association testing we are interested in the effect of a specific allele in the population. We ask the question: "Is allele X1 more common in affected individuals than unaffected individuals?" We do not need family data to answer this, but we can use it if we have it.

Recall that Linkage Disequilibrium is the tendency for alleles at two loci closely linked on a chromosome to be associated in a population. Linkage is done within a family (referring to alleles inherited together), and LD is done within a population (alleles found at two loci together on a haplotype more often than expected). Two loci are linked when the theta score is less than .5, this does not necessarily mean they are in LD.

After many generations disease causing mutations in LD will change due to recombination.

image-1665531223732.png
The above graphic represents a mutation on the blue chromosome that over time and recombination through mitosis slowly decreases the LD between the mutation and the surrounding chromosome.

Specific alleles may be functional or in LD with functional mutations.
    Functional = casual -> the mutation that actually causes the increased/decreased risk/phenotype.

Genetic Association Testing

In genetic association analysis, we are trying to identify association between genotype (SNP) and phenotypes. Some alleles may be functional or in LD with functional mutations, meaning it's not the "true cause" of a particular phenotype.

functional/casual - the mutation that actually causes the increased/decreased phenotype

H0: No association between marker genotypes and phenotype

When the null hypothesis is rejected, we conclude a marker is associated with a phenotype.

The marker may be a functional mutation or in LD with a functional mutation

Association Testing for Qualitative Traits

Logistic Regression

The linear logistic model has the form:image-1665532592248.png

Coding Genetic Variables

We are interested in evaluating if the genotype at a specific marker is associated with being effected by a phenotype.

Consider a SNP with alleles 1 and 2. We have 3 genotypes, where we het is the heterozygote status and hom is homozygote status of allele 2:

image-1665532906635.png

Model: ln(odds for affected | geno) = 𝛽0 + 𝛽1Het + 𝛽2Hom
Our test has 2 degrees of freedom with H0: 𝛽1 = 𝛽2 = 0

So the odds ratio comparing genotype 12 to 11:image-1665607189363.pngEvaluates to be the same as the e to the power of the coefficient of heterozygote variable. We assume is the same as comparing genotype 22 to 12, since we only consider 1 allele at a time.

We can also test if there is a difference between 12 and 22 by creating a dominant and recessive model

Below we look at an additive model where ADD denotes the number of "1" alleles, which we'll assume is more rare than "2" alleles (the reference group):image-1665608024216.png

Interpretation

General model:

"The odds of disease for the 12 genotype is OR12,22 times the odds of the disease for the 22 genotype, and the odds of disease for the 11 genotype is OR11,22 times the odds of disease for the 22 genotype."

Additive model:

"The odds ratio increases multiplicatively by exp(𝛽1) for each additional 1 allele"
"The odds of disease increase by a factor of OR12,22 for each additional 1 allele"
"The odds of disease for the 12 genotype is OR12,22 times the odds of disease for the 22 genotype and the odds of disease for the 11 genotype is OR11,22 times the odds of the disease for the 12 genotype."

Determining the Type of Model

Based on Beta Estimates

Compare the beta coefficients from the general model to determine the best genetic model (general, additive, dominant, recessive).

image-1665609420936.png1 is dominant because the estimates are nearly the same between heterogygous and homozygous
2 is additive because the het geno is roughly half the hom geno
3 is recessive because there is only an effect if there is a 11 allele
4 is general because it doesn't fit any of the above criteria

Case-Control Tests

Tables vs. Logistic Regression Tests

Cross-Sectional and Cohort Studies

Association Testing for Quantitative Traits

This is typically performed in a random sample from the population. The most common testing method is linear regression; genotypes can be coded exactly as shown for logistic regression. Parameters here are the mean difference in the outcome due to difference in genotype.

H0: There is no association between alleles (or genotypes) and this trait; 𝛽1 = 𝛽2 = 0

From there we pretty much use the same equation as above:
image-1665610465797.png

For example, in a table coded identical to the one above, the mean difference in 12 and 11 simplifies to the coefficient of beta for 12 genotype:
𝛽0 + 𝛽1 - 𝛽0 = 𝛽1

Confounding

Recall that a confounder is a third factor which is related to both exposure and outcome, and which accounts for some or all of the observed relationship between the two.image-1665610998370.png

image-1665611079535.png

When testing for association between a SNP and phenotype confounders might be population structure, but SNPs are very unlikely to be confounded as behavioral and environmental factors do not alter DNA and vice versa.

Mixing two populations can cause spurious association, or departure from HWE, causing alleles at 2 loci to appear associated even though they may be on different chromosomes.

While it's unlikely to confound in the strict sense, adjustment for behavioral and environmental factors may be helpful if they affect the phenotype independently of the genes of interest. (increased precision of estimate -> increased power)

How to Deal with Structure?

When the sub-populations are known:

When factors defining sub-populations are unknown or difficult to measure:

Useful Guidelines

Always explicitly distinguish between variables that derive from non-genetic, reported information vs. genetically inferred information.


Multiple Comparisons and Evaluating Significance

Prior to the GWAS era, genetic association studies were hypothesis driven; Testing markers within/near the gene or region for association. "H0: The trait X is caused/influenced by Gene A." The hypothesis (gene or genes) came from:

Chip-based Genome-wide Association Scans

Candidate
Genome-Wide

Whole Genome or Exome Sequencing

Statistical Significance

There many things to test in genetic association studies:

The multiple tests are often correlated.

Type I error: Null hypothesis of "no association" is rejected, when in fact the marker is NOT associated with that trait.
This implies research will spend a considerable amount of resources focusing on a gene or chromosomal region that is not truly important for your trait.

Type II error: Null hypothesis of "no association" is NOT rejected, when in fact the trait and marker are associated.
This implies the chromosomal region/gene is discarded; a piece of the genetic puzzle remains missing for now.

For a multiple testing problem with m tests:
image-1666377062984.png

Family-wise error rate (FWER) is the probability of at least one type I error; FWER = P(V > 0)

False discovery rate (FDR) is the expected proportion of type I errors among the rejected hypotheses; FDR = E(V/R)
    Assume V/R = 0 when R = 0

Procedures to Control FWER

The general strategy is to adjust p-value of each test for multiple testing; Then compare the adjusted p-values to alpha, so that FWER can be controlled at alpha.

Equivalently, determine the nominal p-value that is required achieve FWER alpha.

Sidák

Sidák adjusted p-value is based on the binomial distribution:

Bonferroni

A simplification of Sidák:image-1666377583820.png

Bonferroni adjusted p-value: 

Below are the individual p-values needed to reject for family-wise significance level=.05
image-1666378611019.png

minP

The probability that the minimum p-value from m tests is smaller than the observed p-value when ALL of the tests are NULL.
image-1666378838710.png

Equivelent to Sidak adjustment if all tests are independent. But for dependent tests, we don't know the distribution of the p-values under the null hypothesis, so we use permutation to determine the distribution.

Adjusted p-value is the probability that the minimum p-value in a resampled data set is smaller than the observed p-value.

This is less conservative than the above two methods, but the results are equal to Sidak when tests are significant.

Permutation is done under the assumption that the phenotype is independent of the genotypes; and phenotypes are permuted with respect to genotype.
Original
image-1666379458667.png
Permuted:
image-1666379478602.png
Genotypes from an individual are kept together to preserve LD

Permutation Procedure

Permutation is computationally expensive, and in some situations it is not possible at all (related individuals, meta-analysis results).

Alternative

Use the Bonferroni or Sidak correction with the "effective number of independent tests" instead of total number of tests. This reduces the number of tests to account for dependence among test statistics. We must approximate the equivalent number of independent tests.

For a single study you can compute the effective number of independent tests based on the genotype data.

Once you have an estimate of meff use it in the Bonferroni or Sidak correction

Another alternative: Extreme Tail Theory Approximation (not covered here)

FWER Summary

Binferroni, Sidak, minP are all single-step adjustments; i.e. all p-values are adjusted in the same manner regardless of their values. This makes them very easy to understand and compute, however it sacrifices power.

Control FWER is very stringent (< 5% chance of a single false positive)

False Discovery Rate (FDR)

Controlling P(V = 0) is too stringent when m is large and you can expect multiple true positives. For better power,  control E(V/R) instead.

E(V/R) = the expected proportion of Type I errors among rejected null hypothesises.

  1. Rank p-values pr1 <= pr2 <= ... prm
  2. Adjusted p-values
    1. p*rm = prm
    2. p*rk = min(p*rk+1, prkm/k)

FDR: Q-Value

For an individual hypothesis test, the minimum FDR at which the test may be called significant is the q-value. It indicates the expected proportion of false-positive discoveries among associations with equivalent statistical evidence.

FDR for a specific p-value threshold t:
image-1666382414483.pngwhere m0 = number of true null hypothesis

We can estimate m0 by m*pi_hat_0; where pi_hat_0 is an estiamte of the proportion of true null hypothesises (could be ~1)

image-1666382536844.png

  1. Rank p-values: pr1 <= pr2 <= ... prm
  2. Estimate the proportion of true null hypothesises pi_hat_0
    1. Using pi_hat_0 = 1 leads to conservative q-values estimates equal to the FDR adjusted p-values
    2. See suggested approach in Story and Tibshirani (2003)
  3. Compute q-values:

    image-1666382645804.png

  4. Reject null hypothesis f q-value <= alpha

Which to use?

Guidelines for Adjusting for Multiple Comparisons

Overall: we are unlikely to have sufficient power to achieve experiment-wide or genome-wide significance with a single study, large number of SNPs

Best choice is meta-analysis and/or replication using independent studies

When combining studies is not an option, report the most promising results based on p-value and other factors

Make available results from all SNP association analyses so that other investigators can attempt to confirm or repliucate your findings.

Choose a threshold BEFORE looking at the data.

Genomic Control

Genomic control was preposed to measure and adjust for modest population structure within a sample in the context of GWAS

 

Association Testing in Related Individuals

Family data is correlated that could lead to inflation in test statistics if not accounted for. Many genetic studies contain related individuals. Family-based studies were developed to avoid bias due to population structure; Biological family members are genetically matched.

Most behavioral and environmental factors do not alter DNA, so SNP associations are unlikely to be confounded EXCEPT confounding by ancestry.

image-1667394044217.png

Population Stratification Bias

Designs for Family-Based Studies

We'll mostly focus on case-parent trios and TDT tests.

Case-Parent Trios

In this design we collect samples of trios; Two parents and one affected child (affection status of parents are not considered). The idea is under Mendels laws, heterozygous parents transmit each allele with equal probability (1/2) regardless of population structure. If there is preferential transmission of a specific allele from parents to affected offspring it indicates that allele is associated with the disease.

Transmission Disequilibrium Test (TDT)

Tests whether a particular marker allele is transmitted to affected offspring more frequently than expected. Non-transmitted alleles act as matched controls.

H0: No association or no linkage
Ha: Variant is associated with disease/trait (and is not due to population structure)

Notation:

image-1667395369592.png

Under the null we would expect:
image-1667395321311.png
image-1667395351114.png

The test statistic follows a McNemar test:
image-1667395577062.png

We observe this is only a function of the heterozyzgous parents, homozygous parents provide no information.

Discordant Siblings

General Family Structures

Unconditional Tests

Review: For unrelated (independent) subjects we can use standard linear or logistic regression. Including related individuals in the analysis violates the assumption of independence, causing inflation in the test statistics.

image-1667397203470.png
Where Xi1 is the genotype of the ith person and beta1 is the fixed affect of the SNP. Xi2 and Xi3 are covariates.

image-1667397383331.png
We can see the mixed effects model adds a G variable to account for genetic variance component.

For association analysis of quantitative traits, if X1 is the SNP then the test H0: Beta1 = 0 is a test of association just as for the linear model for unrelated subjects. The regression estimate beta is exactly the same as in simple linear regression.

In either case G is not tested for significance. If we want to test for heritability we estimate variance of G.

Common software: GCTA, SOLAR, R GMMAT package, Genesis R/Bioconductor package.

Dichotomous Traits

In this course we focus on quantitative traits to keep things simple.


Haplotypes and Imputation

When multiple markers/SNPs are genotyped in a gene or gene region, the SNPs may be in linkage disequilibrium (LD). Each individual test of association with a marker is correlated with all tests for other markers in LD with that marker.

So, instead of testing the individual markers for association, we may want to test haplotypes of markers for association.

Review: A haplotype is a combination of alleles at multiple loci that are transmitted together on the same chromosome. It should provide the alleles present on the locus and which alleles are on the same chromosome. For example, if we have 2 SNPs with alleles (A, a) and (B, b) then we have 2x2= 4 possible haplotypes: AB, Ab, aB, ab.

For linked SNPs within a gene or small chromosomal region there are typically far fewer haplotypes observed than are theoretically possible (as a consquence of LD).

Reasons to Care About Haplotypes

  1. We haven't typed the casual variant. The haplotype that the variant lies on is a better surrogate for the variant than any one SNP in the haplotype.

    image-1667409934425.png

  2. The "causal" genetic variant is actually the haplotype, not any one variant -- The haplotype confers risk rather than an individual SNP allele.

    image-1667409958708.png

  3. They allow us to impute additional ungenotyped variants

Often we don't have family data to help us. Haplotype phase is ambiguous only if there are 2 or more heterozygous genotypes. For example, Aa Bb genotype has two possible phases:image-1667411325955.png

All multi-locus genotypes that consist of 2 or more heterozygotes have ambiguous phase.

When family data is not available to determine haplotypes, multi-locus haplotype probabilities and frequencies can be estimated.

Haplotype Inference

The goal is to get the probability of a haplotype given the individual's genotype. There are two classes of algorithms: EM or MCMC

There are many options for software for phasing with and without family data.

The basic idea is to determine haplotype frequencies for sets of SNPs in close proximity on a chromosome, and compare frequencies in cases and controls or by quantitative phenotype.

Which SNPs should we use to test with Haplotypes?

Regression with Haplotypes/Likelihood-Based Tests

Take the following example with 3 SNPs and 4 observed Haplotypes, where each person i has 2 haplotypes:
image-1667414339979.png

We can use variables to count the number of each haplotype type per person. This coding works for any type of multi-allelic marker, not just haplotypes.

We can use a linear regression model with the haplotype counts as predictors:
image-1667415990954.png

Note we do not include H1 in the model. Since H1 + H2 + H3 + H4 = 2, we already know H1 once we know the remaining haplotypes. Think of H1 as a reference haplotype.

Each individual will have a 0, 1, or 2 for each of the 3 indicator variables for the haplotypes.

For 4 haplotypes, the general (Omnibus) test for association would be:
H0: Beta1 = Beta2 = Beta3 = 0; df = 3
If we reject the null, the do tests on individual haplotypes

Global (omnibus) tests of haplotype association provide some protection from  the multiple testing problem, as it tests whether the haplotype distribution is associated with phenotype.

In reality we rarely know haplotypes with certainty. Instead, for each individual haplotype estimation produces the probability of each possible haplotype pair. Can use the same model, but replace the observed haplotype counts with the expected counts (determined by probabilities).

SNP Imputation

The most recent application of haplotypes is imputation. The idea is we use a reference population that has denser genotyping of whole genome sequencing on your subset of SNPs plus many additional SNPs to impute the SNPs not typed on your chip.

A reference population is a set of individuals who have been genotyped or sequenced in a comprehensive manner. Typically these are for the whole genome sequenced (all the SNPs are included in the sample, plus many additional).

The HapMap project was the first available reference population. HRC created a large reference panel of human haplotypes by combining together sequencing data from multiple cohorts (current release consists of 64,976 haplotypes at ~39 million SNPs). TOPmed has a set of haplotypes from 97,256 genome sequenced individuals, more diverse than HRC.

image-1667589046111.png

The two commonly used methods for imputation are IMPUTE and Minimac, and some organizations have "imputation servers" so you can impute your GWAS data to large sequenced data sets without needing powerful on-site computers.

"Dosage"

Genotype imputation does not assign a genotype to each individual, instead for each individual it assigns a probability for each of the possible genotypes at each genetic variant.

Analysis is then performed on the "dosage" - the expected genotype. The most common model is additive, but one could also use a dominant or recessive dosage.

image-1667590451926.png
image-1667590441064.png

"Best Guess Genotype"

Another method is "best guess" genotype. This is the genotype with highest probability. It ignores uncertainty in genotype assignment. We usually impose a posterior probability threshold; If p < Q, genotype is set to missing (typically set to .8, .9 or .95)

Imputation Accuracy

The overall quality of imputed data depends on how well matched the reference samples are to the samples you are trying to impute.

The concordance of imputed genotypes with true (unknown) genotypes also depends on how much information about the ungenotyped SNPs (how strong is the LD between ungenotyped and genotyped SNPs?)

For high information the concordance reaches nearly 100%, for low information regions it can be much lower.

Imputation R2 is an estimate of the squared correlation between imputed and true genotypes. It can be estimated by the obs/exp variance ratio: ratio of empirical genotype dose variance to expected (binomial) genotype variant.

image-1667591207876.png

Due to excess homozygosity in a well imputed variant, R2 will occasionally be more than 1.

This imputation quality measure is also a measure of the correlation between the imputed genotypes and true gentoypes. 

Typically:

Good imputation:

The worst possible imputation is when we have no information about the genotypes other than the allele frequency (which can be estimated from the reference haplotypes). In this case, the imputation would assign everyone the genotype probabilities q2, 2pq, p2.  Thus everyone would have the same genotype dosage:
image-1667591673115.png
Variance would be 0 because there is no LD between SNPs, and thus correlation would be 0.

Why Impute Genotypes?

We can also combine studies to increase sample size, and thus the power. The problem is when studies use different reference SNPs, which is why the common SNPs across platforms is small. There are 2 ways to combine studies:

 

 

Power and Sample Size Calculations for Association Studies

Review of errors and difference in means/proportions:
image-1668094351617.png
Type II error is represented by beta and type I error as alpha
image-1668094463033.png
Where Z1 - (alpha /2)is the Z-value that creates the target value of alpha in the tail of the distribution. Power decreases as alpha increases.

Power also has a direct relationship with sample size. Sometimes we choose sample sized based on the desired power for a specific alternative hypothesis of interest, and other times we used a fixed sample size and significance level.

A Z-test of differences in proportions with 2 groups with different sample sizes:
image-1668094942720.png
C is the mean of the distribution under the alternative hypothesis:
image-1668096033431.png
And the power can be calculated by taking the cdf of C minus the one sided Z-value
image-1668096022931.png

Likewise, we can use a similar z-test for differences in the mean of a quantitative trait:
image-1668096147319.png
And the power can be calculated the same way as proportions.

Applying Power to Genetic Associations

Dichotomous Traits

  1. Determine the expected genotype frequencies in cases and controls under the alternative hypothesis of interest
  2. Estimate power for given sample size or estimate minimum sample size required for given power and significance level under the planner analysis model

Genetic Model

Genotype Relative Risk

Another way to parameterize penetrance, the probability that someone is affected given their genotype.

Penetrance fi = P(affected | i risk alleles) for i = 0,1, or 2 risk alleles at a locus

So if we know the relative risks (y1 and y2), the population prevalence K, and the risk allele frequency q, we can determine the penetrances (f0, f1, f2):
image-1668177944525.png
This is derived from the law of total probabilities which I will not show here.

Models for Dichotomous Traits

image-1668178026751.png

Note: multiplicative GRR ~= multiplicative OR
additive log(OR) -> multiplicative OR
Thus, multiplicative GRR model is most similar to the additive log(OR) model that we use in logistic regression

H0: Genotype or allele frequencies are the same in cases and controls
Ha: genotype frequencies are not equal in cases and controls. They follow a [dominant, recessive, additive] model

To compute power we need the genotype frequencies for the cases and controls under the alternative hypothesis:
P(i risk alleles | person is a case)
P(i risk alleles | person is a control)
For i = 0,1,2

The expected genotype frequencies for cases and control from f using Bayes Rule:
image-1668427102204.png

To compute power we not only have to specify the alternative hypothesis model, but also choose the model you will use to do the analysis. The analysis model is the model you choose for the analysis for which you compute power.

Under a dominant model the genotype relative risk should be y1 = y2 = 2
image-1668428249256.png
Notice how when we analyze the recessive model as a dominant one (right), the power decreases significantly.

Power - General 2DF Test

We can also compute power for the 2df genotype-based chi-squared test. The underlying formulas will not be explained here as there is software to compute this called the Genetic Power Calculator; You specify GRRs, prevalence risk allele frequency and it produces the genotype frequencies in cases and controls and the power under dominant, recessive, additive and 2DF tests.

Summary - Power for Dichotomous Traits

  1. Determine the expected genotype frequencies in cases and controls under the alternative hypothesis
    • Need to specify genetic model/effect size
  2. Estimate power for given sample size or estimate minimum sample size required for given power using the specific analysis model and significance level
  3. Can also find the minimum effect size that yields specific (usually 80%) power for a given sample size and allele frequency

Quantitative Traits

We usually parameterize power for quantitative traits in terms of locus-specific heritability (h2), the proportion of variance explained by the genetic variant (R2). It is independent of underlying model and allele frequency.

For a linear regression F test, the power is determined by the non-centrality parameter 𝜆, the proportion of variance in the trait explained by the genotype (h2) * sample size

𝜆 = h2 * N

To determine the (central) F distribution:
    In R: qf(1-alpha, df1, df2)

Code to determine power using non-central F-distribution and non-centrality parameter "ncp":
    In R: pf(crit.val, df1, df2, ncp, lower.tail = F)

So for any proportion of variance explained by h2 and N we can compute power. However, a single h2 value corresponds to many alternative hypothesis models.

For dichotomous traits we can simply compute power for difference in proportions, but specifying an underlying genetic model (GRRs, prevalence, allele frequency) is more meaningful. Similarly, power for a quantitative trait is usually more meaningful if we understand how the heritability relates to the genotypic means for a trait -- the genetic model. To find this we need allele frequency for that allele that increases phenotype and the degree of dominance d (additive, dominant or recessive).

Genotypic Means

a is effect size

Recall the variance components in a single gene model:
image-1668429807414.png

Locus-specific heritability h2, degree of dominance d, and allele frequency p:
image-1668430044058.png

We can re-scale so that the total trait σ2P = 1, by dividing the trait values by SD:
image-1668430015190.png

Next we can set R = (-1, 0, 1) for recessive, additive and dominant models respectively, then d = Ra. We substitute Ra for d in the above and solve for a:
image-1668430178148.png
When R = 0:
image-1668430206980.png

Recap on Effects on Power

Use a significance level you will use when performing association tests, or the entire family-wise error for the entire experiment.

For the model, if there is no prior hypothesis, try to find examples from the literature of reasonable effect sizes for your alternative hypothesis. Usually under an additive or multiplicative model. Beware that differences in study design can lead to different expected effect sizes.

Winner's curse is when significant regression estimates from small underpowered samples are biased -- tend to over-estimate the true effect.

Sequencing Data and Analysis of Rare Variants

Genotyping arrays can be obtained at pre-selected sites for each sample. Ex. Genotyping sites known to be polymorphic based on prior sequencing.

Sequencing is obtaining "every" base in the exome or genome for each sample. Most of the sequence is identical across samples. This is used to find locations that are polymorphic, or differ across samples.

Whole Genome Sequencing

image-1668973412196.png

Genome: ~3GB per individual

Advantage - whole genome coverage
Disadvantage - cost ~$1000 for 30x, limited interpretability

Whole "Exome" Sequencing

image-1668973539321.png

Exome: ~33MB per individual

Advantage: Covers protein coding regions, interpretable variation, cost ~$500
Disadvantage: Missing 99% genome coverage

VCF Format

Variant Calling File (VCF) is a standard format for storing sequencing data. It includes genotypes for all sites where at least one individual had alleles different from references alleles for all individuals in the study.

Every VCF file has three parts in the following order:

image-1668981799141.png

The first nine columns of the data record give information about the variants:
1. CHROM – the chromosome number/id
2. POS – the genome coordinate of the first base in the variant.
Within a chromosome, VCF records are sorted in order of increasing position.
3. ID – a semicolon-separated list of marker identifiers (often rsid)
4. REF – the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC")
5. ALT – the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate allele, the field is a comma-separated list of all alternate alleles.
6. QUAL - The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. A value of 10 indicates a 1 in 10^1 chance of error, 50 indicates 10^5 chance of error, etc.
image-1668982098663.png
7. FILTER - Either "PASS" or a semicolon-separated list of failed quality control filters
8. INFO - Values in INFO are defined in the header. Contains additional information about the variant represented as tag-value pairs, where the tag and value are separated by an equals sign, and pairs are separated by colons. Usually it is information summarized from the samples, but can also include information from other sources such as population frequencies from a database.
9. FORMAT - Explanation of information in FORMAT are defined in the header. As for INFO, a colon-separated list. Describes the format of the data reported for each sample in the file.

There are many tools that can be used with VCF and BCF (binary version of VCF), such as bcftools and plink.

"Missing Heritability"

GWA studies are great for identifying SNP associations, but usually loci identified have small effects on traits. Much of the phenotypic variation or risk due to genetics ("heritability") is unexplained.

Unexplained variance/risk may be due to:

Rare Variants

There are a lot of different rare variants, they are very common! Rare variants are more likely to be functional/deleterious. The best way to find rare variants is to sequence.

Recent expansion of the human population supports the theory of causal rare variants (there's 8 billion people today). Multiple causal rare variants have been found within genes. Animal studies suggest larger effects of rarer variants.

MAF (Minor Allele Frequency) cutoff can be used to define a "rare" variant (this level makes a difference in analysis).

Association Analysis with Rare Variants

The problem with single variant tests such as regression methods is that there are too few observations to provide a stable test.

We can combine rare variants by grouping them by gene region or functional information (exomes, non-synonymous or nonsense, predicted function), which can improve power.

image-1668986962347.png

Burden Tests

CAST

Unweighted Sum / CMC

RVT1

Weighted Sum of the Variants

image-1668986062637.png
image-1668986557104.png
Risk Allele Frequency:
image-1668986571391.png
For case-control outcomes:
image-1668986646424.png

am: total number of minor (or risk) alleles for the SNP m
nm: total number of subjects for SNP m
aum: number of minor (or risk) alleles in the unaffected subjects for SNP m
num: total number of unaffected subjects for SNP m
Gi,m: is the number of risk variants for SNP m in individual i

Madsen and Browning proposed using a weight based on the inverse of the MAF in the unaffected sample. Lower MAF SNPs have a larger weight. Makes the assumption that the rarer variants have a large phenotypic effect.

Variable Threshold (VT) Model

Variance Component Tests (SKAT)

Sequence Kernel Association Test

We want to test H0: Beta_1 = Beta_2 = ... = Beta_m  Or   t = 0
Assume:
image-1668987802308.png
t is a variance component and wj is a pre-specified weight for variant j, usually chosen to be a function of the MAF.

Under the null hypothesis Q is a mixture of chi-squared distributions.

Variance component score test statistic:
image-1668988060578.png

image-1668988434224.png

For the weights the original paper recommends this weight for variant j:
wj = Beta(MAFj, alpha1, alpha2), with alpha1 = 1 and alpha2 = 25 or Madsen-Browning uses alpha1 = .5 and alpha2 = .5

SKAT-O

The goal is to combine the best features of SKAT and burden tests into one test with optimal power. Burden tests have good power when all variants have similar effect size & direction. SKAT has better power when many variants are null and/or effects in opposite directions.

Significance Thresholds for Rare Variant Tests

Depends on:

Bonferroni for the number of tests performed can also be applied. Though it is likely to be conservative, even if tests are independent -- some tested genes/regions do not have sufficient variation. Unless the study is very large, p-values likely to be less significant than expected under H0: no association.

Conditional Analysis

For regression-based tests, a conditional analysis amounts to including the associated common variant AND the rare variant score in the same model.

image-1668990020720.png

To test whether the rare variant score is associated with the trait conditional on the common variant:
H0: Beta_2 = 0

If the test is significant then the association between Yi and common variant Zi does not fully explain the association observed between Yi and rare variant score Xi, i.e. the rare variant is not just a proxy for the common variant or vice-versea.

 

Interactions in Genetic Assocation Analysis

Statistical interaction occurs between two factors if their combined effect is different than what would be expected based on their individual separate effects. In genetics, differences in risk or mean phenotype between genotypes vary according to the exposure (interacting) variable.

Simple example: Phenylketonuria (PKU) is a genetic defect that causes severe intellectual disability only in the presence of dietary phenylalalnine. Every baby is tested for this trait to see if they need a special diet.

Modeling Statistical Interaction

We need a model to form an expectation for the joint effect, which we can use to define the interaction. The definition of interaction relies on some specification of non-interactive effects.

Let Gi be the genotype for person i, and Ei be an environmental factor for a person i
image-1669735373721.png

H0: βGE = 0; No interaction, the measure of association βG between the distribution of the phenotype Y and the genotype G does not depend on E.

image-1669818948296.png

Example logistic model with interaction (CP = cleft palate):
image-1669740889941.png

So there are 2 interpretations of βGE:
1. The difference between the Environment logOR when G = 1 vs when G = 0
2. The difference between the Genotype logOR when E = 1 vs when E = 0

The odds ratio of exposure to non-exposure is exp(beta), or the increased odds of disease at the locus for each additional allele is exp(beta) times smaller for exposure group than non-exposure group.

If there is interaction, we should not use summary measures (OR & genotypic means) for the marker or the environmental factor alone. We may improve power to identify genes and SNPs by testing within subgroups of the interacting exposure, or just testing for interaction. 

Detecting Interactions: Power

Even in the GWAS era, there are very few well-known examples of interaction of GxE that have been identified. This is in part due to the fact that tests for interaction are not as powerful as tests for main effect association.

We need much larger sample sizes to detect interaction than main effects, at least 4x the size is needed for same effect. Due to lack of power published reports of GxE interactions may be more prone to publication bias (although usually interaction is not the primary hypothesis in a study). It is better to publish all results to avoid publication bias, but this also leads to many published interactions that have not been replicated.

This power issue is particularly a problem when a large number of markers are tested; as always multiple testing adjustment is required to preserve family-wise error rate. We'll focus on two strategies for maximizing power when conducting GxE GWAS have been proposed: Case-only designs and two-stage screening procedures

Case-Only Design

If G and E are independent (depending on design, in population or among controls) cases from a case-control study can be used to estimate the multiplicative GxE interaction.

When G and E are independent in the controls, the OR for the G-E association among cases equals the multiplicative interaction between E and G. ??Improved precision: acts like a case-control or cohort study with an infinite number of controls.??

image-1669742387338.png
The interaction OR is the ratio of G-E OR in cases to G-E OR in controls. D_bar represents the controls, D represents cases.

If G and E are independent in the controls then the joint probability P(GE | D) = P(G | D)*P(E | D) and the ratio of ORs is just the E-G OR among cases.

We can test for G-E interaction by looking at the G-E association in cases.

Advantages
Limitations

Two-Stage Screening Procedure

As we've seen GxE tests have poor power compared to the main effect tests, as the multiple testinig problem is more serious. One way to improve power is to do fewer tests, which gives a smaller multiple testing correction and thus larger significance level.

For example, a genome wide test for smoking and SNP interaction power will be very low, as we are usually accounting for > 2.5 million SNPs, so we would need to use a multiple testing correction like alpha = 5x10^-8

Screen procedures are one way to choose which GxE tests to do. A screening procedure gives us a way to select only a small subset of the SNPs m << M to test for G-E interaction. So, we adjust for m SNPs instead of M SNPs, making significance alpha/m instead of alpha/M.

The tricky part is the screening procedure must be independent of the G-E interaction test in order to maintain alpha, also the screening test needs to be good at identifying variants that are likely to have a G-E interaction.

Kooperberg LeBlanc Stage 1 (Main G Effect Screening)

image-1669821247142.png

Perform test using all data (E = 0 and E = 1)

Murcray Stage 1 (SNP-E Screening)

image-1669821620508.png

This test is only performed with cases and controls combined.

Meta Analysis

GWAS is performed on millions of SNPs. Because of multiple testing, we use very stringent thresholds for statistical significance. This can greatly reduce power and may not be sufficient to detect associated SNPs. Combining information across studies will improve power.

The objective of meta-analysis is to combine results from multiple studies to improve power and get more precise effect estimates.

Limitations of Pooling Raw Data

Originally meta-analysis was used to combine results from published studies with small samples to improve power. This is subject to publication bias, only studies with significant p-values are chosen so we are more likely to see a positive effect. Not an issue in GWAS meta-analysis since ALL results are included regardless of significance.

Meta-Analysis of a Single SNP

Collection of K studies to test the same null hypothesis. Each study provides an estimate, beta_hat, of the effect size and a p-value for testing the null hypothesis.

The key assumption is independence of study results, no subjects should appear in more than one study.

The first step in a meta-analysis is to identify a consistent measure from  each study to be summarized (OR, regression coefficient and SE, p-value from chi-square association.

We can use Fisher's method or Stouffer's method, and that's all the information I need for now.