Individual vs Pool Sequencing
October 31, 2017
Fine-mapping visualization
November 2, 2017


Expression quantitative trait loci (eQTLs), variations in the genome that impact gene expression, are identified through eQTL studies that test for a relationship between single nucleotide polymorphisms (SNPs) and gene expression levels. These studies typically assume an underlying additive model. Non-additive tests have been proposed, but are limited due to the increase in the multiple testing burden and are potentially biased by filtering criteria that relies on marginal association data. Here we propose using combinations of short haplotypes instead of SNPs as predictors for gene expression. Essentially, this method looks for genomic regions where haplotypes have different effect sizes. The differences in effect can be due to multiple genetic architectures such as a single SNP, a burden of rare SNPs, multiple SNPs with independent effect or multiple SNPs with an interaction effect occurring on the same haplotype.

A SNP represents variation at one specific location in the genome. Haplotypes represent variation across a set of successive SNPs on a single chromosome. Generally, a small number of haplotypes are representative of the majority of the haplotypes in a sample when looking at a small region. Each of these haplotypes can have different effect sizes depending on the genetic architectures occurring on the haplotypes. The example image shows a compound heterozygous relationship between SNP and haplotypes such that any haplotype containing an alternate allele at either of the SNP positions has complete loss of expression. Our proposed method seeks to maximize the difference between the frequency-weighted mean effect size on gene expression of haplotypes within a defined set (a HapSet) and those not in the set. Once a set is defined, haplotypes from individuals are identified as either in or out of the set, and a pseudo-genotype, defined as the number of haplotypes an individual carries that are within the set, is assigned. Gene expression can then be regressed on this pseudo-genotype to estimate the mean difference in effect size of haplotypes in and out of the set. In our example, our method would try to great a set of haplotypes where each of the haplotypes contains at least one alternate allele at either of the two SNPs and to compares that set to the set of haplotypes that contain no alternate alleles at the two SNPs.

Simulations show that when haplotypes, rather than SNPs, are assigned non-zero effect sizes, our method has increased power compared to the marginal SNP method. In the GEUVADIS gene expression data, our method finds 101 more eGenes than the marginal method (5,202 vs. 5,101). The methods do not have full overlap in the eGenes that they find. Of the 5,202 eGenes found by our method, 707 are not found by the marginal method—even though it has a lower significance threshold. This indicates that many genes have regulatory architectures that are not well tagged by marginal SNPs and demonstrates the need to better model alternative architectures.


Difficulty Level: medium

Difficulty Description: While most of the heavy statistical analysis will be done with pre-existing software. There will be a significant amount of data manipulation.  This project will also require you to develop and run a disease simulation pipeline.

Recommended prior knowledge.

  1. strong statistical background with knowledge of bayesian statistics
  2. proficiency in a programming language (e.g. R, python, MATLAB, etc.)
  3. CS124
  4. Basic biology background (e.g. LS 3, 4, 7B)

Before meeting with me:

Please read the introduction and basic method description of the following:

  1. tbd

Please watch:

  1. tbd

If interested, contact Rob Brown (