Bioinformatics: Computational methods to assess and remedy mapping bias in allele-specific expression and genomic cis-element analysis

Assessing allele-specific expression (ASE) and allele-specific cis-element (ASC) binding or modification from massively parallel sequencing read data is a straightforward way to home in on transcriptional and cis-regulatory variation at the level of single individuals. Mapping the sequence reads to a reference genome will cause the reads with the allele identical to that of the reference genome to be preferentially mapped, thus introducing a bias in the mapping towards that allele (Figure 1). This bias can significantly affect the outcome of analyses of allele-specificity, both for RNA-seq and ChIP-seq experiments. Here, we plan to investigate the causes and effects of mapping bias in this setting. We will start by reviewing and benchmarking the current set of methods to remedy it. We have so far identified six approaches to tackle mapping bias: (i) masking of genomic variants (Degner et al. 2009), (ii) changing the expected allelic ratio, 0.5, to the mapping ratio of simulated reads with equal allelic ratio (Montgomery et al. 2010), (iii) map reads to an individual-specific transcriptome reference generated from phasing (i.e., assigning to either maternal or paternal allele) of RNA genotype calls of the individual (Turro et al. 2011), (iv) use of an alignment program that supports ambiguous allele coding at genomic variants (Heap et al. 2010) (v) construction of all possible haplotypes within a read given an input set of SNPs (Satya et al. 2012), and (vi) permit a relatively high number of mismatches, but requiring that the second best mapping have a considerably higher number of mismatches, as well as using several mapping programs (Bahn et al. 2012). These will be benchmarked and we plan to incorporate our discoveries regarding what seems to work best into a novel computational approach for remedying mapping bias.

An important part for the expression part of this assessment is to develop a novel computational method to simulate the RNA-seq read data, which, unlike existing methods used for simulation of data for ASE studies, would model the event of several SNVs per read, which is a crucial feature at the predominant sequence read length (typically 2×100 bases). Furthermore, it would preferably model sequencing errors in a position-specific manner based on base-quality measures from real data (either existing in-house data sets, or from other sets of RNA-seq data), and takes splice isoforms into account.

A third part is the development of an RNA-seq read based phasing tool, where no other input than the reads are used to phase the variants detected. This amounts to a haplo-aware, reference-genome guided transcriptome data assembler. Here, we plan to merge the use of a transcriptome assembler (e.g., Trinity (Grabherr et al. 2011)) with a variant caller (e.g. mpileup) to provide a computational method that with useful quality can provide phasing based on RNA-seq data alone.

References

Bahn et al. Accurate identification of A-to-I RNA editing in human by transcriptome sequencing. Genome Research (2012) vol. 22 (1) pp. 142-50

Degner et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics (2009) vol. 25 (24) pp. 3207-12

Grabherr et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol (2011) vol. 29: 644-652

Heap et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum Mol Genet (2010) vol. 19 (1) pp. 122-34

Montgomery et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature (2010) vol. 464 (7289) pp. 773-7

Turro et al. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol (2011) vol. 12 (2) pp. R13-R13

Satya et al. A new strategy to reduce allelic bias in RNA-Seq read mapping. Nucleic Acids Research (2012) vol. 40 (16) e127

Investigators