Bioinformatics: Inferring functional coupling of proteins using next gen sequencing data

Next generation sequencing experiments have a very high throughput producing gigabytes of raw data and setting a big challenge on analytic and processing ability, but in turn have the possibility to produce accurate and abundant data. The main goal of the project has been to try to utilize next generation sequencing data to infer functional coupling in protein interaction networks. The next gen data type used in this project was RNA-Seq.

We have been working on implementing a pipeline that is able to go from raw RNA-Seq read to a suitable format for the corresponding expression of each gene. This includes filtering reads based on quality, mapping the raw reads to a reference genome and filtering for artifacts in the data, and finally calculating expression values by taking the sum of reads for each transcript and do an appropriate normalization to get a final Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value gene. FPKM is a measure of how many reads have been recorded for each transcript normalized by transcript length and the total number of reads.

Tools exist to handle different parts of the pipeline. For mapping the reads to reference genome, BWA or bowtie can be used. Our current optimised pipeline includes the fastx toolbox for filtering, tophat/bowtie for mapping and assembling transcripts, and cufflinks to calculate FPKM values. The expression values for the genes were then used to calculate co-expression between gene pairs to be able to infer functional coupling. This was done in the framework of FunCoup which is a functional coupling database containing several different data types, to which RNA-Seq was to be added. Several data sets, mainly from the NCBI short read archive, have been analyzed, both single and pair end read data sets. A few data sets that contain enough usable information have been distilled out to be included in the next release of FunCoup.