Bioinformatics: Improved scaffolding for genome assembly

A crucial part of genomics is the ability to accurately assemble reads (often short) into larger pieces, so-called contigs. This is still considered difficult and recent results indicate that there is no single assembler that good for all data sets. An important step in genome assembly is scaffolding, in which information about read-pairing is used to connect contigs into larger units.

We have identified several weaknesses with common scaffolders:

  • The estimate of the distance between contigs has been biased in all
    available scaffolding tools.
  • There has been a lack of statistical evaluation of connections between contigs, making scaffolding of complex plant genomes, full of repeats and duplications, unnecessarily complicated.
  • Many available scaffolders are slow and contains bugs, making them impractical on large problem instances.
  • Evaluation of scaffolders have been lacking and it has been difficult to decide which tool one can rely for larger projects.

We are developing a new scaffolder, named BESST, which addresses these issues. The target datasets for BESST are plant genomes such as the spruce (Picea abies) and poplar (Populus tremula), for which our preliminary results are very promising.