Genomic Data Visualization in Python - From deep sequencing to insights | Best | Scoop.it

Data comes out of the sequencer usually as fastq files which we align to the reference genome using a particular aligner (the software that will map the short reads to the reference). Once the alignment is done, two types of files can be generated, a SAM file (the alignment) and its binary version the BAM file (used in this tutorial here).

 

The data consist of list of bam files (one bam file per subject studied, let’s say a tissue for example), and an interval file in a bed format (listing the amplicon regions). Once these are generated, we need to visualize what we have to answer a number of questions such as:

  • what is the coverage of the sequencing
  • how are the short reads distributed across the genome ?
  • is the coverage across all amplicon regions equally distributed ?
  • in what specific amplicon region we see a lot of coverage in comparison to the others
  • can we cluster tissues per coverage ?
  • etc ..

The tutorial shows how to answer these questions and what can be done in python.