Depth Coverage Calculations for raw NGS data

Image result for coverage genome

There are two type of coverage:

  • Breadth coverage ( this is related with N50 and contig numbers)   
    • FASTA files run with QUAST to get the metrics
  • Depth of coverage 
    • With a reference genome ( works with alligned files;  bam or sam files)
    • Without a reference genome ( works with FASTQ and FASTA files explained below)

DEPTH COVERAGE WITHOUT REFERENCE GENOME

To calculate the depth coverage of a genome straight from FASTQ and FASTA files without a reference genome we need to have:

  1. Average sequence length
  2. Number of reads
  3. Genome size

Average sequence length

run Fast-QC analysis and Multi-QC analysis (http://multiqc.info/)  to get the average size.

module load FastQC/0.11.6-Java-1.8.0 
fastqc -t 2 -o fastqc path/test_R1.fastq.gz path/test_R2.fastq.gz

After run is complete, transfer the zip files from remote computer to a file in desktop .
This is where I located mine /user/gizemlevent/desktop/fastq/analysis
Multi-QC (should run on computer terminal or Anaconda, I used anaconda environment with manual installment)

git clone https://github.com/ewels/MultiQC.git 
cd MultiQC 
python setup.py install
multiqc /user/gizemlevent/desktop/fastq/analysis -o /user/gizemlevent/desktop/fastq

After this run is complete, go to the location and open the multiqc_fastqc.txt and copy it on excel. Now you have the average read size from both R1 and R2. Take the average of two. This will give you your average sequence lenght.

Number of reads

For those who use Illumina Miseq and who has an account of BaseSpace  can access total number of reads for each samples in the run report. Take that number.

Alternatively, you can run echo $(zcat your.fastq.gz | wc -l) / 4 | bc  to obtain number of reads for each FASTQ files. 

Genome Size 

Run Quast analysis with fasta files on http://quast.bioinf.spbau.ru/  or use command line 

module load QUAST/5.0.2-intel-2018b-Python-3.6.6

python /sw/eb/software/QUAST/5.0.2-intel-2018b-Python-3.6.6/bin/quast.py *.fasta -o quast

Report generated by Quast will provide the total length of genome. Take that number.

Now move to excel to perform the math below:

Formula for depth coverage = ( average sequence length* number of reads)/ genome size 

Below is an example of Escherichia coli draft genome:

total length av.read lengh number of reads coverage depth
5288173 284.9929871 482,180 25.98589693
5292551 288.0475368 516,596 28.11578109