There are two type of coverage:
- Breadth coverage ( this is related with N50 and contig numbers)
- FASTA files run with QUAST to get the metrics
- Depth of coverage
- With a reference genome ( works with alligned files; bam or sam files)
- Without a reference genome ( works with FASTQ and FASTA files explained below)
DEPTH COVERAGE WITHOUT REFERENCE GENOME
To calculate the depth coverage of a genome straight from FASTQ and FASTA files without a reference genome we need to have:
- Average sequence length
- Number of reads
- Genome size
Average sequence length
run Fast-QC analysis and Multi-QC analysis (http://multiqc.info/) to get the average size.
module load FastQC/0.11.6-Java-1.8.0 fastqc -t 2 -o fastqc path/test_R1.fastq.gz path/test_R2.fastq.gz
After run is complete, transfer the zip files from remote computer to a file in desktop .
This is where I located mine /user/gizemlevent/desktop/fastq/analysis
Multi-QC (should run on computer terminal or Anaconda, I used anaconda environment with manual installment)
git clone https://github.com/ewels/MultiQC.git cd MultiQC python setup.py install multiqc /user/gizemlevent/desktop/fastq/analysis -o /user/gizemlevent/desktop/fastq
After this run is complete, go to the location and open the multiqc_fastqc.txt and copy it on excel. Now you have the average read size from both R1 and R2. Take the average of two. This will give you your average sequence lenght.
Number of reads
For those who use Illumina Miseq and who has an account of BaseSpace can access total number of reads for each samples in the run report. Take that number.
Run Quast analysis with fasta files on http://quast.bioinf.spbau.ru/
Report generated by Quast will provide the total length of genome. Take that number.
Now move to excel to perform the math below:
Formula for depth coverage = ( average sequence length* number of reads)/ genome size
Below is an example of Escherichia coli draft genome:
|total length||av.read lengh||number of reads||coverage depth|