Basics of Illumina paired end metagenomics data analysis

I had a chance to take a metagenomics course  to learn bioinformatics analysis using NGS metagenomics data. I realized it has been a while since I haven’t use what I had learnt. Again I am here to remember.

Metagenomics analysis using QIIME 2 and command-line

Steps of Illumina paired end metagenomics:

1 .DNA is isolated from some kind of sample, and PCR is performed on the DNA using universal primers targeting one or more variable regions of the 16S rRNA gene.

2.DNA sequencing of the amplicon is performed on the Illumina MiSeq device

3.Usually multiple samples are barcoded and sequenced on the same MiSeq run (i.e. multiplexing).

4.16S sequence data from the Illumina DNA sequencer is produced in the FASTQ format

5.The 16S sequence data produced is what we refer to as paired-end data, that is, the DNA fragments are sequenced from both ends

6.For each DNA fragment sequenced we have two sequences; “R1” forward read and “R2” reverse read.

Merging reverse and forward reads


OTU picking


1.After the reads are merged the next step is to convert the FASTQ file to a FASTA file

2.With the FASTA file we perform OTU picking

3.We choose a database (e.g. Greengenes) and run a script that compares each sequence from our FASTA file to the database

4.If a sequence has >97% identity with one of the species in the database then it is assigned to that OTU

5. OTU picking produces a table (.biom file extension) that is not human readable

6.We can then calculate measures of diversity and do additional downstream analyses (e.g. comparing abundance of various taxonomic groups across samples)

To analyze metagenomics data

  1. Use supercomputer command lines ( I will explain this one)
  2. Sequence Hub Applications (BaseSpace)
  • ­16S Metagenomics
  • ­QIIME preprocessing
  • ­QIIME Visulization


  • There is a huge metagenomics literature out there, and almost all of the data was analyzed with QIIME or MOTHUR, which use a similar approach.
  • QIIME was not good at telling the difference between sequencing errors and different species/strains/OTUs, and the result was a gross overestimation of diversity in each sample.
  • QIIME might tell you that there are 2,000 types whereas QIIME-2 will tell you that there are 150 types of microbes in a pair of samples
  • QIIME2 uses a program called DADA2 to remove bad sequences and count how many different types of sequences are in a sample without assigning taxonomy. ( DADA2 paper)
  • QIIME2 tells you how many different microbes are in your sample without knowing what any of them are!
  • QIIME2 uses a naïve Bayesian classifier to assign taxonomy to the sequences; the classifier is trained on GreenGenes or SILVA
  • QIIME2 attempts to give only high-confidence result


QIIME says:

“By comparing all your sequencing data to this bacterial database, I think that you have 600 different kinds of bacteria in your samples.”

QIIME-2 says:

 “I aligned and grouped all the different types of sequences in your samples, and it looks like you have 200 distinct types of sequence in the samples. If you want to know what types of bacteria those sequences are from then you need to run another analysis!”



Starting with the supercomputer

1.Create a new file on your /SCRATCH drive (e.g. qiime2)

2.Transfer your FASTQ files into this folder

3.Take subsamples of each FASTQ files (run seqtk_subsample.job)

Grep –c command 274,820 sequences (but illumina says you just need 100,000)

4.Create manifest.txt file

5. Demultiplex files and summarize (Qiime2 tools)

6.Use to view quality of sequences

7.Run multi-dada2.job

8.Create metadata.txt file

9.Create future-table of rep. sequences (Qiime2 tools) and visualize

10. Naïve Bayesian classifier Classification of taxonomies (nb_class.job) and visualization

11.Create taxa-bar plots and visualize

12.Generate heat-map

13. Coremetrics and others


Running seqtk_subsample.job

module load Seqtk/1.2-intel-2015B

# TODO Edit these variables as needed:



subsample_fraction=‘0.4’        # keep 40%

random_seed=66                 # (remember to use the same random seed to keep pairing)


seqtk sample -s$random_seed $read1 $subsample_fraction > subsample_1_${subsample_fraction}.fq &

seqtk sample -s$random_seed $read2 $subsample_fraction > subsample_2_${subsample_fraction}.fq &

#this job script needs to be submitted on supercomputer


Create manifest file


#see the example below for reads of 250_T and 252_C


#Example of the first line of a manifest file for this data set–use this template to make your own!


250_T,/scratch/user/gizemlevent /qiime2/fastaq/250_S12_L001_R2_001.fastq.gz,reverse

252_C,/scratch/user/gizemlevent /qiime2/fastaq/252_S32_L001_R1_001.fastq.gz,forward

252_C,/scratch/user/ gizemlevent /qiime2/fastaq/252_S32_L001_R2_001.fastq.gz,reverse



2.Drag and drop inject_paired-end-demux.qzv file

3.Go to interactive quality plot and determine sequence base above 20 Qscore sequence base number

Multi_dada2. Job

module load Anaconda/3-

source activate qiime2-2017.12

qiime dada2 denoise-paired –i-demultiplexed-seqs /scratch/user/gizemlevent/qiime2/paired-end-demux.qza –o-table table o-representative-sequences rep-seqs –p-trim-left-f 0 –p-trim-left-r 5 p-trunclen-f 236 –p-trunclen-r 230 –p-n-threads 20

cp /work/$LSB_JOBID.tmpdir/qiime2*log ./

#this is a job script needs to be submitted to supercomputer server not login mode!!



4 columns are necessary to have (sampleID, BarcodeSequnce, LinkerPrimer, Description) we can identify factors, treatments etc based on our experiment. The one above is an example for qiime_Cont dataset. if the FASTQ file is not demultiplexed, we can also identify the barcode sequneces here. They are case sensitive.


#metadata files helps us to create bar plots, heatmap since it separates the samples with the treatment and other factors.

Creating feature tables

qiime feature-table summarize \

i-table table.qza \

o-visualization table.qzv \


qiime feature-table tabulate-seqs metadata_Cont.txt\

–i-data rep-seqs.qza \

o-visualization rep-seqs.qzv


module load Anaconda/3-

source activate qiime2-2017.12

qiime feature-classifier classify-sklearn –i-classifier /scratch/user/gizemlevent/qiime2/gg-13-8-99-515-806-nb-classifier.qza –i-reads /scratch/user/gizemlevent/qiime2/rep-seqs.qza –o-class

qiime metadata tabulate –m-input-file /scratch/user/gizemlevent/qiime2/taxonomy.qza —o-visualization taxonomy.qzv

#to start this job, transfer the database gg-13-8-99-515-806-nb-classifier.qza to your working folder.

#you can visualize taxonomies (qza file) on Qiime-2view.



module load Anaconda/3-

source activate qiime2-2017.12

qiime taxa barplot –i-table table.qza –i-taxonomy taxonomy.qza –m-metadata-file metadata_Cont.txt –o-visualization barplot.qzv


#you can visualize taxonomy bars (qzv file) on Qiime-2view.


Creating heat-maps

qiime feature-table heatmap –i-table table.qza  –m-metadata-file metadata_Cont.txt –m-metadata-category Treatment1 –output-dir heatmap



Core metrics (Bray Curtis AND Jaccard)

qiime diversity core-metrics \

–i-table table.qza \

–m-metadata-file metadata_Cont.txt \

–p-sampling-depth 223000 \

–output-dir core-metrics-results

(REMOVE \ and spacing if the above does not work–works without back slashes and spacing!!!)


qiime diversity core-metrics –i-table table.qza –m-metadata-file metadata_Cont.txt –p-sampling-depth 223000 –output-dir core-metrics-results


Alpha diversity

module load Anaconda/3-

source activate qiime2-2017.12

qiime diversity alpha-rarefaction –i-table table.qza –p-max-depth 400000 –p-min-depth 250000 –m-metadata-file bella_2_8_18_metadata.txt –output-dir alpha_r_curves

qiime diversity alpha-rarefaction –i-table table.qza –p-max-depth 100000 –p-min-depth 1000 –m-metadata-file bella_2_8_18_metadata.txt –output-dir low_minmax_alpha_r_curves

qiime diversity alpha-phylogenetic –i-table table.qza –i-phylogeny rooted-tree.qza –p-metric faith_pd –output-dir alpha_phylo

Leave a Reply