Masking bacteriophage regions for phylogenetic tree analysis

Last week my challenge was masking phage regions on my reference strains in order to conduct phylogenetic analysis accurately. I will show the steps to find phage regions using an online tool and mask those regions in my references using command-line.

I used PHASTER online tool (http://phaster.ca/)  to determine the phage regions in my reference FASTA files. PHASTER gives you the range of sequence that are belong to phage regions. I also used bedtools ( https://bedtools.readthedocs.io/en/latest/ ) for masking those regions.

Bedtools works with a bed file. Bed file is a tab eliminated text file.

  1. Open a new excel sheet
  2. first column needs to have the Id that should match with your FASTA file header.
  3. Second column needs to have the start of the phage region and third column needs to have the ending site. Here is the example of txt file that is saved as tab delimited on excel:
CP007216.1 336056 354207
CP007216.1 617727 649609
CP007216.1 1804531 1821407
CP007216.1 2633977 2658272

4. After this step you need to change the file extension from .txt to .bed

5. Change the header of your FASTA files to > CP007216.1 as well.

6. After this step I faced with an error. It was related with using dos and UNIX platforms. Convert both of the files to unix using dos2unix reference.fasta phage_regions.bed

7. run this command :

maskFastaFromBed -fi reference.fasta -bed phage_regions.bed -fo 

reference_phagecleaned.fasta

8. After this step you will get a new reference with phage regions masked with “N”

9. To make sure that your masking was successful check your nee file with grep

grep -o ‘N’  reference_phagecleaned.fasta

If you see Ns listed on your screen you are good to go!

Yaay for problem solving and learning new tools!

 

img_5561

 

 

 

Leave a Reply