Last week my challenge was masking phage regions on my reference strains in order to conduct phylogenetic analysis accurately. I will show the steps to find phage regions using an online tool and mask those regions in my references using command-line.
I used PHASTER online tool (http://phaster.ca/) to determine the phage regions in my reference FASTA files. PHASTER gives you the range of sequence that are belong to phage regions. I also used bedtools ( https://bedtools.readthedocs.io/en/latest/ ) for masking those regions.
Bedtools works with a bed file. Bed file is a tab eliminated text file.
- Open a new excel sheet
- first column needs to have the Id that should match with your FASTA file header.
- Second column needs to have the start of the phage region and third column needs to have the ending site. Here is the example of txt file that is saved as tab delimited on excel:
4. After this step you need to change the file extension from .txt to .bed
5. Change the header of your FASTA files to > CP007216.1 as well.
6. After this step I faced with an error. It was related with using dos and UNIX platforms. Convert both of the files to unix using dos2unix reference.fasta phage_regions.bed
7. run this command :
maskFastaFromBed -fi reference.fasta -bed phage_regions.bed -fo
8. After this step you will get a new reference with phage regions masked with “N”
9. To make sure that your masking was successful check your nee file with grep
grep -o ‘N’ reference_phagecleaned.fasta
If you see Ns listed on your screen you are good to go!
Yaay for problem solving and learning new tools!