How to trim small contigs from a FASTA file

Last week, I wanted to deposit FASTA files to the NCBI. All my FASTA files gave an error of  having contigs less than 200 bp. So, apparently your min. bp length of contigs needs to be min of 200 bp in order to successfully complete the submission. I was depositing around 400 FASTA files, so manually removing those contigs is not a smart move I know. Therefore, I started to google stuff like :

  • how to delete small size contigs from fasta – Google Search
  • how to delete small size contigs from fasta – Google Search
  • remove small contigs from fasta – Google Search

And I decided to go with the command-line I am sure with grep and sed commands would help. But I am not an expert of this yet ( going to be at one point tho), so i needed to find a script that other people shared.

My challenge was to find a way a perl or phyton script that would handle that problem. This web-site seems promising for a perl script ” ” to “Filter the Sequence by Their Length”. Specially “” perl script can identify your bp lengh and filter. Very promising.

While I was trying to make this script run I received another e-mail from HPRC person ( Michael is the best he helps us all the time with our bioinformatics questions.) mentioning he shared a python script with me called “” I can share this script upon request since it is not my work. This is the command that needs to be run for each FASTA files:

Path/  name.fasta 200 2000000000000 new_name.fasta

Here we specify the filtration parameters being min. 200 max. 200000000000. So that perfectly worked and i was able to deposit all my FASTA files.

P.s.  Please remember to re-arrange the metadata file names with new names on NCBI submission portal!

Or —> change the name of your files for filename in ./*; do mv “./$filename” “./$(echo “$filename” | sed -e ‘s/_200//g’)”; done

Related image

Leave a Reply