Renaming subfiles with parent folder name

Finally, i was able to figure out how to rename contigs.fasta files that were generated by SPades with parent_file_name_contigs.fasta

find . -type f -name “*contigs.fasta” -printf “/%P\n” | while read FILE ; do DIR=$(dirname “$FILE” ); mv .”$FILE” .”$DIR””$DIR”_contigs.fasta;done

Intrinsic and acquired antibiotic resistance


Genetically (inherently) insensitivity for a drug is intrinsic. Intrinsic mechanism is those specified by naturally occurring gene found on host chromosome such as beta-lactamase of gram negative bacteria and many multidrug resistance (MDR) efflux system.

Examples: gram negative has impermeability for glycopeptides because of the outer membrane, if a microorganism has no cell wall they are intrinsically resistant to penicillin, Klebsiella also produce cephalosporinase , Pseudomonas aeruginosa resistant to tetracycline, chloramphenicol, and quinolones because of chromosome codded active exporters.


Spontaneous Mutation

  • Occur spontaneously in any gene of bacteria
    • Mutation is rare event 10-9 to 10-8 compared with gene transfer 10-5 to 10-4
  • Frequency of mutation differs between genes
  • Point mutations arise at a rate of around 10-9 to 10-10 per base pair per generation, and although extremely low, the large overall population size during infection, and the short generation time, provides the opportunity for diversity to arise during a single infection.
  • For example mutation leading streptomycin, nalidixic acid resistance is common 10-8-10-10 cells per generation whereas vancomycin or polymixin B is very rare
  • Streptomycin require single mutation and fluoroquinolone resistance is gradual; different mutations required.
  • Depends on antibiotic, bacterial species, time of drug use (slow killing ability of drug), stress, fitting cost, other genetic factors.

Types of mutation:

  1. Point mutation
    1. Silent, missense, nonsense
  2. Deletion
  3. Insertion ( frame-shift)
  4. Duplication

Transition– purine (G or A) misplace with (G or A) or Cor T misplace with C or T

Transversion- Pyrimidine (C or T) misplace with a  purine (A or G)

Factors increase the mutation rate of antibiotic resistance genesmu rate.png

Spontaneous mutations may cause resistance by:

  • Altering the target protein to which the antibacterial agent binds by modifying or eliminating the binding site.
  • Up regulating the production of enzymes that inactivate the antimicrobial agent.
  • Down regulating or altering an outer membrane protein channel that the drug requires for cell entry.
  • Up regulating pumps that expel the drug from the cell.
  • The mutation rate of Salmonella enterica serotype Typhimurium for rifampin resistance increases under starvation conditions
  • These observations indicate that bacterial growth conditions have a dramatic effect on the mutation rate. Analysis of several model systems have demonstrated that stress-enhanced bacterial mutation is a regulated phenomenon
  • This increased mutation rate is typically conferred by alterations in the genes that constitute the mismatch repair (MMR) system (mutS, mutL, mutH, mutT, mutY, mutM, and uvrD)
    • Mutations in the MMR system also increases the prevalence of genetic recombination, providing diversity to antibiotic resistance mechanisms
  • Spontaneous mutations.Chromosomal mutations are quite rare (one in a population of 106–108 microorganisms) and commonly determine resistance to structurally related compounds
  • Resistance to quinolones in E. coli is caused by changes in at least seven amino acids in the gyrAgene or three amino acids in the parCgene [1], whereas only a single point mutation in the rpoBgene is associated with a complete resistance to rifampin

This section above is adapted from a paper ( \) which I was very happy to run into!! I had one of this “aha” moments.


Horizontal transfer (HGT)

HGT may occur between strains of the same species or between different bacterial species or genera. Mechanisms of genetic exchange include conjugation, transduction, and transformation. HGT is the primary mechanism for evolution in prokaryotes and is synergised by complex networks of transfer involving the microbiome.

The dominance of HGT on Salmonella evolution is apparent from the observation that S. Typhi and S. Typhimurium share an average nucleotide identity of around 99%, yet around 15% of their genes are serovar specific. Genes affected include prophage, pathogenicity islands, ICEs, transposons, IS elements, and plasmids.


  • Most important way to spread bacterial resistance genes
  • Transfer between donor and recipient using conjugative genetic elements( transposons or plasmids) requires direct cell-to-cell contact (conjugation bridge) usinf F pillus.
  • The process allows for the passage of more than one functional gene at a time, so that multiple resistance could occur within a single step.
  • Conjugation is thus an important and highly efficient process for transferring genes, and the acquisition of resistance by most pathogens is probably a result of this process.
  • Resistance is believed to be mediated either by resistance plasmids (R plasmids) or transposable elements (transposons)
  • F (fertility) factor Donor F+ and recipient F-
    • If f+ cells divides and one daughter does not get F+; then it kills itself
  • OriT opens double strand becames single and DNA gets transferred (Type IV secretion apparatus)
  • IS elements( insertion sequence); homolog with chromosome parts.OriT is one of them
  • Conjugation is the dominant mechanism, used by ICEs and plasmids, but likely plays an important role in transfer of other mobile genetic elements (MGEs), such as transposons.
  • Antibiotic-resistance genes are present on the chromosome or on plasmids in Salmonella, often associated with MGEs such as composite transposons or ICEs



  • Does not require viability of donor cell
  • DNA is transferred from a donor to a recipient by way of a host, a bacteriophage.
  • Hershey and Chase 1952- labelled protein code ( remained outside of the cell)
  • It is still unknown whether this process causes clinically observed resistance to antibiotics. Because this process is highly dependent on specific phages, it may occur only within certain bacterial species.
  • Only a limited amount of DNA that can be packed into the head of a phage can normally be transferred. Transduction therefore cannot be responsible for multiple drug resistance.


  • Does not require viability of donor cell
  • Involves the passage of DNA to a recipient through a specific medium.
  • Transferring genetic material is mostly observed in vitro at laboratory. Almost never occurs in nature.
  • One strand degrades, single strand gets into cell via uptake system- DNA aligns with homologous part of chromosome- heteroduplex DNA is repaired into DNA.
  • There are naturally transformable bacteria and chemically transformable ( that’s why we don’t see our interest MO in nature transforming)
  • Limited between species transfer
  • Uptake of naked DNA, Mediates exchange of any part of DNA

Genetic materials (to be cont..)




IS (Insertion Sequences) 

Genomic Islands 

Antibiotic resistance mechanisms

Enzymatic drug interactions

Enzyme modify active nucleus of the drug (cleavage of molecules or addition of a chemical group ( eg., B-lactams, aminoglycosides, chloramphenicols)

Drug inactivation enzymes are usually related with mobile genetic elements β-lactamase and aminoglycoside enzymes are most prevalent.

Β-lactamases hydrolyze the beta lactam ring, which are no longer able to bind active serine site of peptidoglycan layer (4 class A-D)

Aminoglycosidase modifying enzymes catalyze transfer of acetyl group that drug poorly binds to ribosome.

Chloramphenicol is inactivated by chloramphenicol acetyltransferase

Target modification

(penicillin, glycopeptides, MLS, quinolones)

  1. Mutation in penicillin binding proteins on bacterial cell membrane
  2. Replace alanine with lactate in peptidoglycan by changing peptide component from d-alanine-d-alanine to d-alanine-d-lactate (Resistance to vancomycin)
  3. Change peptide component from d-alanine-d-alanine to d-alanine-d-lactate
  4. Mutation in catalase-peroxide (tuberculosis)
  5. Mutation in DNA gyrase (Quinolone resistance – chromosomal mutation)
  6. Mutation in RNA polymerase, Chromosomal mutation in gene for B subunit of bacterial RNA polymerase (Resistance to rifampin)

This mechanism particularly important for penicillins, glycopeptides and MLS in gram-positive bacteria.

MRSA (methicillin resistant Staph aureus (+); mecA synthesis new PBP (penicillin binding proteins)

Glycopeptides resistant Enterococcus (Vancomycin resistant Enterococcus) is in this class (vanA and vanB peptidoglycan ending changes).

Efflux pumps

(tetracycline, chloramphenicols, nalidixic acid, novobiocin)

  • The most relevant bacterial efflux system families involved in antibiotic resistance are: ABC, MFS, MATE, RND and SMR (Li et al., 2015).

Group in sequence similarity:

  1. Major facilitator (MF) family (EmrAB and MdfA)
  2. RND family (AcrAB, AcrD, AcrEF, MdtABC and MdsAB)
  3. Multidrug and toxic compound extrusion (MATE) family (MdtK)
  4. ATP-binding cassette (ABC) family

Group in specificity:

  • Specific drug resistance efflux pump (tetracycline in gram neg and MLS or phenicols)
  • Multi drug resistance efflux pump

OR Group in energy source:

  • ATP binding cassette (ABC) transporters; utilize ATP for energy source; MLS
  • Secondary drug transporters; utilize proton and sodium ions MDR resistance.

OR Group in size:

  • MFS (Major facilitator super-family)
  • SMR( Small multi-drug resistance)
  • RND (resistance nodulation cell division)



ref: ,

Reduction of permeability

(ompC, ompF mutations)

Mutations leading loss, reduced size, decrease expression of porins can cause reduce drug uptake. OmpF and OmpC are most common genes can mutate and alter porins. The combined presence of blaCMY-2 and a lack of OmpF and OmpC have been previously associated with imipenem resistance in Ecoli (carbapenem).

Β-lactams and fluoroquinolones in gram negative bacteria is most common.

  • Compared with gram-positive species, gram-negative bacteria are intrinsically less permeable to many antibiotics as their outer membrane forms a permeability barrier
  • One mechanism that results in reduced drug permeability in bacteria is the cell wall’s lipo polysaccharide (LPS), which consists of lipid A, a core consisting of polysaccharide and O-antigen . Bacteria that harbor LPS moieties show resistance to erythromycin, roxithromycin, clarithromycin and azithromycin in gram-negative bacteria such as strains of Pseudomonas aeruginosa and Salmonella enterica, all of which are serious pathogens, especially in immune-compromised patients.

Mutation in porin proteins commonly cause penicillin and aminoglycoside resistance.

ref: ,

Target protection

Mostly seen in  tetracycline (TetM and TetO), fluoroquinolones (Qnr) and fusidic acid (FusB and FusC).

Alternative pathways 

Chromosome mutation increase number of PABA receptors (folic acid synthesis) for Sulfonamides.



Mode of action of antibiotics

assorted plastic figures
Photo by on

  1. Cell wall
  2. Protein synthesis (70S)
  3. DNA/RNA synthesis
  4. Other targets


a. Cytoplasmic  phase

Fosfomycin– MurA analog, blocks muramyl-pentopeptide

b. Membrane associated phase

Bacitracin– blocks lipid carrier

c. Extra-cytoplasmic phase:

B-lactam (penicillin, cephalosporins, carbapanems, monobactams)- blocks transpeptidase ( penicillin binding protein), effects cross-linkage

Glycopeptide (vancomycin, avoparcin)-  avoids transpeptidase formation, cause cross-linkage (D-ala D-ala termini)

Polymyxin (colistin)- effects peptidoglycan layer, increases permeability


a. 50S (31 protein+ 23S RNA+ 5S RNA)

Macrolide, linozalid, streptogramin (MLS), chloramphenicol, clindamycin– Peptide bond formation.

B. 30S (21 protein + 16S RNA)

Tetracycline– blocks binding tRNA

Aminoglycoside– impair proofreading



      Quinolone- a. DNA gyrase (topoisomerase II)- Beginning at the replication fork in gram negatives; GyrA* and GyrB / b.Topoisomerase IV- later at claving stage in gram positives; parC* and parE . (* where mutations usually occur)

Rifomycin– inhibits transcription (DNA –> mRNA)


      Trimethoprim- It  is a folic acid inhibitor.  It binds to dihydrofolate reductase and inhibits the synthesis of tetrahydrofolic acid which plays an essential role in thymidine synthesis consequently inhibits DNA synthesis.

     Sulfonamides- competitor for the enzyme dihydropteroate synthetas which blocks PABA joining folic acide molecule, so prevent folic acide synthesis which are essential for bacteria.


Note to myself: I will improve this section later.


Masking bacteriophage regions for phylogenetic tree analysis

Last week my challenge was masking phage regions on my reference strains in order to conduct phylogenetic analysis accurately. I will show the steps to find phage regions using an online tool and mask those regions in my references using command-line.

I used PHASTER online tool (  to determine the phage regions in my reference FASTA files. PHASTER gives you the range of sequence that are belong to phage regions. I also used bedtools ( ) for masking those regions.

Bedtools works with a bed file. Bed file is a tab eliminated text file.

  1. Open a new excel sheet
  2. first column needs to have the Id that should match with your FASTA file header.
  3. Second column needs to have the start of the phage region and third column needs to have the ending site. Here is the example of txt file that is saved as tab delimited on excel:
CP007216.1 336056 354207
CP007216.1 617727 649609
CP007216.1 1804531 1821407
CP007216.1 2633977 2658272

4. After this step you need to change the file extension from .txt to .bed

5. Change the header of your FASTA files to > CP007216.1 as well.

6. After this step I faced with an error. It was related with using dos and UNIX platforms. Convert both of the files to unix using dos2unix reference.fasta phage_regions.bed

7. run this command :

maskFastaFromBed -fi reference.fasta -bed phage_regions.bed -fo 


8. After this step you will get a new reference with phage regions masked with “N”

9. To make sure that your masking was successful check your nee file with grep

grep -o ‘N’  reference_phagecleaned.fasta

If you see Ns listed on your screen you are good to go!

Yaay for problem solving and learning new tools!






Basics of Illumina paired end metagenomics data analysis

I had a chance to take a metagenomics course  to learn bioinformatics analysis using NGS metagenomics data. I realized it has been a while since I haven’t use what I had learnt. Again I am here to remember.

Metagenomics analysis using QIIME 2 and command-line

Steps of Illumina paired end metagenomics:

1 .DNA is isolated from some kind of sample, and PCR is performed on the DNA using universal primers targeting one or more variable regions of the 16S rRNA gene.

2.DNA sequencing of the amplicon is performed on the Illumina MiSeq device

3.Usually multiple samples are barcoded and sequenced on the same MiSeq run (i.e. multiplexing).

4.16S sequence data from the Illumina DNA sequencer is produced in the FASTQ format

5.The 16S sequence data produced is what we refer to as paired-end data, that is, the DNA fragments are sequenced from both ends

6.For each DNA fragment sequenced we have two sequences; “R1” forward read and “R2” reverse read.

Merging reverse and forward reads


OTU picking


1.After the reads are merged the next step is to convert the FASTQ file to a FASTA file

2.With the FASTA file we perform OTU picking

3.We choose a database (e.g. Greengenes) and run a script that compares each sequence from our FASTA file to the database

4.If a sequence has >97% identity with one of the species in the database then it is assigned to that OTU

5. OTU picking produces a table (.biom file extension) that is not human readable

6.We can then calculate measures of diversity and do additional downstream analyses (e.g. comparing abundance of various taxonomic groups across samples)

To analyze metagenomics data

  1. Use supercomputer command lines ( I will explain this one)
  2. Sequence Hub Applications (BaseSpace)
  • ­16S Metagenomics
  • ­QIIME preprocessing
  • ­QIIME Visulization


  • There is a huge metagenomics literature out there, and almost all of the data was analyzed with QIIME or MOTHUR, which use a similar approach.
  • QIIME was not good at telling the difference between sequencing errors and different species/strains/OTUs, and the result was a gross overestimation of diversity in each sample.
  • QIIME might tell you that there are 2,000 types whereas QIIME-2 will tell you that there are 150 types of microbes in a pair of samples
  • QIIME2 uses a program called DADA2 to remove bad sequences and count how many different types of sequences are in a sample without assigning taxonomy. ( DADA2 paper)
  • QIIME2 tells you how many different microbes are in your sample without knowing what any of them are!
  • QIIME2 uses a naïve Bayesian classifier to assign taxonomy to the sequences; the classifier is trained on GreenGenes or SILVA
  • QIIME2 attempts to give only high-confidence result


QIIME says:

“By comparing all your sequencing data to this bacterial database, I think that you have 600 different kinds of bacteria in your samples.”

QIIME-2 says:

 “I aligned and grouped all the different types of sequences in your samples, and it looks like you have 200 distinct types of sequence in the samples. If you want to know what types of bacteria those sequences are from then you need to run another analysis!”



Starting with the supercomputer

1.Create a new file on your /SCRATCH drive (e.g. qiime2)

2.Transfer your FASTQ files into this folder

3.Take subsamples of each FASTQ files (run seqtk_subsample.job)

Grep –c command 274,820 sequences (but illumina says you just need 100,000)

4.Create manifest.txt file

5. Demultiplex files and summarize (Qiime2 tools)

6.Use to view quality of sequences

7.Run multi-dada2.job

8.Create metadata.txt file

9.Create future-table of rep. sequences (Qiime2 tools) and visualize

10. Naïve Bayesian classifier Classification of taxonomies (nb_class.job) and visualization

11.Create taxa-bar plots and visualize

12.Generate heat-map

13. Coremetrics and others


Running seqtk_subsample.job

module load Seqtk/1.2-intel-2015B

# TODO Edit these variables as needed:



subsample_fraction=‘0.4’        # keep 40%

random_seed=66                 # (remember to use the same random seed to keep pairing)


seqtk sample -s$random_seed $read1 $subsample_fraction > subsample_1_${subsample_fraction}.fq &

seqtk sample -s$random_seed $read2 $subsample_fraction > subsample_2_${subsample_fraction}.fq &

#this job script needs to be submitted on supercomputer


Create manifest file


#see the example below for reads of 250_T and 252_C


#Example of the first line of a manifest file for this data set–use this template to make your own!


250_T,/scratch/user/gizemlevent /qiime2/fastaq/250_S12_L001_R2_001.fastq.gz,reverse

252_C,/scratch/user/gizemlevent /qiime2/fastaq/252_S32_L001_R1_001.fastq.gz,forward

252_C,/scratch/user/ gizemlevent /qiime2/fastaq/252_S32_L001_R2_001.fastq.gz,reverse



2.Drag and drop inject_paired-end-demux.qzv file

3.Go to interactive quality plot and determine sequence base above 20 Qscore sequence base number

Multi_dada2. Job

module load Anaconda/3-

source activate qiime2-2017.12

qiime dada2 denoise-paired –i-demultiplexed-seqs /scratch/user/gizemlevent/qiime2/paired-end-demux.qza –o-table table o-representative-sequences rep-seqs –p-trim-left-f 0 –p-trim-left-r 5 p-trunclen-f 236 –p-trunclen-r 230 –p-n-threads 20

cp /work/$LSB_JOBID.tmpdir/qiime2*log ./

#this is a job script needs to be submitted to supercomputer server not login mode!!



4 columns are necessary to have (sampleID, BarcodeSequnce, LinkerPrimer, Description) we can identify factors, treatments etc based on our experiment. The one above is an example for qiime_Cont dataset. if the FASTQ file is not demultiplexed, we can also identify the barcode sequneces here. They are case sensitive.


#metadata files helps us to create bar plots, heatmap since it separates the samples with the treatment and other factors.

Creating feature tables

qiime feature-table summarize \

i-table table.qza \

o-visualization table.qzv \


qiime feature-table tabulate-seqs metadata_Cont.txt\

–i-data rep-seqs.qza \

o-visualization rep-seqs.qzv


module load Anaconda/3-

source activate qiime2-2017.12

qiime feature-classifier classify-sklearn –i-classifier /scratch/user/gizemlevent/qiime2/gg-13-8-99-515-806-nb-classifier.qza –i-reads /scratch/user/gizemlevent/qiime2/rep-seqs.qza –o-class

qiime metadata tabulate –m-input-file /scratch/user/gizemlevent/qiime2/taxonomy.qza —o-visualization taxonomy.qzv

#to start this job, transfer the database gg-13-8-99-515-806-nb-classifier.qza to your working folder.

#you can visualize taxonomies (qza file) on Qiime-2view.



module load Anaconda/3-

source activate qiime2-2017.12

qiime taxa barplot –i-table table.qza –i-taxonomy taxonomy.qza –m-metadata-file metadata_Cont.txt –o-visualization barplot.qzv


#you can visualize taxonomy bars (qzv file) on Qiime-2view.


Creating heat-maps

qiime feature-table heatmap –i-table table.qza  –m-metadata-file metadata_Cont.txt –m-metadata-category Treatment1 –output-dir heatmap



Core metrics (Bray Curtis AND Jaccard)

qiime diversity core-metrics \

–i-table table.qza \

–m-metadata-file metadata_Cont.txt \

–p-sampling-depth 223000 \

–output-dir core-metrics-results

(REMOVE \ and spacing if the above does not work–works without back slashes and spacing!!!)


qiime diversity core-metrics –i-table table.qza –m-metadata-file metadata_Cont.txt –p-sampling-depth 223000 –output-dir core-metrics-results


Alpha diversity

module load Anaconda/3-

source activate qiime2-2017.12

qiime diversity alpha-rarefaction –i-table table.qza –p-max-depth 400000 –p-min-depth 250000 –m-metadata-file bella_2_8_18_metadata.txt –output-dir alpha_r_curves

qiime diversity alpha-rarefaction –i-table table.qza –p-max-depth 100000 –p-min-depth 1000 –m-metadata-file bella_2_8_18_metadata.txt –output-dir low_minmax_alpha_r_curves

qiime diversity alpha-phylogenetic –i-table table.qza –i-phylogeny rooted-tree.qza –p-metric faith_pd –output-dir alpha_phylo

Transferring FASTQ files to NCBI using FTP

You can either use desktop app for FTP (The File Transfer Protocol ) which are:

Your FTP address, user ID and password will be provided by NCBI. You can simply login  on to desktop app and transfer (deposit) those files from your local computer to NCBI Dropbox.

Alternatively, which was a challenge for me today you can use Linux or Unix transfer files from your supercomputer server to the NCBI server. I am going to give examples of steps to use to  login and transfer those files from your super computer server to NCBI Dropbox.

In my case, I was informed that 90/400 FASTQ.GZ files were ended different format ( binary?!) in NCBI server. I suspected the FTP software that I used in desktop corrupted those 90 files since they look good on super computer server.

You can check those with

zcat example.fastq.gz | head -50

you can also check the file fingerprint ( unique ID) using

md5sum  example.fastq.gz and see if your number are matching with theirs

Anyways, today’s task today I had 180 fastq.gz files ( they gave me the list of files that I need to reupload).

First I needed to copy those files from my folder where I have all fastq.gz files ( almost 800) here is the loop for it:

for the files in $( cat subset.txt); do cp $file /scratch/gizemlevent/FASTQfiles/subsetFASTQ ; done

So what we do here we basically ask the computer to go to subset.txt files ( lists of 90 fast.gz) and copy them into a new folder called “subsetFASTQ” 

After you created this fastq file folder with your desired FASTQ files you login to ftp in our case we use

module load lftp/4.8.4-GCCcore-6.4.0


After this command you will be asked to enter your password and you can reach the Dropbox of NCBI. Make sure before you login their server when you are at the “subsetFASTQ”.And type:
mput *.fastq.gz

This ill be transfer every fastq.gz file you have in that folder into their server.


May the luck be with you!




Transferring FASTQ files using Basemount Illumina

Basemount is a tool provided by Illumina to transfer sequencing files to the local_directory using command line. Today’s challenge for me was picking FASTQ files from different sub-directories transferring to a new one (my_fastq_files)


First you need to install basemount which you can file here how: 

I will not mention about previous steps. So,  going to dive into to the topic right away..

Let’s say you created a basespace_mount directory and successfully able to see your Projects and Runs. In my case I have many different projects and runs in that folder. But I am only interested in Gizem_project1 and Gizem_project2. Those are my run. So what I would like to do; write a command which goes sub-directories and copies fastq.gz files into the local_directory that I created.


Here is the command using “cp” command:

cp basespace_mount/Projects/Gizem_project1/Samples/*/Files/*_001.fastq.gz /user/gizemlevent/my_fastq_files/

cp basespace_mount/Projects/Gizem_project2/Samples/*/Files/*_001.fastq.gz /user/gizemlevent/my_fastq_files/


Alternatively with “rsync” command

rsync -arv –include “*/” –include “*fastq.gz” –exclude “*” –prune-empty-dirs  /scratch/user/gizemlevent/basespace_mount/Projects/Gizem_project1 /scratch/user/gizemlevent/my_fastq_Files2/


rsync comand is usefull if you have same file names containing ” fastq.gz” files. Basically, rsync will transfer all fastq.gz files along with the directory paths related to the files. Whereas, cp will only copy the files. I hope this was helpful.


If you put the –dry-run option (see below) you can echo what will be copied.

rsync -arv –include “*/” –include “*fastq.gz” –exclude “*” –prune-empty-dirs –dry-run /scratch/user/gizemlevent/basespace_mount/Projects/Gizem_project1 /scratch/user/gizemlevent/my_fastq_Files2/


How to trim small contigs from a FASTA file

Last week, I wanted to deposit FASTA files to the NCBI. All my FASTA files gave an error of  having contigs less than 200 bp. So, apparently your min. bp length of contigs needs to be min of 200 bp in order to successfully complete the submission. I was depositing around 400 FASTA files, so manually removing those contigs is not a smart move I know. Therefore, I started to google stuff like :

  • how to delete small size contigs from fasta – Google Search
  • how to delete small size contigs from fasta – Google Search
  • remove small contigs from fasta – Google Search

And I decided to go with the command-line I am sure with grep and sed commands would help. But I am not an expert of this yet ( going to be at one point tho), so i needed to find a script that other people shared.

My challenge was to find a way a perl or phyton script that would handle that problem. This web-site seems promising for a perl script ” ” to “Filter the Sequence by Their Length”. Specially “” perl script can identify your bp lengh and filter. Very promising.

While I was trying to make this script run I received another e-mail from HPRC person ( Michael is the best he helps us all the time with our bioinformatics questions.) mentioning he shared a python script with me called “” I can share this script upon request since it is not my work. This is the command that needs to be run for each FASTA files:

Path/  name.fasta 200 2000000000000 new_name.fasta

Here we specify the filtration parameters being min. 200 max. 200000000000. So that perfectly worked and i was able to deposit all my FASTA files.

P.s.  Please remember to re-arrange the metadata file names with new names on NCBI submission portal!

Or —> change the name of your files for filename in ./*; do mv “./$filename” “./$(echo “$filename” | sed -e ‘s/_200//g’)”; done

Related image

Course works

Graduate-level completed course works

Texas A&M University

  • Applied epidemiology
  • Risk analysis
  • Disease detection and surveillance
  • Epidemiological data analysis
  • Statistics in research I
  • Epidemiological methods I
  • Epidemiological methods II & data analysis
  • Microbial genetics
  • Bioinformatics command line
  • Metagenomics data analysis
  • Foundations of biomedical science education,
  • Scientific ethics

Istanbul University 

  • Meat and meat production hygiene techniques
  • Food legislation
  • Advanced food chemistry
  • Advanced food microbiology
  • Advanced food hygiene
  • Advance food technology
  • Milk and dairy products hygiene and technology
  • Slaughtering science and meat inspection
  • Advanced food control and analysis techniques