Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • awk script to print a set of target sequences to same file

    I have a folder with sample files (81 total .fasta) from a barcoded MiSeq run.

    Each sample file contains consensus sequences for up to 53 targets.

    The .fasta is organized so that name ">" corresponds to locus (AT#G######), followed by the consensus sequence. I need to search all sample files (from 81 total taxa) and create new .fasta files for each locus lists the name of the taxon, followed by the locus consensus sequence for each locus.

    With some help from stackexchange, I have a script that does this beautifully. I've now encountered only one hang-up. The new locus .fasta files are not merged for each taxon, so I get a .fasta for locus ATXGXXXXX for Sample_1 only, a separate .fasta for Sample_2 for the same locus, and so on and so forth for all samples. I can't seem to find a command to merge all Sample sequences for locus ATXGXXXXXX into the same .fasta.

    Here is the script:
    awk '
    FNR==1 { sample = FILENAME ; sub(/\.fasta/, "", sample )}
    /^>/ { target = substr($0,2)".fasta" ; next }
    { print items ">" sample > target ; print > target; close(target) }
    ' C_*.fasta
    Does anyone have any thoughts?

  • #2
    Got it sorted out. It's a nifty little script if anyone needs to batch sort multilocus target consensus files from Geneious Export to a new .fasta for per-locus alignment. Huge thanks to Janis at Stackexchange for that one!

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Today, 08:47 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    59 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    54 views
    0 likes
    Last Post seqadmin  
    Working...
    X