Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • N311V
    Member
    • Jul 2013
    • 34

    Combine FASTA files in a specific order based on sequence ID

    Hi all,

    I frequent these forums often but this is my first post.

    I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

    What I want to do is combine two multi fasta files in a specific order based on the sequence IDs.

    For example;

    file 1

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....

    file 2

    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    In need to combined file to look like;

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    Note that only part of file 2's sequence IDs are common to file 1's.

    I'd prefer to use perl as that is the language I'm learning but any solution will suffice.

    Thanks for reading.
  • Heisman
    Senior Member
    • Dec 2010
    • 534

    #2
    I am on my phone and can't type anything elegant (and I don't know perl), but if you want to get the job done with basic linux tools you can look up how to print out every other line in a file with sed (google sed one liners if you can't find it easily), make these separate files, then you can use the paste command followed by the tr command to convert the tabs to new line characters and get what you want. It is ugly but you should be able to figure it out quickly. Use the lines you posted above as test files so you don't waste time practicing with large files.

    Comment

    • Heisman
      Senior Member
      • Dec 2010
      • 534

      #3
      Here's what I had in mind. Save this in a script, give yourself permission to execute it, and then run it as: ./script file1 file2 output

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1
      sed -n '1,${n;p}' $file_1 > temp2
      sed -n '1,${p;n;n;n}' $file_2 > temp3
      sed -n '1,${n;p;n;n}' $file_2 > temp4
      sed -n '1,${n;n;p;n}' $file_2 > temp5
      sed -n '1,${n;n;n;p}' $file_2 > temp6
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      This is quite inefficient with large files but should introduce some basic commands. You can make it a lot faster by running all of the sed commands together and then having it wait for them to complete prior to putting them together:

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1 &
      pid1=$!
      sed -n '1,${n;p}' $file_1 > temp2 &
      pid2=$!
      sed -n '1,${p;n;n;n}' $file_2 > temp3 &
      pid3=$!
      sed -n '1,${n;p;n;n}' $file_2 > temp4 &
      pid4=$!
      sed -n '1,${n;n;p;n}' $file_2 > temp5 &
      pid5=$!
      sed -n '1,${n;n;n;p}' $file_2 > temp6 &
      pid6=$!
      
      wait $pid1 $pid2 $pid3 $pid4 $pid5 $pid6
      
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      But obviously with perl you can read in both files and just output the lines in the order you desire. So definitely figure that out too. But it is nice to be able to get stuff done with linux commands while learning how to do things in a much better fashion with a scripting language, so if you can understand how this works that would also be useful.
      Last edited by Heisman; 07-08-2013, 09:44 PM.

      Comment

      • martinghunt
        Junior Member
        • Jul 2013
        • 5

        #4
        Assuming your files are called 1.fa and 2.fa, this hack will work:

        Code:
        samtools faidx 2.fa
        awk '{id=substr($1,2); getline; for (i=1;i<3;i++){print ">"id; print; system("samtools faidx 2.fa "id"_probe"i)}}' 1.fa
        awk is pretty powerful for this kind of thing.
        Last edited by martinghunt; 07-09-2013, 12:23 PM. Reason: didn't need the samtools faidx 1.fa command

        Comment

        • HMorrison
          Senior Member
          • May 2009
          • 121

          #5
          as a one-off solution:

          sed -e '$!N;s/\n/\t/' file1 > col1
          sed -e '$!N;s/\n/\t/' file2 | sed -e '$!N;s/\n/\t/' > col2
          paste col1 col2 | fmt -5

          >seq1
          TTTGGATTACAAAGTTATTTAAATCACATGT....
          >seq1_probe1
          CTTTGTCCTTGTCCTTGGTGGCGG....
          >seq1_probe2
          ATTTCTTCTCATCCTCCTCCTCCTA....
          >seq2
          GCCGTGCCATTTCAATTACAAATACATAATA....
          >seq2_probe1
          ACTAAAAACTCGTTGAAGAAATCC....
          >seq2_probe2
          AGGATATAACACACAGCCATCACC....

          Comment

          • huma Asif
            Member
            • Oct 2010
            • 53

            #6
            how i can convert
            >1...>2....>3...>10000 to >1

            and
            >1..>2..>3....>10000 for b.fasta to >2 and
            same for for all my 5 samples

            Comment

            • HMorrison
              Senior Member
              • May 2009
              • 121

              #7
              Originally posted by huma Asif View Post
              how i can convert
              >1...>2....>3...>10000 to >1

              and
              >1..>2..>3....>10000 for b.fasta to >2 and
              same for for all my 5 samples
              I do not understand the question. Can you explain further?

              Comment

              • westerman
                Rick Westerman
                • Jun 2008
                • 1104

                #8
                I agree with @HMorrison -- the question needs to be stated better. That said 'fastx_renamer' will rename FastA files.

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  This is the parent thread with "some" additional information: http://seqanswers.com/forums/showthread.php?t=46474

                  Comment

                  • huma Asif
                    Member
                    • Oct 2010
                    • 53

                    #10
                    i created fasta from vcf file using target intervals so now in fasta file i have the same number of header as the coordinates in bed

                    so what i am doing is i want to cat all these sequences

                    Comment

                    Latest Articles

                    Collapse

                    • GATTACAT
                      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by GATTACAT
                      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                      07-01-2026, 11:43 AM
                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 07-02-2026, 11:08 AM
                    0 responses
                    12 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-30-2026, 05:37 AM
                    0 responses
                    14 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-26-2026, 11:10 AM
                    0 responses
                    20 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    54 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...