Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combine FASTA files in a specific order based on sequence ID

    Hi all,

    I frequent these forums often but this is my first post.

    I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

    What I want to do is combine two multi fasta files in a specific order based on the sequence IDs.

    For example;

    file 1

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....

    file 2

    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    In need to combined file to look like;

    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe1
    CTTTGTCCTTGTCCTTGGTGGCGG....
    >seq1
    TTTGGATTACAAAGTTATTTAAATCACATGT....
    >seq1_probe2
    ATTTCTTCTCATCCTCCTCCTCCTA....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe1
    ACTAAAAACTCGTTGAAGAAATCC....
    >seq2
    GCCGTGCCATTTCAATTACAAATACATAATA....
    >seq2_probe2
    AGGATATAACACACAGCCATCACC....

    Note that only part of file 2's sequence IDs are common to file 1's.

    I'd prefer to use perl as that is the language I'm learning but any solution will suffice.

    Thanks for reading.

  • #2
    I am on my phone and can't type anything elegant (and I don't know perl), but if you want to get the job done with basic linux tools you can look up how to print out every other line in a file with sed (google sed one liners if you can't find it easily), make these separate files, then you can use the paste command followed by the tr command to convert the tabs to new line characters and get what you want. It is ugly but you should be able to figure it out quickly. Use the lines you posted above as test files so you don't waste time practicing with large files.

    Comment


    • #3
      Here's what I had in mind. Save this in a script, give yourself permission to execute it, and then run it as: ./script file1 file2 output

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1
      sed -n '1,${n;p}' $file_1 > temp2
      sed -n '1,${p;n;n;n}' $file_2 > temp3
      sed -n '1,${n;p;n;n}' $file_2 > temp4
      sed -n '1,${n;n;p;n}' $file_2 > temp5
      sed -n '1,${n;n;n;p}' $file_2 > temp6
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      This is quite inefficient with large files but should introduce some basic commands. You can make it a lot faster by running all of the sed commands together and then having it wait for them to complete prior to putting them together:

      Code:
      #! /bin/bash
      
      file_1=$1
      file_2=$2
      output=$3
      
      sed -n '1,${p;n}' $file_1 > temp1 &
      pid1=$!
      sed -n '1,${n;p}' $file_1 > temp2 &
      pid2=$!
      sed -n '1,${p;n;n;n}' $file_2 > temp3 &
      pid3=$!
      sed -n '1,${n;p;n;n}' $file_2 > temp4 &
      pid4=$!
      sed -n '1,${n;n;p;n}' $file_2 > temp5 &
      pid5=$!
      sed -n '1,${n;n;n;p}' $file_2 > temp6 &
      pid6=$!
      
      wait $pid1 $pid2 $pid3 $pid4 $pid5 $pid6
      
      paste temp1 temp2 temp3 temp4 temp1 temp2 temp5 temp6 | tr '\t' '\n' > $output
      
      rm temp1 temp2 temp3 temp4 temp5 temp6
      But obviously with perl you can read in both files and just output the lines in the order you desire. So definitely figure that out too. But it is nice to be able to get stuff done with linux commands while learning how to do things in a much better fashion with a scripting language, so if you can understand how this works that would also be useful.
      Last edited by Heisman; 07-08-2013, 09:44 PM.

      Comment


      • #4
        Assuming your files are called 1.fa and 2.fa, this hack will work:

        Code:
        samtools faidx 2.fa
        awk '{id=substr($1,2); getline; for (i=1;i<3;i++){print ">"id; print; system("samtools faidx 2.fa "id"_probe"i)}}' 1.fa
        awk is pretty powerful for this kind of thing.
        Last edited by martinghunt; 07-09-2013, 12:23 PM. Reason: didn't need the samtools faidx 1.fa command

        Comment


        • #5
          as a one-off solution:

          sed -e '$!N;s/\n/\t/' file1 > col1
          sed -e '$!N;s/\n/\t/' file2 | sed -e '$!N;s/\n/\t/' > col2
          paste col1 col2 | fmt -5

          >seq1
          TTTGGATTACAAAGTTATTTAAATCACATGT....
          >seq1_probe1
          CTTTGTCCTTGTCCTTGGTGGCGG....
          >seq1_probe2
          ATTTCTTCTCATCCTCCTCCTCCTA....
          >seq2
          GCCGTGCCATTTCAATTACAAATACATAATA....
          >seq2_probe1
          ACTAAAAACTCGTTGAAGAAATCC....
          >seq2_probe2
          AGGATATAACACACAGCCATCACC....

          Comment


          • #6
            how i can convert
            >1...>2....>3...>10000 to >1

            and
            >1..>2..>3....>10000 for b.fasta to >2 and
            same for for all my 5 samples

            Comment


            • #7
              Originally posted by huma Asif View Post
              how i can convert
              >1...>2....>3...>10000 to >1

              and
              >1..>2..>3....>10000 for b.fasta to >2 and
              same for for all my 5 samples
              I do not understand the question. Can you explain further?

              Comment


              • #8
                I agree with @HMorrison -- the question needs to be stated better. That said 'fastx_renamer' will rename FastA files.

                Comment


                • #9
                  This is the parent thread with "some" additional information: http://seqanswers.com/forums/showthread.php?t=46474

                  Comment


                  • #10
                    i created fasta from vcf file using target intervals so now in fasta file i have the same number of header as the coordinates in bed

                    so what i am doing is i want to cat all these sequences

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Multiomics Techniques Advancing Disease Research
                      by seqadmin


                      New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                      A major leap in the field has
                      ...
                      02-08-2024, 06:33 AM
                    • seqadmin
                      The 3D Genome: New Technologies and Emerging Insights
                      by seqadmin


                      The study of three-dimensional (3D) genomics explores the spatial structure of genomes and their role in processes like gene expression and DNA replication. By employing innovative technologies, researchers can study these arrangements to discover their role in various biological processes. Scientists continue to find new ways in which the organization of DNA is involved in processes like development1 and disease2.

                      Basic Organization and Structure
                      Understanding...
                      01-22-2024, 03:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 08:57 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-14-2024, 09:19 AM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-12-2024, 03:37 PM
                    0 responses
                    409 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-09-2024, 03:36 PM
                    0 responses
                    649 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X