Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • compare two Fasta files

    Hi,
    I am new in NGS and have very little knowledge in tools used for analyzing sequences generated by NGS. Here, I have a problem and think somebody with good perl script skill may be able to help me out:
    I have two Fasta files like below:

    File1:
    >a
    GVKKDVKCTTTGGG
    .
    .
    >f
    AAATTTGGGCCCEEE
    >g
    SSSGGGYYYTTTGTFR
    .
    .
    >x
    DDDGGGYYYTTTGTFR
    .
    .
    .

    File2:
    1>
    GVKKDVKCTTTGGG
    .
    .
    >41
    FFFGGGYYYTTTGTFR
    .
    .
    >200
    AAATTTGGGCCCEEE
    .
    .
    >1000
    SSSGGGYYYTTTGTFR
    .
    .
    .

    Many but not all sequences are identical in these two files. I would like to compare each sequence of the first file with the second file and make a following table:

    a GVKKDVKCTTTGGG 1
    f AAATTTGGGCCCEEE 200
    g SSSGGGYYYTTTGTFR 1000
    x DDDGGGYYYTTTGTFR 0
    .
    .
    .

    In the table, the first column is the header of each sequence of the first file. The second column is each sequence of the first file and the third column is the header of the second file with the identical sequence with the first file. If there is no sequence identical in the second file, then use number zero instead.
    Appreciate if someone can help me out.
    Thanks a lot.
    Acyrocks

  • #2
    Is this a homework question?

    Comment


    • #3
      It sounds like one, but in any case it is pretty simple to hash this out in not much code :-)

      Comment


      • #4
        Agreed, I think you've given him the key idea.

        Comment


        • #5
          Thanks for replying. No, it's not a home work. I don't have much Perl knowledge, except have read couple of chapters and know how to run a script. I may spend more time in future to learn it if I found some time. If you guys can help me out, I appreciate it.

          Comment


          • #6
            Do you know how to program in any particular language? Perhaps we could assist you to implement it in a language you have experience in. If not, someone will probably give you a solution to try.

            Comment


            • #7
              Dear acyrocks,
              Please run the attached code using following command(unix):
              perl for_seqanswer_fasta_formatter.pl file1.fasta file2.fasta out.fasta
              It is working fine here
              Best wishes,
              Rahul
              Attached Files
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment


              • #8
                Thanks so much, Rahul.
                I tried your code and it works for most parts. However, it skip the sequences in file1 which are not identical to any sequences in the file2. I hope these sequences are still include in out file and with zero printed on the third column. I think this could be easy fix for you. Thanks again.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 11:49 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-24-2024, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                61 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Working...
                X