Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract a fasta sequence based on Id AND length

    Hi all !

    After several days looking and testing, I'm finally here, looking for help...

    I have a large multi fasta file from RNA-seq which contain isoforms.

    The file looks like this : (truncated for the example)

    Code:
    >comp32_c0_seq1 len=365 path=[18710:0-364]
    CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
    >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp36_c0_seq1 len=275 path=[22213:0-274]
    CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
    Here the sequence ">comp32_c0_seq1 len=365 path=[18710:0-364]" is unique while the sequence ">comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]" and ">comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]" are isoforms (note the "seq1" and "seq2").


    Tipically what I want is a small script that is able to
    1) check the #1 Id whit the #2, #3, #4...
    2) If the #1 Id is unique, pass to the #2 Id
    3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.

    Note that some sequence have more than 10 isoforms...

    Is it something feasible ?

    I would really appreciate if you guys could give me a hand on this..!

    Thanks !

    Gabriel.

  • #2
    That would require some fancy scripting, though it's feasible. It could miss some exons, though.

    Alternatively, you could run Dedupe (int the BBMap package) like this:

    dedupe.sh in=contigs.fasta out=deduped.fasta

    That will absorb all duplicate or contained sequences. The net effect will be to retain the longest transcript per gene (so if some transcript contains all the exons, it will keep that one). However, if there is alternative splicing such that some transcript contains a unique exon not found in other transcripts, it will keep that one, too. For most uses, this is probably a safer method.

    Comment


    • #3
      Hi, thanks for your quick reply.
      I tried it out but it does not seem to do exactly what I want it to... I think dedupe looks for exact duplicates which results in this output :

      Code:
      Input:                  	216203 reads 		178150781 bases.
      Duplicates:             	4 reads (0.00%) 	3003 bases (0.00%)     	0 collisions.
      Containments:           	29 reads (0.01%) 	15742 bases (0.01%)    	177580 collisions.
      Result:                 	216170 reads (99.98%) 	178132036 bases (99.99%)
      While I know that there is more than 4 "duplicates" (i.e. isoforms) in the whole dataset...
      For now, I don't really care if some isoforms are completely removed from my dataset... I really need to keep only the longest sequence for each Id.

      Thank you again,
      I'll try to find another way

      Comment


      • #4
        Hi Gabriel,

        It actually removes both exact duplicates and full containments. There were only 29 isoforms that were fully contained by other isoforms; there is no way to get a smaller subset of the data without losing unique sequence.

        That said, the results are pretty surprising; looks like that method is not very effective in this case.

        -Brian

        Comment


        • #5
          Use the following script at your own risk.
          There could be bugs left, but they should be easy to fix.
          I'll probably put the script on my GitHub account.

          You do need Python3 and BioPython installed to be able to run it, which may prove to be an obstacle to a non-programmer.

          I'll go back to writing code for which I actually get paid now.

          collapseIsoForms.py

          Code:
          #!/usr/bin/env python3
          
          import argparse
          
          from Bio import SeqIO
          from Bio.Seq import Seq
          from Bio.SeqRecord import SeqRecord
          
          # Read the command line arguments.
          parser = argparse.ArgumentParser(description="Collapses isoforms, keeping only the longest one.")
          parser.add_argument("-i", "--input_file", help="Input FASTA file.", required=True)
          parser.add_argument("-o", "--output_file", help="Output FASTA file with collapsed isoforms", required=True)
          args = parser.parse_args()
          
          # Process the command line arguments.
          input_file = args.input_file
          output_file = args.output_file
          
          # Get FASTA file handle
          fasta_sequences = SeqIO.parse(open(input_file),'fasta')
          
          # Get output file handle
          output_handle = open(output_file, "w")
          
          # Create a variable to store the longest record
          # Set it to the first record, to start
          longest_seq_record = next(fasta_sequences)
          
          #Process FASTA file
          with open(output_file) as out_file:
              for seq_record in fasta_sequences:
                  # Compare id of current seq_record to id of longest stored seq_record
                  if (seq_record.id == longest_seq_record.id):
                      # Compare lengths
                      if(len(seq_record) > len(longest_seq_record)):
                          # Store current record as the longest record to date.
                          longest_seq_record.id = seq_record
                  else:
                      # New id. Print previous longest_seq_record to date.
                      output_handle.write(longest_seq_record.format("fasta"), end="")
                      # Reset longest_seq record.
                      longest_seq_record = seq_record
          Code:
          [blancha@lg-1r17-n02 ~]$ collapseIsoforms.py -i=test.fa -o=test_collapsed.fa
          [blancha@lg-1r17-n02 ~]$ more test.fa
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp36_c0_seq1 len=275 path=[22213:0-274]
          CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
          [blancha@lg-1r17-n02 ~]$ more test_collapsed.fa 
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          Last edited by blancha; 11-10-2015, 01:44 PM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Non-Coding RNA Research and Technologies
            by seqadmin




            Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

            Nobel Prize for MicroRNA Discovery
            This week,...
            10-07-2024, 08:07 AM
          • seqadmin
            Recent Developments in Metagenomics
            by seqadmin





            Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
            09-23-2024, 06:35 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 10-11-2024, 06:55 AM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-02-2024, 04:51 AM
          0 responses
          109 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-01-2024, 07:10 AM
          0 responses
          114 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-30-2024, 08:33 AM
          1 response
          119 views
          0 likes
          Last Post EmiTom
          by EmiTom
           
          Working...
          X