Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Gabriel_
    Junior Member
    • Oct 2015
    • 2

    Extract a fasta sequence based on Id AND length

    Hi all !

    After several days looking and testing, I'm finally here, looking for help...

    I have a large multi fasta file from RNA-seq which contain isoforms.

    The file looks like this : (truncated for the example)

    Code:
    >comp32_c0_seq1 len=365 path=[18710:0-364]
    CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
    >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
    ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
    AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
    >comp36_c0_seq1 len=275 path=[22213:0-274]
    CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
    Here the sequence ">comp32_c0_seq1 len=365 path=[18710:0-364]" is unique while the sequence ">comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]" and ">comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]" are isoforms (note the "seq1" and "seq2").


    Tipically what I want is a small script that is able to
    1) check the #1 Id whit the #2, #3, #4...
    2) If the #1 Id is unique, pass to the #2 Id
    3) if the #1 Id is not unique, then compare its length (from the "len=XXX" commentary string in the header or from the sequence itself, it doesn't matters) whit the length of the other Ids, and finally keep the longest.

    Note that some sequence have more than 10 isoforms...

    Is it something feasible ?

    I would really appreciate if you guys could give me a hand on this..!

    Thanks !

    Gabriel.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    That would require some fancy scripting, though it's feasible. It could miss some exons, though.

    Alternatively, you could run Dedupe (int the BBMap package) like this:

    dedupe.sh in=contigs.fasta out=deduped.fasta

    That will absorb all duplicate or contained sequences. The net effect will be to retain the longest transcript per gene (so if some transcript contains all the exons, it will keep that one). However, if there is alternative splicing such that some transcript contains a unique exon not found in other transcripts, it will keep that one, too. For most uses, this is probably a safer method.

    Comment

    • Gabriel_
      Junior Member
      • Oct 2015
      • 2

      #3
      Hi, thanks for your quick reply.
      I tried it out but it does not seem to do exactly what I want it to... I think dedupe looks for exact duplicates which results in this output :

      Code:
      Input:                  	216203 reads 		178150781 bases.
      Duplicates:             	4 reads (0.00%) 	3003 bases (0.00%)     	0 collisions.
      Containments:           	29 reads (0.01%) 	15742 bases (0.01%)    	177580 collisions.
      Result:                 	216170 reads (99.98%) 	178132036 bases (99.99%)
      While I know that there is more than 4 "duplicates" (i.e. isoforms) in the whole dataset...
      For now, I don't really care if some isoforms are completely removed from my dataset... I really need to keep only the longest sequence for each Id.

      Thank you again,
      I'll try to find another way

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Hi Gabriel,

        It actually removes both exact duplicates and full containments. There were only 29 isoforms that were fully contained by other isoforms; there is no way to get a smaller subset of the data without losing unique sequence.

        That said, the results are pretty surprising; looks like that method is not very effective in this case.

        -Brian

        Comment

        • blancha
          Senior Member
          • May 2013
          • 367

          #5
          Use the following script at your own risk.
          There could be bugs left, but they should be easy to fix.
          I'll probably put the script on my GitHub account.

          You do need Python3 and BioPython installed to be able to run it, which may prove to be an obstacle to a non-programmer.

          I'll go back to writing code for which I actually get paid now.

          collapseIsoForms.py

          Code:
          #!/usr/bin/env python3
          
          import argparse
          
          from Bio import SeqIO
          from Bio.Seq import Seq
          from Bio.SeqRecord import SeqRecord
          
          # Read the command line arguments.
          parser = argparse.ArgumentParser(description="Collapses isoforms, keeping only the longest one.")
          parser.add_argument("-i", "--input_file", help="Input FASTA file.", required=True)
          parser.add_argument("-o", "--output_file", help="Output FASTA file with collapsed isoforms", required=True)
          args = parser.parse_args()
          
          # Process the command line arguments.
          input_file = args.input_file
          output_file = args.output_file
          
          # Get FASTA file handle
          fasta_sequences = SeqIO.parse(open(input_file),'fasta')
          
          # Get output file handle
          output_handle = open(output_file, "w")
          
          # Create a variable to store the longest record
          # Set it to the first record, to start
          longest_seq_record = next(fasta_sequences)
          
          #Process FASTA file
          with open(output_file) as out_file:
              for seq_record in fasta_sequences:
                  # Compare id of current seq_record to id of longest stored seq_record
                  if (seq_record.id == longest_seq_record.id):
                      # Compare lengths
                      if(len(seq_record) > len(longest_seq_record)):
                          # Store current record as the longest record to date.
                          longest_seq_record.id = seq_record
                  else:
                      # New id. Print previous longest_seq_record to date.
                      output_handle.write(longest_seq_record.format("fasta"), end="")
                      # Reset longest_seq record.
                      longest_seq_record = seq_record
          Code:
          [blancha@lg-1r17-n02 ~]$ collapseIsoforms.py -i=test.fa -o=test_collapsed.fa
          [blancha@lg-1r17-n02 ~]$ more test.fa
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp36_c0_seq1 len=275 path=[22213:0-274]
          CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
          [blancha@lg-1r17-n02 ~]$ more test_collapsed.fa 
          >comp32_c0_seq1 len=365 path=[18710:0-364]
          CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
          >comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          >comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
          ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
          AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
          Last edited by blancha; 11-10-2015, 01:44 PM.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM
          • SEQadmin2
            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
            by SEQadmin2


            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


            Introduction

            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
            05-22-2026, 06:42 AM
          • SEQadmin2
            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
            by SEQadmin2

            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
            05-06-2026, 09:04 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          21 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          14 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-28-2026, 11:40 AM
          0 responses
          29 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 05-26-2026, 10:12 AM
          0 responses
          31 views
          0 reactions
          Last Post SEQadmin2  
          Working...