Seqanswers Leaderboard Ad

**Brian Bushnell** · 11-09-2015, 10:22 AM

That would require some fancy scripting, though it's feasible. It could miss some exons, though.

Alternatively, you could run Dedupe (int the BBMap package) like this:

dedupe.sh in=contigs.fasta out=deduped.fasta

That will absorb all duplicate or contained sequences. The net effect will be to retain the longest transcript per gene (so if some transcript contains all the exons, it will keep that one). However, if there is alternative splicing such that some transcript contains a unique exon not found in other transcripts, it will keep that one, too. For most uses, this is probably a safer method.

**Gabriel_** · 11-10-2015, 12:38 AM

Hi, thanks for your quick reply.

I tried it out but it does not seem to do exactly what I want it to... I think dedupe looks for exact duplicates which results in this output :

Code:

Input:                  	216203 reads 		178150781 bases.
Duplicates:             	4 reads (0.00%) 	3003 bases (0.00%)     	0 collisions.
Containments:           	29 reads (0.01%) 	15742 bases (0.01%)    	177580 collisions.
Result:                 	216170 reads (99.98%) 	178132036 bases (99.99%)

While I know that there is more than 4 "duplicates" (i.e. isoforms) in the whole dataset...
For now, I don't really care if some isoforms are completely removed from my dataset... I really need to keep only the longest sequence for each Id.

Thank you again,
I'll try to find another way

**Brian Bushnell** · 11-10-2015, 10:23 AM

Hi Gabriel,

It actually removes both exact duplicates and full containments. There were only 29 isoforms that were fully contained by other isoforms; there is no way to get a smaller subset of the data without losing unique sequence.

That said, the results are pretty surprising; looks like that method is not very effective in this case.

-Brian

**blancha** · 11-10-2015, 01:20 PM

Use the following script at your own risk.
There could be bugs left, but they should be easy to fix.
I'll probably put the script on my GitHub account.

You do need Python3 and BioPython installed to be able to run it, which may prove to be an obstacle to a non-programmer.

I'll go back to writing code for which I actually get paid now.

collapseIsoForms.py

Code:

#!/usr/bin/env python3

import argparse

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Read the command line arguments.
parser = argparse.ArgumentParser(description="Collapses isoforms, keeping only the longest one.")
parser.add_argument("-i", "--input_file", help="Input FASTA file.", required=True)
parser.add_argument("-o", "--output_file", help="Output FASTA file with collapsed isoforms", required=True)
args = parser.parse_args()

# Process the command line arguments.
input_file = args.input_file
output_file = args.output_file

# Get FASTA file handle
fasta_sequences = SeqIO.parse(open(input_file),'fasta')

# Get output file handle
output_handle = open(output_file, "w")

# Create a variable to store the longest record
# Set it to the first record, to start
longest_seq_record = next(fasta_sequences)

#Process FASTA file
with open(output_file) as out_file:
    for seq_record in fasta_sequences:
        # Compare id of current seq_record to id of longest stored seq_record
        if (seq_record.id == longest_seq_record.id):
            # Compare lengths
            if(len(seq_record) > len(longest_seq_record)):
                # Store current record as the longest record to date.
                longest_seq_record.id = seq_record
        else:
            # New id. Print previous longest_seq_record to date.
            output_handle.write(longest_seq_record.format("fasta"), end="")
            # Reset longest_seq record.
            longest_seq_record = seq_record

Code:

[blancha@lg-1r17-n02 ~]$ collapseIsoforms.py -i=test.fa -o=test_collapsed.fa
[blancha@lg-1r17-n02 ~]$ more test.fa
>comp32_c0_seq1 len=365 path=[18710:0-364]
CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
>comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp36_c0_seq1 len=275 path=[22213:0-274]
CAGAGGCTGGCCGGCGGCTGGAGGCTGCAGAGGCTGGCCGCCGTGCGGGCGCCGCA
[blancha@lg-1r17-n02 ~]$ more test_collapsed.fa 
>comp32_c0_seq1 len=365 path=[18710:0-364]
CGGGCGCAAGCACTGCTGTTGCTCGAATCTGCGAATGCGACGGGGCAAACTGGCTGC
>comp34_c0_seq1 len=334 path=[22818:0-146 23907:147-246 24647:247-333]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG
>comp34_c0_seq2 len=323 path=[22818:0-146 25393:147-235 24647:236-322]
ATTACTTCCTCTGCTTGCCTAGGACGTCCTGTTACTCCACAAAACTCCCTAGCATTTCCG
AAGACCAGCTGGCCACCCGGCCAAGACGGCTGGGCAAACCGCACGGCTGCCGGCGG

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Extract a fasta sequence based on Id AND length

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News