Unconfigured Ad

**Dethecor** · 07-13-2010, 01:09 AM

Set intersection

Now, I'm not an expert in (Bio-)Perl, but what you are doing seems to be somewhere in the time complexity of O( |all reads| * |aligned reads| ) with a lot of file connections being opened.

Depending on how many reads there are, you might be better of just reading both sets of read-ids ('all' and 'aligned') into an appropriate data-structure and do a set intersection.
Then use theses lists of id's to parse your .fasta-files and split them into 'unaligned' and 'aligned' (is there a Set with operations 'union', 'intersection', etc. defined in Perl?)

Cheers

**Bruins** · 07-13-2010, 04:13 AM

First, you read in both 'aligned reads' and 'unaligned reads'. This takes some time.
Within the while loop, you read in the unaligned reads (variable $inseq2) again. This means that for every sequence in 'aligned reads' the program reads in the unaligned reads. Find a way around this and save a LOT of time. Perhaps rename the first $seqin2 in

Code:

while (($seqin2 = $inseq2->next_seq) && $flag == 0) {

Chrz

**kmcarr** · 07-13-2010, 06:21 AM

Dethecor is correct, you are doing this in the most complex (time wise) manner possible. He is also correct that this is formally a set operation, and the operation you want to perform is a 'difference'. We can also make a simplifying assumption, that being that Aligned reads is a proper subset of Total reads.

Here is how I would do it:

- Read in the aligned.fasta, storing the display_ids in a hash.

- Read through the total.fasta, check if the current display_id is defined in your hash.

- If it is not defined, write the current seq object to the unaligned.fasta file

(I don't see the need to write out an aligned.fasta file since that file already exists; it was one of your input files after all.)

This method opens and reads each of the files (aligned.fasta, total.fasta) only once. A single comparison is done for each member of total.fasta.

**Adamo** · 07-15-2010, 03:10 AM

Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".
And you are right, aligned.fasta isn't useful at all, don't know why I did that.
It is very much faster this way!

@Bruins: Thank you too for your suggestion, it's fine now.

**kmcarr** · 07-15-2010, 05:20 AM

Originally posted by Adamo View Post

Yes, I knew I'd not chosen the most efficient way to do the job... I'm not very comfortable with informatics!
Thanks kmcarr and Dethecor, I've done what you've suggested: the program now stores aligned reads' ids in an array, and reads the "total reads" file to check if it can find the ids in this array. If not, the id is written in "unaligned.fasta".

Adamo,

Are you using an array or a hash to store the ids of the aligned reads? If you are using an array you then for each id in the total reads file you would need to scan through the array looking for the id of interest; this is inefficient. You should store the aligned read ids as keys in a hash (the values associated with these keys are irrelevant, setting them equal to 1 will be fine). Then for each id in your total reads file you only have to perform a single boolean operation, asking is this id defined as key in my hash?

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 40 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 46 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

[Optimization] perl script for unaligned reads

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News