Unconfigured Ad

**drio** · 03-09-2010, 02:14 PM

1. Align you reads againt plastid (I personally like bwa and bfast).
2. Once you have the alignments is trivial to separate reads that come
from one or the other organisms.

If you want to go the ABi way use Bioscope instead of corona.

**KevinLam** · 03-10-2010, 12:27 AM

Originally posted by drio View Post

2. Once you have the alignments is trivial to separate reads that come
from one or the other organisms.

Hmmm I beg to differ that its trivial to separate the reads.
Getting the ids of the reads that map to two different is simple.

but working with the large number of reads isn't.
you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads

**drio** · 03-10-2010, 05:29 AM

Originally posted by KevinLam View Post

Hmmm I beg to differ that its trivial to separate the reads.
Getting the ids of the reads that map to two different is simple.

but working with the large number of reads isn't.
you will have to use disk based hash tables or input the sequences into mysql to effectively sort/extract the reads

Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.

**KevinLam** · 03-10-2010, 11:20 PM

Originally posted by drio View Post

Sort the reads by the read id and iterate over the two sets dropping reads that don't map to the organism.

I would love to look at your code if you got it working the way you mentioned.
for me?

I needed to extract 40 mil ids from a 70 mil csfasta.
looping thru the csfasta is simple.
but I found that I had memory issues if I stored 40 mil ids in a normal hash.
So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

So if you got it working like the way you said I would really love to c how I got it wrong.

**nilshomer** · 03-10-2010, 11:43 PM

Originally posted by KevinLam View Post

I would love to look at your code if you got it working the way you mentioned.
for me?

I needed to extract 40 mil ids from a 70 mil csfasta.
looping thru the csfasta is simple.
but I found that I had memory issues if I stored 40 mil ids in a normal hash.
So I split the ids into 1 mil (I think i can get away with 10 mil but it failed intermittently) and and iterate over the csfasta 40 x

next implementation will use disk based hash so that I only need to loop thru the csfasta only once.

So if you got it working like the way you said I would really love to c how I got it wrong.

If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.

**KevinLam** · 03-11-2010, 12:02 AM

Originally posted by nilshomer View Post

If the reads are sorted by read name, then why do you need such a complicated hash? You should be able to use constant memory and linear time.

I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
I am actually not sure if they are sorted already (coming out of the machine)

**nilshomer** · 03-11-2010, 12:16 AM

Originally posted by KevinLam View Post

I didn't try to sort the csfasta by read names actually. I just assumed that's a task doomed for failure (gnu sort might work for the ids but it will probably run out of memory for csfasta in bioperl or biopython) and went on to other options.
I am actually not sure if they are sorted already (coming out of the machine)

They are sorted coming off the machine, so no need to resort.

**sci_guy** · 03-12-2010, 09:51 PM

I agree with drio. It's a old classical computer science problem. Google "Intersection of sorted lists". If your lists aren't sorted then use GNU sort beforehand. You only need to write a shell script, no requirement for huge hashes in RAM.

Topics	Statistics	Last Post
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, Today, 12:17 PM	0 responses 10 views 0 reactions	Last Post by SEQadmin2 Today, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, Yesterday, 11:41 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 37 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM

Unconfigured Ad

Filtering SOLiD reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News