Seqanswers Leaderboard Ad

**westerman** · 07-20-2011, 11:52 AM

There are two levels to your question.

The first level is, how do I summarize and count sequences that may vary by zero, one or a handful of bases? In other words how does a person group sequences together that have a small edit distance from each other? While I am unaware of a program that does this, multiple sequence alignment programs or sequence comparison programs could potentially help out here. Or I suspect that a custom program could be written relatively quickly (I have done something similar and I don't recall it being too hard with the proper modules or helper program handy.)

The need to find a edit-distance-aware counting program would exist no matter what sequencing platform you are using. Unless you are willing to settle for edit distances of zero -- i.e., exact matches only, in which case 'cut', 'sort', 'uniq' and 'wc' are your friends.

The second level is, how can I do the above edit-distance-aware-counting with SOLiD data? Here you are working with gold because any color-space sequences with an edit distance of 0 or 1 are almost certainly the same. An edit distance (or mismatch) of 1 in color-space means a machine error. That is it. An edit distance of 1 in any other platform (454, Illumina, 3730) could mean either a machine error or a SNP -- no one can tell without further inquiry (which might include looking at quality values.) However .... and this is the big however ... you must do all of your work within color-space (or its bastard cousin 'double-encoded' space if required by the counting program) because as soon as you convert from color-space to base-space then you not only lose the advantage of edit distance but you also potentially screw up the base calls.

The cardinal rule of thumb when working in color-space is to not convert to base-space until the very last possible step.

Hope this helps a bit. Sorry I do not have a specific program to recommend.

**kumar** · 07-20-2011, 01:08 PM

@westerman That helps a lot. I hadn't considered the benefits of colorspace, only the difficulties. So I can do all my counting in colorspace and decode only at the end when I need to recover the base sequence. Very cool. Thanks!

**kumar** · 07-20-2011, 04:52 PM

@eacker Could you post your message publicly or set your preferences to allow reception of private messages? Thanks.

**kumar** · 07-21-2011, 08:19 AM

This is getting slightly OT, but if expect a constant region of sequence at the end of my reads (ideally the same 4 nt) then can I expect the last 3 colorspace calls to be identical across reads? Assuming everything went perfectly as planned.

**westerman** · 07-21-2011, 08:57 AM

Originally posted by kumar View Post

This is getting slightly OT, but if expect a constant region of sequence at the end of my reads (ideally the same 4 nt) then can I expect the last 3 colorspace calls to be identical across reads? Assuming everything went perfectly as planned.

Yes. Example conversion of 7 sequences all with the same 4 ending bases.

>one
AAAAGTCA
>two
ACCCGTCA
>three
ACGTGTCA
>four
GGGGGTCA
>five
GTTGGTCA
>six
CCGGGTCA
>seven
TATAGTCA

>one
A0002121
>two
A1003121
>three
A1311121
>four
G0000121
>five
G1010121
>six
C0300121
>seven
T3332121

So you can see that '121' is always going to be there no matter what your start bases are.

However this does not mean that '121' is always going to stand for 'GTCA'. Inverse conversion shows from

>one-rev
A3212121
>two-rev
A0000121
>three-rev
A1111121
>four-rev
A2123121

to

>one-rev
ATCAGTCA
>two-rev
AAAAACTG
>three-rev
ACACACTG
>four-rev
AGTCGTCA

Different ending bases. This is yet another example of why to do all of your work in color-space before, at the very end, converting into base-space.

**kumar** · 07-21-2011, 09:38 AM

@westerman Thanks again. I'm learning how colorspace can be your friend (hopefully I don't have to eat those words). Any suggestions on places to look for sequence comparison algorithms using colorspace? Are there any libraries (python preferred, but perl and C acceptable) for working with sequences in colorspace?

**westerman** · 07-21-2011, 01:13 PM

There are a number of mapping programs that work with color-space. If required, you can always convert your 0123 CS into the dreaded (but sometimes useful) ACGT "double-encoded-color-space" and use a base-space-aware package to work in that pseudo-color-space.

As far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.

If you are unaware of double-encoded space basically each 0 in CS is replaced with an 'A', each 1 with a 'C' and so on. Telling the difference between a double-encoded file and a true base-space file is left up to the imagination. :-(

**kumar** · 07-21-2011, 03:56 PM

Originally posted by westerman View Post

As far as your project, no one has chimed in yet with a "yes, here is a good edit-distance aware comparative" program (which is what you need) so it may be time to write your own. I don't think that it would be difficult. The time I did something similar I used Perl's Bio::Grep and the agrep and vmatch options within it. That was for a base-space project so the tool should work with double-encoded-color-space.

I was planning on writing something, which is why I asked about libraries/modules. Always better to check before rolling your own. If I come up with something useful, I'll post back. Thanks!

Topics	Statistics	Last Post
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 13 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 23 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM

Seqanswers Leaderboard Ad

Announcement

Counting distinct sequences in csfasta

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News