Unconfigured Ad

**robs** · 07-02-2010, 08:05 PM

Do you need to count the barcodes or the unique sequences between the barcodes?
You could count the number of unique (full) sequences first (probably 2-3 mins) to reduce the number of sequences to process and then use those sequences to check for the barcodes using regex or some algo for approximate string matching.
Do you know the original barcodes? Do your mismatches include indels?

**shainaporter** · 07-06-2010, 09:23 AM

Oops, sorry, we refer to the unknown 20bp sequence as a "barcode", but I realize that the term means something else to the rest of the world.
We do not know the original 20bp sequence, as they were created from randomized oligos. The mismatches will not include indels.
A line of our data looks like this:
GGCGCGCCNNNNNNNNNNNNNNNNNNNNGGCCAT
With the ends being our unknown sequences, flanked by "known" sequence.
Basically we are wanting to compare bases 9-29 of each line of data, and be able to count how many times each is found among the ~30 million lines of data.
I hope that is clearer, thanks so much for your help!

**robs** · 07-06-2010, 09:51 AM

One more thing to think about. Since you want to group the sequences with 2 allowed mismatches, you run into the problem of clustering. You basically have to calculate the distance between all the sequences and then group them. There are different approaches on how to cluster or classify and each of them might give you a different number.

I would suggest the following:
1) extract the "unknown" sequence
2) remove duplicates, but keep the counts
3) calculate distance between all sequences (I would suggest hamming distance, since no indels)
4) use cluster or classification method to get number of "groups" with max 2 mismatches

This is kind of similar of finding OTUs for e.g. 16S sequences. There are a bunch of programs already designed to do the work for you. You could do step (1) and (2) and then input the data into one of those programs.

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 48 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 106 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

comparing seq data to itself for frequency

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News