Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shainaporter
    Junior Member
    • Aug 2009
    • 2

    comparing seq data to itself for frequency

    What we're trying to do:
    Illumina single end reads
    ~30 million text strings made up of approximately 400,000 unique, unknown 20bp sequences flanked by known sequence (a "tag" if you will)
    need to count the frequency of each barcode sequence (with up to a 2bp mismatch) for entire data set
    the problem is that we don't have a reference to match the data to, as the sequences are unknown. Has anyone done anything like this, or know of software that might be able to do this?

    currently using matlab and a brute force technique, in which we compare each new sequence to all of the others before it, increase by one if it matches, or add it to the list if it is unique. This process is going to be exceedingly slow, hoping there is a better way!

    Thanks in advance!
  • robs
    Senior Member
    • May 2010
    • 116

    #2
    Do you need to count the barcodes or the unique sequences between the barcodes?
    You could count the number of unique (full) sequences first (probably 2-3 mins) to reduce the number of sequences to process and then use those sequences to check for the barcodes using regex or some algo for approximate string matching.
    Do you know the original barcodes? Do your mismatches include indels?

    Comment

    • shainaporter
      Junior Member
      • Aug 2009
      • 2

      #3
      Oops, sorry, we refer to the unknown 20bp sequence as a "barcode", but I realize that the term means something else to the rest of the world.
      We do not know the original 20bp sequence, as they were created from randomized oligos. The mismatches will not include indels.
      A line of our data looks like this:
      GGCGCGCCNNNNNNNNNNNNNNNNNNNNGGCCAT
      With the ends being our unknown sequences, flanked by "known" sequence.
      Basically we are wanting to compare bases 9-29 of each line of data, and be able to count how many times each is found among the ~30 million lines of data.
      I hope that is clearer, thanks so much for your help!

      Comment

      • robs
        Senior Member
        • May 2010
        • 116

        #4
        One more thing to think about. Since you want to group the sequences with 2 allowed mismatches, you run into the problem of clustering. You basically have to calculate the distance between all the sequences and then group them. There are different approaches on how to cluster or classify and each of them might give you a different number.

        I would suggest the following:
        1) extract the "unknown" sequence
        2) remove duplicates, but keep the counts
        3) calculate distance between all sequences (I would suggest hamming distance, since no indels)
        4) use cluster or classification method to get number of "groups" with max 2 mismatches

        This is kind of similar of finding OTUs for e.g. 16S sequences. There are a bunch of programs already designed to do the work for you. You could do step (1) and (2) and then input the data into one of those programs.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        12 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        48 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        106 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        125 views
        0 reactions
        Last Post SEQadmin2  
        Working...