Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Lohman
    Junior Member
    • Jun 2015
    • 7

    Sub-setting fasta file by forward and reverse baracodes

    Hi All,

    I have a large number of sequences in a fasta file, and each read contains both a forward and reverse inline barcode. I have far more reads than necessary for my downstream application. I am looking for a way to subset the fasta file by both forward and reverse barcode (this is how sample ID is established). I only need 1000 reads per sample, but currently have many times that. I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run. Therefore, I am looking for a way to subset the reads before demultiplexing. Any suggestions would be welcome.

    Thanks in advance!
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    You can randomly subsample with reformat like this:

    reformat.sh in=reads.fq out=sampled.fq samplerate=0.1


    ...which will reduce the data to 10%, or 0.5 days worth If your paired reads are in 2 files use the in2 and out2 flags; pairs will be kept together. Otherwise, if they're interleaved, that will be autodetected.

    Alternatively, you may be able to filter for a specific barcode quickly using BBDuk, depending on where the barcode is. If, for example, you wanted the pairs where read1 starts with "ACGTTGCA":

    bbduk.sh in=reads.fq out=filtered.fq literal=ACGTTGCA mm=f rcomp=f k=8 skipr2 restrictleft=8

    Comment

    • Lohman
      Junior Member
      • Jun 2015
      • 7

      #3
      Hi Brian,

      Thanks for the great suggestion. I used your reformat.sh without any problems, and I'm rerunning the multiplexing now.

      I may further subset the data again after demultiplexing to even out the number of reads per sample, but this won't be nearly as big of an issue. It was the raw reads that were giving me trouble.

      Thanks!

      Comment

      • dschika
        Member
        • Mar 2010
        • 56

        #4
        Originally posted by Lohman View Post
        I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run.
        Even though I don't know how large your sample is, 5 days for demultiplexing sounds very long, even for very large data sets. Maybe you could speed up the script?

        Two questions just for curiosity:
        - This depends on the analysis of course, but in general: Is it not possible to introduce a bias if you sample randomly reads from your raw data?
        - Why do you need only 1000 reads per sample?

        Comment

        • Lohman
          Junior Member
          • Jun 2015
          • 7

          #5
          Hi dschika,

          Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

          I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

          Comment

          • SNPsaurus
            Registered Vendor
            • May 2013
            • 525

            #6
            Not to harp on the script, but 5 days does seem like a long time. I have a similar step as you describe when generating a pseudo-reference for nextRAD in large populations. In Perl-ish code, something like:

            $indexA = substr($seq,0,$indexA_length);
            $indexB = substr($seq,$indexB_length); # $indexB_length is negative to take end of string
            $read = substr($seq,$indexA_length,$indexB_length); #grab the middle
            $unique{$indexA}{$indexB}{$read}++;
            $unique_all{$read}++;

            The biggest issue on a small computer is that sequence error reads will grow the hash of unique reads quite large given 100 millions of reads, but at 1000 reads per sample this would be done in a few minutes and not take much memory.

            If you are already doing something like the above and it is taking 5 days, then you've got quite a large file indeed!
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment

            • dschika
              Member
              • Mar 2010
              • 56

              #7
              Originally posted by Lohman View Post
              Hi dschika,

              Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

              I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.
              Indeed, if you are interested in only a handful of loci that sounds reasonable. Thanks for clarification.

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              38 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              100 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              121 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              114 views
              0 reactions
              Last Post SEQadmin2  
              Working...