Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sub-setting fasta file by forward and reverse baracodes

    Hi All,

    I have a large number of sequences in a fasta file, and each read contains both a forward and reverse inline barcode. I have far more reads than necessary for my downstream application. I am looking for a way to subset the fasta file by both forward and reverse barcode (this is how sample ID is established). I only need 1000 reads per sample, but currently have many times that. I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run. Therefore, I am looking for a way to subset the reads before demultiplexing. Any suggestions would be welcome.

    Thanks in advance!

  • #2
    You can randomly subsample with reformat like this:

    reformat.sh in=reads.fq out=sampled.fq samplerate=0.1


    ...which will reduce the data to 10%, or 0.5 days worth If your paired reads are in 2 files use the in2 and out2 flags; pairs will be kept together. Otherwise, if they're interleaved, that will be autodetected.

    Alternatively, you may be able to filter for a specific barcode quickly using BBDuk, depending on where the barcode is. If, for example, you wanted the pairs where read1 starts with "ACGTTGCA":

    bbduk.sh in=reads.fq out=filtered.fq literal=ACGTTGCA mm=f rcomp=f k=8 skipr2 restrictleft=8

    Comment


    • #3
      Hi Brian,

      Thanks for the great suggestion. I used your reformat.sh without any problems, and I'm rerunning the multiplexing now.

      I may further subset the data again after demultiplexing to even out the number of reads per sample, but this won't be nearly as big of an issue. It was the raw reads that were giving me trouble.

      Thanks!

      Comment


      • #4
        Originally posted by Lohman View Post
        I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run.
        Even though I don't know how large your sample is, 5 days for demultiplexing sounds very long, even for very large data sets. Maybe you could speed up the script?

        Two questions just for curiosity:
        - This depends on the analysis of course, but in general: Is it not possible to introduce a bias if you sample randomly reads from your raw data?
        - Why do you need only 1000 reads per sample?

        Comment


        • #5
          Hi dschika,

          Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

          I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

          Comment


          • #6
            Not to harp on the script, but 5 days does seem like a long time. I have a similar step as you describe when generating a pseudo-reference for nextRAD in large populations. In Perl-ish code, something like:

            $indexA = substr($seq,0,$indexA_length);
            $indexB = substr($seq,$indexB_length); # $indexB_length is negative to take end of string
            $read = substr($seq,$indexA_length,$indexB_length); #grab the middle
            $unique{$indexA}{$indexB}{$read}++;
            $unique_all{$read}++;

            The biggest issue on a small computer is that sequence error reads will grow the hash of unique reads quite large given 100 millions of reads, but at 1000 reads per sample this would be done in a few minutes and not take much memory.

            If you are already doing something like the above and it is taking 5 days, then you've got quite a large file indeed!
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              Originally posted by Lohman View Post
              Hi dschika,

              Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

              I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.
              Indeed, if you are interested in only a handful of loci that sounds reasonable. Thanks for clarification.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 11:49 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X