No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way to split FASTQ files based on Illumina indexes in the ID

    I have received new NextSeq reads from our core facility in a semi-demultiplexed state. The P5 and P7 indices are placed in the sequence ID in an unsorted state. Here is an example of four reads in one of the pair's fastq files:
    @NS500551:36:H5VJNBGXX:1:11101:17033:1044 2:N:0:GGACTCCT+GCGATCTA
    @NS500551:36:H5VJNBGXX:1:11101:2211:1044 2:N:0:TAAGGCGA+TCTACTCT
    @NS500551:36:H5VJNBGXX:1:11101:24462:1044 2:N:0:TCCTGAGC+GCGATCTA
    @NS500551:36:H5VJNBGXX:1:11101:16844:1044 2:N:0:AGGCAGAA+TCTACTCT
    I would like to split the reads into separate fastq files based on the indices, but I cannot find any suitable tools to do it. It needs to be reasonably fast as well, as this sequencing run has 400 million reads ....

    All help is very appreciated.

  • #2
    Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).


    • #3
      The most flexible demultiplexing tool I am aware off is this:

      I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+".


      • #4
        BBTools has a program, "demuxbyname", which will do this. Usage: in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

        "Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

        In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

        Oh, and it's extremely fast.
        Last edited by Brian Bushnell; 04-07-2015, 02:37 PM.


        • #5
          bbmap is the solution!

          Thank you all for your help! The "" approach works well and is fast.

          Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches).

          For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it.

          Thank you again!


          • #6
            Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.


            • #7
              Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.

              We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well.

              This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this:

              1) Concatenate all of the output files that did have correct barcodes into a single file:
              cat out_*_1.fq > combined.fq

              2) Run
     in=r#.fq out=nonmatching#.fq names=combined.fq include=f


              • #8
                Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.


                • #9
                  Does demuxbyname support wildcards by chance?
                  Thanks in advance!


                  • #10
                    It does not explicitly support wildcards, but you also don't necessarily need to supply a list of exact names. For example, with standard Illumina headers that have a barcode in them (at least, in the format we generate them), you can demux into multiple files, one per barcode, without supplying a list of barcodes. Or you can match just a prefix, suffix, or substring (to a list of names) so the rest is implicitly a wildcard... in other words, you can match patterns like "foo*" or "*foo" or "*foo*", but not "foo*bar".


                    • #11
                      Great tool. I was wondering, is there a way to get it to match on the first N nts of the name? My use case is the following: I have a big fastq file with unsplit indices. The indices were read as 9+9 but were in fact only indexed with 6mers. So they look like this:

                      @7001253F:517:CBKMUANXX:5:1107:5342:1998 1:N:0:GTGTGATCT+TCTTTCCCT

                      But only the first 6 nts of the suffix (i.e. GTGTGA) are actually part of the barcode.

                      All my barcodes are separated by hamming distance of 3. Ideally, I would like to separate the barcodes, allowing up to 2 mismatches in the barcode region only, ignoring mismatches in the non-barcode region, e.g.
                      GTGTGA => check for mismatches and sort read into bin
                      TCT+TCTTTCCCT => ignore mismatches in this region

                      So when I run this:
             in=S1_R1.fastq in2=S1_R2.fastq out=out_%_#.fq prefixmode=f names=barcode_names.txt hdist=2 outu=unmatched

                      I get reads in the unmatched file where some of the mismatches that led to read exclusion are outside the barcode region.

                      I tried the argument length=6, but it did not seem to solve the problem for me. I did not quite understand the documentation for the length argument, reproduced here:

                      "If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
                      For example, you could create files based on the first 8 characters of read names."

                      How do you specify if it is a suffix or prefix?
                      What does it mean "insted of or in addition to the list of names"?

                      I did see the argument substring, so perhaps I could use that and it would produce similar results to what I want for the most part, but it's not technically what I want, since I only want to consider matches in the first 6 nts as valid.

                      I realize there is a workaround in that I could write a script to make a duplicate fastq file where I've truncated the barcodes to 6, then run it, then recover the original by read ID, but I was wondering if there is already a built-in way to do this with your tool and I am missing something in my reading of the docs.

                      Thanks in advance!


                      Latest Articles


                      • seqadmin
                        Advanced Methods for the Detection of Infectious Disease
                        by seqadmin

                        The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                        11-27-2023, 01:15 PM
                      • seqadmin
                        Strategies for Investigating the Microbiome
                        by seqadmin

                        Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                        11-09-2023, 07:02 AM





                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 08:23 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 12-01-2023, 09:55 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 11-30-2023, 10:48 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 11-29-2023, 08:26 AM
                      0 responses
                      Last Post seqadmin