Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way to split FASTQ files based on Illumina indexes in the ID

    I have received new NextSeq reads from our core facility in a semi-demultiplexed state. The P5 and P7 indices are placed in the sequence ID in an unsorted state. Here is an example of four reads in one of the pair's fastq files:
    @NS500551:36:H5VJNBGXX:1:11101:17033:1044 2:N:0:GGACTCCT+GCGATCTA
    @NS500551:36:H5VJNBGXX:1:11101:2211:1044 2:N:0:TAAGGCGA+TCTACTCT
    @NS500551:36:H5VJNBGXX:1:11101:24462:1044 2:N:0:TCCTGAGC+GCGATCTA
    @NS500551:36:H5VJNBGXX:1:11101:16844:1044 2:N:0:AGGCAGAA+TCTACTCT
    I would like to split the reads into separate fastq files based on the indices, but I cannot find any suitable tools to do it. It needs to be reasonably fast as well, as this sequencing run has 400 million reads ....

    All help is very appreciated.

  • #2
    Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).


    • #3
      The most flexible demultiplexing tool I am aware off is this:

      I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+".


      • #4
        BBTools has a program, "demuxbyname", which will do this. Usage: in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

        "Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

        In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

        Oh, and it's extremely fast.
        Last edited by Brian Bushnell; 04-07-2015, 02:37 PM.


        • #5
          bbmap is the solution!

          Thank you all for your help! The "" approach works well and is fast.

          Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches).

          For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it.

          Thank you again!


          • #6
            Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.


            • #7
              Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.

              We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well.

              This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this:

              1) Concatenate all of the output files that did have correct barcodes into a single file:
              cat out_*_1.fq > combined.fq

              2) Run
     in=r#.fq out=nonmatching#.fq names=combined.fq include=f


              • #8
                Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.


                • #9
                  Does demuxbyname support wildcards by chance?
                  Thanks in advance!


                  • #10
                    It does not explicitly support wildcards, but you also don't necessarily need to supply a list of exact names. For example, with standard Illumina headers that have a barcode in them (at least, in the format we generate them), you can demux into multiple files, one per barcode, without supplying a list of barcodes. Or you can match just a prefix, suffix, or substring (to a list of names) so the rest is implicitly a wildcard... in other words, you can match patterns like "foo*" or "*foo" or "*foo*", but not "foo*bar".


                    • #11
                      Great tool. I was wondering, is there a way to get it to match on the first N nts of the name? My use case is the following: I have a big fastq file with unsplit indices. The indices were read as 9+9 but were in fact only indexed with 6mers. So they look like this:

                      @7001253F:517:CBKMUANXX:5:1107:5342:1998 1:N:0:GTGTGATCT+TCTTTCCCT

                      But only the first 6 nts of the suffix (i.e. GTGTGA) are actually part of the barcode.

                      All my barcodes are separated by hamming distance of 3. Ideally, I would like to separate the barcodes, allowing up to 2 mismatches in the barcode region only, ignoring mismatches in the non-barcode region, e.g.
                      GTGTGA => check for mismatches and sort read into bin
                      TCT+TCTTTCCCT => ignore mismatches in this region

                      So when I run this:
             in=S1_R1.fastq in2=S1_R2.fastq out=out_%_#.fq prefixmode=f names=barcode_names.txt hdist=2 outu=unmatched

                      I get reads in the unmatched file where some of the mismatches that led to read exclusion are outside the barcode region.

                      I tried the argument length=6, but it did not seem to solve the problem for me. I did not quite understand the documentation for the length argument, reproduced here:

                      "If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
                      For example, you could create files based on the first 8 characters of read names."

                      How do you specify if it is a suffix or prefix?
                      What does it mean "insted of or in addition to the list of names"?

                      I did see the argument substring, so perhaps I could use that and it would produce similar results to what I want for the most part, but it's not technically what I want, since I only want to consider matches in the first 6 nts as valid.

                      I realize there is a workaround in that I could write a script to make a duplicate fastq file where I've truncated the barcodes to 6, then run it, then recover the original by read ID, but I was wondering if there is already a built-in way to do this with your tool and I am missing something in my reading of the docs.

                      Thanks in advance!


                      Latest Articles


                      • seqadmin
                        Exploring the Dynamics of the Tumor Microenvironment
                        by seqadmin

                        The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                        07-08-2024, 03:19 PM
                      • seqadmin
                        Exploring Human Diversity Through Large-Scale Omics
                        by seqadmin

                        In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                        06-25-2024, 06:43 AM





                      Topics Statistics Last Post
                      Started by seqadmin, 07-16-2024, 05:49 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 07-15-2024, 06:53 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 07-10-2024, 07:30 AM
                      0 responses
                      Last Post seqadmin  
                      Started by seqadmin, 07-03-2024, 09:45 AM
                      0 responses
                      Last Post seqadmin