Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • @Brian: K-mer module in FastQC only tracks 2% of the data (1 in 50 sequences)


    • Originally posted by Brian Bushnell View Post
      Hi Gopo,

      Yes, I will add that (as an option). Is that common practice in other variant-callers? Note that does currently have a "PF" (pass filter) field per sample, but I want to make things as simple and compatible as possible.
      Hi Brian,
      GATK does output "./." for individuals where a genotype can't be reliably called at a particular SNP (can't remember if freebayes and/or ANGSD does this too). What is the flag for "PF" (pass filter) field per sample?

      Is there a flag for filtering --min-alleles 2 --max-alleles 2? I ask because I normally do this with VCFtools, but with big VCF files, this takes a while.

      Thank you,


      • aligning Illumina mate-pair library using BBMAP

        Code: rcomp=t rcs=f in=read1.fq in2=read2.fq out=mapped.sam

        Is this the correct way to align illumina mate pair(5.2Kb)/ long mate pair(10Kb) libraries.

        No matter what combination of the flags rcs=t/f rcomp=t/f I use the standard error file shows that "Processing reads in paired-ended mode.". I can't understand how to use the flags correctly because in the UsageGuide it is suggested that one should use requirecorrectstrand=f (rcs=f) and rcomp=t for long mate pair. However, since I am getting insert size of 2611.48 when I am expecting something around 5200bp (predicted 5.2Kb Illumina mate pair library from sequencing center) it is most likely the flags are overridden by the default values which are rcs=t and rcomp=f (which is why I presume bbmap is processing reads in the paired-end mode).

        Could you please help.

        Last edited by bio_d; 10-25-2017, 11:13 AM. Reason: Adding more information to the post


        • Hi Brian,

          Would it be possible to add to bbduk the capability of filtering also the Illumina Index file? Just to clarify what I mean, Illumina produces the following 3 output files for a typical pair-end single indexed sample:

          Read 1 : Sample_S1_L001_R1_001.fastq.gz
          Read 2 : Sample_S1_L001_R2_001.fastq.gz
          Index 1 : Sample_S1_L001_I1_001.fastq.gz

          When I'm filtering using bbduk, I specify in1=read1 in2=read2 out1=filtered.read1 out2=filtered.read2.

          What I would like is to be able to add "in3=index1 out3=filtered.index1". I don't think any quality filtering should be applied to this file except only outputting only the sequences that passed the quality filtering for reads 1/2.

          Currently, what I'm doing now is extracting the IDs of the filtered file and then filtering the index1 file by these IDs, which is very inefficient.

          Many new protocols (mainly for single-cell analysis) are using these 3 Illumina files as input in their pipelines. So I think it could be a good addition.

          Thank you very much in advance!



          • Can you give examples of programs you are referring to? I assume the index read here is not the standard Illumina index but some sort of UMI?

            BTW: Generation of index read in a separate file is an atypical application which requires a separate option for the bcl2fastq pipeline.
            Last edited by GenoMax; 11-03-2017, 07:52 AM.


            • Hi Genomax,

              I'm using 10Xs cellranger pipeline.

              In this particular configuration, they include both "cell barcode" and "UMI" in Read 1 while the current RNA data is encoded in Read 2. The Index 1 file is used as part of the demultiplexing process because 10X uses 4 Illumina indexes per sample instead of just one.

              One additional question: in this scenario, Read 1 is only 26 bp while Read 2 is 75 bp. I want to apply a basic filtering (only maxns) on Read 2, is there a way of doing that? I've tried skipr1=t but that only applies to kmer opperations.

              Thanks again.


              • If you are using the cellranger pipeline then you should not need to do anything to the data outside the pipeline. Cellranger pipeline takes in raw Illumina flowcell data folder and does the entire analysis (including demultiplexing/alignments/counting etc).
                Last edited by GenoMax; 11-03-2017, 08:17 AM.


                • cellranger pipeline is splitted in two steps: "cellranger mkfastq" which produces the three above mentioned files (R1, R2, I1) and "cellranger count" which uses this 3 files as input.

                  The reason why I need to do the filtering first is that some platforms fail sometimes when calling reads from the TOP surface of the flowcell for Read 2, resulting in reads which will be fully or partly composed by Ns. Then, "cellranger count" will use all of the reads (including those that are 100% Ns) to do their calculations, ending up with aberrant metrics (like "mean reads per cell", which is determined as the raw number of reads divided by the number of called cells).


                  • While I have not worked with cellranger (I have only used longranger for WGS data) this sounds like an odd behavior. 10x should be accounting for presence of N's in the reads in their software.

                    If you have not talked with their software tech support then I would suggest that you give that a try. They may have some specific suggestions or will need to implement a fix in their software. Support has been responsive to me in past.


                    • Regarding the issue that produces the Ns, it's not supported by them, because it occurs when sequencing on a HiSeq 4000 using 75,8,0,75 bp cycle pattern instead of their recommended 26,8,98 bp pattern.

                      Regarding the N filtering (or any other quality filtering) I agree with you that they should be doing it in advance and incorporate it into their pipeline. I know that so far it's not being done and I have just sent them an email about it.

                      I'll see what their answer is and be back to you with it. Let's see what their feedback is.

                      In the meantime, do you think this would be something complicated to implement? If not as a permanent thing on bbduk, could you walk me on how to make it work for me? Because I was unable to match your scripts performance (I tried coding things in perl, python and bash).

                      Thanks for this quick replies!


                      • Yikes! You are certainly doing something completely off-label here :-)

                        In the cellranger mkfastq run, you could use a base-mask such as --use-bases-mask=Y26n*,I8,Y*. That should produce reads in the format 10x wants. If this is a 2D, 4000 run then you may have to do --use-bases-mask=Y26n*,I8,n*,Y*, Your second read is going to be short by about 20 bases. I am not sure if the software will like that.

                        @Brian would have to comment on your original request. For him, being a programmer, anything would be possible to implement.


                        • Oh, sorry if I was not clear enough. Those are the sequencing cycles not how I'm doing the base calling.

                          I've been working on 10X projects for several months now and evaluating them in different platforms. That's why I also included the additional question above.


                          • So how are you doing the base calling then?

                            10x recommends that you run 26x8x98 but you are running this as a 75x8x75 run instead? If that is the case my recommendation above would cut the read 1 down to 26 as expected by 10x. The index read is correct length and the last read, which should be 98 bp is only 75 in your case. Is this how you are running cellranger mkfq?

                            With above base mask you should get reads that don't have any N's. Or so I would think.


                            • Yes, that's how I'm running mkfastq. The sequencing is currently producing 2x75 bp, but by running mkfastq with the above mention cycle pattern: Read 1 is being trimmed to 26 bp, and for Read 2, we are only getting 75 bp, which is still enough data for doing alignment. Even though this is not the recommended way of sequencing this 10X libraries, 10X support team tell us they had other customers doing it this way with good results.

                              Regarding the N thing, that's not the case. These high N proportion reads are not being filtered at all by the first step (not by the bcl2fastq nor by the mkfastq wrapper).

                              I have contacted 10X and they are saying that this reads are not being filtered, but they are not being mapped also thus downstream analysis is not being affected, which is something we already knew.

                              However, they told me that for v2 chemistry the Index read is not being used any more after the demultiplexing step. I replied them asking how are they calculating "Q30 Bases in Sample Index" metric then without that file. In the meanwhile, I'm running a test so I'll be back later with both the results and the reply. So lets see. If this works, then I won't be needing any change on bbduk. Lets hope for the best!



                              • So, I've spoken with 10X support. For now they are not doing or planning to do any filtering step on the rawdata. However, thanks to my email they've opened a software feature request. Running "cellranger count" without the Index file will just skip the metric on that file, nothing else.

                                In summary, I'm so sorry for making you waste your time! If appropriate, we could even delete all this posts because we have spoken more about 10X than bbduk.

                                Thanks GenoMax for your feedback! Cheers!


                                Latest Articles


                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin

                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin

                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM





                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                Last Post seqadmin