Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence Quality for large number of fastqs

    Hi Members,

    Organism: S. aureus
    I've 100 isolates' Illumina WGS. That is 2*100 fastq.gz files. Read length 100.
    Unfortunately, the run failed before completing last 6 cycles.
    Sequencing centre has de-multiplexed data, and we've it in our court now.

    My challenge here is to strike out isolates which have poor quality. I'm interested for R2 only.

    One approach I know is:
    Get fast-qc report for all the R2, and visually decide(pass,fail) respective data one by one.
    Then again that would be qualitative, and thus I do not want to go for this approach (for the time being).
    Also, poor is a subjective term.

    Let's say, I'd like to have isolates with an average of 25 coverage. Is there way to do so?

    I'm sure this situation is very common with other labs.

    Looking forward for guidance!
    Last edited by bio_informatics; 11-12-2015, 07:55 AM.
    Bioinformaticscally calm

  • #2
    I would quality-trim (discarding reads below a certain length after trimming, say, 70bp) or quality-filter. Then discard libraries that are below your coverage target after the operation. For example, using BBDuk on files named "read1.fq.gz" and "read2.fq.gz" and trimming to q15:

    bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

    You'll get some result like this:

    Code:
    Input:                          200000 reads            60200000 bases.
    QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
    Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
    Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.

    Alternately, if the only problem is the last 6 bases of read 2, then just trim the last 6 bases of read 2 and keep all of the libraries.

    Comment


    • #3
      Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

      Basically, what you have in hand is a 100 x 94 run.

      Comment


      • #4
        Originally posted by Brian Bushnell View Post

        bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70


        Code:
        Input:                          200000 reads            60200000 bases.
        QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
        Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
        Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.
        Hi Brian,
        Thanks much for your reply, and suggesting bbduk. This is the exact thing I was looking for.
        I'll give this a try.
        Bioinformaticscally calm

        Comment


        • #5
          Originally posted by GenoMax View Post
          Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

          Basically, what you have in hand is a 100 x 94 run.
          Hi GM,
          Oh, I see.
          I couldn't imagine to that extent. Shall include both reads.

          Thank you.
          Bioinformaticscally calm

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70
            Hi Brian,
            Thanks again for sharing tool, and command.
            I've following output:

            [in=R2_001.fastq.gz, out=trimmed.fq.gz, qtrim=r, trimq=25, minlen=70]

            Input: 2301451 reads 195623335 bases.
            QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
            Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
            I've following doubt:



            1) If I'm understanding output properly, here, 88.79% of the input reads passed my filters (Length 70, and Quality 25).
            Kindly correct me if I'm wrong.
            Last edited by bio_informatics; 11-11-2015, 11:32 AM.
            Bioinformaticscally calm

            Comment


            • #7
              Here is what @Brian had to say about this question

              Why Input =/= KTrimmed + Result at times?

              There are two reasons for that, and they involve the "minlen" parameter and read pairing. For trimming, generally, the result should equal the input unless some reads are trimmed below the minimum length and discarded; you can, for example, have 100% trimmed and a result of 100% as long as they are only trimmed a little bit.

              With pairs, it's even more confusing as sometimes you have a pair discarded because one read got trimmed too short, so you can end up with, theoretically, 50% of reads trimmed and a result of 0%.

              If you set "minlen=0", then nothing will get discarded on the basis of being too short and result will equal input.
              I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.

              Comment


              • #8
                Originally posted by GenoMax View Post
                Here is what @Brian had to say about this question


                I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.
                Hi Genomax,
                Sorry, I don't understand what it mean.
                Let me take baby steps here. Input R2 - lagging strand.


                Input: 2301451 reads 195623335 bases.
                QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
                Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
                Qtrimmed (reads) % + Result (reads) % >100

                However,
                Qtrimmed (bases) % + Result (bases) % =100

                I'm confused here.
                Were 16.11% of my reads discarded?
                If so, then my output should have 83-ish% of result. Why there's a conflict?

                References:


                Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
                Bioinformaticscally calm

                Comment


                • #9
                  Originally posted by bio_informatics View Post
                  Hi Genomax,
                  Qtrimmed (reads) % + Result (reads) % >100

                  However,
                  Qtrimmed (bases) % + Result (bases) % =100

                  I'm confused here.
                  Were 16.11% of my reads discarded?
                  If so, then my output should have 83-ish% of result. Why there's a conflict?
                  ~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

                  BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

                  BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    ~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

                    BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

                    BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?
                    Hi GM,
                    Thanks for your reply.
                    That helps.

                    But, then why doesn't this apply to the bases? Base %-age seems to add up to 100.
                    Yes, I had posted output for Read 2 only.

                    I'm definitely going to use R1 for analysis.

                    If I include both my reads, with q25 and length 70, output looks like:

                    Processing time: 4.471 seconds.

                    Input: 2301450 reads 195623250 bases.
                    QTrimmed: 459677 reads (19.97%) 51656212 bases (26.41%)
                    Result: 1701048 reads (73.91%) 143967038 bases (73.59%)

                    Time: 4.911 seconds.
                    Reads Processed: 2301k 468.65k reads/sec
                    Bases Processed: 195m 39.84m bases/sec
                    The reason I was using R2 only is if Read 2's result (from bbduk) percentage is below 80% all those isolate will be discarded for the analysis.
                    Bioinformaticscally calm

                    Comment


                    • #11
                      Since I'm interested to see how many reads/bases are good. Why would I trim them?
                      Can I use maq (minimum average quality) argument in bbduk? My goal is to get count for the reads which have quality say xyz.
                      Sorry, I'm making waters murky. I think it's important for me to understand what to use, when and why.
                      Bioinformaticscally calm

                      Comment


                      • #12
                        This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

                        It is not logical to calculate "average" quality for a read. Is that what you are referring to?

                        Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

                          It is not logical to calculate "average" quality for a read. Is that what you are referring to?

                          Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)
                          Hi,

                          - Yes, average read quality I referred to.
                          - Yes, the run stopped after 96 cycles.
                          - I pulled QC for Read 2, it looks good than I had expected. Nothing below quality 36. There are no regions where quality slacks in middle.

                          Yes, I'm interested to find out reads which are ugly and then can be sent for re-sequencing.

                          There's no reason to select Q25.
                          Q1) Is quality 25 while working with "Read 2" alone is high? Or
                          while working with both "R1 and R2" is high?

                          That is the issue, should I pull the bar for result percent below, that is select reads which have passed Q25, L 70? I'm considering to take reads which give 80% in result. (this is while running Read 2 only).

                          Or

                          Pull down Quality at which I'm obtaining my result which in this case 25.

                          Thanks again for your support, and guiding through these discussions.
                          Last edited by bio_informatics; 11-12-2015, 08:43 AM.
                          Bioinformaticscally calm

                          Comment


                          • #14
                            Based on what you are describing this run seems fully acceptable. Just 6 (or 4 bp) shorter on R2 than you expected, which is no big deal (unless there are NNNNN's at the end of every read in R2, where the cycles failed).

                            I would say just scan/trim for presence of adapters (and if you must, trim <Q20 bases) and then on towards mapping.

                            Comment


                            • #15
                              For trimming, Q25 is very high. The more bases you trim, the less accurate your mapping. Even low-quality bases (under Q10) can increase the specificity when mapping. I don't trim above ~Q15 except in very rare circumstances; Q12 is more typical.

                              To clarify a couple of things:

                              "maq", or minimum average quality, translates the quality scores to probabilities, averages them, then transforms back to a PHRED-scaled Q-score. So, maq=20 would remove all reads with an average error rate of over 1%, or an average of 1 error per 100bp read. This is applied after trimming, if trimming is performed. Ns are considered to have a 75% error rate, so it's useful to at least trim trailing Ns (trimq=1) if you use maq; otherwise, any read with 4 trailing Ns would be discarded. And its mate would also be discarded.

                              And - it's important to always trim (and otherwise process) R1 and R2 together, at the same time, in any process that might discard or reorder the reads. Otherwise you will end up with unmatched pairs that won't map correctly.

                              Lastly, note that you can trim the last 4 bases of read 2 only, and keep the result in the same order as before, like this:

                              bbduk.sh in=read2.fq out=trimmed2.fq ftr2=4 ordered

                              In this situation, it would be helpful if you could post a couple charts from one of the bad batches of raw data:

                              Either post the quality histogram according to Illumina:
                              bbduk.sh in=read1.fq in2=read2.fq qhist=qhist.txt bhist=bhist.txt

                              Or, even better, post the quality histograms as determined by mapping:
                              bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fasta qhist=qhist.txt bhist=bhist.txt mhist=mhist.txt

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 11-08-2024, 11:09 AM
                              0 responses
                              223 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-08-2024, 06:13 AM
                              0 responses
                              165 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X