Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Dinab
    Junior Member
    • Apr 2017
    • 3

    Quality distribution graph

    A bit late, but hope you can help me:
    I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
    Also, if I use reformat.sh with maq=XX, does it simply calculates the avvarge quality based on base quality and checks the error rate (2% max)?

    Thank you!

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      Originally posted by Dinab View Post
      A bit late, but hope you can help me:
      I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
      Yes, the histogram section of the help covers that. Depending on how you want to plot it, you can use qhist, qchist, aqhist, or bqhist. The most generic is aqhist, which plots the number of reads per average quality score.

      Also, if I use reformat.sh with maq=XX, does it simply calculates the average quality based on base quality and checks the error rate (2% max)?
      "maq" transforms the qualities into probability space, calculates the average error rate, and then transforms that back into the phred scale. So, a 2% maximum error rate would be approximately maq=17.

      Comment

      • Dinab
        Junior Member
        • Apr 2017
        • 3

        Perfect, Thank you!

        Comment

        • Dinab
          Junior Member
          • Apr 2017
          • 3

          Another question

          Hi Brian,
          Your previous response worked perfectly, thank you again.
          I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            Originally posted by Dinab View Post
            I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?
            No, I've never implemented something like that... the closest you could get with the existing package is to use a loop for each length with something like...

            Code:
            reformat.sh in=reads.fq minlen=147 maxlen=147 out=stdout.fq int=f | reformat.sh in=stdin.fq aqhist=aqhist_147.txt

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              Originally posted by Dinab View Post
              is it possible to get a plot of the mean quality of each sequence as variable of the length?
              FastQC gives you an average per sequence quality score.

              Comment

              • blsfoxfox
                Member
                • Jan 2013
                • 12

                fastq format error

                Hi Brian,

                I found there is some format error with my fastq. By using reformat.sh in=in.fastq out=out.fastq, I get following error:

                Input is being processed as unpaired
                java.lang.AssertionError:
                Error in in.fastq, line 4241767, with these 4 lines:
                >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868
                CTCTTTCCGGTAACACGCGCCTAGATACTAACAACCTACCTTATACTAATCAGCATGCCTAATACCAACA
                ACCTACTTTATACCTAATAACAATGCCTTGATAACTAACAACATACCTTAATACCTAATACACGCTTAAT
                ACCTAACAACAATACCTTAATACCTAACAACATACTTTAATACCTAACAACCTACCTTAATAACTTAATA

                at stream.FASTQ.quadToRead(FASTQ.java:779)
                at stream.FASTQ.toReadList(FASTQ.java:710)
                at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:110)
                at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:95)
                at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
                at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                Input: 1060400 reads 9378503777 bases
                Output: 1060400 reads (100.00%) 9378503777 bases (100.00%)

                Time: 75.193 seconds.
                Reads Processed: 1060k 14.10k reads/sec
                Bases Processed: 9378m 124.73m bases/sec
                Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
                at jgi.ReformatReads.process(ReformatReads.java:1130)
                at jgi.ReformatReads.main(ReformatReads.java:45)

                Could you please provide any suggestions on which steps shoud I follow to find the exact error of that read?

                Thanks!

                Comment

                • Brian Bushnell
                  Super Moderator
                  • Jan 2014
                  • 2709

                  The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                  So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                  Last edited by Brian Bushnell; 05-07-2017, 10:35 PM.

                  Comment

                  • blsfoxfox
                    Member
                    • Jan 2013
                    • 12

                    Hi Brian,

                    Thanks for your quick response!

                    Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
                    Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

                    Thanks again for your time,

                    Comment

                    • blsfoxfox
                      Member
                      • Jan 2013
                      • 12

                      Originally posted by Brian Bushnell View Post
                      The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                      So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                      Hi Brian,

                      Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

                      Thanks,

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Code:
                        reformat.sh in=seq.fq out=seq.fa
                        will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

                        Comment

                        • blsfoxfox
                          Member
                          • Jan 2013
                          • 12

                          Originally posted by GenoMax View Post
                          Code:
                          reformat.sh in=seq.fq out=seq.fa
                          will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.
                          Hi,

                          Thanks for your response!

                          My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            How did that happen?

                            Unless you have a defined reason I would suggest not trusting that file.

                            Comment

                            • skatrinli
                              Junior Member
                              • May 2017
                              • 1

                              Hey Brian,

                              Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
                              $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
                              it comes this error:
                              Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
                              at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
                              at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

                              What did I do wrong?

                              Comment

                              • GenoMax
                                Senior Member
                                • Feb 2008
                                • 7142

                                There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
                                @Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

                                I get the following when I run stats.sh.

                                Code:
                                $ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
                                A       C       G       T       N       IUPAC   Other   GC      GC_stdev
                                0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000
                                
                                Main genome scaffold total:             1
                                Main genome contig total:               1
                                Main genome scaffold sequence total:    0.005 MB
                                Main genome contig sequence total:      0.005 MB        0.000% gap
                                Main genome scaffold N/L50:             1/5.386 KB
                                Main genome contig N/L50:               1/5.386 KB
                                Main genome scaffold N/L90:             1/5.386 KB
                                Main genome contig N/L90:               1/5.386 KB
                                Max scaffold length:                    5.386 KB
                                Max contig length:                      5.386 KB
                                Number of scaffolds > 50 KB:            0
                                % main genome in scaffolds > 50 KB:     0.00%
                                
                                
                                Minimum         Number          Number          Total           Total           Scaffold
                                Scaffold        of              of              Scaffold        Contig          Contig  
                                Length          Scaffolds       Contigs         Length          Length          Coverage
                                --------        --------------  --------------  --------------  --------------  --------
                                    All                      1               1           5,386           5,386   100.00%
                                   5 KB                      1               1           5,386           5,386   100.00%

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 08:59 AM
                                0 responses
                                4 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                14 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...