Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality distribution graph

    A bit late, but hope you can help me:
    I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
    Also, if I use reformat.sh with maq=XX, does it simply calculates the avvarge quality based on base quality and checks the error rate (2% max)?

    Thank you!

    Comment


    • Originally posted by Dinab View Post
      A bit late, but hope you can help me:
      I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
      Yes, the histogram section of the help covers that. Depending on how you want to plot it, you can use qhist, qchist, aqhist, or bqhist. The most generic is aqhist, which plots the number of reads per average quality score.

      Also, if I use reformat.sh with maq=XX, does it simply calculates the average quality based on base quality and checks the error rate (2% max)?
      "maq" transforms the qualities into probability space, calculates the average error rate, and then transforms that back into the phred scale. So, a 2% maximum error rate would be approximately maq=17.

      Comment


      • Perfect, Thank you!

        Comment


        • Another question

          Hi Brian,
          Your previous response worked perfectly, thank you again.
          I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

          Comment


          • Originally posted by Dinab View Post
            I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?
            No, I've never implemented something like that... the closest you could get with the existing package is to use a loop for each length with something like...

            Code:
            reformat.sh in=reads.fq minlen=147 maxlen=147 out=stdout.fq int=f | reformat.sh in=stdin.fq aqhist=aqhist_147.txt

            Comment


            • Originally posted by Dinab View Post
              is it possible to get a plot of the mean quality of each sequence as variable of the length?
              FastQC gives you an average per sequence quality score.

              Comment


              • fastq format error

                Hi Brian,

                I found there is some format error with my fastq. By using reformat.sh in=in.fastq out=out.fastq, I get following error:

                Input is being processed as unpaired
                java.lang.AssertionError:
                Error in in.fastq, line 4241767, with these 4 lines:
                >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868
                CTCTTTCCGGTAACACGCGCCTAGATACTAACAACCTACCTTATACTAATCAGCATGCCTAATACCAACA
                ACCTACTTTATACCTAATAACAATGCCTTGATAACTAACAACATACCTTAATACCTAATACACGCTTAAT
                ACCTAACAACAATACCTTAATACCTAACAACATACTTTAATACCTAACAACCTACCTTAATAACTTAATA

                at stream.FASTQ.quadToRead(FASTQ.java:779)
                at stream.FASTQ.toReadList(FASTQ.java:710)
                at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:110)
                at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:95)
                at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
                at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                Input: 1060400 reads 9378503777 bases
                Output: 1060400 reads (100.00%) 9378503777 bases (100.00%)

                Time: 75.193 seconds.
                Reads Processed: 1060k 14.10k reads/sec
                Bases Processed: 9378m 124.73m bases/sec
                Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
                at jgi.ReformatReads.process(ReformatReads.java:1130)
                at jgi.ReformatReads.main(ReformatReads.java:45)

                Could you please provide any suggestions on which steps shoud I follow to find the exact error of that read?

                Thanks!

                Comment


                • The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                  So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                  Last edited by Brian Bushnell; 05-07-2017, 10:35 PM.

                  Comment


                  • Hi Brian,

                    Thanks for your quick response!

                    Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
                    Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

                    Thanks again for your time,

                    Comment


                    • Originally posted by Brian Bushnell View Post
                      The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                      So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                      Hi Brian,

                      Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

                      Thanks,

                      Comment


                      • Code:
                        reformat.sh in=seq.fq out=seq.fa
                        will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

                        Comment


                        • Originally posted by GenoMax View Post
                          Code:
                          reformat.sh in=seq.fq out=seq.fa
                          will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.
                          Hi,

                          Thanks for your response!

                          My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

                          Comment


                          • How did that happen?

                            Unless you have a defined reason I would suggest not trusting that file.

                            Comment


                            • Hey Brian,

                              Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
                              $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
                              it comes this error:
                              Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
                              at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
                              at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

                              What did I do wrong?

                              Comment


                              • There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
                                @Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

                                I get the following when I run stats.sh.

                                Code:
                                $ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
                                A       C       G       T       N       IUPAC   Other   GC      GC_stdev
                                0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000
                                
                                Main genome scaffold total:             1
                                Main genome contig total:               1
                                Main genome scaffold sequence total:    0.005 MB
                                Main genome contig sequence total:      0.005 MB        0.000% gap
                                Main genome scaffold N/L50:             1/5.386 KB
                                Main genome contig N/L50:               1/5.386 KB
                                Main genome scaffold N/L90:             1/5.386 KB
                                Main genome contig N/L90:               1/5.386 KB
                                Max scaffold length:                    5.386 KB
                                Max contig length:                      5.386 KB
                                Number of scaffolds > 50 KB:            0
                                % main genome in scaffolds > 50 KB:     0.00%
                                
                                
                                Minimum         Number          Number          Total           Total           Scaffold
                                Scaffold        of              of              Scaffold        Contig          Contig  
                                Length          Scaffolds       Contigs         Length          Length          Coverage
                                --------        --------------  --------------  --------------  --------------  --------
                                    All                      1               1           5,386           5,386   100.00%
                                   5 KB                      1               1           5,386           5,386   100.00%

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X