Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quality distribution graph

    A bit late, but hope you can help me:
    I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
    Also, if I use reformat.sh with maq=XX, does it simply calculates the avvarge quality based on base quality and checks the error rate (2% max)?

    Thank you!

    Comment


    • Originally posted by Dinab View Post
      A bit late, but hope you can help me:
      I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
      Yes, the histogram section of the help covers that. Depending on how you want to plot it, you can use qhist, qchist, aqhist, or bqhist. The most generic is aqhist, which plots the number of reads per average quality score.

      Also, if I use reformat.sh with maq=XX, does it simply calculates the average quality based on base quality and checks the error rate (2% max)?
      "maq" transforms the qualities into probability space, calculates the average error rate, and then transforms that back into the phred scale. So, a 2% maximum error rate would be approximately maq=17.

      Comment


      • Perfect, Thank you!

        Comment


        • Another question

          Hi Brian,
          Your previous response worked perfectly, thank you again.
          I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

          Comment


          • Originally posted by Dinab View Post
            I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?
            No, I've never implemented something like that... the closest you could get with the existing package is to use a loop for each length with something like...

            Code:
            reformat.sh in=reads.fq minlen=147 maxlen=147 out=stdout.fq int=f | reformat.sh in=stdin.fq aqhist=aqhist_147.txt

            Comment


            • Originally posted by Dinab View Post
              is it possible to get a plot of the mean quality of each sequence as variable of the length?
              FastQC gives you an average per sequence quality score.

              Comment


              • fastq format error

                Hi Brian,

                I found there is some format error with my fastq. By using reformat.sh in=in.fastq out=out.fastq, I get following error:

                Input is being processed as unpaired
                java.lang.AssertionError:
                Error in in.fastq, line 4241767, with these 4 lines:
                >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868
                CTCTTTCCGGTAACACGCGCCTAGATACTAACAACCTACCTTATACTAATCAGCATGCCTAATACCAACA
                ACCTACTTTATACCTAATAACAATGCCTTGATAACTAACAACATACCTTAATACCTAATACACGCTTAAT
                ACCTAACAACAATACCTTAATACCTAACAACATACTTTAATACCTAACAACCTACCTTAATAACTTAATA

                at stream.FASTQ.quadToRead(FASTQ.java:779)
                at stream.FASTQ.toReadList(FASTQ.java:710)
                at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:110)
                at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:95)
                at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
                at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                Input: 1060400 reads 9378503777 bases
                Output: 1060400 reads (100.00%) 9378503777 bases (100.00%)

                Time: 75.193 seconds.
                Reads Processed: 1060k 14.10k reads/sec
                Bases Processed: 9378m 124.73m bases/sec
                Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
                at jgi.ReformatReads.process(ReformatReads.java:1130)
                at jgi.ReformatReads.main(ReformatReads.java:45)

                Could you please provide any suggestions on which steps shoud I follow to find the exact error of that read?

                Thanks!

                Comment


                • The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                  So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                  Last edited by Brian Bushnell; 05-07-2017, 10:35 PM.

                  Comment


                  • Hi Brian,

                    Thanks for your quick response!

                    Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
                    Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

                    Thanks again for your time,

                    Comment


                    • Originally posted by Brian Bushnell View Post
                      The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

                      So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.
                      Hi Brian,

                      Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

                      Thanks,

                      Comment


                      • Code:
                        reformat.sh in=seq.fq out=seq.fa
                        will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

                        Comment


                        • Originally posted by GenoMax View Post
                          Code:
                          reformat.sh in=seq.fq out=seq.fa
                          will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.
                          Hi,

                          Thanks for your response!

                          My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

                          Comment


                          • How did that happen?

                            Unless you have a defined reason I would suggest not trusting that file.

                            Comment


                            • Hey Brian,

                              Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
                              $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
                              it comes this error:
                              Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
                              at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
                              at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

                              What did I do wrong?

                              Comment


                              • There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
                                @Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

                                I get the following when I run stats.sh.

                                Code:
                                $ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz 
                                A       C       G       T       N       IUPAC   Other   GC      GC_stdev
                                0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000
                                
                                Main genome scaffold total:             1
                                Main genome contig total:               1
                                Main genome scaffold sequence total:    0.005 MB
                                Main genome contig sequence total:      0.005 MB        0.000% gap
                                Main genome scaffold N/L50:             1/5.386 KB
                                Main genome contig N/L50:               1/5.386 KB
                                Main genome scaffold N/L90:             1/5.386 KB
                                Main genome contig N/L90:               1/5.386 KB
                                Max scaffold length:                    5.386 KB
                                Max contig length:                      5.386 KB
                                Number of scaffolds > 50 KB:            0
                                % main genome in scaffolds > 50 KB:     0.00%
                                
                                
                                Minimum         Number          Number          Total           Total           Scaffold
                                Scaffold        of              of              Scaffold        Contig          Contig  
                                Length          Scaffolds       Contigs         Length          Length          Coverage
                                --------        --------------  --------------  --------------  --------------  --------
                                    All                      1               1           5,386           5,386   100.00%
                                   5 KB                      1               1           5,386           5,386   100.00%

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                201 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 08:54 AM
                                0 responses
                                212 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-02-2024, 03:00 PM
                                0 responses
                                194 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X