Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • brachysclereid
    Member
    • Feb 2011
    • 32

    Ideas on collecting quality scores per base in an illumina fastq file

    Hi,

    I am trying to make per base quality plots like fastqc because I would like to customize reporting. The summary stats reported by fastq in the text export is difficult to work with in R. Instead it would be better to have a list of the quality scores and let R do the work/stats. Does anyone know of a program that will generate a raw list of of quality scores per base from a fastq file? If not I think it should be pretty easy to write a perl script for this. I thought it would be worth asking...

    thanks
  • gringer
    David Eccles (gringer)
    • May 2011
    • 845

    #2
    Do you want something more than what fastx_quality_stats from fastx-tools can provide?
    Code:
    usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
    ...
       [-N]         = New output format (with more information per nucleotide/cycle).
    ...
    The *NEW* output format:
            cycle (previously called 'column') = cycle number
            max-count
            For each nucleotide in the cycle (ALL/A/C/G/T/N):
                    count   = number of bases found in this column.
                    min     = Lowest quality score value found in this column.
                    max     = Highest quality score value found in this column.
                    sum     = Sum of quality score values for this column.
                    mean    = Mean quality score value for this column.
                    Q1      = 1st quartile quality score.
                    med     = Median quality score.
                    Q3      = 3rd quartile quality score.
                    IQR     = Inter-Quartile range (Q3-Q1).
                    lW      = 'Left-Whisker' value (for boxplotting).
                    rW      = 'Right-Whisker' value (for boxplotting).

    Comment

    • dgtnk
      Junior Member
      • Nov 2011
      • 4

      #3
      agree with gringer

      fastx_quality_stats from Fastx_Toolkit works well. It will not give you the raw list of quality scores, but will provide you the quartile values of read quality at each read position, which you can use for boxplotting in R.

      Comment

      • Dario1984
        Senior Member
        • Jun 2011
        • 166

        #4
        Try using QualityScore in ShortRead.

        Comment

        • Blahah404
          Member
          • Dec 2011
          • 48

          #5
          You can easily extract a .qual file containing per-base quality scores from a fastq file, for example using biopython:
          Code:
          #!/usr/bin/env python
          
          """Usage: fastq2qual.py filename 
              where filename is a .fastq (without extension)
              will produce: filename.qual
          """
          
          import sys
          from Bio import SeqIO
          
          file_name = sys.argv[1]
          
          SeqIO.convert(file_name+".fastq", "fastq", file_name+".qual", "qual")
          
          sys.exit()

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by Dario1984 View Post
            Try using QualityScore in ShortRead.
            +1

            If you want to use R for the plotting and analysis, why not use R to read the FASTQ files as well?

            Comment

            • kwyattm
              Junior Member
              • Jul 2011
              • 7

              #7
              The way I handled this was to write a perl script that 1)parses qseq to fastq 2)trims for adaptor and 3)parses quality score data to a text file. The text file is subsequently imported into R and simply graphed. I even get the graphs imported into a pdf and e-mailed to me when everything is done!

              Comment

              • gringer
                David Eccles (gringer)
                • May 2011
                • 845

                #8
                qseq -> fastq is already done in CASAVA, most likely including the removal of any adaptor sequences. CASAVA 1.8+ process the intensity files directly into fastq:

                Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                Comment

                • kwyattm
                  Junior Member
                  • Jul 2011
                  • 7

                  #9
                  Yep!

                  Originally posted by gringer View Post
                  qseq -> fastq is already done in CASAVA, most likely including the removal of any adaptor sequences. CASAVA 1.8+ process the intensity files directly into fastq:

                  http://seqanswers.com/forums/showthread.php?t=13147
                  Thanks, Ginger! Yeah, I knew about this, it's just an old script. Just passing along the information I had!

                  Comment

                  • brachysclereid
                    Member
                    • Feb 2011
                    • 32

                    #10
                    Idease on q scores

                    Thanks!

                    I used the biopython suggestion and now have the .qual files. This is what I wanted.

                    kwyattm,
                    Is there a tool that will take a random sample of the .qual file in R for the purpose of plotting? I am curious about what your are using to make the plots.

                    Thanks again!

                    Comment

                    • gringer
                      David Eccles (gringer)
                      • May 2011
                      • 845

                      #11
                      I used the biopython suggestion and now have the .qual files. This is what I wanted.
                      Just as a word of caution, you need to make sure the quality base is correct. Different sequencers have in the past used different bases / ascii values to represent the same qualities.

                      Is there a tool that will take a random sample of the .qual file in R for the purpose of plotting?
                      You can randomly sample data in R by using the 'sample' function, but boxplot should be able to manage with the full dataset. There's also a fastX tool for displaying quality statistics (fastq_quality_boxplot_graph), just in case you want something that's already been made by someone else.

                      Comment

                      • Dario1984
                        Senior Member
                        • Jun 2011
                        • 166

                        #12
                        Since he is working in R, it seems much more straightforward to read it in R.

                        e.g.

                        library(ShortRead)
                        fastqs <- readFastq("/path/to/fastqs")
                        qualities <- quality(fastqs)

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        57 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        50 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        201 views
                        0 reactions
                        Last Post seqadmin  
                        Working...