Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454 amplicon quality filtering

    Can anyone point out software which can quality filter 454 data, and isn't Linux based i.e. easy for a non-Linux user to use?

    I have been using the Galaxy portal, yet their 454 filtering function only retrieves high quality segments. I want to retrieve all amplicons with an average quality score. Galaxy can only retrieve full amplicons with every single base above Q.20.

    Any ideas?

    Cheers,

    J

  • #2
    Are you looking for an off the shelf solution, or are you willing to write some code? You could use the Biopython SFF support to do whatever quality filtering you have in mind, but you would have to write this yourself.

    Also what do you mean by "all amplicons with an average quality score"? Do you mean a mean/median over the whole of each (trimmed) read sequence? That doesn't seem so helpful to me.

    Comment


    • #3
      I am not looking for either. I would like a user friendly free-ware package that does the trick. I can use the Roche off instrument package, but with no experience in Linux just installing the software is tedious.

      And yes, I mean pull out the sequences (untrimmed) which have a mean quality score of Q20. I would also want to pull sequences which have a minimum of Q20 for each base over 95% of the sequence.

      There seems to be these options available for FASTQ formats...but it seems I run into issues when combining FASTA and Qual files into FASTQ and then converting back to FASTA.

      Comment


      • #4
        Originally posted by JackieBadger View Post
        There seems to be these options available for FASTQ formats...but it seems I run into issues when combining FASTA and Qual files into FASTQ and then converting back to FASTA.
        Problems in Galaxy converting between FASTA+QUAL and FASTA? That should work fine.

        Personally I'd go SFF to FASTQ, and then stick with FASTQ for your filtering and trimming (rather than trying to use FASTA+QUAL).

        Comment


        • #5
          The issue in the Galaxy FASTA+QUAL ->FASTQ->FASTA is that it seems my barcodes get removed through these conversions.

          I require my quality filtered files to be in FASTA format, pre-timmed, to feed in to the MHC specific barcode sorting program, jMHC.

          I'm not sure that Galaxy allows upload of SFF files, and requires the use of FASTA + QUAL.

          So the question still stands, is there an easy way (which doesn't require lots of code) to filter 454 pre-trimmed data?

          Cheers,

          J

          Comment


          • #6
            The filtering/trimming options for 454 formatted data are a bit more limited than they are for FASTQ formatted data. However, you do have a few options. The first is to get access to a Mac OS. You only seem adverse to the Linux OS, but many of the FASTQ filtering tools work great and are easy to install on Mac OS. I definitely encourage this option if you are planning on doing more genomics research in the future. Many of the best tools for doing these types of things are readily available for the Mac OS.

            If you decide to go this route you can easily convert your SFF files to the FASTQ format and process them using FASTX tools. I help run a Workshop on Genomics that developed a tutorial for doing this exact task along with some additional QC manipulations. The tutorial is at: http://www.molecularevolution.org/re..._data_activity and the FASTX tools are available at: http://hannonlab.cshl.edu/fastx_toolkit/download.html. I believe that the fastq_quality_filter command will be able to do what you need it to do. You should also take a look at the fastq_quality_trimmer command which is not as well documented, but may be useful in your situation as it only trims the low-quality sequence at the end of the read.

            Galaxy does accept 454 sff files. You can upload them and use the tool called "Select High Quality Segments" which is located under NGS TOOLBOX BETA -> ROCHE-454-DATA. I believe you will need to have your data as a seperate "reads" file and "quality" file to get everything to work right. You will also need to change your data type to 'quality score' format by clicking on the pencil next to your quality data in order for Galaxy to correctly interpret the uploaded file. You are also limited in file size when using the web interface for Galaxy. If your file is over 2 GB you will need to use the Galaxy FTP service, not the web format. Or role your own Galaxy install locally.

            You should also check out FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) and PRINSEQ (http://edwards.sdsu.edu/prinseq_beta/). FastQC doesn't implement any filtering (last I checked at least), but it can be installed on any OS and will allow you to take a good close look at your data. It may be useful for setting your filtering paramaters and to assess the effect of any filtering or trimming you end up doing. PRINSEQ is a web-based tool (can also be installed locally) that does QC and filtering on next-gen data. It doesn't take 454 sff files directly, so you would still need to convert to the FASTQ format. But this is probably one of your best options if you are not able to install the other software packages. The PRINSEQ site has a lot of documentation and the paper was just published in Bioinformatics.

            SAH

            Comment


            • #7
              Thanks a lot for this. Very helpful.

              The Galaxy method you outline is what I have been doing. But "filtering high quality segments" for 454 amplicon data is pretty useless because it invariably mashes up a target sequence and you can loose the barcodes on either end. I know my total amplicon size, so I can select a contiguous segment of this size, but this chomps away half of my data set because the stringent quality filtering (i.e. everybase must be of certain fixed quality score).

              I spoke to the authors of FASTQC a while back and this was their reply

              "Whilst you might get some useful information from running 454 sequence through FastQC it wasn't really designed with sequences of that length in mind. The duplicate and overrepresented sequence plots will be pretty meaningless when you have the sorts of per sequence error rates which come from 454, and some of the other plots may be pretty wide!"

              Thanks for the input, and it seems I should attempt to grasp the Mac OS FASTQ route.
              Cheers,

              J

              Comment


              • #8
                Wow, PRINSEQ is exactly what I'm looking for!

                Thanks a lot Shandley!

                Comment


                • #9
                  No problem! One word of warning. I have never actually used PRINSEQ before. The publication is new, and I have ambitions to incorporate it into our metagenomics/pathogen discovery pipeline, but just haven't found the time.

                  Good luck. If you have any insights into PRINSEQ after using it I would love to hear about them.

                  Best,

                  SAH

                  Comment


                  • #10
                    Originally posted by JackieBadger View Post
                    I have been using the Galaxy portal, yet their 454 filtering function only retrieves high quality segments. I want to retrieve all amplicons with an average quality score. Galaxy can only retrieve full amplicons with every single base above Q.20.
                    I realize I'm late to the conversation, but I think you can actually do this in Galaxy. Try

                    NGS: QC and manipulation -> Combine FASTA and QUAL into FASTQ

                    and then either
                    NGS: QC and manipulation -> Filter FASTQ reads by quality score and length

                    or
                    NGS: QC and manipulation -> Filter by quality

                    Comment


                    • #11
                      Originally posted by tnabtaf View Post
                      I realize I'm late to the conversation, but I think you can actually do this in Galaxy. Try

                      NGS: QC and manipulation -> Combine FASTA and QUAL into FASTQ

                      and then either
                      NGS: QC and manipulation -> Filter FASTQ reads by quality score and length

                      or
                      NGS: QC and manipulation -> Filter by quality
                      Yep Galaxy is what I have used in the past, but if you read the above threads you'll see that for some reason the conversions back and forth screw up my amplicon barcodes.

                      PRINSEQ seems to be a great preliminary assessment tool!

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 11:49 AM
                      0 responses
                      15 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      61 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X