Announcement

Collapse
No announcement yet.

FastQC: A quality control application for FastQ data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastqc didn't work on my Mac

    I download the FastQC v0.7.0 (Mac DMG image) version from the website.
    Then installed it on my Mac, but when I try to open it, it always quit automatically. Does it because my java version is low: java version "1.5.0_19"
    . How could I solve this problem? Thanks!

    Comment


    • Originally posted by zhangpanda View Post
      I download the FastQC v0.7.0 (Mac DMG image) version from the website.
      Then installed it on my Mac, but when I try to open it, it always quit automatically. Does it because my java version is low: java version "1.5.0_19"
      . How could I solve this problem? Thanks!
      I'm not sure - I don't have a version of OSX that old to test here. Can you try running it manually in a shell to see what happens:

      Open /Applications/Utilites/Terminal.app

      in the terminal window run:

      /Applications/FastQC.app/Contents/MacOS/JavaApplicationStub

      (replace /Applications with the correct path if you put the program somewhere other than your applications folder).

      You should see an error in your terminal which tells you why the program failed to launch.

      Comment


      • Originally posted by simonandrews View Post
        I'm not sure - I don't have a version of OSX that old to test here. Can you try running it manually in a shell to see what happens:

        Open /Applications/Utilites/Terminal.app

        in the terminal window run:

        /Applications/FastQC.app/Contents/MacOS/JavaApplicationStub

        (replace /Applications with the correct path if you put the program somewhere other than your applications folder).

        You should see an error in your terminal which tells you why the program failed to launch.
        Here it is:
        delld3ss0011:~ zhangz$ /Applications/FastQC.app/Contents/MacOS/JavaApplicationStub
        [JavaAppLauncher Error] CallStaticVoidMethod() threw an exception
        Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:242)
        at apple.launcher.LaunchRunner.loadMainMethod(LaunchRunner.java:55)
        at apple.launcher.LaunchRunner.run(LaunchRunner.java:84)
        at apple.launcher.LaunchRunner.callMain(LaunchRunner.java:50)
        at apple.launcher.JavaApplicationLauncher.launch(JavaApplicationLauncher.java:52)
        delld3ss0011:~ zhangz$

        Comment


        • Originally posted by zhangpanda View Post
          Here it is:
          Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
          I've just rebuilt a new Mac snapshot using a lower compliance setting. Can you download this version and see if it runs on your older system:

          http://www.bioinformatics.bbsrc.ac.u....7.1_devel.dmg

          Comment


          • Originally posted by simonandrews View Post
            I've just rebuilt a new Mac snapshot using a lower compliance setting. Can you download this version and see if it runs on your older system:

            http://www.bioinformatics.bbsrc.ac.u....7.1_devel.dmg
            Yes, it works! Thanks!

            Comment


            • Originally posted by zhangpanda View Post
              Yes, it works! Thanks!
              Cool. I'll make sure that I use the lower settings for future releases then.

              Comment


              • FastQC v0.7.1 has been released. This contains a much improved command line interface to the program which should make it easier to include it in analysis pipelines. It also adds a new command line option to manually define the format of an input sequence file rather than letting the program guess from the filename.

                You can get the new version from:

                http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

                [If you don't see the new version of any page hit shift+refresh to force our cache to update]

                Comment


                • FastQC v0.7.2 is now out at the same address as above. I've fixed a bug which affected libraries where there weren't any unique sequences. I've also added a new command line option to allow a user to specify a custom contaminants file rather than using the default systemwide one.

                  Comment


                  • Fastq results

                    Hi,

                    I find this program really good. but I wish the help files were a bit more detailed. It is sometime difficult to understand the results of each of the analyses.

                    I also find the idea of constructing some sort of DB for various results (good, bad, etc) so one would have a comparisons to look for.

                    I have a problem with the results of my analysis in the duplication level. As in the attached image clearly visible I have a very high number of duplication with more than 10 duplicated sequences.

                    As this is my first run at the next generation sequencing methods, I don't understand exactly what this means. In the report summary I have this list of percentage:

                    >>Sequence Duplication Levels fail
                    #Total Duplicate Percentage 67.8215
                    #Duplication Level Relative count
                    1 100.0
                    2 24.090619513028887
                    3 15.179389965349534
                    4 11.182932703513215
                    5 8.729431141911524
                    6 7.257951737961682
                    7 6.00867038550585
                    8 5.357614556303122
                    9 4.614882607952515
                    10+ 96.88767344655594
                    >>END_MODULE

                    If I add all the numbers together I get over 180% of duplicated reads.
                    Q: how can that be?

                    Q: What can be a reason for such a huge number of duplicated files?

                    Q: Does it means that my library is not good? Is there a tool to extract this duplicated sequences?

                    I want to mention that I am working with PE Illumina reads of 76bp long.

                    Can anyone tell me of a way to visualize these duplicated reads?

                    If I understood it correctly, with Picard I can find out how many duplicates I have, but is there a way of extracting them?

                    Thanks for any help

                    Assa
                    Attached Files
                    Last edited by frymor; 12-21-2010, 08:13 AM.

                    Comment


                    • Re: Fastq results

                      Hi --

                      High duplication levels typically result from low DNA in the sample (or the fraction size-selected in library preparation) masked by extra PCR cycles. Since FASTQC runs before alignment, it should actually under-estimate duplicates -- more will become apparent when fragments align together on the genome with allowance for sequencing errors. However, I don't know its algorithms. I agree with you about the terseness of the documentation: it wasn't clear to me, either, exactly how to interpret the proportions of sequences duplicated at each level.

                      In particular, I often see the same spike you did at 10 duplicates (using 0.7.0, BTW). Perhaps it's the effect of lumping together all sequences duplicated 10 or more times; perhaps a few sequences (like chrM -- see below) were greatly over-duplicated in the library, or perhaps it's a bug.

                      For a good visual overview of duplication impacts after alignment, I suggest posting BAM files in a web-accessible location and making UCSC custom tracks (bigDataUrl=myURL/myfile.bam visibility="squish"). Then look in your regions of interest for the following pattern: a sparse landscape dominated by "towers" of many reads at identical positions, without a lot of others nearby like you see in a true amplified region. (A confirmatory detail one biologist pointed out to me: are the sequences identical with just occasional random sequencing error differences, instead of expected proportions of multiple alleles at locations you know are heterozygous?)

                      You can also use de-duplication tools such as the one in SAMtools. For single-read data, the Java version Picard is supposed to be superior. However, I don't fully trust these. I'm not sure their algorithms are foolproof -- visually I've seen unexpected effects. Certainly for dense clusters of reads in ChIPped/exon enhanced loci, they remove some real data and skew the results. However undetected duplication will skew algorithms like MACS far worse. Where I suspect duplication, I give researchers both full and deduplicated data as well as custom tracks to help them assess what it means.

                      One more note: skewed RNA or DNA sources can simulate the effects of PCR clonal duplication. The biggest example I know is that whole-RNA preparation methods often yield a significant (20-30%) proportion of mitochondrial RNA. Millions of reads mapped to 16K "chrM" looks like duplication to FASTQC and other deduplication programs, but it says nothing about duplication in regions you care about.

                      Net, net: be aware of duplication and warn researchers if it might compromise their results -- a thankless task, but it's better they hear bad news immediately. Trying too hard to draw conclusions from inadequate data costs both time and credibility. Rather, work with the lab people to better results next time.

                      Cheers!
                      Howie

                      Comment


                      • Originally posted by frymor View Post
                        If I add all the numbers together I get over 180% of duplicated reads.
                        Q: how can that be?
                        Because the percentages are relative to the number of unique sequences, so if you have more duplicated sequences than unique ones then you get totals >100%. We do it this way so that the plot still shows useful information even when you have a single sequence (say an adapter) which makes up a high proportion of your library. The overall figure for duplication levels in the library is given in the header of the plot.

                        Originally posted by frymor View Post
                        Q: What can be a reason for such a huge number of duplicated files?
                        Howie seems to have covered this pretty well. Basically the answer will depend on the type of library you have. For ChIP libraries the answer is usually technical (PCR overamplification). For some other libraries (eg 4C) duplication is expected. For yet others (small RNA?) there may be a higher than usual level of duplication due to overrepresentation of certain genomic regions. In each case the shape of the plot will be different and you should be able to figure out the basic cause for your library.

                        Originally posted by frymor View Post
                        Q: Does it means that my library is not good? Is there a tool to extract this duplicated sequences?
                        It means that you're not making the best use of the sequencing capacity you have because nearly 70% of the sequences you've generated are simply duplicates of something which was already in the library. It's also a warning that if the duplication is technical and biased then you may get artefacts in your analysis.

                        Whether you remove duplicates will depend on the type of library you're working with. If your intention is to map this data to a reference then you don't want to deduplicate until after you've done that since (as Howie pointed out) there will be duplicates which are missed at the sequence level due to sequencing errors artificially increasing diversity.

                        Originally posted by frymor View Post
                        Can anyone tell me of a way to visualize these duplicated reads?
                        If you load these into any data browser (after assembly or mapping) you'll see your duplicates as towers of reads with exactly the same position. In our downstream analysis package you can even quantitate the level of duplication and visualise it on a genome wide scale if you're really interested in it.

                        Comment


                        • Originally posted by Howie Goodell View Post
                          For a good visual overview of duplication impacts after alignment, I suggest posting BAM files in a web-accessible location and making UCSC custom tracks (bigDataUrl=myURL/myfile.bam visibility="squish"). Then look in your regions of interest for the following pattern: a sparse landscape dominated by "towers" of many reads at identical positions, without a lot of others nearby like you see in a true amplified region. (A confirmatory detail one biologist pointed out to me: are the sequences identical with just occasional random sequencing error differences, instead of expected proportions of multiple alleles at locations you know are heterozygous?)
                          I have a basic question, but I think is quite important for my analysis.
                          I am using two fastqc files, one for each genotype. After the QC I ran bowtie with each of these files.
                          When I am trying to load the data into the UCSC genome browser, it always time out, before finishing uploading.
                          Q: Is it possible to do a separate bowtie for each of the chromosomes, or is it better to do one run for the complete file and than separate the sam/bam files into single chromosomes?

                          Q: What about the following analysis (tophat, cufflinks)? Is it preferable to run them on separate chromosomes, or a complete genome? I am talking not only due to the file size but also because of the correctness of the analysis.

                          Originally posted by Howie Goodell View Post
                          You can also use de-duplication tools such as the one in SAMtools.
                          Do you mean the rmdup option?

                          Originally posted by Howie Goodell View Post
                          For single-read data, the Java version Picard is supposed to be superior.
                          Picard support also paired-end and I tired to run it with my data. But I can only mark the duplicates, not mask them.
                          I am still looking for a way of extracting these duplicates, so that I can calculate the true coverage of my library and to check its quality.

                          Originally posted by simonandrews View Post
                          Basically the answer will depend on the type of library you have. For ChIP libraries the answer is usually technical (PCR overamplification). For some other libraries (eg 4C) duplication is expected. For yet others (small RNA?) there may be a higher than usual level of duplication due to overrepresentation of certain genomic regions. In each case the shape of the plot will be different and you should be able to figure out the basic cause for your library.
                          I am working with mRNA-Seq and try to look for differentially regulated genes between a wild type and a mutation genotype. For that reason I am expecting genes with high expression to be found more than just one time. Genes with lower expression will be found not so often.
                          My Problem is how to identify PCR amplification and to be able to distinguish between those and high expression of the genes.

                          Originally posted by simonandrews View Post
                          Whether you remove duplicates will depend on the type of library you're working with. If your intention is to map this data to a reference then you don't want to deduplicate until after you've done that since (as Howie pointed out) there will be duplicates which are missed at the sequence level due to sequencing errors artificially increasing diversity.
                          If you load these into any data browser (after assembly or mapping) you'll see your duplicates as towers of reads with exactly the same position. In our downstream analysis package you can even quantitate the level of duplication and visualise it on a genome wide scale if you're really interested in it.
                          I have download SeqMonq and look at my data. the image I posted give an overview of a part of chromosome X from D. melanogaster genome. As you can see here, I don't have this sparse landscape behaviour, Howie spoke about.

                          Maybe it is a very naive question, maybe it is even a bit silly, but I would like to know how the number of PCR cycles influence the read duplication numbers.
                          In my data set I have the expression profiles of drosophila genes. I thought I need to expect them to be more often for highly expressed genes and therefore find more duplications for this positions.
                          Q: Does it make sense to extract the duplicated reads and than to look for differentially regulated expression?

                          I hope it is not too much and will be very greatfull for your help.

                          Thanks
                          Assa
                          Attached Files

                          Comment


                          • Originally posted by frymor View Post
                            I have download SeqMonk and look at my data. the image I posted give an overview of a part of chromosome X from D. melanogaster genome. As you can see here, I don't have this sparse landscape behaviour, Howie spoke about.
                            Actually from the screenshot you posted you can't really tell whether you're seeing sparse data. You won't get completely isolated peaks on a chromosome level, but rather when you look at an individual exon you won't see smooth coverage over the whole exon but will see a small number of positions where most reads sit.

                            If you want to look at this quantitatively in SeqMonnk then do Data > Quantiation > Coverage Depth Quantitation and then find the Max depth for Exact overlaps and Express as % of all reads. This will tell you what proportion of all of the reads in a given exon are coming from potential PCR duplicates. For all exons where you have a reasonable number of reads (say >30) you should be seeing values of only a few percent. For heavily duplicated experiments you can see values going way higher than that.


                            Originally posted by frymor View Post
                            Maybe it is a very naive question, maybe it is even a bit silly, but I would like to know how the number of PCR cycles influence the read duplication numbers.
                            We've tried to look at this in some of our data. Our conclusion is that it isn't just a simple case that the number of cycles determines duplication level. For some samples the problem seems to be the diversity in the starting material, ie that if you have too low an amount of starting material then the high duplication level is fixed almost immediately when you start your PCR, and reducing cycles won't help. For other samples you can increase duplication levels by adding more cycles but you need to go really over the top to completely bias an otherwise diverse sample. There may also be effects associated with the PCR conditions you use, but we haven't really gone into that.
                            [/QUOTE]

                            Originally posted by frymor View Post
                            Q: Does it make sense to extract the duplicated reads and than to look for differentially regulated expression?
                            We've certainly done that in the cases where we had a ridiculously high level of duplication, and we got sensible results. It's not something we'd routinely do for a sample which looked to be diverse though.

                            Comment


                            • Thanks for the fast response
                              Originally posted by simonandrews View Post
                              Actually from the screenshot you posted you can't really tell whether you're seeing sparse data. You won't get completely isolated peaks on a chromosome level, but rather when you look at an individual exon you won't see smooth coverage over the whole exon but will see a small number of positions where most reads sit.

                              If you want to look at this quantitatively in SeqMonnk then do Data > Quantiation > Coverage Depth Quantitation and then find the Max depth for Exact overlaps and Express as % of all reads. This will tell you what proportion of all of the reads in a given exon are coming from potential PCR duplicates. For all exons where you have a reasonable number of reads (say >30) you should be seeing values of only a few percent. For heavily duplicated experiments you can see values going way higher than that.
                              I added here two screen shots I took from a smaller portion of the chromosomes X and 3R respectively.
                              I also quantified the data according to the description after defining the probes using the feature probe generator and featuring the design around mRNA. This got me ~167K probes.

                              It is clearly visible, that mostly the only reads with high percentage are with very low depth. The groups of reads with deeper coverage have usually very low percentage.
                              According to your description and to my understanding, it is highly reliable, that these reads are not PCR duplications, but true deep coverage of the mRNA.

                              Another question which comes by looking at the data is to the block I marked with yellow. What are these regions of reads. These are not genes, so they can't be mRNA or CDS.
                              Are these repeats which were mapped at random according to the bowtie/bwa/etc. preferences and therefore to the presumably wrong place?

                              Are there any other suggestions as to what kind of reads these are?

                              I would like again to mention That i am working with paired-end RNA-seq obtained from PolyA purification.
                              Q: Can these reads have something to due with PolyA tail residues?
                              Q: Are these reads maybe ncRNA with polyA tail or rRNA which slipped through? Is there a way to establish such a theory?
                              Attached Files
                              Last edited by frymor; 01-04-2011, 02:23 AM. Reason: another question

                              Comment


                              • Dear Simon,

                                Like many others, I want to thank you for your excellent program, which made its way rightaway to our NGS data analysis pipeline.

                                I would like to propose a slight improvement in the Per Base GC Content and Per Base Sequence Content plots. Would it be possible to add horizontal grid lines to those plots as well? It would make the visual interpretation of the plots easier.

                                Yilong

                                Comment

                                Working...
                                X