Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastqc read limit?

    I have a question about Trimmomatic.

    I have a PE 100 reads from hiseq 2000, 100PE reads =166,867,542 PE reads. when i opened it by fastqc, it only shows total sequence:16000000.
    which CASAVA has it for ELAND. If that is the case, where does this read limit comes from and how to bypass this problem. Thanks.

  • #2
    My guess is that CASAVA divided the reads into multiple fastq files, with maximum number of reads per file set to 16 million. So your sample might be spread across multiple files. You can always do a word count (wc -l) on the file and divide by 4 to get the number of reads. There is a CASAVA mode added to FastQC (beginning in Version 0.10.0) that handles the multiple fastq files produced by CASAVA.

    Justin

    Comment


    • #3
      Actually I have concatenate those 12 files into a big file and then upload it to FastQC, but it still only showed 16m reads.
      michael

      Comment


      • #4
        Maybe try counting the number of lines in the fastq file, using something like "wc -l", to see if the file has the number of reads you are expecting.

        Comment


        • #5
          Did you ever figure this out? I have a file with > 24 million reads and the fastqc report is saying 4000000 exactly... It also appears to be bailing out early.

          I'll try upgrading to the latest version.
          Last edited by mgogol; 04-12-2012, 01:43 PM.

          Comment


          • #6
            This will be because your original file will have been created by concatenating multiple gzipped files. This places gzip headers throughout the file rather than having a single header for all of the data at the top. The core java gzip decompressor doesn't account for multiple headers within the file, so says that the file has finished when the end of the first compressed block is reached (ie the end of the first file in the set). This problem will affect all programs written in java which use these classes to read gzipped data.

            There are a few solutions:
            1. Instead of doing cat *fastq.gz > allfiles.fastq.gz to join your files do zcat *fastq.gz | gzip -c > allfiles.fastq.gz. This will decompress and recompress the data so you'll end up with a single compressed block
            2. Don't join the files together, but leave them separate and pass them all to fastqc and add the --casava option when starting fastqc. This will reombine the files into a single report for you.
            3. Use the development verison of fastqc where I've added a work round for this. The fix will be in the next release.


            The development version is here and the new release should be out very soon now.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM
            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:35 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-09-2024, 02:46 PM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-07-2024, 06:57 AM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-06-2024, 07:17 AM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Working...
            X