Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lynnco2008
    Junior Member
    • Feb 2012
    • 4

    Test for RNAseq data preprocessing step (with regards to adapter and hexamer)

    Hi, every one. This is my first thread at this forum, so please forgive me if I asked some naive questions.

    My question is at the bottom of this thread.

    I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.





    I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)

    1. Do nothing (1,000,000 sequences : 82.2%)
    822406 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    822406 + 0 mapped (100.00%:-nan%)
    822406 + 0 paired in sequencing
    413414 + 0 read1
    408992 + 0 read2
    718716 + 0 properly paired (87.39%:-nan%)
    776272 + 0 with itself and mate mapped
    46134 + 0 singletons (5.61%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
    826128 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    826128 + 0 mapped (100.00%:-nan%)
    826128 + 0 paired in sequencing
    414760 + 0 read1
    411368 + 0 read2
    721394 + 0 properly paired (87.32%:-nan%)
    777296 + 0 with itself and mate mapped
    48832 + 0 singletons (5.91%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
    264949 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    264949 + 0 mapped (100.00%:-nan%)
    264949 + 0 paired in sequencing
    140743 + 0 read1
    124206 + 0 read2
    42 + 0 properly paired (0.02%:-nan%)
    50210 + 0 with itself and mate mapped
    214739 + 0 singletons (81.05%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)

    271326 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    271326 + 0 mapped (100.00%:-nan%)
    271326 + 0 paired in sequencing
    131398 + 0 read1
    139928 + 0 read2
    98 + 0 properly paired (0.04%:-nan%)
    48556 + 0 with itself and mate mapped
    222770 + 0 singletons (82.10%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!

    Several things to be mentioned here.
    1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
    2. I use fastx_trimmer to trim the first 15 bases for each reads.
    3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
    4. Number of sequences was given by fastQC "basic statistics" table.

    Regards

    Lynn
    Attached Files
  • lynnco2008
    Junior Member
    • Feb 2012
    • 4

    #2
    I should have change the icon of the title from ^_^ to ?.

    Comment

    • minoru_harvest
      Junior Member
      • Aug 2012
      • 5

      #3
      i'm fresh here. is anyone here giving help?

      Comment

      • mceachin
        Junior Member
        • May 2010
        • 5

        #4
        I'm not sure I know the right answers and I'd like to hear other folks' ideas.

        For the question about removing the first 15 bases, my impression is that the bias introduced by non-random hexamer priming is not changed by trimming. Trimming the sequences makes the fastqc report look better but the sequences that are in the sample are still the product of the non-random hexamer priming.

        Note that, if this is true, we have been incorporating biased read counts into the tuxedo analysis - though maybe the biases cancel out.

        Any other observations on this?

        With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis. In case you are not aligning to a reference transcriptome, you may still have to remove them separately.

        Comment

        • westerman
          Rick Westerman
          • Jun 2008
          • 1104

          #5
          Originally posted by mceachin View Post
          With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis.
          That is my understanding and experience as well.

          In case you are not aligning to a reference transcriptome, you may still have to remove them separately.
          I would replace the word "may" with the word "must". But I suppose it does depend on which program you will be using for denovo analysis.

          Comment

          • amaurizio
            Junior Member
            • Sep 2012
            • 4

            #6
            Hello all, this is my first post. Can anybody tell me how the ligation of the random hexamers works in the priming for the 1st strand cDNA synthesis reaction during the RNA-Seq (TruSeq kit Illumina)?!
            In the Fastq report the per base sequence content looks much better after the first 13 bp. I wonder if the bias caused by priming with random hexamers?
            We thought that, being the hexamers short random sequences contained in a mix, they randomly bind to the fragment, maybe with some mismatches, in different positions . The ones that bind in the beginning of the sequence will produce the desired strand, the others will generate short sequences that are going to be lost in the next purification steps. It would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read…but what about the next 7 bases? The bias in the first 13 bp is probably generated by hexamer-dimers! These are all hypothesis that need to be validated.
            Maybe part of the problem is due to the sequencer that amplifies the signal too much when meets more bases one next to the other.
            I red something about this here "http://ethanomics.wordpress.com/2012/03/12/more-thoughts-on-the-truseq-rna-sample-prep-kit/" and there "http://nar.oxfordjournals.org/content/38/12/e131.short"
            Can you tell me something more?
            Thanks for any help anyone can provide!

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            11 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            23 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            28 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            22 views
            0 reactions
            Last Post SEQadmin2  
            Working...