Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test for RNAseq data preprocessing step (with regards to adapter and hexamer)

    Hi, every one. This is my first thread at this forum, so please forgive me if I asked some naive questions.

    My question is at the bottom of this thread.

    I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.





    I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)

    1. Do nothing (1,000,000 sequences : 82.2%)
    822406 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    822406 + 0 mapped (100.00%:-nan%)
    822406 + 0 paired in sequencing
    413414 + 0 read1
    408992 + 0 read2
    718716 + 0 properly paired (87.39%:-nan%)
    776272 + 0 with itself and mate mapped
    46134 + 0 singletons (5.61%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
    826128 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    826128 + 0 mapped (100.00%:-nan%)
    826128 + 0 paired in sequencing
    414760 + 0 read1
    411368 + 0 read2
    721394 + 0 properly paired (87.32%:-nan%)
    777296 + 0 with itself and mate mapped
    48832 + 0 singletons (5.91%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
    264949 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    264949 + 0 mapped (100.00%:-nan%)
    264949 + 0 paired in sequencing
    140743 + 0 read1
    124206 + 0 read2
    42 + 0 properly paired (0.02%:-nan%)
    50210 + 0 with itself and mate mapped
    214739 + 0 singletons (81.05%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)

    271326 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    271326 + 0 mapped (100.00%:-nan%)
    271326 + 0 paired in sequencing
    131398 + 0 read1
    139928 + 0 read2
    98 + 0 properly paired (0.04%:-nan%)
    48556 + 0 with itself and mate mapped
    222770 + 0 singletons (82.10%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!

    Several things to be mentioned here.
    1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
    2. I use fastx_trimmer to trim the first 15 bases for each reads.
    3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
    4. Number of sequences was given by fastQC "basic statistics" table.

    Regards

    Lynn
    Attached Files

  • #2
    I should have change the icon of the title from ^_^ to ?.

    Comment


    • #3
      i'm fresh here. is anyone here giving help?

      Comment


      • #4
        I'm not sure I know the right answers and I'd like to hear other folks' ideas.

        For the question about removing the first 15 bases, my impression is that the bias introduced by non-random hexamer priming is not changed by trimming. Trimming the sequences makes the fastqc report look better but the sequences that are in the sample are still the product of the non-random hexamer priming.

        Note that, if this is true, we have been incorporating biased read counts into the tuxedo analysis - though maybe the biases cancel out.

        Any other observations on this?

        With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis. In case you are not aligning to a reference transcriptome, you may still have to remove them separately.

        Comment


        • #5
          Originally posted by mceachin View Post
          With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis.
          That is my understanding and experience as well.

          In case you are not aligning to a reference transcriptome, you may still have to remove them separately.
          I would replace the word "may" with the word "must". But I suppose it does depend on which program you will be using for denovo analysis.

          Comment


          • #6
            Hello all, this is my first post. Can anybody tell me how the ligation of the random hexamers works in the priming for the 1st strand cDNA synthesis reaction during the RNA-Seq (TruSeq kit Illumina)?!
            In the Fastq report the per base sequence content looks much better after the first 13 bp. I wonder if the bias caused by priming with random hexamers?
            We thought that, being the hexamers short random sequences contained in a mix, they randomly bind to the fragment, maybe with some mismatches, in different positions . The ones that bind in the beginning of the sequence will produce the desired strand, the others will generate short sequences that are going to be lost in the next purification steps. It would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read…but what about the next 7 bases? The bias in the first 13 bp is probably generated by hexamer-dimers! These are all hypothesis that need to be validated.
            Maybe part of the problem is due to the sequencer that amplifies the signal too much when meets more bases one next to the other.
            I red something about this here "http://ethanomics.wordpress.com/2012/03/12/more-thoughts-on-the-truseq-rna-sample-prep-kit/" and there "http://nar.oxfordjournals.org/content/38/12/e131.short"
            Can you tell me something more?
            Thanks for any help anyone can provide!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Non-Coding RNA Research and Technologies
              by seqadmin




              Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

              Nobel Prize for MicroRNA Discovery
              This week,...
              10-07-2024, 08:07 AM
            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            101 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            110 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-30-2024, 08:33 AM
            1 response
            114 views
            0 likes
            Last Post EmiTom
            by EmiTom
             
            Started by seqadmin, 09-26-2024, 12:57 PM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Working...
            X