Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test for RNAseq data preprocessing step (with regards to adapter and hexamer)

    Hi, every one. This is my first thread at this forum, so please forgive me if I asked some naive questions.

    My question is at the bottom of this thread.

    I am currently working on RNA-seq data. I am using tophat + cufflinks pipeline from this paper. I did fastQC for my rna-seq data and I pasted some pictures from FastQC report here. The experiment is designed to comparing different gene expression and splicing isoforms.





    I did some test, each of the data are preprocessed individually. (I extract 500,000 sequences from forward fastq as well as reverse fastq. So 500,000*2 reads in total)

    1. Do nothing (1,000,000 sequences : 82.2%)
    822406 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    822406 + 0 mapped (100.00%:-nan%)
    822406 + 0 paired in sequencing
    413414 + 0 read1
    408992 + 0 read2
    718716 + 0 properly paired (87.39%:-nan%)
    776272 + 0 with itself and mate mapped
    46134 + 0 singletons (5.61%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    2. Only Trim first 15 bases (1,000,000 sequences : 82.6%)
    826128 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    826128 + 0 mapped (100.00%:-nan%)
    826128 + 0 paired in sequencing
    414760 + 0 read1
    411368 + 0 read2
    721394 + 0 properly paired (87.32%:-nan%)
    777296 + 0 with itself and mate mapped
    48832 + 0 singletons (5.91%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    3. Only Remove adapter (424612+470223=894,835 sequences : 29.6%)
    264949 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    264949 + 0 mapped (100.00%:-nan%)
    264949 + 0 paired in sequencing
    140743 + 0 read1
    124206 + 0 read2
    42 + 0 properly paired (0.02%:-nan%)
    50210 + 0 with itself and mate mapped
    214739 + 0 singletons (81.05%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    4. Trim first 15 bases and remove adapter (454648+468550=923,198 sequences : 29.4%)

    271326 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    271326 + 0 mapped (100.00%:-nan%)
    271326 + 0 paired in sequencing
    131398 + 0 read1
    139928 + 0 read2
    98 + 0 properly paired (0.04%:-nan%)
    48556 + 0 with itself and mate mapped
    222770 + 0 singletons (82.10%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    My question is, with an overall good quality score for each position, is it really necessary to remove adapter or remove first 15 bases (bias caused by hexamer). The result from samtools stat shows if I do adapter remove, I will lose a large amount of data. If I do hexamer trimming, I can get a better mapping result (actually only 0.4% improvement), but I lost 15 bases for each read!

    Several things to be mentioned here.
    1. Adapter removal was performed by using fastx_clipper. Adapter sequences was specified from corresponding adapter from fastQC contaminants.txt. I discard those trimmed sequences if they are less than 20bp after clipping.
    2. I use fastx_trimmer to trim the first 15 bases for each reads.
    3. For the last test "Trim first 15 bases and remove adapter", trimming was the first step and adapter removal was the second.
    4. Number of sequences was given by fastQC "basic statistics" table.

    Regards

    Lynn
    Attached Files

  • #2
    I should have change the icon of the title from ^_^ to ?.

    Comment


    • #3
      i'm fresh here. is anyone here giving help?

      Comment


      • #4
        I'm not sure I know the right answers and I'd like to hear other folks' ideas.

        For the question about removing the first 15 bases, my impression is that the bias introduced by non-random hexamer priming is not changed by trimming. Trimming the sequences makes the fastqc report look better but the sequences that are in the sample are still the product of the non-random hexamer priming.

        Note that, if this is true, we have been incorporating biased read counts into the tuxedo analysis - though maybe the biases cancel out.

        Any other observations on this?

        With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis. In case you are not aligning to a reference transcriptome, you may still have to remove them separately.

        Comment


        • #5
          Originally posted by mceachin View Post
          With respect to removing adapter sequences, we have been ignoring them in the fastq files. With rare exceptions, they don't align to the reference transcriptome so they do not show up in the accepted_hits.bam files output by tophat. This effectively filters them out without requiring an extra step in the analysis.
          That is my understanding and experience as well.

          In case you are not aligning to a reference transcriptome, you may still have to remove them separately.
          I would replace the word "may" with the word "must". But I suppose it does depend on which program you will be using for denovo analysis.

          Comment


          • #6
            Hello all, this is my first post. Can anybody tell me how the ligation of the random hexamers works in the priming for the 1st strand cDNA synthesis reaction during the RNA-Seq (TruSeq kit Illumina)?!
            In the Fastq report the per base sequence content looks much better after the first 13 bp. I wonder if the bias caused by priming with random hexamers?
            We thought that, being the hexamers short random sequences contained in a mix, they randomly bind to the fragment, maybe with some mismatches, in different positions . The ones that bind in the beginning of the sequence will produce the desired strand, the others will generate short sequences that are going to be lost in the next purification steps. It would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read…but what about the next 7 bases? The bias in the first 13 bp is probably generated by hexamer-dimers! These are all hypothesis that need to be validated.
            Maybe part of the problem is due to the sequencer that amplifies the signal too much when meets more bases one next to the other.
            I red something about this here "http://ethanomics.wordpress.com/2012/03/12/more-thoughts-on-the-truseq-rna-sample-prep-kit/" and there "http://nar.oxfordjournals.org/content/38/12/e131.short"
            Can you tell me something more?
            Thanks for any help anyone can provide!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Multiomics Techniques Advancing Disease Research
              by seqadmin


              New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

              A major leap in the field has
              ...
              02-08-2024, 06:33 AM
            • seqadmin
              The 3D Genome: New Technologies and Emerging Insights
              by seqadmin


              The study of three-dimensional (3D) genomics explores the spatial structure of genomes and their role in processes like gene expression and DNA replication. By employing innovative technologies, researchers can study these arrangements to discover their role in various biological processes. Scientists continue to find new ways in which the organization of DNA is involved in processes like development1 and disease2.

              Basic Organization and Structure
              Understanding...
              01-22-2024, 03:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:57 AM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-14-2024, 09:19 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-12-2024, 03:37 PM
            0 responses
            406 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 02-09-2024, 03:36 PM
            0 responses
            647 views
            0 likes
            Last Post seqadmin  
            Working...
            X