Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat output very large read lengths from RNA-Seq

    I have a problem with tophat v2.1 outputting read lengths that are extremely large and causing problems downstream. I have extracted the read lengths from sam file and plotted a histogram to see that majority of my read lengths are of normal size but there are some that are 10,000-100,000 bp (< 1%). This causes my data have long stretches of a chromosome which contain a constant signal over 100,000 bp (basically a straight line across the entire region).

    Does TopHat have a way to filter out these erroneous reads? I have made a python script to filter out these reads but I'd rather do it in a one-step approach as opposed to having to go through multiple steps.

    Thank you!

  • #2
    But are these reads present in your fastq file?

    Comment


    • #3
      I don't believe so as TopHat outputs the minimum and maximum read length from the fastq and it always says minimum: 33 bp and maximum: 33 bp

      Ex.

      [2016-07-15 00:07:29] Preparing reads
      left reads: min. length=33, max. length=33, 7668831 kept reads (338 discarded)
      right reads: min. length=33, max. length=33, 7668468 kept reads (701 discarded)

      I'm assuming tophat is fusing two reads together that in reality are hundreds of kb apart and then genomeCov (bedtools) fills in the gap because the signal is always a straight line. It could be an issue with the reference genome (yeast R64-1-1 from ensembl). but, as far as I can tell ensembl has the newest build
      Last edited by dlbuz; 07-15-2016, 07:07 AM.

      Comment


      • #4
        Sounds like splicing?

        Comment


        • #5
          So, I've narrowed it down to one chrVII the region from 560 kb to 790 kb. Using the --no-mixed and --no-discordant options caused even more regions to be affected (my reads are paired end and figured this would give better results).

          I attached a picture of what I'm experiencing. The blue signal is just using bowtie to map to reference genome, while the red signal is using TopHat to map to reference genome. Apparently TopHat also tends to lose reads (As there is a huge signal in the bowtie method vs the TopHat method) Is there something specific TopHat uses in bowtie that causes this phenomenon? I've been using the --bowtie-n option in TopHat as well.
          Attached Files

          Comment


          • #6
            The problem seems to be the -pc option in genomecoverageBed (bedtools v2.26.0). Unfortunately, I need this option in order to get proper strand specific outputs as -strand doesn't work on it's own without specifying paired end reads (ie -pc). Are there any other options I could try or other coverage programs out there? -fs function does not help either in limiting large reads that cover 100,000 kb
            Last edited by dlbuz; 07-21-2016, 09:15 AM.

            Comment


            • #7
              I was able to fix this issue by separating the .bam file into + and - strands using samtools and then running genomecoverageBED. Seems like bedtools has problems with separating the strands correctly of paired ends reads if they are merged. The -pc function was recently implemented in the new version of bedtools v2.26 so it probably doesn't work as well as it should

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Choosing Between NGS and qPCR
                by seqadmin



                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                10-18-2024, 07:11 AM
              • seqadmin
                Non-Coding RNA Research and Technologies
                by seqadmin




                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                Nobel Prize for MicroRNA Discovery
                This week,...
                10-07-2024, 08:07 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 11-01-2024, 06:09 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-30-2024, 05:31 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-24-2024, 06:58 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-23-2024, 08:43 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Working...
              X