Header Leaderboard Ad

Collapse

Tophat reads kept/discarded during initial conversion

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat reads kept/discarded during initial conversion

    I am using Tophat to analyze illumina HiSeq2000 paired end read data. I have noticed that during the initial execution, Tophat1(and 2) "converts the reads" and then sorts the left reads into kept and discarded groups (e.g. 8,000,012 kept, 10,121 discarded) and does the same for the right reads (e.g. 7,804,000 kept, 206133 discarded). Since there are a different number of discarded reads, I'm assuming that "lone" mates are treated as single reads.

    My question is, how does tophat decide which reads to keep and discard and why? Are there some underlying QC filters?

  • #2
    I am also VERY interested in this question/answer as I do quite a bit of quality trimming prior to mapping my reads and I've noticed the discarded reads being about 1-2% of my total read library.

    Comment


    • #3
      Hey anyone of you got the answer as the same occurred with me also.

      Tophat version is v2.0.6. Previously using the old software and that was working fine.

      Comment


      • #4
        I'm also using Tophat v2.0.6 and I also had this same question. I'm assuming it is removing reads that don't meet some quality threshold, but can't seem to find any documentation with the manual.

        Comment


        • #5
          I still haven't figured out why these reads are discarded. Since this step happens before alignment to the genome or GTF annotations, it has to be related to discarding low quality reads. I emailed [email protected] with this thread's link, so hopefully they respond.

          Comment


          • #6
            TopHat filter out some reads if they are of low complexity or include too many Ns.

            Comment


            • #7
              About how many might "too many" be?

              Comment


              • #8
                Not the answer to your question but...

                I can tell you that the 'discarded' reads end up in unmapped.bam.

                Hopefully future versions of tophat will allow for more user control/better documentation of the quality filtering.

                Comment


                • #9
                  I checked unmapped.bam from TopHat 2.0.9

                  samtools view -f 0x200 unmapped.bam | head

                  I got:
                  Code:
                  HWI-7001436:48:C2ET1ACXX:5:1108:2968:28222	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDD[email protected]DDDDBBBDDDD	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1203:5292:62817	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCTCGTTACA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBB5&)0((+()+((++	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1312:13946:40878	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1203:5920:62936	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBDDDDDD	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1312:14680:40864	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:2312:9415:35514	581	*	0	255	*	*	0	0	ATTAAAAAAAAAAAACTCCTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHIII<FHCHIIIIIIHDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1312:14593:40904	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCTCTCTTATAAAC	CCCFFFDFHGHHHIJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDBDD<9>&&+((((4(+(((((	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1108:4206:28028	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBBDDD<BDDDDDDDDDDDDDDDDDDDDDDD9	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1203:7475:62973	581	*	0	255	*	*	0	0	AGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJJHFDDD[email protected][email protected]&	ZT:A:L
                  HWI-7001436:48:C2ET1ACXX:5:1108:4708:28068	581	*	0	255	*	*	0	0	AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	CCCFFFFFHHHHHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDD	ZT:A:L
                  I think it makes sense removing these reads before alignment.

                  Right??

                  Another question:
                  what is the meaning of "ZT:A:L"?
                  Last edited by harryzs; 10-05-2013, 12:44 AM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                    by seqadmin



                    Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                    03-21-2023, 01:49 PM
                  • seqadmin
                    Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                    by seqadmin




                    Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                    03-10-2023, 05:31 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 11:44 AM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-24-2023, 02:45 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2023, 12:26 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-17-2023, 12:32 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X