Seqanswers Leaderboard Ad

**pbluescript** · 11-05-2012, 10:21 PM

Originally posted by dvanic View Post

More reads != better mapping, it may just mean more false assignments of reads to location + salvaging more reads that may not be mappable...

Since you quoted me, I guess I should add my comments.

My testing hasn't been as extensive as that in the recent paper, so I recommend you all just go read that.

It's true that more mapped reads does not equal better mapping. However, I do get more uniquely mapped reads and more reads with both pairs mapped (for my PE runs). For much of my data, even if I throw out multi-mapped reads, just the unique, proper pair reads from STAR yields more aligned reads than Tophat (up to 2.0.3 when I just gave up on testing Tophat). These are predominantly good quality alignments too.
So far, the testing I have done to confirm the data produced by STAR has held up very well. Gene expression levels, novel splice junctions (when well supported), and alternative isoforms have confirmed well with PCR-based methods.

This comes with a couple caveats though. I don't use STAR for analysis of RNA editing. For that, I switch to BWA and custom built transcriptomes for alignment.
For most of my data, I am sequencing small amounts of fragmented RNA, so the read quality can be quite variable. Getting 50-60% of my reads mapped makes me happy. For my few good quality samples, the differences between Tophat and STAR aren't as pronounced.

As the paper mentions, Tophat is MUCH slower. STAR can be slow on some of my data sets with a small percentage of reads that map to the target genome, but it's still faster than Tophat.

On a side note, of the four or five emails I've sent to the Tophat team requesting advice or reporting a bug, only one was answered. Every email I've sent to the STAR developer has been answered and answered quickly.

**dvanic** · 11-05-2012, 11:19 PM

My testing hasn't been as extensive as that in the recent paper, so I recommend you all just go read that.

Oh, I've read the paper, but see several "problems" with their tests (not really problems, more like it's hard to test every version against every version, and every version adds some new feature and maps differently).

With Tophat, we've found that
1) There is a significant difference (for versions 1.4.1 and 2.0.0-2.0.4) with mapping with a reference transcriptome or without one. I did some benchmarking and visualization with one of my datasets and found that without a reference more reads are mapped. Many of these reads are "wierd" - low quality, not "adding up" to transcripts, located in regions with no annotated transcripts in "clusters" that don't appear to be a continuous transcript etc...

For reads that are mapped differently with/without the annotation,
- lots mapped to pseudogenes without a reference
- reads mapping over splice junctions that have a small overlap with one of the junctions (~5 nucleotides) will be differently aligned with and without the annotation (the annotation has information on the structure of the isoforms at the given locus, while without the annotation the "tail" is just mapped to the nearest sequence of those nucleotides, which may not be part of the transcript at all)
- some reads are just mapped weirdly (as above for unmapped) when you don't use an annotation.

Hence, with these versions of Tophat we have always tried to use an annotation.

2) With Tophat 2.0.5/6:
They have introduced a double mapping:

Version 2.0.5 adds new options to better control the read alignment and to improve mapping accuracy, and the ability to resume partial TopHat runs:
along with -N/--read-mismatches, TopHat introduces new options for finer control of the read alignment process by limiting the number of mismatches, indels and indel length. Please check new options --read-gap-length and --read-edit-dist.
the new --read-realign-edit-dist option can be used to greatly improve spliced-mapping accuracy (especially in the absence of annotation data) by forcing the re-mapping of some or all reads regardless of them being already mapped in earlier steps of the pipeline.

This, in my experience, makes the mapped data "look better" (when visualized as a wiggle or bam=> bed), i.e. the read positions make sense and look to be part of reasonable transcripts. Also, theoretically, having the ability to accurately map to both genome and transcriptome independently and choose the best alignment seems like a very good idea to me, which is why I am using this version with these settings at the moment.

Getting back on topic, the STAR paper used Tophat with the default options, but with 10 mismatches (default is 2) for a 100 bp PE read (and, crucially, no reference annotation, which, as I've outlined above, makes a big difference for alignment "quality" in Tophat). Hence, in an indirect way, Tophat was at a disadvantage here in terms of how it would perform in a real-world scenario... I'm just wondering how much, and what is observed in a real-world scenario with different datasets by different people. (As in, should I convert???)

[As a side note, the newer versions of Tophat 2.0.5/6 have a much higher memory requirement, in my experience, than the older ones (for example, on 100 million 100bp PE reads it uses ~23 gb, and runs using 8 cores for four bloody days... - but this is a price I am willing to pay to be more "confident" in the higher accuracy of my mapping)]

Gene expression levels, novel splice junctions (when well supported), and alternative isoforms have confirmed well with PCR-based methods.

We've found this as well with the latest Tophats when using a reference.

I don't use STAR for analysis of RNA editing. For that, I switch to BWA and custom built transcriptomes for alignment.

I think editing is a separate issue altogether. I've seen some of the recent "better" papers (Kleinman, Bahn, Ju etc), but I am still not sure we can accurately estimate editing from a "normal" RNA-Seq dataset (and not one targeted at detecting editing) - the levels of editing for most transcripts are just so low, mapping accuracy needs to be better, SNPs specific to that particular individual need to be taken into account (i.e. in a perfect world you need a genome sequence) etc etc... After the Li paper fiasco I'm very much a skeptic.

For my few good quality samples, the differences between Tophat and STAR aren't as pronounced.

I am spoiled by decent quality data that others playing with it have gone green with envy at...

As the paper mentions, Tophat is MUCH slower. STAR can be slow on some of my data sets with a small percentage of reads that map to the target genome, but it's still faster than Tophat.

Undoubtedly. But for me (and some of my colleagues here in Oz as well), I'd rather wait four days for my data to map and then play with it for several weeks/months, confident in that what I see, no matter how biologically strange, is probably real, rather than getting my results in a day, spending two weeks looking at some interesting feature, and then discovering it's a mapping artefact. And I know I'm not insured against this, no matter how fancy I am with my pipelining, but I'd like to minimize the chances of this where I can.

On a side note, of the four or five emails I've sent to the Tophat team requesting advice or reporting a bug, only one was answered. Every email I've sent to the STAR developer has been answered and answered quickly.

That's actually an important selling point!

With Tophat we've been mostly lucky, but I have issues with another tool in the Tuxedo pipe - cufflinks. There is, at the moment, no actual paper that reports what cufflinks is now doing and how, whether using the smorgasbord of methods it's trying to use together is actually statistically valid, and only the rather complex, confusing and, frankly, unintelligible "how cufflinks works" web page as a reference, and the promise of a "manuscript in preparation". This annoys me, since this IS the most commonly used tool in the field, and the fact that most people who use it have no idea what it's doing reflects IMHO shoddy science... (sorry for the rant)

**pbluescript** · 11-06-2012, 01:57 PM

Originally posted by dvanic View Post

Hence, with these versions of Tophat we have always tried to use an annotation.

When Tophat introduced the option of mapping to a transcriptome first, I did notice an overall improvement in mapping quality. It found a good number of additional splice junctions. However, for my data, STAR was still the winner.

Getting back on topic, the STAR paper used Tophat with the default options, but with 10 mismatches (default is 2) for a 100 bp PE read (and, crucially, no reference annotation, which, as I've outlined above, makes a big difference for alignment "quality" in Tophat).

Those are not the Tophat options used in the STAR paper. Perhaps you were looking at one of the different aligners? From the STAR paper:
tophat --solexa1.3-quals -p $1 -r172 --min-segment-intron 20 --max-segment-intron 500000 --min-intron-length 20 --max-intron-length 500000 <genome_name> Read1.fastq Read2.fastq

There isn't even a way to set 10 mismatches/read in Tophat.

Undoubtedly. But for me (and some of my colleagues here in Oz as well), I'd rather wait four days for my data to map and then play with it for several weeks/months, confident in that what I see, no matter how biologically strange, is probably real, rather than getting my results in a day, spending two weeks looking at some interesting feature, and then discovering it's a mapping artefact. And I know I'm not insured against this, no matter how fancy I am with my pipelining, but I'd like to minimize the chances of this where I can.

I totally agree. I spent months working on mapping methods. I have access to a cluster, so it was fairly easy to test numerous mapping methods on a large number of samples. I'm just happy that the method that gave me the best results is also the fastest.

**dvanic** · 11-06-2012, 02:23 PM

Those are not the Tophat options used in the STAR paper. Perhaps you were looking at one of the different aligners? From the STAR paper:
tophat --solexa1.3-quals -p $1 -r172 --min-segment-intron 20 --max-segment-intron 500000 --min-intron-length 20 --max-intron-length 500000 <genome_name> Read1.fastq Read2.fastq

Damn, should have checked the supplements. I assumed that this was what they were using based on the statement:

All aligners were run in the de novo mode, i.e. without using gene/transcript annotations. The maximum number of mismatches was set at 10 per paired-end read, the minimum/maximum intron sizes were set at 20b/500kb

from the main paper.

There isn't even a way to set 10 mismatches/read in Tophat.

I am assuming you would be able to by using the option:

-N/--read-mismatches Final read alignments having more than these many mismatches are discarded. The default is 2.

although I have never tried to use 10 in the real world (have gone up to 5 successfully, though, but generally stick to the default 2).

I totally agree. I spent months working on mapping methods.

It's just we ALL do this, and it would be so much nicer if there could be really relevant "real-world" comparison papers, as opposed to "this is my software - it is the best against every other software (if we run the tests in a particular way)".

But it's good to know STAR is working for someone! I'll give it a shot and see what I get out of it.

Have you used STAR-generated bams with cufflinks by any chance?

**pbluescript** · 11-06-2012, 04:19 PM

Oh cool. It looks like they added that -N option in the 2.0.5 release, after I stopped using it.

I have used STAR for Cufflinks. In fact, a fairly recent update added the appropriate tags for use with Cufflinks. Before that, I had to add them separately.
Whether or not Cufflinks works is another issue of hot debate on this board.

**dvanic** · 11-06-2012, 05:57 PM

Whether or not Cufflinks works is another issue of hot debate on this board.

I agree wholeheartedly. But use it anyway because it is, with all of its flaws, the "best" thing out there. The only problem is that I am still not sure how it is now working... Especially for differential expression of isoforms, with or without replicates. [And I have read the "How cufflinks works" page. Doesn't really make it that much clearer. Or convince me that the stats are valid.

And I have had it trip and assemble weird transcripts, especially when run without a reference. And not assemble something in one library and assemble it in another, even though there are an approximately equal number of reads in both libraries...

Basically, I use it and then look in very intently in the browser at anything I'm basing a hypothesis on

**EGrassi** · 11-28-2012, 06:48 AM

Can I ask which options did you use for your tests of tophat-star? Some default values seems very different (for example about multi mapped reads) and I would like to avoid losing something...thank you.

**NicoBxl** · 02-06-2013, 03:23 AM

Is STAR managing strand-specific data like tophat ?

**Torst** · 06-05-2013, 10:12 PM

Yes, according to the manual (PDF) it assumes strand-specific reads. If you don't have them, you need to enable an option.

**sdriscoll** · 06-05-2013, 11:36 PM

Are people still interested in this discussion? I've been benchmarking things like crazy including Tophat and STAR as well as several other mapping approaches and then quantifying things like alignment position and read counts at the gene level as well as at the isoform level. I'm seeing some interesting things that sort of validate what I've suspected in the past when using these tools on real data. STAR, with a reference, vs Tophat, with a reference, perform VERY similarly in terms of alignment precision (or accuracy or 1-FDR depending on what you want to call it). Tophat with a reference is a significant improvement over without a reference while STAR's improvement is less (so STAR without a reference seems to out-perform Tophat in alignment precision). In terms of counting hits to genes they, again, perform very similarly however STAR beats Tophat out in count value precision. What I mean to say is that if I compare the list of genes that received alignments to the list of genes that should have alignments the counts from the two aligners are similar but if I compare the count values to the control values then STAR has much higher precision compared to Tophat. Both of these pipelines are defeated pretty significantly by RSEM and eXpress, however, for gene level counts.

Cufflinks is a different story. I figured out how to generate counts from cufflinks (not cuffdiff) and I compared those counts to my own naive counter and go the same values so I can see that it's counting hits at the gene level in a logical way.

The isoform level counts aren't awesome but the sensitivity (ratio of isoforms with counts to those that should have counts) is OK. the FDR is bad though.

I've not finished yet but I'm sticking together a pipeline to benchmark cufflinks' de-novo isoform assembly abilities. I've only run a single simulation which resulted in 26% false positives (that's isoforms it assembled that cuffcompare evaluated to be matches to annotated isoforms that weren't expressed at all in my simulation). so I wasn't too stoked about that. I even provided it with isoforms with massive expression to be sure there was enough reads for it to do its thing. I have a lot more to look at, however, before making any noise. There's more going on here than just cufflnks' ability to make isoforms - it's completely dependent upon the aligner. Also it's isoform expression assigning stage doesn't have the same flexibility of RSEM and eXpress who get all mappings of reads and can perform a detailed evaluation of which of those mappings is really "correct". This approach does yield an improvement over the aligner chosen "primary" mapping. To me it seems that cufflinks is at a disadvantage since it could be strangled by the aligner's ability to select the correct mapping of reads.

**dpryan** · 06-06-2013, 03:13 AM

Originally posted by sdriscoll View Post

Are people still interested in this discussion?

Yes, I'd be very interested in seeing more about that, it sounds quite useful.

**Jon_Keats** · 06-06-2013, 10:27 PM

We have done a decent bit of side by side comparisons of STAR 2.3.0 and TOPHAT 2.0.8b.

Short story is they are not very different. On most obvious counts STAR wins, faster and more unique alignments. We run both with ensembl 70 GTF. In fact the biggest thing we notice recently was moving from ensembl 64 to 70 and changing from TOPHAT 2.0.4 was that the run time dropped to under 24hrs compared to the previous 2.5 days on ~70 million read pairs. STAR did a much better job picking up large known indels detected in matching exomes.

My only negative comment on STAR is that it is very aggressive at trying to find junctions and throughs out some clear garbage. HOWEVER, this can easily be removed using the filter by sbjOUT option.

Based on our testing I highly suspect we will be following pbluescript and moving to STAR from TOPHAT.

**sdriscoll** · 06-06-2013, 10:53 PM

You know there is something that Heng posted once in the bio-bwa user group. He appears to be more of a DNA alignment dude than an RNA-seq alignment dude but he was talking about adapting the BWA aligners to become RNA-seq aligners. He posted a quick simulation where he sampled 1M reads from the genome (so no spliced reads) and then aligned them to the genome with STAR, Tophat (with both bowtie1 and bowtie2) and also his bwa 'mem' aligner. So tools like STAR and Tophat should have reported no junctions. Tophat, with bowtie2, managed to do this pretty well. I think he said it reported 1 junction. the bwa 'mem' aligner did pretty good as well reporting only a few chimeric alignments that could all be filtered out by removing alignments with MAPQ < 5. Tophat with bowtie1 reported all kinds of fusion alignments and STAR reported several hundred spliced alignments.

His point was that with bwa mem he seems to have a good base aligner - one that isn't reporting junctions when it shouldn't report junctions. It seems that STAR IS probably over-eager to report junctions based on his test. It may be useful to try such simulations yourself and maybe convince yourself which aligner is controlling the false positives of reported junctions better.

**genomeHunter** · 06-29-2013, 09:15 PM

STAR's MAPQ values should NOT be used for filtering reads and judging their qualities. I saw Heng's ROC plot in which STAR's ROC was just a single dot. I tried Bioplanet's GCAT test set on STAR and it was very good:

http://www.bioplanet.com/gcat/reports/237-nggjskmpkk/alignment/100bp-pe-small-indel/star-gh-m3/compare-23-18

Note that (1) STAR is not a DNA mapper and (2) MAPQ fileds are not set the same as say BWA.

I have seen a lot of STAR spliced alignment where the overhang is just one or two bases. I look at the CIAGR and throw out anything with an overhang less than 8.

Cheers
GH

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, 06-07-2024, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-07-2024, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, 06-06-2024, 08:18 AM	0 responses 21 views 0 likes	Last Post by seqadmin 06-06-2024, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, 06-06-2024, 08:04 AM	0 responses 20 views 0 likes	Last Post by seqadmin 06-06-2024, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 14 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

STAR vs Tophat (2.0.5/6)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News