Seqanswers Leaderboard Ad

**alexdobin** · 07-19-2013, 01:51 PM

Originally posted by lh3 View Post

I put the spliced alignments at: http://lh3lh3.users.sourceforge.net/tmp-aln.tgz. Among ~500 spliced alignments reported by star, >60% should be mapped with a small indel. This rate of gapped alignments is much higher than the average rate ~11% in the whole data set. Does star prefer ungapped alignments to avoid frameshifts? Is it possible to tune star to choose unspliced alignments for these reads, or ask it to report both spliced and unspliced alignments?

In addition, it should be noted that the spliced alignments are only 1/8 of all mismappings (<10% for paired-end data). Excluding spliced alignments would not affect the overall conclusion on genomic read alignment.

Thanks.

Heng,
thanks a lot for posting the data. I have looked at it and I think the results are quite educating.
First, I re-run STAR on this subset of 503 erroneously spliced alignments with the following parameters:
--scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --scoreGap -2 --scoreGapNoncan -100 --outFilterMatchNmin 95 --alignIntronMax 100000 --seedSearchStartLmax 25
Compared to the default parameters, this lowers the penalty for indels, introduces non-zero penalty for canonical junctions and very large penalty for non-canonical junctions. It also reduces the maximum junction gap to 100kb (from ~600kb by default), and requires at least 95 out of 101 bases matched - to avoid poor quality alignments.
--seedSearchStartLmax 25 option increases STAR sensitivity to alignments with short overhangs over indels or junctions.

With these parameters STAR produces just 29 spliced reads. I have compared these 29 splices alignments with bwa-mem alignments in the attached excel files. Interestingly, for 23 out of 29 alignments, STAR finds spliced reads with the same or smaller edit distance (marked with green in the Excel file). If these were truly RNA-seq data, these would be the cases where it's impossible to differentiate between spliced and unspliced alignments when reads are considered separately from each other. Possibly, they can be filtered by requiring that their junctions are supported by more than one read.

Attached Files

HengTest.zip (8.4 KB, 29 views)

**lh3** · 07-20-2013, 08:16 AM

For RNA-seq, it is hard to tell between an unspliced alignment with 2 mismatches and a spliced alignment with 1 mismatch. In that case, I would give either alignment a low mapping quality, and/or provide the unspliced alignment in an optional tag, informing users that the spliced alignment should not be trusted. This is how base callers, variant callers and genomic read mappers work: when there is ambiguity, tell the users, instead of merely making an arbitrary choice.

If a mismapping is caused by variants, not sequencing errors, requiring two or more supporting reads helps little as reads containing the same variants are all likely to be mismapped.

**sdriscoll** · 07-21-2013, 09:45 PM

@lh3 you've just spoken one of my main issues with some tools out there. Leaving ambiguity in the results is very important so that the scientists can make their own decisions. They need to know when something is computationally impossible to separate. I've always thought this when counting hits to genes from genomic aligned reads. Some throw out ambiguous hits (those that hit more than one gene_id feature at a single alignment position, for example) which in some cases throws out a lot of information - enough to severely throw off count values from controls in a simulated experiment. I've always felt it would be more informative to the researcher to see those hits but to know that they may not be able to separate those hits from the two (or more) gene id features due to ambiguous alignments. That's much better than the researcher thinking that those features are either not there or expressed very low...that can be very confusing.

**alexdobin** · 12-03-2013, 12:11 PM

Comment on TopHat2 paper

Dear All,

If you are interested in comparing STAR and TopHat2, please check out our comment on the TopHat2 paper on bioRxiv:
http://biorxiv.org/content/early/2013/11/22/000851.
The main conclusion is that in fair comparisons STAR's alignment accuracy exceeds that of TopHat2. As we showed before, this high accuracy is achieved with the mapping speed ~50-fold higher than TopHat2's.

As always, I will be happy to answer any questions.
Cheers,
Alex

Abstract:
In the recent paper by Kim et al. (Genome biology, 2013. 14(4): p. R36) the accuracy of TopHat2 was compared to other RNA-seq aligners. In this comment we re-examine most important analyses from this paper and identify several deficiencies that significantly diminished performance of some of the aligners, including incorrect choice of mapping parameters, unfair comparison metrics, and unrealistic simulated data. Using STAR (Dobin et al., Bioinformatics, 2013. 29(1): p. 15-21) as an exemplar, we demonstrate that correcting these deficiencies makes its accuracy equal or better than that of TopHat2. Furthermore, this exercise highlighted some serious issues with the TopHat2 algorithms, such as poor recall of alignments with a moderate (>3) number of mismatches, low sensitivity and high false discovery rate for splice junction detection, loss of precision for the realignment algorithm, and large number of false chimeric alignments.

**rmorey** · 02-03-2014, 12:26 PM

Appropriate settings for variant calling

I am new to using STAR and I would like to get opinions on what the appropriate settings would be if I would like to use STAR's mapping results for VarScan. I used the parameters specified in Alex's earlier post:

First, I re-run STAR on this subset of 503 erroneously spliced alignments with the following parameters:
--scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --scoreGap -2 --scoreGapNoncan -100 --outFilterMatchNmin 95 --alignIntronMax 100000 --seedSearchStartLmax 25
Compared to the default parameters, this lowers the penalty for indels, introduces non-zero penalty for canonical junctions and very large penalty for non-canonical junctions. It also reduces the maximum junction gap to 100kb (from ~600kb by default), and requires at least 95 out of 101 bases matched - to avoid poor quality alignments.
--seedSearchStartLmax 25 option increases STAR sensitivity to alignments with short overhangs over indels or junctions.

My settings are: --outSAMmode Full --outSAMattributes Standard --outReadsUnmapped Fastx --outFilterMultimapNmax 5 --outFilterMismatchNmax 6 --clip5pNbases 15 --clip3pAdapterSeq AGATCGGAAGAGC --clip3pAdapterMMp 1 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --scoreGap -2 --scoreGapNoncan -100 --alignIntronMax 100000 --seedSearchStartLmax 25 --outFilterMatchNmin 95

The above settings result in 30% uniquely mapped reads and removal of the --outFilterMatchNmin 95 results in 70%. I would like to be stringent so that I can avoid false positives as much as possible when calling variants. Can anyone recommend better settings for this sort analysis?

**sdriscoll** · 02-05-2014, 12:54 AM

I'm not sure what to recommend. If you're goal is to have robust alignments then using --outFilterMatchNmin 95 is a good idea. If you're only getting 30% alignment with that setting then I'd look at your data a little more closely. Why are so many reads failing to align within 5% of the read length? If those alignments are allowed I'd expect it would gum up the SNP calling with a lot of noise. However if they are included then the portions that do align well could still aid you by providing additional depth.

One way to see if it really matters is to run your variant caller on both versions of the alignments and then filter the results to only those that have high statistical significance. If the list from the less strict alignments has a lot of high statical calls that the other does not then go look at the alignments at those locations in the IGV viewer or something similar where you can actually see all of the bases of the alignments verses the reference. You might be able to make sense out of why so many more reads align with the stringency is lowered. You might also find that all of those additional reads provide no additional information in terms of variant calling.

**alexdobin** · 02-06-2014, 02:52 PM

Hi @rmorey,

as Shawn pointed out, 30% mapping rate is quite low, even with --outFilterMatchNmin 95. It may indicate poor quality tails, or high mismatch rate. When you mapped it with default --outFilterMatchNmin, what were the "Average mapped length" and "Mismatch rate per base, %"? These numbers could hint at the problem.
Are you aligning RNA-seq data? If so, I highly recommend using annotations at the genome index generation step, which improves mapping of reads with short spliced overhangs, and drastically reduces the number of false positive alignments with mismatches (mostly to pseudogenes).
If you are aligning DNA-seq, you would want to use --alignIntronMax 1 to prohibit splicing. Of course, in this case you should not include annotations.

Cheers
Alex

**sdriscoll** · 02-15-2014, 12:18 AM

This just reminded me of some SNP calling benchmarking I was doing a while back. I'd bet the lower-stringency alignments aren't going to be a big deal for SNP calling though their presence could mask out some real SNP calls by increasing mismatch noise if there are significant mismatched bases in the vicinity of a true SNP. Also I recommend what Alex recommended for RNA-seq alignment (using a annotation built index) because I've found it's highly likely with genomic alignments to find false SNP calls as a result of mis-aligned reads. STAR is very good at controlling this, however, and even better when an annotation is provided during genome indexing.

**ndaniel** · 05-10-2014, 08:57 PM

RNA-seq & transcriptome

Originally posted by lh3 View Post

For RNA-seq, it is hard to tell between an unspliced alignment with 2 mismatches and a spliced alignment with 1 mismatch...

Hi lh3,

indeed this is difficult if the mapping is done on genome.

In a RNA-seq experiment one is sequencing RNA (and not the genome)!!!! It is obvious that best choice is to map the reads first on the RNA/transcriptome (and not the genome) for the organisms where the reference transcriptome is known like for example human (afterwards the unmapped reads may be mapped on the genome)!!!! This is how eXpress is doing it!

Always mapping first reads from a RNA-seq experiments on a reference transcriptome will give better results than than mapping those on a reference genome!

@lh3, have you tried to map the reads on the reference transcriptome using BWA-MEM (and afterwards convert the mappings to genomic coordinates in order to make the comparison easier) and compare with the case when the mapping using BWA-MEM was done on the reference genome? I guess that BWA-MEM would perform much better!

**Brian Bushnell** · 05-11-2014, 12:09 AM

Originally posted by ndaniel View Post

In a RNA-seq experiment one is sequencing RNA (and not the genome)!!!! It is obvious that best choice is to map the reads first on the RNA/transcriptome (and not the genome) for the organisms where the reference transcriptome is known like for example human (afterwards the unmapped reads may be mapped on the genome)!!!! This is how eXpress is doing it!

No, that's not obvious at all; in fact, I'd say it's wrong. Mapping to a genome is objective, while mapping to a transcriptome is subjective. I would never approve of any results gathered from transcriptome-mapping.

Always mapping first reads from a RNA-seq experiments on a reference transcriptome will give better results than than mapping those on a reference genome!

That is absolutely correct, if by 'better' you mean 'the results you plan to get by carefully choosing your transcriptome'.

**ndaniel** · 05-12-2014, 08:30 PM

Originally posted by Brian Bushnell View Post

No, that's not obvious at all; in fact, I'd say it's wrong. Mapping to a genome is objective, while mapping to a transcriptome is subjective. I would never approve of any results gathered from transcriptome-mapping.

I would say that both are objective. Mapping first on RNAs the reads from RNA-seq experiment will always give better mappings than mapping them on genome when the transcriptome is known pretty well, like for example for human. It is even in the title of RNA-seq that one should map the reads on the RNA and not on the DNA!

This is like the difference between supervised clustering (transcriptome) and unsupervised clustering (genome). In majority of cases supervised clustering will perform better than unsupervised clustering. See http://en.wikipedia.org/wiki/Unsupervised_learning and http://stanford.edu/~rezab/papers/interactive.pdf

It is just that when mapping on genome without using the information from the transcriptome then one is on purpose not using all the information available and therefore is using a worse model. Not using the information from the transcriptome when mapping the reads from an RNA-seq experiments is like re-inventing/re-discovering the wheel/transcriptome every single time for every single RNA-seq sample.

For example, the following tools are mapping the reads from RNA-seq experiment on RNA/transcriptome first:
- TopHat (with a --GTF option)
- eXpress are doing this (that is mapping on transcriptome),
- BitSeq
- ...

This is from TopHat manual:
----------------------------------------------
-G/--GTF <GTF/GFF3 file>
Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat.
-----------------------------------------------

How does BBMap compare against TopHat when TopHat is run using --GTF option? Did you run TopHat using the GTF as input?

In the end of the day one wants to get the best mapping using the ALL the information that is available out there. The human transcriptome is pretty well known. In cases where the transcriptome is not known very well (e.g. some species of fishes/butterflies/reptiles/etc.) then indeed the best choice is to map the reads from RNA-seq experiment on the genome of the organism if the genome is available.

**Brian Bushnell** · 05-13-2014, 10:17 AM

Originally posted by ndaniel View Post

How does BBMap compare against TopHat when TopHat is run using --GTF option? Did you run TopHat using the GTF as input?

In the end of the day one wants to get the best mapping using the ALL the information that is available out there. The human transcriptome is pretty well known. In cases where the transcriptome is not known very well (e.g. some species of fishes/butterflies/reptiles/etc.) then indeed the best choice is to map the reads from RNA-seq experiment on the genome of the organism if the genome is available.

I have not directly compared the performance of Tophat with GTF files to BBMap; most organisms I deal with are novel and don't have any reliable annotation. But my reason for avoiding the use of annotation or transcriptomes when mapping is more philosophical - they change very frequently, and even if they give a higher mapping rate, they invariably incur bias, particularly in organisms with alternative splicing.

When I was dealing with human genetics a few years ago, I tried to keep our local copies of human gene annotations up-to-date. There were different versions like refgene, refseq, and knowngene, available from various websites like UCSC and NCBI, and I re-downloaded them every month because they changed at least that frequently. Despite a large overlap, they were all different - hence, you got different results from each of them. I didn't actually use them for mapping RNA-seq data, but rather for annotating the effects of mutations called from DNA mapping. Furthermore, I didn't use any of them individually; rather, I created a union of the three, which was itself a subjective operation, because in some places they had the same gene with just slightly different exon or UTR boundaries. They typically also disagreed on things like gene isoforms - how many there were, and which ones. And on whether a sequence was a real gene, or a pseudogene. And this was not some random organism with only machine-generated annotation from a draft genome... this was human, with a huge amount of research poured into individual genes for many years and manually-curated databases.

So, considering how volatile they were, and how different they were, and how full of errors they were (otherwise there would be no contradictions between them), it seems like a very bad idea to rely on them for mapping when you don't have to. You may need to use the annotations at some point, like when deciding what the name or function is of a gene that appears to be differentially expressed, but I think it's wise to avoid them whenever possible. I think it's unwise to use the annotation for mapping even if the annotation is perfect and complete - but in the real world, where it isn't, I think it's doubly unwise.

**Brian Bushnell** · 05-13-2014, 05:50 PM

On second thought, it might be useful to do the opposite of what Tophat does - first map to the genome, and then attempt to map the remaining unmapped reads to the transcriptome. That would avoid almost all of the bias that comes from transcriptome-mapping (since the vast majority of reads would already be mapped), yet allow more reads to map to hypothetical things like inter-chromosomal junctions, 10Mbp introns, and post-transcription-modified sequences.

**ndaniel** · 05-14-2014, 12:33 AM

Originally posted by Brian Bushnell View Post

I have not directly compared the performance of Tophat with GTF files to BBMap; most organisms I deal with are novel and don't have any reliable annotation. But my reason for avoiding the use of annotation or transcriptomes when mapping is more philosophical - they change very frequently, and even if they give a higher mapping rate, they invariably incur bias, particularly in organisms with alternative splicing.

One can argue easily that the reference human genome changes very frequently too and this incurs bias too!

Originally posted by Brian Bushnell View Post

When I was dealing with human genetics a few years ago, I tried to keep our local copies of human gene annotations up-to-date. There were different versions like refgene, refseq, and knowngene, available from various websites like UCSC and NCBI, and I re-downloaded them every month because they changed at least that frequently. Despite a large overlap, they were all different - hence, you got different results from each of them. I didn't actually use them for mapping RNA-seq data, but rather for annotating the effects of mutations called from DNA mapping. Furthermore, I didn't use any of them individually; rather, I created a union of the three, which was itself a subjective operation, because in some places they had the same gene with just slightly different exon or UTR boundaries. They typically also disagreed on things like gene isoforms - how many there were, and which ones. And on whether a sequence was a real gene, or a pseudogene. And this was not some random organism with only machine-generated annotation from a draft genome... this was human, with a huge amount of research poured into individual genes for many years and manually-curated databases.

So, considering how volatile they were, and how different they were, and how full of errors they were (otherwise there would be no contradictions between them), it seems like a very bad idea to rely on them for mapping when you don't have to. You may need to use the annotations at some point, like when deciding what the name or function is of a gene that appears to be deferentially expressed, but I think it's wise to avoid them whenever possible. I think it's unwise to use the annotation for mapping even if the annotation is perfect and complete - but in the real world, where it isn't, I think it's doubly unwise.

Again one could argue that the reference human genome is as volatile as the reference human transcriptome.
For example there is for the reference human genome the following versions released in the last few years (see: https://www.ncbi.nlm.nih.gov/project...bly/grc/human/ ):
- hg18
- GRCh37 (with over 13 patch releases during last 5 years) ==>> YES, 13 different patch releases!!!! (by the way, which patch release have you used for BBMap?)
- GRCh38

That means that most likely the reference human genome downloaded from Ensembl is not the same with the reference human genome downloaded from UCSC/RefSeq/... because they apply those release patches at different dates. Some of them are faster than the others.

Also the reference human genome is full with errors and there are still plenty of regions which are still missing.

The volatility of the reference human transcript comes mostly by adding new isoforms. Just because there is found a new exon border it is needed to add an entire isoform (which contains all the previously know exon borders plus one more that is the new one). So if one looks to the file size of the reference human transcriptome indeed would see that it variates quite a bit but this does not mean that using transcriptome to do the mappings is bad. If one converts the transcriptome to genomic coordinates will see that actually the reference human transcriptome is not as volatile as some might believe. Also for example if one really wants to use a very stable reference human transcriptome annotation I suggest to use the RefSeq one.

Also it is a little bit weird to use the reference transcriptome when doing DEGs analysis using RNA-seq data (one needs info from the reference human transcriptome) but not to use it while mapping the RNA-seq reads! If one uses the reference transcriptome annotation in a RNA-seq experiment then it is obvious then that the same transcriptome may be used also for mapping the same reads from the same experiment.

Originally posted by Brian Bushnell View Post

I think it's unwise to use the annotation for mapping even if the annotation is perfect and complete - but in the real world, where it isn't, I think it's doubly unwise.

I think that is smart if one uses all the information that is available out there to do the job. In this case the model for mapping RNA-seq reads will perform better obviously if the information about transcriptome is used. One could easily argue the same about the reference human genome which goes something like this. The reference human genome still has plenty of errors and missing regions and it a new version is released quite often and therefore one should not use it and instead should to de novo assembly of the transcriptome. Right?

Originally posted by Brian Bushnell View Post

On second thought, it might be useful to do the opposite of what Tophat does - first map to the genome, and then attempt to map the remaining unmapped reads to the transcriptome. That would avoid almost all of the bias that comes from transcriptome-mapping (since the vast majority of reads would already be mapped), yet allow more reads to map to hypothetical things like inter-chromosomal junctions, 10Mbp introns, and post-transcription-modified sequences.

I personally think that first one should map the reads on transcriptome (that is RNA and it is even in the name of the RNA-seq) because the reads are coming from RNA and not from DNA (DNA is a contamination in the RNA-seq world; here by DNA I mean the reference human genome and not cDNA). Afterwards the reads which map badly or do not map on the transcriptome should be mapped on the genome. In the end all the mapping results should be given as genomic coordinates (that is, convert the transcriptome mappings coordinates into genomic coordinates) in order to be able to use other bioinfo tools (e.g. HTSeq, featureCounts, etc.)

But, indeed it would be interesting to do comparisons for RNA-seq, like:
(a) map the reads first on genome and left overs reads on transcriptome using TopHat
(b) map the reads first on transcriptome and left overs reads on genome using TopHat
(c) map the reads on genome only using TopHat
(d) map the reads first on genome and left overs reads on transcriptome using BBMap
(e) map the reads first on transcriptome and left overs reads on genome using BBMap
(f) map the reads on genome only using BBMap
(g) map the reads first on genome and left overs reads on transcriptome using BWA-MEM
(h) map the reads first on transcriptome and left overs reads on genome using BWA-MEM
(i) map the reads on genome only using BWA-MEM
and see which of these gives the best mappings (number of reads mapped, mapping quality of the reads, number of unmapped reads, etc.).

One would expect that "map the reads first on transcriptome and left overs reads on genome" to perfom better than "map the reads first on genome and left overs reads on transcriptome" and the worst would be "map the reads on genome only".

**earonesty** · 05-21-2014, 07:08 AM

Originally posted by alexdobin View Post

Hi @rmorey,

It may indicate poor quality tails, or high mismatch rate.

Or you may need to remove adaptors at the tails of your reads, esp if you sequenced a long read length, while keeping the Illumina fragmentation default of only 165bp. Fragment to 165, sequence at and extended read lenght with new chemistry .... 200x200 say... and get all sorts of adaptor at the ends... and fail to clip them off. Check the output of this:

What's the output of this command?

samtools view output.bam | cut -f 9 | head -100000 | perl -ne 'if ($_>0) {++$c; $s+=$_} END{print $s/$c,"\n"}'

And this one:

samtools view output.bam | cut -f 10 | head -100000 | perl -ne 'if ($_) {++$c; $s+=length($_)} END{print $s/$c,"\n"}'

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News