Seqanswers Leaderboard Ad

**sdriscoll** · 06-07-2013, 10:34 AM

Question...

Wei mentioned that one difference between the performance of subread-align and subjunc is that subjunc has lower sensitivity. In other words if you compare the two on the same set of rna-seq like reads subread-align will align more of the data than subjunc. This seems to be opposite what I'm used to. For example if you align RNA-seq to a genome with bowtie and then with Tophat you'll have more reads aligned with Tophat almost guaranteed thanks to the additional alignments from spliced alignments. So intuitively something doesn't make sense. Why wouldn't the output of subjunc basically be the same as subread-align with the added alignments of reads that span junctions?

**sdriscoll** · 06-07-2013, 11:51 AM

Wei everything you fixed yesterday seems to be working great. Thanks. Got another one for you. This is maybe simpler to deal with though. When using the '-J' option with subread-align and aligning paired-end reads it looks like if one of the mates is soft-clipped on the left side of the alignment the other mate's "mate position" field isn't updated to include the offset from the soft-clipping.

Here's an exmaple

Code:

ENST00000367469_153_348_0_1:0_3	83	1	4557898	121	34M66S	=	4557683	-315	CGATCTGGGACCGCAGCTGAAGTGACGTGGGGCTAGAATCGGGTTTCTCCACTTCCAGGTCCTGGGAAACCCGCCGTTTCCGCAGCTCCTCCATCCTCTC	????????????????????????????????????????????????????????????????????????????????????????????????????AS:i:3	NM:i:0	NH:i:1
ENST00000367469_153_348_0_1:0_3	163	1	4557719	147	36S24M40S	=	4557898	315ACCTTCTTGGAAGGTGGTCCTGGGCAGAGGGAGAAAGACTTACTTTCTTTCCACTTCTGGGGTTGACACGGCGCTACAGAAGCCAAGCGACTCTTCGATC????????????????????????????????????????????????????????????????????????????????????????????????????AS:i:1	NM:i:0	NH:i:1

**sdriscoll** · 06-07-2013, 02:51 PM

No offense but I find the usage message in the terminal to be a mess. Allowing the argument descriptions to wrap and mix in with the arguments makes it very difficult to read and find information. I reformatted the usage function (only in aligner.c) to wrap text at about 80 characters. The difference is like night and day...

**shi** · 06-07-2013, 03:55 PM

Dear sdriscoll,

Thanks for your helpful suggestion and reporting the bug. We will fix it when we are back to work on Tue. This is a long weekend in Australia.

For the comparison of subread vs subjunc, firstly they can both map exon-spanning reads, so they are both splicing-aware aligners. But the difference is that subread performs local alignments, meaning that you will not get full alignments for exon-spanning reads from it, while Subjunc can give you full alignments for such reads.

Subjunc applies a more stringent criteria for the mapping of exon-spanning reads. This is mainly because the aim of Subjunc is to detect exon-exon junctions and we found that using exon-spanning reads with higher mapping confidence to detect junctions significantly reduced its false discovery rate.

So my recommendation for choosing subread or subjunc to align your RNA-seq data is that if the purpose of your analysis is to perform a gene expression analysis (eg. discovering differentially expressed genes), subread is a better choice (the slightly lower accuracy is outweighted by its higher sensitivity). Otherwise you should use subjunc.

I hope this makes sense to you.

Best wishes,

Wei

**Bernt.Popp** · 06-09-2013, 02:40 AM

Hey Wei,

another question:
Is there a possibility to add read-group infomation to the SAM file from the command line (without doing it after the alignment, in order to save IO)?

**Bernt.Popp** · 06-09-2013, 03:53 AM

Hey Wei,

Just finished aligning one color-space exome in 25min on 16 cores, that is insanely fast!

Problem is the resulting SAM file is still in color-space encoding... Am I missing some parameter here to output basespace, or would I have to do the conversion to basespace with a script? Also this poses the same problem as my above question, if the colorspace-->basespace conversion is not done directly one would have to write another file and thus increase disc usage, which slows down the process...

**sdriscoll** · 06-11-2013, 04:38 PM

Also during runtime this section of the output message should be updated to match the redefined meaning of the -d and -D options:

Performing paired-end alignment:
Maximum distance between reads=600
Minimum distance between reads=50
Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
Number of anchors=10
The directions of the two input files are: forward, reversed

**shi** · 06-11-2013, 04:54 PM

Thanks sdriscoll,

They will be changed the next release. We are now testing the changes we have made. A new release should be available fairly soon.

Best wishes,

Wei

**shi** · 06-11-2013, 05:22 PM

We have just released a new version 1.3.5-p2 which mainly includes the following changes:

(1) Fixed a bug of reporting mapping location of mate read when it contains soft-clipped bases.
(2) Reformatted the program usage info and updated the program output info.
(3) An '-b' option was added to subread-align to output base-space reads when mapping color-space reads.

Please check it out from http://subread.sourceforge.net

Thanks again for your helpful comments.

Best wishes,

Wei

**sdriscoll** · 06-12-2013, 03:14 PM

awesome, thanks!

I was wondering what the expected change in the aligner's performance would be by tweaking the -n option. I have seen that the alignments are more strict when I increase the -m value but I don't understand what should happen when we increase or decrease -n. I do understand that these values are defaulted to relatively optimal settings.

**shi** · 06-12-2013, 04:22 PM

Hi sdriscoll,

From our evaluations based on 100bp reads, we found that -n should not be too small (<7) or too big (>20). My explanation for this is that if -n is too small, you will not have enough power to map the reads accurately. When -n is too big, there seems to be quite a bit of noises introduced to the mapping locations, which may also decrease the mapping accuracy as well.

We found that using -n=10 (default setting) or a very close number yielded the best results in terms of sensitivity, accuracy and speed. However, the differences are quite minor for different -n values if you always keep the ratio of -m/-n at ~30% for different -n values.

You are correct that the alignments become more stringent when the -m value is increased. The false positive rate will be reduced with larger -m values, but you will get less mapped reads. So the -n and -m options allow you to get the balance you want to have between the sensitivity and accuracy. With the default setting, Subread leans a little towards the accuracy end of the spectrum.

Hope this makes sense to you.

Best wishes,

Wei

**sdriscoll** · 06-12-2013, 11:20 PM

Thank you for the explanation. Does it seem logical, then, to adjust these settings if I have 50bp or 75bp reads instead of 100?

**shi** · 06-13-2013, 03:46 AM

Hi sdriscoll,

I wouldn't recommend you to change the subread setting for the mapping of shorter reads. We actually ran subread on quite a few datasets containing shorter reads and we found the mapping performance was quite good. The default setting, which is also our recommended setting, is largely insensitive to the read length because it is primarily the number of votes which is used by subread to determine the read mapping locations, rather than using other metrics such as the number of mismatched bases.

In the subread paper, we also ran subread on a 202bp dataset using its default setting and subread was found to perform very well. I suspect that the default setting may work well with even longer reads.

So I think the default setting should deliver the best mapping results for Subread in most cases.

Best regards,

Wei

**Bernt.Popp** · 10-18-2013, 07:04 AM

Hey Wei,

I am trying to align SOLiD colorspace reads with subread (1.4.0).
The commands used are:
1)
subread-buildindex -c -o human_g1k_v37_decoy human_g1k_v37_decoy.fasta
2)
subread-align -T 16 -I 16 -b -i $ref -r $myfilename".csfasta" -o $mydnaID.$myslide.subread.sam
3) adding readgroup information, sorting and converting to BAM with picard.

Unfortunately either there is some bug in the conversion from colorspace to basespace (option -b) or I am doing something wrong as the alignments are totally messy when viewed in IGV (although the reads seem to be at the right position).
Here is a example with a comparison to CUSHAW2 and novoalignCS alignments:
https://www.dropbox.com/s/4vgi0c7ev1...%20subread.jpg
Do you have any idea what could be wrong?

Also the new Indel feature does not emit any variants for the colorspace exomes analyzed...

Cheers,

Bernt

**gringer** · 10-19-2013, 12:16 PM

My guess is that this is due to the colour-space to basespace mapping being too strict.

It is possible that the "bad" alignments that you are seeing are the result of errors in the color-space sequence -- any error in the sequence will cause all the following bases to be incorrect. You're not showing the reference sequence, so I can't work out if this is the case in this situation.

Any colour-space to base-space conversion needs to take into account (and correct) errors so that the base-space sequences are correct. When there is a sequence difference, the conversion needs to make sure that only that position is changed in the base-space version.

Consider the following sequences that map to the same position:

Code:

reference: G101320112
sequence1: X101120112
sequence2: X101312011
sequence3: X201320312

[I chucked an INDEL in there to make things a bit harder]

A naive base-space conversion would convert these sequences as follows:

Code:

reference: GTTGCTTGTC
sequence1: GTTGTCCACT
sequence2: GTTGCAGGTG
sequence3: GAACGAATGA

[apologies if my conversion is incorrect. Fixes appreciated]

Very similar colour-space sequences, but very different base-space sequences.

A more correct conversion would notice where the errors were in the sequences relative to the index, and modify the next colour-space base as well to something that looks appropriate:

Code:

reference: G101320112
sequence1: X101100112
sequence2: X101313011
sequence3: X231320332

This would end up with these converted sequences:

Code:

reference: GTTGCTTGTC
sequence1: GTTGTTTGTC
sequence2: GTTGCATTGG
sequence3: GATGCTTATC

Which look considerably better.

I hate colour-space because the conversions are very unintuitive, and difficult to explain to other people. About the only nice thing is that reverse complement is just the reverse, but this also means that aligners need to be modified to account for that when working in colour-space (or double-encoded colour-space), and you can get weird unexpected chimeras (e.g. poly-A tails and poly-T heads merging). You can save a lot of pain and confusion by sticking with a base-space sequencer.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News