Seqanswers Leaderboard Ad

**rskr** · 10-20-2013, 04:34 AM

Originally posted by gringer View Post

One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

Accuracy and recall aren't interchangeable with sensitivity and specificity. Sensitivity is for binary classifiers and recall is for a database. Suppose you framed your search as a binary classifier, where every object in the database was classified as returned or not returned. Since there are so few returnable objects compared to the ones that are returnable. Sensitivity might as well be irrelevant. IE you could mark everything as not returnable and be 100.00% accurate, since there is only one in three billion correct answers. This is why it makes more sense to frame the evaluation in a relevance framework eg Accuracy Recall.

**gringer** · 10-20-2013, 05:24 AM

Originally posted by rskr View Post

Accuracy and recall aren't interchangeable with sensitivity and specificity. Sensitivity is for binary classifiers and recall is for a database. Suppose you framed your search as a binary classifier, where every object in the database was classified as returned or not returned. Since there are so few returnable objects compared to the ones that are returnable. Sensitivity might as well be irrelevant. IE you could mark everything as not returnable and be 100.00% accurate, since there is only one in three billion correct answers. This is why it makes more sense to frame the evaluation in a relevance framework eg Accuracy Recall.

Understood, thanks for the clarification. I don't deny that accuracy and recall work well for what has been done in the paper, it's just that they're not biology-friendly.

FWIW, Medical science uses positive and negative predictive value to account for extreme chances of correct/incorrect clasifications. Wikipedia tells me that PPV is equivalent to precision, while sensitivity is equivalent to recall.

**rskr** · 10-20-2013, 06:00 AM

Originally posted by gringer View Post

Understood, thanks for the clarification. I don't deny that accuracy and recall work well for what has been done in the paper, it's just that they're not biology-friendly.

FWIW, Medical science uses positive and negative predictive value to account for extreme chances of correct/incorrect clasifications. Wikipedia tells me that PPV is equivalent to precision, while sensitivity is equivalent to recall.

I'm just saying, I wouldn't dumb down the content, just because you think Doctors aren't smart enough to understand. Many people would consider that arrogant. Besides, many patients find it annoying when doctors treat them as objects, which is just one of the pitfalls of using statistics for medical trials outside of the proper domain, where random variables don't represent people.

**shi** · 10-21-2013, 06:47 PM

Originally posted by Bernt.Popp View Post

Hey Wei,

I am trying to align SOLiD colorspace reads with subread (1.4.0).
The commands used are:
1)
subread-buildindex -c -o human_g1k_v37_decoy human_g1k_v37_decoy.fasta
2)
subread-align -T 16 -I 16 -b -i $ref -r $myfilename".csfasta" -o $mydnaID.$myslide.subread.sam
3) adding readgroup information, sorting and converting to BAM with picard.

Unfortunately either there is some bug in the conversion from colorspace to basespace (option -b) or I am doing something wrong as the alignments are totally messy when viewed in IGV (although the reads seem to be at the right position).
Here is a example with a comparison to CUSHAW2 and novoalignCS alignments:
https://www.dropbox.com/s/4vgi0c7ev1...%20subread.jpg
Do you have any idea what could be wrong?

Also the new Indel feature does not emit any variants for the colorspace exomes analyzed...

Cheers,

Bernt

Hi Bernt,

We found a problem with color base conversion for those reads mapped to negative strand. We are now investigating this and will fix it with a patch.

Thanks for reporting this.

Wei

**shi** · 10-25-2013, 04:25 AM

Originally posted by shi View Post

Hi Bernt,

We found a problem with color base conversion for those reads mapped to negative strand. We are now investigating this and will fix it with a patch.

Thanks for reporting this.

Wei

We have fixed the bug. Please update your Subread with the latest version (1.4.0-p1) and rerun your alignments.

Best,
Wei

**Bernt.Popp** · 10-25-2013, 06:57 AM

Originally posted by shi View Post

We have fixed the bug. Please update your Subread with the latest version (1.4.0-p1) and rerun your alignments.

Best,
Wei

Error persists for me, alignment with version 1.4.0-p1:
https://www.dropbox.com/s/zr0zhrtsqx...or_subread.jpg

I did not rebuild the index though, should I?

Maybe the dynamic programming approach described in Li H, Durbin R Bioinformatics (2009) could help in solving the conversion problem?

Cheers,

Bernt

**yangliao** · 10-25-2013, 01:42 PM

Dear Bernt,

I think the alignment result on SOLiD data has been largely improved in subread-1.4.0-p1. In your screenshot, most reads have the full length or a substantially long part mapped to the reference genome correctly. When I looked closely, I found that the reads with a part mismatched are very likely to have one color in the middle wrong, ruining the remaining part in color->base conversion.

There were also few reads entirely mismatched because Subread on SOLiD data does not compare base by base, but color by color, and it trims off the first two characters from the read before mapping (as what bowtie does). If the first base in the SOLiD read is wrong, the entire read has all its bases distorted.

If you convert those highly mismatched reads into colors, you may find that all these reads matched the genome very well in the color space.

By the way, if the data is from RNA-seq, it may contain junctions that our subjunc program can discover. Subjunc also works on SOLiD reads, so maybe it's worth a try

Cheers,

Yang

Originally posted by Bernt.Popp View Post

Error persists for me, alignment with version 1.4.0-p1:
https://www.dropbox.com/s/zr0zhrtsqx...or_subread.jpg

I did not rebuild the index though, should I?

Maybe the dynamic programming approach described in Li H, Durbin R Bioinformatics (2009) could help in solving the conversion problem?

Cheers,

Bernt

**gringer** · 10-25-2013, 02:01 PM

Originally posted by yangliao View Post

... Subread on SOLiD data does not compare base by base, but color by color, and it trims off the first two characters from the read before mapping (as what bowtie does). If the first base in the SOLiD read is wrong, the entire read has all its bases distorted.

Looks like my guess about not correcting colour-space to base-space conversions was correct (but there was an additional reverse-complement bug).

If you convert those highly mismatched reads into colors, you may find that all these reads matched the genome very well in the color space.

The problem with this "it's almost identical in colour-space" point of view is that people don't live in colour-space when they're looking at genome alignments -- it's just not intuitive when the sequence changes completely half-way through the alignment. Can you really tell me that the following sequences look the same to you?

Code:

.31230
ATGATT
CGTCGG
GCAGCC
TACTAA

Colour-space should only be used as an intermediate data format, and should not be treated as the most correct representation when showing sequences as base space.

**shi** · 10-25-2013, 03:07 PM

Yes, I agree the color to base conversion caused a lot of trouble for SNP calling although the reads seem to be mapped to the correct locations. I also agree that the color representations of the alignments are not intuitive and it is hard to see if they match with the reference or not.

One way to get around this issue is possibly to convert the color-space reads to base-space reads before carrying out alignments. This may reduce the number of mapped reads, but it should considerably reduce the number of mismatched bases due to the issue with color to base conversion.

Wei

**shi** · 10-25-2013, 03:23 PM

Or alternatively you may perform a more stringent alignment by using a larger -m value (eg. -m=6). This will reduce the number of mismatched color bases present in mapped reads, which should help alleviate the color to base conversion issue.

**gringer** · 10-25-2013, 09:51 PM

Originally posted by shi View Post

One way to get around this issue is possibly to convert the color-space reads to base-space reads before carrying out alignments. This may reduce the number of mapped reads, but it should considerably reduce the number of mismatched bases due to the issue with color to base conversion

You need to align in colour-space for the reasons I've already mentioned. Basically the base space sequence changes too much. Any base-space alignments would have far too many misses due to small errors in the colour-space sequences.

However, when representing an alignment in base-space, you need to consider the base-space representation of the reference sequence, and modify the aligned colour-space sequence to fix any colour-shift errors.

edit: Note that it is always the case that a single colour-space difference between read and reference sequence is an instrument read error, and will cause a base-shift error in any base-space representation. A single SNP will modify two consecutive colours, and an INDEL will shift all subsequent colours (in the same fashion as in base-space) as well as (possibly) changing the colour at the site of the INDEL.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, 11-08-2024, 11:09 AM	0 responses 35 views 0 likes	Last Post by seqadmin 11-08-2024, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, 11-08-2024, 06:13 AM	0 responses 28 views 0 likes	Last Post by seqadmin 11-08-2024, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 32 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 23 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News