Seqanswers Leaderboard Ad

**fishtank** · 07-29-2009, 03:33 PM

Originally posted by westerman View Post

That is low but it depends on your reference and your DNA and your organism. Which I do not think you have stated. But given this thread I presume that your reference is genomic and your DNA is microRNA. In that case you have to ask yourself, "how much of the genome do I expect to be miRNA as versus other RNA, genes, and structural?" If the answer is that you expect only 0.3% of your genome to be miRNA then your mapping is fine.

microRNA are so newly discovered -- i.e., since I've been out of school

-- that I am not sure how much of a genome should be miRNA. I could tell you roughly how much of genome should be gene and thus how much a mRNA experiment should have have as coverage but not for miRNAs.

But according to this ABI document, they are getting 50% reads mapped to miRNA so 0.3% is worrying. And it is already enriched for small RNA. I am skeptical of the 50% claim though, wonder what other people are getting?

Attached Files

cms_057560.pdf (617.4 KB, 82 views)

**nilshomer** · 07-29-2009, 04:25 PM

Originally posted by westerman View Post

2 mismatches is great for SNP discovery since any given read is unlikely to have more than 1 SNP in it. Anything else can be discarded as error.

The fraction of possible 50bp reads with X SNPs (from hg18 and dbsnp) is:

0 84.08%
1 13.02%
2 2.30%
3 0.40%
4 0.10%
5 0.03%
...

so make your own judgment.

On the other hand some of us have to deal DNA from species only partially related to our known (and often incomplete) reference sequence. We then use larger mismatch parameters and are thankful for what information we do get back.

I think that if it is possible, try to align with the greatest sensitivity as possible, since you will recover the most amount of data. SOLiD color error rates are non-trivial and can be easily corrected (while correctly using dynamic programming, not valid-adjacent rules). I would recommend somewhere around 10% color differences (in most cases SNPs count as two, color errors as one).

**OneManArmy** · 07-29-2009, 04:31 PM

Originally posted by Sheila View Post

In the configuration file you can choose between "all" or "unique".
all = all mapping positions
unique= unique mapping positions

Thanks. Even with this, the pipeline still discards the reads that map to multiple places - even though a read may map to a reference with 0 mismatches and another one with 2 mismatches.

**Sheila** · 07-31-2009, 01:17 AM

Originally posted by fishtank View Post

I am wondering where you came to the conclusion that last bases of the miRNA that are close to the adaptor have a high error rate. Could these be due to miRNA editing?

Hi,
It's is known the last bases close to the adaptor have a higher error rate so I would not use 0 mismatches first because you would not detect any isomiR with 1nt diference (polymorphic or not) and second because of the higher error rate at the end of the sequences.
I'm still playing with the parameters, it's hard to define what's best.

S.

**fishtank** · 07-31-2009, 12:21 PM

I am trying to figure out how the *.csfasta_extend.counts.35.6 gets generated from .csfasta_extend.ma.35.6. In the .csfasta_extend.ma.35.6, what does

>1_17_829_F3,220_-79.6.21
T13100202312110020020101102011303111

means? I saw some documents that says it should be
>TAG_ID,LOCATION,MISMATCHES.

so 1_17_829_F3 is the TAG_ID.
Is 6 is the mismatches? But how do I decode the location part?

Thanks.

**fishtank** · 07-31-2009, 11:14 PM

Using rna2map, it seems to me the start/end chromosome coordinates in the *.csfasta_extend.counts.35.6 is offset by 1 relative to the reads...i.e. to view the read sequence correctly, I have to input chr:start-1 to end-1 into the ucsc genome browser.
But if I take the chromosome location specified in mirBase.13.0.fasta generated, I don't have the offset to view the reference sequence. Why the difference?
Can someone confirm this?

**OneManArmy** · 08-03-2009, 02:08 PM

Originally posted by fishtank View Post

I am trying to figure out how the *.csfasta_extend.counts.35.6 gets generated from .csfasta_extend.ma.35.6. In the .csfasta_extend.ma.35.6, what does

>1_17_829_F3,220_-79.6.21
T13100202312110020020101102011303111

PANEL_XCOORD_YCOORD_[F3/BC],FASTASEQNUMBER_LOCATION.MISMATCHES.LENGTH

where FASTASEQNUMBER is the 1-indexed sequence number in your multi-entry fasta file.

**fishtank** · 08-03-2009, 11:35 PM

Originally posted by OneManArmy View Post

PANEL_XCOORD_YCOORD_[F3/BC],FASTASEQNUMBER_LOCATION.MISMATCHES.LENGTH

where FASTASEQNUMBER is the 1-indexed sequence number in your multi-entry fasta file.

Thanks. It took me a while before I realize location is sequence number in fasta file. Any explanation regarding the chromosome coordinates "offset" in *.csfasta_extend.counts posted earlier? Thanks again.

**kevleb** · 09-02-2009, 02:17 AM

I can provide some statistics concerning small RNA matching pipeline from AB.
I use a small RNA purifyed human sample in a barcoding experiment with 7.3M reads

I've run the pipeline many times with differents parameters :
- SeedMM : 0,1,2,3
- ExtendMM : 1, 3 or 6
- ReadType : random or unique

R_0_6 = Random, 0 seed MM and 6 Extend MM
For Tag count, Total beads and uniquely placed beads

_____________Tags________Total_____Unique
R_0_6 : __983.679____1.023.809____527.973
R_1_6 : 1.377.737____1.433.096____752.479
R_2_6 : 1.677.397____1.739.800____925.693
R_3_6 : 1.762.540____1.834.924____981.906

R_0_1 : __441.813______469.826____162.466

I do not perform genome mapping but we get between 13% to 24% of useable reads
mapped to a miRNA reference (the more we allow mismatchs, the more we have reads mapping miR).
Note that the number of uniquely placed beads does not increase (~55%),
and i would think that the more MM we allow the more there is a possibility that a read match
multiple references miR and does not uniquely mapped... Any idea where i'm wrong ?

Anyway it seems that in the later analysis that miR expression is not
clearly affected by the parameters we took to run the pipeline (Hopefully).

**das** · 10-21-2009, 06:35 AM

Hi:
I was wondering what people are doing with their miRNA data to quantitate
miR and miR* from their sequence reads.Especially novel miR*.
(I wish they change miR* nomenclature to more sensible 3p-5p one)
Is anybody aware of any computational approaches to automate miR vs miR*
quantitation?

Also I was wondering how people are addressing sense - antisense
mapping issues related to ds regions in pre-miRs?
We are still not sure how small RNA pipeline handles strand information, how it counts reads when they map to both strands (looks like it double-counts them).
And how do we summarize read counts efficiently in table form (not GB track) efficiently with strand information preserved.

Thanks

**aguffanti** · 12-15-2009, 07:07 AM

Realsitic miRNA mapping from SREK

Hi. I have a long experience in miRNA identification from 454 data and from march of this year I am grinding my teeth on SOLiD SREK results. I am using SHRiMP and custom made scripts both for genome mapping and mapping against miRBase reference (both mature and haripin)

Even biologically, the claim of 50% of miRNAs in a sample is unbelievable. I do think this number is including tRNAs (yes, there are many tRNA fragments of the stem very similar to miRNAs), snoRNAs etc etc. I am very cautious and conservative in this classification. I would say that mapping percentage of small RNAs from SREK experiment against Hs Genome will be between 50% and 60% of the reads. Known (ie well established) miRNAs will be from 5% to 15% of the total beads, i.e. from 10% to 30% of the mappable reads. You should be well aware of the danger of false positives also in known miRNA identification. More details on request

Originally posted by fishtank View Post

But according to this ABI document, they are getting 50% reads mapped to miRNA so 0.3% is worrying. And it is already enriched for small RNA. I am skeptical of the 50% claim though, wonder what other people are getting?

**mmuratet** · 02-10-2010, 07:47 AM

Is there any documentation of the algorithms inside rna2map?

**mmuratet** · 02-28-2010, 05:20 AM

Has anyone every found a detailed description of the rna2map tool?

**patelhardip** · 05-26-2010, 06:02 PM

How mismatches are calculated

In ideal world, I would expect rna2map pipeline to report number of mismatches that are present between "the part of the read that aligns to reference" and "the reference sequence". That is to say in old BLAST searches way of things, mismatches between highscoring pairs.
However, after doing some digging in to the code of the rna2map pipeline and analyzing mapping results, i have discovered that rna2map stupidly puts the number of mismatches that are found with the adaptor sequence as well in the alignment.

therefore if the alignment reads as follows
>TagID1,1_1000.6.22
>TagID2,1_1000.6.22

it means that there are six errors in total for both tags.
now consider this: your miRNA aligning with 0 mismatches to the reference for 22 bp (which is great) but adaptor is aligning with 6 mismatches (who cares).
and in second case: your miRNA aligning with 6 mismatches to the reference for 22 bp (which is not so great) but adaptor is aligning with 0 mismatches (who cares).

now if we looked at the alignment file only and not the reads that actually align, then we would be tempted to use both reads with equal weight. however, in real world it would not be such a great idea to use a read with six mismatches over 22 bp (~73% match).

has anybody ever looked into this kind of things before or anybody accounted for this ever before.

please share your views and opinions and we can discuss it further.

cheers

hardip

Topics	Statistics	Last Post
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, Today, 08:18 AM	0 responses 8 views 0 likes	Last Post by seqadmin Today, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, Today, 08:04 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 27 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News