Unconfigured Ad

**Ben Langmead** · 08-10-2010, 06:05 AM

Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben

**IrisZhu** · 08-10-2010, 06:30 AM

Originally posted by Ben Langmead View Post

Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben

Thanks Ben.
I see. I'll map the two mates separately.

Iris

**Zigster** · 08-10-2010, 11:33 AM

A more useful way to pose this question in the future is to find one example of a read or a pair of reads that you expected would map, but did not.

**Lee Sam** · 08-10-2010, 11:45 AM

Originally posted by Ben Langmead View Post

Hi Iris,

Paired-end alignment against the transcriptome is tricky, because the distance between the mates in genome space is affected both by fragment length, and on the "shape" of the transcriptome. E.g. when a fragment spans a long intron, the intron size is in some sense "added" to the fragment length. A better way of measuring sequencing quality is to break the mates apart and align both files in an unpaired manner. And an even better way is to use TopHat or another tool that attempts spliced mapping, so that alignments that span splice junctions can be found as well.

Hope that helps,
Ben

If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?

**IrisZhu** · 08-10-2010, 11:58 AM

Originally posted by Lee Sam View Post

If she's aligning against the set of RefSeq transcripts, wouldn't she automatically get the alignments for reads spanning splice junctions?

I am supposed to. That's why I don't use tophat.

**IrisZhu** · 08-10-2010, 12:02 PM

Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

===mate1=====
reads processed: 130653415
# reads with at least one reported alignment: 47875358 (36.64%)
# reads that failed to align: 82659307 (63.27%)
# reads with alignments suppressed due to -m: 118750 (0.09%)
Reported 47875358 alignments to 1 output stream(s)

====mate2=====
# reads processed: 130653415
# reads with at least one reported alignment: 35773578 (27.38%)
# reads that failed to align: 94798249 (72.56%)
# reads with alignments suppressed due to -m: 81588 (0.06%)
Reported 35773578 alignments to 1 output stream(s)

If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
Any comments are welcome

Iris

**Lee Sam** · 08-10-2010, 12:03 PM

Originally posted by IrisZhu View Post

I am supposed to. That's why I don't use tophat.

that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?

Originally posted by IrisZhu View Post

Following Ben's suggestion, so I mapped the two mates seperately. But still only a small percentage of reads get mapped.

===mate1=====
reads processed: 130653415
# reads with at least one reported alignment: 47875358 (36.64%)
# reads that failed to align: 82659307 (63.27%)
# reads with alignments suppressed due to -m: 118750 (0.09%)
Reported 47875358 alignments to 1 output stream(s)

====mate2=====
# reads processed: 130653415
# reads with at least one reported alignment: 35773578 (27.38%)
# reads that failed to align: 94798249 (72.56%)
# reads with alignments suppressed due to -m: 81588 (0.06%)
Reported 35773578 alignments to 1 output stream(s)

If bowtie works well, then there are two explanations: either the human refseqs are far from complete, or, the sample is contaminated with a lot of pre-mRNA/genomic DNA.
Any comments are welcome

Iris

Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?

**IrisZhu** · 08-10-2010, 12:13 PM

Anyone knows how to make bowtie report unmapped reads?
What's the option pamameter?
Thanks!
Iris

**IrisZhu** · 08-10-2010, 12:17 PM

Originally posted by Lee Sam View Post

that's why I'm a bit mystified by Ben's comment about insert sizes because that shouldn't be an issue if the fragment you're sequencing is basically represented by the RefSeq transcript. Off the top of my head (e.g. I'm not thinking about this very hard), my initial suspicions might be that you're sequencing some strange artifacts (maybe some contamination?). Maybe try to align a subset of your reads to the set of UCSC transcripts to see if the number improves substantially (as you might expect with more references?). Also, I noticed you aren't supplying Bowtie with arguments related to the inset size? Are your insert sizes in line with the bowtie defaults?

Looks like you beat me to posting. I guess the insert size issue is refuted by the low individual mapping rate... Might as well align to hg18/19 to see if you've got genomic contamination, right?

I mapped two mates seperately, i.e. in a unpaired manner. So no inset size here.

**Zigster** · 08-10-2010, 12:52 PM

Originally posted by IrisZhu View Post

Anyone knows how to make bowtie report unmapped reads?
What's the option pamameter?
Thanks!
Iris

no but that's easy enough to infer:

#get defline, strip >
grep '>' mate1.fa | sed -e 's/>//' > in.seqs

#get first col with seqnames
cut -f1 output.bowtie > out.seqs

#get symmetric difference
sort in.seqs out.seqs | uniq -u

**babati** · 08-10-2010, 12:59 PM

IrisZhu,

to make bowtie report the unmapped reads, use the command line option:
--un <output_file.fa>

Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!

**Zigster** · 08-10-2010, 01:12 PM

Originally posted by babati View Post

IrisZhu,
to make bowtie report the unmapped reads, use the command line option:
--unmapped <output_file.fa>

can you point me to where that feature is described? it is not working for me.

**IrisZhu** · 08-10-2010, 04:17 PM

Originally posted by babati View Post

IrisZhu,

to make bowtie report the unmapped reads, use the command line option:
--un <output_file.fa>

Do you know the protocol used to generate the RNAseq data? For example, the totalRNA protocol yields a lot of ribosomal and non-coding RNAs, most of which are not in RefSeq, while polyA-selected protocol should yield more coding RNAs.

Also, what is the source of your data? Some highly-expressed transcripts in diseased tissues may not be in RefSeq.

In either case, you should try Lee Sam's suggestion and map directly to the genome and see how many more reads map that didn't map to RefSeq. Let us know what you find!

My sample is supposed to be mRNA. I mapped them to the whole genome, ~30% was mapped. If i used bowtie properly , then it must be the problems of sequencing data itself.
And I just tried mapping another batch of data to human refseqs, the results are much more reasonble:
reads processed: 76269225
# reads with at least one reported alignment: 46230181 (60.61%)
# reads that failed to align: 29912348 (39.22%)
# reads with alignments suppressed due to -m: 126696 (0.17%)

BTW, is my command "bowtie -p 10 -m 10 -f hg19 myfile.fa (for unpaired reads)" right?
Thanks!

**IrisZhu** · 08-11-2010, 10:46 AM

Originally posted by Zigster View Post

no but that's easy enough to infer:

#get defline, strip >
grep '>' mate1.fa | sed -e 's/>//' > in.seqs

#get first col with seqnames
cut -f1 output.bowtie > out.seqs

#get symmetric difference
sort in.seqs out.seqs | uniq -u

Oh yes we can do it this way ....

Thank you!

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

transcriptome mapping with bowtie -- only 22% reads hit

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News