tophat + cufflinks: strand information

blanco replied

04-20-2012, 12:35 AM
I agree, a very useful thread indeed.
Still it has been some time since it started and now tophat has a -library--type option.

I have sequences from a dUTP strand specific library and I ran it through tophat with the -library--type option. By using this option nearly all (99.9%) lines in the .sam file have the XS:A tag.

By applying this -library--type option to a strand specific library can one safely use the XS:A tag to say from which strand the read came from?
Leave a comment:
sterding replied

04-04-2012, 06:43 AM
I just noticed that there are FLAG values of 256, 272 (not just 0, 16, 4 for single-strand RNAseq). Since 256=256+0, 272=256+16, so we may assign 256 to plus (+) strand, and minus (-) for 272 reads.

So, the above code for adding XS:A tag should be modified as:

samtools view accepted_hits.bam | awk '{if($0 ~ /XS:A:/) print $0; else {if($2%256==16) print $0"\tXS:A:-"; if($2%256==0) print $0"\tXS:A:+";}}' > accepted_hits.sam

UPDATE: This only works for single-end lib. For pair-end lib, the FLAG should be odd number, but in any case, reads on minus strand always have 1 on the 5th bit of binary code (e.g. 0x10 =10000). Thanks to Wei's suggestion on this. Here is the updated code:

samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x10)) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam

Last edited by sterding; 05-09-2012, 06:11 AM.
Leave a comment:
sterding replied

04-03-2012, 07:46 PM
Very useful post!

I followed this post due to the same problem: For strand-specific RNAseq data, the accepted_hits.sam (converted from BAM) from Tophat does not contain XS:A:+/- tag for unspliced alignment (yes, for spliced one, there are XS:A tag). This will result in the loss of strand information if I run cufflinks directly using the Tophat output file.

To fix the issue, I found it's useful if I manually add XS:A tag for those non-spliced alignments. For example:

Code:

samtools view accepted_hits.bam | awk '{if($0 ~ /XS:A:/) print $0; else {if($2==16) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' > accepted_hits.sam

Again, as Thomas Doktor said, it's safe to do so only if the lib is from strand-specific RNAseq. For non-strand-specific one, it makes no sense to do so.

What I want to comment here is, there is still slightly problem for the solution -- For multiple mapped reads, the FLAG (e.g. 2nd column of SAM) is 256. In that case, we can't really tell which strand the reads map to. This is especially the case for output from aligner allowing a number of mismatches, or reads mapped in repeat region. To guarantee the assembly of de novo transcript, a dirty way to do that might be to duplicate the reads and assign both XS:A:- and XS:A:+. I've not tested this solution. If anyone else has better solution, I would like to know.
Leave a comment:
johnwu replied

07-30-2011, 08:42 PM
Hi! I'm new to the RNA-seq field, and now I'm trying to use tophat-cufflinks pipeline to analyze strand-specific libraries (illumina).

After reading above posts in this thread, I'm still confused by the existence of such alignment (in SAM format):

ERR030871.1444950 16 chr10 83984 255 71M689N29M * 0 0 CACTGACTCCATCAGCTCCGCGCCTTCGGTGTAGTGTCCCTTAGCCCAGTTGTTTCCGGCCCCACACTGACCGCTGGCCTCGTTGTAGTACACGTTGATG G;EHHEGDGCFGHAFGBHIHHHFHHDHHHHHHHHHHIHHHHHHHHHHHHHFHHEHHHHHHHHHAHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHH NM:i:1 XS:A:- NH:i:1

According to the read's FLAG (0x10 SEQ being reverse complemented) and the XS tag (XS:A:-), the complement sequence of the read (according to its FLAG: 0x10 SEQ being reverse complemented) could be aligned to negative strand of chromosome 10.

What I don't get is that since the reverse complement sequence of the read could be aligned to negative strand of chromosome, then the sequence of the read should be directly aligned to positive strand as well, why doesn't tophat just report so? like:

ERR030871.1444950 0 chr10 83984 255 71M689N29M * 0 0 CACTGACTCCATCAGCTCCGCGCCTTCGGTGTAGTGTCCCTTAGCCCAGTTGTTTCCGGCCCCACACTGACCGCTGGCCTCGTTGTAGTACACGTTGATG G;EHHEGDGCFGHAFGBHIHHHFHHDHHHHHHHHHHIHHHHHHHHHHHHHFHHEHHHHHHHHHAHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHH NM:i:1 XS:A:+ NH:i:1

By the way, I'm using Tophat version 1.30.

Thanks for your help
Leave a comment:
Thomas Doktor replied

05-17-2010, 05:03 PM
It looks at the genomic sequence of the splice site spanning reads to identify the strand of the transcript. XS tells you the strand of the transcript that produced the read, not the read itself.
Leave a comment:
joung replied

05-17-2010, 02:08 PM
Hi everyone,

I am still not sure about XS tag.
How does tophat generate XS:A:+/- ?
Leave a comment:
xguo replied

05-13-2010, 01:42 PM
Originally posted by xhuister View Post

Hi all,

I tested the strand-specific RNA-seq using TopHat and cufflinks in order to get enriched regions or transcript boundaries.

1) TopHat output accepted_hits.sam file, where the splice reads have "XS:A:+-" tag while other reads don't.

2) I seperated accepted_hits.sam into two files accepted_hits.plus.sam and accepted_hits.minus.sam. The XS:A:+ reads and reads with flag=0 are put into accepted_hits.plus.sam and XS:A:-... to minus.sam.
Here, I didn't add XS:A tag to exon-reads mannually.

3) Cufflinks to get regions:
cufflinks accepted_hits.minus.sam
cufflinks accepted_hits.plus.sam

In one output transcript.expr of cufflinks:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length
CUFF.1.1 21679 chloroplast 165 1522 67673.5 1 1 66397.6 68949.5 414.607 1357
CUFF.3.1 21683 chloroplast 1788 1889 2424.11 1 1 1538.95 3309.27 14.8515 101

In this post http://seqanswers.com/forums/showthr...links+coverage, Cole said that "Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. "

But, when I multiplied the above coverage and length: 414.607*1357=562621.699, it just didn't meet the actual number of reads in that transcript which is 11341.
$ awk -F'\t' '$3=="chloroplast" && $4>=165 && $4<=1522' accepted_hits.minus.sam | wc -l
11341

Am I wrong in calculating the number of reads in one transcript?

Another question:
In the output accepted_hits.sam of Tophat, there are 408,360 rows. But, when I used Bowtie to do the read mapping with at most 2 mismatches and only one location, the number is much larger 2,288,016.
Does Tophat filter the mapping results and how? I thought that the number of mapped reads in tophat should larger than Bowtie's result, because Tophat contains the junction reads.

Thank you in advance.

The coverage in cufflink output seems to be base-level coverage. If the read length is 50, 11341*50 = 567050 is roughly the same as 562621 derived by the multiplification of coverage and transcript length. Does it make more sense to use FPKM to get expected reads for each transcript?
Leave a comment:
xhuister replied

05-12-2010, 11:12 AM
Hi all,

I tested the strand-specific RNA-seq using TopHat and cufflinks in order to get enriched regions or transcript boundaries.

1) TopHat output accepted_hits.sam file, where the splice reads have "XS:A:+-" tag while other reads don't.

2) I seperated accepted_hits.sam into two files accepted_hits.plus.sam and accepted_hits.minus.sam. The XS:A:+ reads and reads with flag=0 are put into accepted_hits.plus.sam and XS:A:-... to minus.sam.
Here, I didn't add XS:A tag to exon-reads mannually.

3) Cufflinks to get regions:
cufflinks accepted_hits.minus.sam
cufflinks accepted_hits.plus.sam

In one output transcript.expr of cufflinks:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length
CUFF.1.1 21679 chloroplast 165 1522 67673.5 1 1 66397.6 68949.5 414.607 1357
CUFF.3.1 21683 chloroplast 1788 1889 2424.11 1 1 1538.95 3309.27 14.8515 101

In this post http://seqanswers.com/forums/showthr...links+coverage, Cole said that "Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. "

But, when I multiplied the above coverage and length: 414.607*1357=562621.699, it just didn't meet the actual number of reads in that transcript which is 11341.
$ awk -F'\t' '$3=="chloroplast" && $4>=165 && $4<=1522' accepted_hits.minus.sam | wc -l
11341

Am I wrong in calculating the number of reads in one transcript?

Another question:
In the output accepted_hits.sam of Tophat, there are 408,360 rows. But, when I used Bowtie to do the read mapping with at most 2 mismatches and only one location, the number is much larger 2,288,016.
Does Tophat filter the mapping results and how? I thought that the number of mapped reads in tophat should larger than Bowtie's result, because Tophat contains the junction reads.

Thank you in advance.

Last edited by xhuister; 05-12-2010, 11:23 AM.
Leave a comment:
damiankao replied

05-11-2010, 02:21 AM
Bioscope will only do split reads according to the .gtf you've given it. It will not to do de novo split reads.
Leave a comment:
Thomas Doktor replied

05-10-2010, 05:54 PM
Haneko,

Does BioScope report a strand for the spliced alignments which correspond to the strand of the "parent" transcript? If so you could insert the XS tag only in the spliced reads. I would not insert XS tag in the reads contained within exons since it would probably only confuse Cufflinks or give you faulty estimates of the expression of transcripts.

Alternatively you could ask Cole (the principal author of TopHat and Cufflinks) if there is any way to run TopHat on colorspace sequences. I'm pretty sure Bowtie supports it in any case, so TopHat might eventually as well (or it's already there and I just haven't seen the option for it).
Leave a comment:
Haneko replied

05-10-2010, 05:09 PM
Hi Thomas,

Yes, bioscope supports spliced alignments. Unfortunately, the rna-seq experiment was not strand specific.

If that's the case, will assigning the wrong strand to an unspliced read result in any difference in cufflinks output? I understand that the flag is compulsory for spliced reads, but not unspliced ones..
Leave a comment:
Thomas Doktor replied

05-10-2010, 06:17 AM
Haneko,

If your reads came from a strand specific RNA-seq experiment, it should be safe to do that. However, I have no experience with BioScope alignment, does it support alignment of reads spanning splice junctions?
Leave a comment:
Haneko replied

05-09-2010, 05:18 PM
Hi Thomas,

I'm using SOLiD BioScope SAM outputs for input into cufflinks, and it doesn't report the XS:A tag regardless of whether it is spliced or unspliced. My initial assumption was that the XS tag represents strand of the alignment of the read, but as you said, this is incorrect. However, if the read were to be uniquely mapped onto the reference genome, would it be ok to assume that the strand from which the read came from is the same as the strand to which it was aligned?
Leave a comment:
xhuister replied

05-09-2010, 11:32 AM
Ok, I'll have a try. Thank you Thomas.
Leave a comment:
Thomas Doktor replied

05-09-2010, 10:50 AM
No, I don't think TopHat "supports" strand-specific RNA-seq yet. Of course it will align the reads and all, it just won't report an XS tag other than for the splice junction spanning reads. If you're sure your reads came from a strand specific RNA-seq experiment it would probably be safe to go ahead and supply the SAM file with XS tags corresponding to the strand of the read itself. I'm not sure it will have a huge effect on the output of Cufflinks, it would be nice if you could report back on that.
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News