Hi guys,
i know there are similar threads about that problem and yet it seems that is still quite confusing. Hell, is still confusing to me so i'd like to share my experience with it.
So i got two sets of PE RNA-seq as BAM files, reads were aligned with TopHat. After a while i've been told it was ssRNA-seq. Okay, but stranded how i haven't been told. Okay. Found about RSeQC and there i find 'infer_experiment.py' which tells me how the reads are stranded. Seems there are two ways: 1++,1–,2+-,2-+ and 1+-,1-+,2++,2–. From RSeQC site:
1. 1++,1–,2+-,2-+
read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
2. 1+-,1-+,2++,2–
read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
So far so good. One of my sets is case.1 the other case.2. So i figure out how to split the reads so i can analyze the data properly. But during analysis in IGV i see weird things so i continue reading on the web and figure out that my reads were not aligned with 'library-type' option but just with default setting aka fr-unstranded. So i thought,okay, maybe weirdness comes from the mapping. So i read tophat manual which makes it even more weird to understand what exactly 'library-type' one needs in each case of strandedness according to RSeQC. Anyhow, i figure out that case.1 should be 'fr-secondstrand' and case.2 should be 'fr-firststrand'. So i aligned myself the reads with the appropriate 'library-type' option of tophat and then reran 'infer_experiment.py' on original bam and the bam i created.
Output from infer_experiment.py:
--original bam--
Fraction of reads failed to determine: 0.0010
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9123
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0868
--my bam--
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9124
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0867
Seems to me there is no significant improvement. So my weird stuff which i see in IGV should be legit.
However this mapping experiment leads me to believe that there is no need for 'library-type' option in TopHat and one must not care much about it. Just map and then see how the library was stranded with RSeQC. Extract the reads properly and proceed with analysis.
Well that was my odyssey
Cheers
D.
PS: if i got it all wrong please correct me. it will be appreciated.
i know there are similar threads about that problem and yet it seems that is still quite confusing. Hell, is still confusing to me so i'd like to share my experience with it.
So i got two sets of PE RNA-seq as BAM files, reads were aligned with TopHat. After a while i've been told it was ssRNA-seq. Okay, but stranded how i haven't been told. Okay. Found about RSeQC and there i find 'infer_experiment.py' which tells me how the reads are stranded. Seems there are two ways: 1++,1–,2+-,2-+ and 1+-,1-+,2++,2–. From RSeQC site:
1. 1++,1–,2+-,2-+
read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
2. 1+-,1-+,2++,2–
read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
So far so good. One of my sets is case.1 the other case.2. So i figure out how to split the reads so i can analyze the data properly. But during analysis in IGV i see weird things so i continue reading on the web and figure out that my reads were not aligned with 'library-type' option but just with default setting aka fr-unstranded. So i thought,okay, maybe weirdness comes from the mapping. So i read tophat manual which makes it even more weird to understand what exactly 'library-type' one needs in each case of strandedness according to RSeQC. Anyhow, i figure out that case.1 should be 'fr-secondstrand' and case.2 should be 'fr-firststrand'. So i aligned myself the reads with the appropriate 'library-type' option of tophat and then reran 'infer_experiment.py' on original bam and the bam i created.
Output from infer_experiment.py:
--original bam--
Fraction of reads failed to determine: 0.0010
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9123
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0868
--my bam--
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9124
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0867
Seems to me there is no significant improvement. So my weird stuff which i see in IGV should be legit.
However this mapping experiment leads me to believe that there is no need for 'library-type' option in TopHat and one must not care much about it. Just map and then see how the library was stranded with RSeQC. Extract the reads properly and proceed with analysis.
Well that was my odyssey
Cheers
D.
PS: if i got it all wrong please correct me. it will be appreciated.
Comment