Hello all,
New to forum here, so please correct me if posted in wrong section.
I am working with ribosome profiling (see http://goo.gl/hvN8PP for a review) libraries generated on an Illumina HiSeq platform. For a lot of the analyses, I need to know precisely where the reads are aligning. Specifically, I need to know if the reads aligning to the coding sequence of a given mRNA are in the 0, +1, or +2 reading frame.
Right now, my workflow is:
1.) Use tophat (with a GFF file as a transcript guide) to map reads to the genome**
2.) Use samtools to generate a sam file of the mapped reads
3.) Read the allignment data into a R table, read the corresponding species GTF file into a table, then create a "local transcript coordinate" and a "CDS coordinate" for each read.
So all 48 million of my reads have a number assigned to them which indicates where the 5' end of the read alligns to in a given transcript relative to the 5' most nucleotide of a given transcript, and then protein-encoding mRNA reads have a number assigned to them that which indicates where the 5' end of a read alligns to in a given transcript relative to the 'A' of the annotated 'AUG' start codon for a given ORF in the transcript
Now, the issue is that the 'local transcript coordinate' and 'local CDS coordinate' for a given read can have different values depending on which isoform a read maps to. So if a given gene has 2 isoforms that happen to share some exons, a read may be alligning to the 50th position in a given annotated CDS and the 110th position in another annotated CDS.
My workflow currently creates an index of all transcripts and their corresponding exons, then makes a table for each transcript with all the reads that could map to that transcript as well as their 'local transcript' and 'local CDS coordinates'.
This seems vastly inefficient because instead of keeping track of 48 million reads, I have to keep track of ~500 million pseudo-reads - i.e. all of the reads that could align to multiple isoforms. I need advice for how to make some kind of efficient data structure that would allow me to keep track of 'local CDS' and 'local transcript' coordinates for all reads without having to duplicate the read objects.
Something like:
Gene ABC:
Transcript_isoform_1:
Transcript_isoform_2
Read_238: (local_position_112, local_position_213), Read_239: (local_position_45, local_position_105), Read_240: (local_position_431, NA)
Right now, I am thinking about making a custom class for the ribosome profiling reads, where each read contains a unique gene object, a unique genomic index, and then a list of objects corresponding to the transcript/'local CDS coordinate'/'local transcript coordinate' to which a read maps to.
Can you recommend a more memory efficient way of doing the aforementioned task (or potentially some work-around that negates the problem entirely)?
Please let me know if you require any clarification about my problem or my question.
*** We decided against mapping reads to the transcriptome because we were concerned that tophat would arbitrarily assign reads to only one specific isoform, when in reality they could be coming from multiple isoforms. My supervisor advised me to also avoid using CuffLinks for our ribosome profiling analyses.
Thanks in advance for your help!
New to forum here, so please correct me if posted in wrong section.
I am working with ribosome profiling (see http://goo.gl/hvN8PP for a review) libraries generated on an Illumina HiSeq platform. For a lot of the analyses, I need to know precisely where the reads are aligning. Specifically, I need to know if the reads aligning to the coding sequence of a given mRNA are in the 0, +1, or +2 reading frame.
Right now, my workflow is:
1.) Use tophat (with a GFF file as a transcript guide) to map reads to the genome**
2.) Use samtools to generate a sam file of the mapped reads
3.) Read the allignment data into a R table, read the corresponding species GTF file into a table, then create a "local transcript coordinate" and a "CDS coordinate" for each read.
So all 48 million of my reads have a number assigned to them which indicates where the 5' end of the read alligns to in a given transcript relative to the 5' most nucleotide of a given transcript, and then protein-encoding mRNA reads have a number assigned to them that which indicates where the 5' end of a read alligns to in a given transcript relative to the 'A' of the annotated 'AUG' start codon for a given ORF in the transcript
Now, the issue is that the 'local transcript coordinate' and 'local CDS coordinate' for a given read can have different values depending on which isoform a read maps to. So if a given gene has 2 isoforms that happen to share some exons, a read may be alligning to the 50th position in a given annotated CDS and the 110th position in another annotated CDS.
My workflow currently creates an index of all transcripts and their corresponding exons, then makes a table for each transcript with all the reads that could map to that transcript as well as their 'local transcript' and 'local CDS coordinates'.
This seems vastly inefficient because instead of keeping track of 48 million reads, I have to keep track of ~500 million pseudo-reads - i.e. all of the reads that could align to multiple isoforms. I need advice for how to make some kind of efficient data structure that would allow me to keep track of 'local CDS' and 'local transcript' coordinates for all reads without having to duplicate the read objects.
Something like:
Gene ABC:
Transcript_isoform_1:
Transcript_isoform_2
Read_238: (local_position_112, local_position_213), Read_239: (local_position_45, local_position_105), Read_240: (local_position_431, NA)
Right now, I am thinking about making a custom class for the ribosome profiling reads, where each read contains a unique gene object, a unique genomic index, and then a list of objects corresponding to the transcript/'local CDS coordinate'/'local transcript coordinate' to which a read maps to.
Can you recommend a more memory efficient way of doing the aforementioned task (or potentially some work-around that negates the problem entirely)?
Please let me know if you require any clarification about my problem or my question.
*** We decided against mapping reads to the transcriptome because we were concerned that tophat would arbitrarily assign reads to only one specific isoform, when in reality they could be coming from multiple isoforms. My supervisor advised me to also avoid using CuffLinks for our ribosome profiling analyses.
Thanks in advance for your help!
Comment