Hi all - I'm sort of new to RNA-Seq and am getting lost at some point and need help. I have two (human derived) cell lines that I'm trying compare to identify a gene or set of genes involved in a specific disease.
Both cell lines were sequenced on HiSeq, paired-end with 101 length reads. They were sequenced at different times. One cell line was sequenced barcoded with 1 other sample, and the other cell line was sequenced barcoded 4 other samples, so the total per sample read counts are very different. As I understand it, normalizing this data *should* take care of this.
Based on the Nature Protocols Vol 7, No 3 (http://www.nature.com/nprot/journal/....2012.016.html), I've used tophat/cufflinks to align the data and am now using cummeRbund to identify some interesting genes that may be involved in this disease. A couple of questions at this point:
The paper automatically picks out a gene (regucalcin) as if they know beforehand that is a gene of interest. I don't have a gene of interest and want to determine what genes are significant. So, Looking further in the paper, they do describe a variable sig_gene_data and sig_isoform_data to get significant DE genes and isoforms. I'm assuming this is the list I should start with. Everything I've read so far says I should look at isoform differences and not gene differences.
When comparing my two samples, I have:
Sig gene differences: 13479
Sig isoform difference: 417
Sig TSS differences: 525
If I look at the variable sig_isoform_data, the isoform_id is TCONS_XXXX. What is this TCONS_XXXXX and how do I get the genomic region it corresponds to?
My next (but not last) question is, am I going about this correctly? Once I have this list, what do I do with it? How do I get a short list of genes that may contribute to the disease state this cell line exists in? All the papers I've read stop short of this so I'm not really sure how to proceed and none of my colleagues are familiar enough with RNA-Seq to help. I'm hoping the community can help...Thanks,
Lost in RNA-Seq
Both cell lines were sequenced on HiSeq, paired-end with 101 length reads. They were sequenced at different times. One cell line was sequenced barcoded with 1 other sample, and the other cell line was sequenced barcoded 4 other samples, so the total per sample read counts are very different. As I understand it, normalizing this data *should* take care of this.
Based on the Nature Protocols Vol 7, No 3 (http://www.nature.com/nprot/journal/....2012.016.html), I've used tophat/cufflinks to align the data and am now using cummeRbund to identify some interesting genes that may be involved in this disease. A couple of questions at this point:
The paper automatically picks out a gene (regucalcin) as if they know beforehand that is a gene of interest. I don't have a gene of interest and want to determine what genes are significant. So, Looking further in the paper, they do describe a variable sig_gene_data and sig_isoform_data to get significant DE genes and isoforms. I'm assuming this is the list I should start with. Everything I've read so far says I should look at isoform differences and not gene differences.
When comparing my two samples, I have:
Sig gene differences: 13479
Sig isoform difference: 417
Sig TSS differences: 525
If I look at the variable sig_isoform_data, the isoform_id is TCONS_XXXX. What is this TCONS_XXXXX and how do I get the genomic region it corresponds to?
My next (but not last) question is, am I going about this correctly? Once I have this list, what do I do with it? How do I get a short list of genes that may contribute to the disease state this cell line exists in? All the papers I've read stop short of this so I'm not really sure how to proceed and none of my colleagues are familiar enough with RNA-Seq to help. I'm hoping the community can help...Thanks,
Lost in RNA-Seq
Comment