I must say I'm new to bioinformatics...
- Since I'm using paired-end (Illumina 36bp) data, I need to provide the expected (mean) inner distance between mate pairs when using TopHat. For example, the median of the insert size of a lane in my data is 170 (low threshold: 149 and high threshold: 3629), so I provided 170 as the -r value (I'm aware that the mean is larger than 170 as the data are positively skewed). However, filesl8B1D.log and fileCVzqGw.log (two TopHat log files) show that only about 11% of the reads have at least one reported alignment, which I think is really low (the FPKM values provided by Cufflinks using the generated SAM file seem OK though)! Is there anything I can do to improve this? Also, is there any simple way I can tell how well TopHat did the mapping? Or I can only do this by examining accepted_hits.sam (output of TopHat)? Any way to tell if a read is uniquely mapped to a gene?
- I'm aware that transcripts.gtf (output of Cufflinks) gives the estimated depth of read coverage across a transcript, but what I want is actually the depth of read coverage across a gene. I've built a GFF file (containing only Ensembl genes) and used it together with accepted_hits.sam to get the raw counts of reads using coverageBed (as Cufflinks only reports abundance in FPKM). However, the number of genes with raw count > 0 is larger than the number of genes with FPKM > 0 in genes.expr (output of Cufflinks) (I've tried using all the reads in accepted_hits.sam and reads mapped in a proper pair, i.e. 0x2 flag set to 1, but Cufflinks seems to filter reads in a different way).
Thanks very much for your time and help!
- Since I'm using paired-end (Illumina 36bp) data, I need to provide the expected (mean) inner distance between mate pairs when using TopHat. For example, the median of the insert size of a lane in my data is 170 (low threshold: 149 and high threshold: 3629), so I provided 170 as the -r value (I'm aware that the mean is larger than 170 as the data are positively skewed). However, filesl8B1D.log and fileCVzqGw.log (two TopHat log files) show that only about 11% of the reads have at least one reported alignment, which I think is really low (the FPKM values provided by Cufflinks using the generated SAM file seem OK though)! Is there anything I can do to improve this? Also, is there any simple way I can tell how well TopHat did the mapping? Or I can only do this by examining accepted_hits.sam (output of TopHat)? Any way to tell if a read is uniquely mapped to a gene?
- I'm aware that transcripts.gtf (output of Cufflinks) gives the estimated depth of read coverage across a transcript, but what I want is actually the depth of read coverage across a gene. I've built a GFF file (containing only Ensembl genes) and used it together with accepted_hits.sam to get the raw counts of reads using coverageBed (as Cufflinks only reports abundance in FPKM). However, the number of genes with raw count > 0 is larger than the number of genes with FPKM > 0 in genes.expr (output of Cufflinks) (I've tried using all the reads in accepted_hits.sam and reads mapped in a proper pair, i.e. 0x2 flag set to 1, but Cufflinks seems to filter reads in a different way).
Thanks very much for your time and help!