Hello,
I wanted to ask your advice about a pipeline to detect differences in allelic expression between RNAseq libraries.
I have two RNAseq libraries, say Control and Treated. The questions I would like to ask are:
1) For a given transcripts with heterozygous SNPs, are some alleles more expressed than others?
2) Are some alleles induced/repressed in the Treated library?
The pipeline I plan to use to go about this is:
- For each library, align reads against the reference genome using bowtie/tophat (output in SAM format)
- Use samtools to produce a pileup file including the consensus calling. E.g:
- Extract from the pileup files the heterozygous SNPs. I.e. those where the consensus column is one of R (A or G), Y (C or T), S (G or C), W (A or T), K (G or T), M (A or C) and SNP phred score is >=30 (or >=20?).
- For each selected SNP in each library: Do a binomial test to see if allele1:allele2 != 50:50. Correct the p-values to have FDR<0.1
- For each selected SNP: Compared the two libraries with Fisher's exact test to see if one allele is more expressed in one of the two library (i.e. if there is inducible allele specific expression). Correct the p-values to have FDR<0.1.
Some pitfalls I can think of:
- Reads with SNP alleles not in the reference have less chance to be aligned thus introducing a bias in favour of the reference alleles. How dramatic can this bias be?
- Is the SNP filtering above correct? (Stringent enough?)
- Some transcripts will have more than one SNP locus. However the pipeline above tests each SNP independently. Is there any (...manageable) way to account for this? (I guess a better way would be to test the allelic expression based on haplotypes rather than individual SNPs).
Any and all comments welcome. Hopefully these will be useful to others.
Dario
I wanted to ask your advice about a pipeline to detect differences in allelic expression between RNAseq libraries.
I have two RNAseq libraries, say Control and Treated. The questions I would like to ask are:
1) For a given transcripts with heterozygous SNPs, are some alleles more expressed than others?
2) Are some alleles induced/repressed in the Treated library?
The pipeline I plan to use to go about this is:
- For each library, align reads against the reference genome using bowtie/tophat (output in SAM format)
- Use samtools to produce a pileup file including the consensus calling. E.g:
Code:
samtools pileup -cvf reference.fa alignment.sorted.bam > ouput.pileup
- For each selected SNP in each library: Do a binomial test to see if allele1:allele2 != 50:50. Correct the p-values to have FDR<0.1
- For each selected SNP: Compared the two libraries with Fisher's exact test to see if one allele is more expressed in one of the two library (i.e. if there is inducible allele specific expression). Correct the p-values to have FDR<0.1.
Some pitfalls I can think of:
- Reads with SNP alleles not in the reference have less chance to be aligned thus introducing a bias in favour of the reference alleles. How dramatic can this bias be?
- Is the SNP filtering above correct? (Stringent enough?)
- Some transcripts will have more than one SNP locus. However the pipeline above tests each SNP independently. Is there any (...manageable) way to account for this? (I guess a better way would be to test the allelic expression based on haplotypes rather than individual SNPs).
Any and all comments welcome. Hopefully these will be useful to others.
Dario
Comment