Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNAseq: Pipeline to detect allele specific expression


    I wanted to ask your advice about a pipeline to detect differences in allelic expression between RNAseq libraries.

    I have two RNAseq libraries, say Control and Treated. The questions I would like to ask are:
    1) For a given transcripts with heterozygous SNPs, are some alleles more expressed than others?
    2) Are some alleles induced/repressed in the Treated library?

    The pipeline I plan to use to go about this is:

    - For each library, align reads against the reference genome using bowtie/tophat (output in SAM format)
    - Use samtools to produce a pileup file including the consensus calling. E.g:
    samtools pileup -cvf reference.fa alignment.sorted.bam > ouput.pileup
    - Extract from the pileup files the heterozygous SNPs. I.e. those where the consensus column is one of R (A or G), Y (C or T), S (G or C), W (A or T), K (G or T), M (A or C) and SNP phred score is >=30 (or >=20?).
    - For each selected SNP in each library: Do a binomial test to see if allele1:allele2 != 50:50. Correct the p-values to have FDR<0.1
    - For each selected SNP: Compared the two libraries with Fisher's exact test to see if one allele is more expressed in one of the two library (i.e. if there is inducible allele specific expression). Correct the p-values to have FDR<0.1.

    Some pitfalls I can think of:
    - Reads with SNP alleles not in the reference have less chance to be aligned thus introducing a bias in favour of the reference alleles. How dramatic can this bias be?
    - Is the SNP filtering above correct? (Stringent enough?)
    - Some transcripts will have more than one SNP locus. However the pipeline above tests each SNP independently. Is there any (...manageable) way to account for this? (I guess a better way would be to test the allelic expression based on haplotypes rather than individual SNPs).

    Any and all comments welcome. Hopefully these will be useful to others.


  • #2
    Your pipeline is almost the same as mine. I used different aligners (SOLiD data) and required Phred score >=20 and additionally 20<= number of reads<= 100 for the SNP position. Also excluded positions with indels and >1/4 at the end of a read.

    I read some papers that mentioned the bias for the reference sequence but ended up ignoring it because it seemed there is no easy and reliable way to get rid of it. How bad the bias is was left unclear to me.
    The papers are:
    Heap et al. 2010, Hum Mol Genet 19:122
    Tuch et al. 2010, PLoSOne 0009317
    Degner et al. 2009, Bioinformatics 25:3207
    Chepelev et al. 2009, Nucl Acids Res 37:e105

    Since was interested in alleles that are ~50:50 in the treated sample but biased for one allele in the control, I had rather few candidate SNPs and can't give advice on multiple SNPs per transcript. But I'd be very interested in learning more!



    • #3
      If I understand your pipeline correctly, you will not get the SNPs which are the most differentiated between the two samples.
      If you have only allele eg: 'T' in one sample and only allele 'A' in the other?


      • #4
        As I mentioned, we wanted heterozygous SNPs in the treated sample (e.g. A and T allele present at ~50:50) that were homozygous in the control sample (only A or T allele present). So a total switch from T to A was not interesting for me.


        • #5
          Hi, Novoalign can align reads against a reference genome with IUB ambiguous NA codes. This should remove any allelic bias, you just need to code SNPs into the genome using the IUB codes.


          • #6

            I hope it's not too late to comment on this post.

            Two things: one, I think there needs to be a threshold or number of reads that align to a particular site for it to be called 'potential ASE'. There is a paper by Fontanillas et al. (2010) in which they show that the number of reads affects the power to detect ASE.

            Secondly, I propose that, where possible, there should be reference sequences biased to the population such that there would not be a bias in the Degner- sense (see Degner et a. 2009 paper).

            Otherwise, I agree with the pipeline.

            P. Korir


            • #7
              Originally posted by polarise View Post
              I hope it's not too late to comment on this post.
              No, it's not! Many thanks to everyone for sharing their comments and expertise!



              • #8
                Since this thread has been revived not too long ago I thought I might ask a question about allele specific alignments as well.

                We were thinking of taking a slightly different approach to analyse sequencing data in an allele-specific manner and I would like to know if someone can spot any obvious problems with it.

                We are mainly working with mouse models from back-crossed strains for which we have a known set of genome-wide SNPs available, so the approach I'll describe here is probably only useful if detailed genome information is available. To avoid some of the pitfalls Dario mentioned above (especially biasing alignments towards the reference alleles without SNPs) we wrote a program which takes in sequencing file(s) and aligns the sequence (or sequence pair) to two reference genomes in parallel (using Bowtie). In order for this to work we made up individual versions of mouse (or yeast) genomes and indexed them separately (this should also work with e.g. the normal mouse genome and one chromosome of a different strain).

                We then analyse the alignment outputs on the fly and decide whether an alignment is specific for one or the other genome (i.e. has a uniquely best alignment for one of them) or if it aligns to both genomes equally well (e.g. if it doesn't contain a SNP which allowing us to say something about the origin). It will then print the alignments to different files (unique 1, unique 2, aligns to both) with a number of useful information (mapping positions, genomic sequences and mismatch information). All downstream analysis would then start from these files (such as percent sequence from which allele, basecall/alignment qualities at positions of SNPs and so on).

                In its initial version ASAP (for allele specific alignment pipeline) is intended to be used for ChIP-Seq data, but we were thinking of extending this to RNA-Seq data as well. This of course would have the disadvantage that - at least in its current form - it doesn't work for spliced mapping, however it should still work for the majority of reads which do not span splice junctions anyway.

                I would be happy to know if people think this is a useful tool or if you can see major flaws which did not occur to us so far. I should mention that ASAP is not fully ready yet, but single-end alignments are nearly working and the paired-end option is yet to be implemented (overall progress is I'd say around 87%. Roughly).



                • #9
                  candidates not allele-specific

                  Finally we got the results from the validation of our candidates - and a big disappointment: almost all turned out to be expressed perfectly 50:50 instead of allele-specific!
                  Since in some papers they only consider reads with different start sites, I next tried duplicate removal with Picard, which left me with less than half of the reads (~20 Mio remaining per sample), and used the new samtools pileup version to call SNPs. This resulted in only a handful of new candidates, which are currently being checked in the lab.
                  I guess we'll have to sequence more because with 20 Mio reads left, we can only detect highly expressed genes. The problem I see in duplicate removal is that true PCR duplicates can't be distinguished from reads that align to the same position by chance just because the gene/allele is so highly expressed. And duplicate removal will of course keep the read with the highest mapping score, which will be the one with the reference SNP.

                  Has any of you worked out a successful approach?

                  To me ASAP sounds like a clever idea. Unfortunately it's not applicable for humans (my data). Even if you sequence the individual's DNA first to find the SNPs, you never know which SNP belongs to which chromosome so you can't figure out the correct combinations ...


                  • #10
                    I'm also trying to measure allele-specific-expression by exome-seq and RNA-seq, with the former to find heteroSNPs and alleles, and the latter to calculate ASE on the identified positions. I found this post to be quite helpful and it's just what I was proposing to do.

                    I also did some method search and found some programs specifically designed for ASE (MMSeq, eXpress, AlleleSeq, etc), but they all require construction of haplotypes (two versions of reference genome/transcriptome), which makes a lot of sense and should be the most accurate way to avoid mapping bias and help integrating SNP-level ASE into gene-level ASE. However in most of the cases, people don't have sequencing data from trio. There are several tools for haplotyping from exome-seq data, like GATK's ReadBackedPhasing and HapCompass, but they can only phase some blocks of the variants which are close to each other and thus were covered by continuous reads.

                    I got to know an aligner called GSNAP that can align RNA-seq reads with tolerance of SNPs. I'm gonna try this to see how the output is different from Tophat alignment.


                    Latest Articles


                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin

                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin

                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM





                    Topics Statistics Last Post
                    Started by seqadmin, 05-14-2024, 07:03 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 05-10-2024, 06:35 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 05-09-2024, 02:46 PM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    Last Post seqadmin