Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Guide/tutorial for the analysis of RNA-seq data

  • Filter
  • Time
  • Show
Clear All
new posts

  • Guide/tutorial for the analysis of RNA-seq data


    This guide is now available in wiki form on the seqanswers wiki. In addition, I would urge everyone to look at the other resources available on the wiki.
    It has been nearly a year since I first wrote this guide and it is already starting to show its age. The only way this will continue to be a useful resource is if we as a community take the time to keep it up to date. Already minor things like syntax changes introduced in software updates are causing some errors to creep in. I have received many emails from people wanting to know how to fix these problems, some I have been able to answer and some others have worked out for themselves. If you are one of these people, I would strongly urge you to add the correction to the wiki (however minor it may be), so future readers can benefit. I will do my best to change things that people bring to my attention. However, I am no longer working in the field of RNA-seq analysis and so my knowledge on the topic will become less and less useful, as well as the time I am able to spend on it. I am glad this guide has been useful to so many people and hope that with your help it will continue to be a useful in the future.

    Kind Regards,



    I've written a guide to the analysis of RNA-seq data, for the purpose of differential expression analysis. It currently lives on our internal wiki that can't be viewed outside of our division, although printouts have been used at workshops. It is by no means perfect and very much a work in progress, but a number of people have found it helpful, so I thought it would useful to have it somewhere more publicly accessible.

    I've attached a pdf version of the guide, although really what I was hoping was that someone here could suggest somewhere where it could be publicly hosted as a wiki. This area is so multifaceted and fast-moving that the only way such a guide can remain useful is if it can be constantly extended and updated.

    If anyone has any suggestions about potential hosting, they can contact me at [email protected]



    Update: I've put a few extra things on our local Wiki and seeing as people here seem to be finding this useful I thought I'd post an updated version. I'm also an author on a review paper on Differential Expression using RNA-seq which people who find the guide useful, might also find relevant...

    RNA-seq Review
    Attached Files
    Last edited by MDY; 08-16-2011, 06:31 AM. Reason: Updated version

  • #2
    I think it is right place and is very useful


    • #3
      The guide is really useful thanks. In it you use the data from Li et al 2008 as an example dataset. Can you point me to where I could download the fasta files you detail?


      • #4
        Dear Matt,
        It is a very good place for a document like this. Someone asked me to detail how to perform RNA-seq gene diff-ex analyses on short read data; this document is an excellent example. I think it will really help a lot of people and save a lot of time (I would have done a couple of things differently but that is just personal experience/preference).

        Thank you for the contribution.

        Actually, a Next Generation Sequencing wiki, if it does not exist already, is a great idea.



        • #5
          I have spent a lot time to find such a tutorial
          but it seems that very little material is availble
          thanks for your help


          • #6
            Hi Matt,
            Thank very much for sharing your guide. Would you please let me know the link to download the Li Prostate cancer dataset you mentioned in the guide, i.e the 7 fa files? I couldn't find them in the publication's supporting information. Thanks


            • #7
              Very useful document for beginners in deep-sequencing

              Hi Matt,

              I have been searching so much for such kind of tutorial. The tutorial is very helpful.




              • #8
                Hi everyone,

                Sorry for the slow reply, I somehow managed to miss the replies. For those asking where to get the seven fasta files used in this guide, they are using the data used in the referenced paper, Li et al 2008 (,f1000m,isrctn ). As far as I know, the files aren't stored on GEO, but the authors were happy to send the data when contacted by email. The 7 files are 3 treated and 4 untreated lanes of RNA-seq.




                • #9
                  Thanks Matt for this nice guide, now, I am tring to analysis some soybean rna-seq data following this article. However, I am very new to this work, could anybody give me some suggestions to solve following problems:

                  1. I try to use makeTranscriptDbFromBiomart to get the information of soybean in phytozome database, but it seems there many organisms in phytozome database, how can I select the G.max which I need?

                  2.bowtie software map the RNA-seq tag to reference gene, what is the criterion for match or does not match.

                  thanks in advance!


                  • #10
                    Matt - Awesome super-polished resource for those with or without experience in NGS or RNA-seq! Please feel free to share any other resources you have created. Thank you.


                    • #11
                      Thanks a lot for the nice guide and sharing it with all of us, Matt!


                      • #12
                        flyyuan - I'm not sure what the answer to your first question about biomart. A detailed description of how bowtie decides on a valid match can be found on the bowtie webpage and in particular the manual. You might want to look at this

                        In brief, in the default mode bowtie will report a read as matching if it has fewer than -n mismatches from the reference in the seed and the sum of the quality scores at ANY mismatching base within the entire read is less than -e.


                        • #13
                          Excellent. This should be a sticky!


                          • #14
                            Excellent introductory guide, thank you!


                            • #15
                              Hi Matt,

                              thanks for putting up this excellent tutorial.

                              I have one constructive critisism or discussion point though; as I understand it, when checking for differential expression (DE) you only consider reads "overlapping some annotation object, which is usually something like a collection of genes downloaded from the UCSC."

                              So you suggest checking DE only for something like RefSeq, and taking the number of reads within each RefSeq (or other object) as the expression level.

                              I think this discards not only much of the information gained by RNA-seq, but also some of the most important information: the most interesting genes are often among the non-annotated genes. Consider for example two cellular states, a very interesting gene might only be expressed in the very unusual state B, and be very highly expressed; while in state A it's not or so lowly expressed that it didn't make it into the annotation. So with this approach a researcher would miss this gene and others like it entirely because it's not in the annotation, although these might be the very genes which explain the biological question at hand.

                              If I'd use RNA-seq just to identify DE genes which are already annotated in UCSC, I almost might as well have used a tiling array spanning the annotated genes only. (sure RNA-seq is "digital", but the point I'm trying to make is that with UCSC or similar annotation one would ignore 90%+ of the RNA-seq data elsewhere in the genome!)

                              So I think a better approach would be first to use the RNA-seq data to produce an ad hoc annotation, including information from all sequenced conditions, then check DE against this annotation.

                              Now the question is of course, what is a very good way to create an annotation, i.e. how to identify the regions spanned by genes, from RNA-seq?
                              Last edited by Azazel; 01-27-2011, 04:59 AM. Reason: typo