Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • prussiap
    Junior Member
    • May 2012
    • 9

    Public RAW RNA-seq data Now What!!

    Hi guys,
    This forum is great for a beginner, lots of how-to's for newbies and experts alike. I've seen snippets of what i want to do here and there but I was hoping to ask questions, engage the community and then at the end write a how-to/decision tree post for others to benefit from this post. If there is no interest let me know as well also sorry if it's in the wrong place. I'm new to this. Help me understand the process and others too .

    Goal:
    Take RAW RNA-SEQ data, understand quality control, trims if needed, discuss the different methods of alignment (or pipeline software), Discuss what's different between the processes of looking at splice variants, CNVs, exomes, expression profiling (maybe miRNA) at the different steps. I'm sure i'm missing things too.

    I'll organize it all into a nice post later but let's get to it

    So the idea is to start with RAW (spit out of the machine) data and from a public database and understand the data and what you can do with it. I decided on K562CellTotalFastqRep2_fastqc from ENCODE as it was suggested to me as a good GAIIX read. Description and Link.

    To get the latter tier you need to select:
    RNA-extract: Total RNA
    View: FastqRd1,2
    Platform: Illumina HiSeq 2000
    Cell: HAoAF I chose fastqRd1
    Rd1,2 stands for read 1,2 for bio replicates: I chose

    I chose this data set because it's known, there is some experimental quality control, it's already been analyzed, it's large and unadulterated, and in theory total rna (though it seems there is nothing 200>)

    Steps:
    QC:
    1. QC- I ran it in FASTQC and saw the attached file.

    Trimming/Alignment:
    1. Tophat/Bowtie/Cufflinks or using python/R?

    Expression Profile: (Looking for Counts)
    What tools and pathway?

    Building Contigs/Exome: (looking for mRNA)
    What tools and pathway?

    Spliceosomes:
    What tools and pathway?

    At the end comparing with other samples


    I leave many answers open (because of time and i'm also somewhat of a newbie ). This is more of an exercise so let's start with Alignment and understanding what you have. I'll edit this as we go. Help me make this more Useful and organized also.
    Attached Files
    Last edited by prussiap; 05-25-2012, 10:06 AM. Reason: Updating Sample info
  • alexdobin
    Senior Member
    • Feb 2009
    • 161

    #2
    ENCODE data

    Hi prussiap,

    you are talking about ENCODE data, not ENSEMBL, right?

    The sample you have chosen is not a good example, it's one of the earliest samples we generated with an unusual library prep and sub-par sequencing quality. I would strongly recommend other samples such as whole cell poly-A+/- for K562 and other cell lines. ENCODE RNA-seq data was mapped with STAR: ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.1.1/

    If you have questions about ENCODE data, please send me a message.

    Alex

    Comment

    • wupengpro
      Junior Member
      • Jun 2012
      • 5

      #3
      Originally posted by alexdobin View Post
      Hi prussiap,

      you are talking about ENCODE data, not ENSEMBL, right?

      The sample you have chosen is not a good example, it's one of the earliest samples we generated with an unusual library prep and sub-par sequencing quality. I would strongly recommend other samples such as whole cell poly-A+/- for K562 and other cell lines. ENCODE RNA-seq data was mapped with STAR: ftp://ftp2.cshl.edu/gingeraslab/trac...release/2.1.1/

      If you have questions about ENCODE data, please send me a message.

      Alex
      Hi Alex,

      I have downloaded some ENCODE datasets from SRA in NCBI(http://www.ncbi.nlm.nih.gov/sra/SRX135162?&report=full). Are these ENCODE datasets raw data or clean data? Need I the further quality control? Which method of quality control do you recommend?

      Thank you!

      Comment

      • alexdobin
        Senior Member
        • Feb 2009
        • 161

        #4
        Originally posted by wupengpro View Post
        Hi Alex,

        I have downloaded some ENCODE datasets from SRA in NCBI(http://www.ncbi.nlm.nih.gov/sra/SRX135162?&report=full). Are these ENCODE datasets raw data or clean data? Need I the further quality control? Which method of quality control do you recommend?

        Thank you!
        Hi @wupengpro,
        the ENCODE data deposited in SRA is raw, filtered only by standard Illumina chastity filters. All of the data is clean and high quality, judged by high mapping rates (90-95%), high correlation of gene expression from bio-replicas (>0.98) and by correct clustering of the samples. I think you do not need any additional quality control or filtering of the .fastq files - however, it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #5
          ... it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

          This is interesting information to be throwing away.

          Comment

          • alexdobin
            Senior Member
            • Feb 2009
            • 161

            #6
            Originally posted by Richard Finney View Post
            ... it's always advisable to filter your alignments, for example, remove multi-mappers, non-concordant mates, non-canonical junctions.

            This is interesting information to be throwing away.
            That is indeed interesting information for some applications, however, it also contains a significantly larger percentage of mis-mappings (i.e. false positives). I guess I need to re-formulate my statement more carefully: if the study does not involve (i) highly similar loci (e.g. paralogs), (ii) fusion/chimeric transcripts, or (iii) non-canonical splicing, it is advisable to remove (i) multi-mappers, (ii) non-concordant mates, (iii) non-canonical junctions.

            Comment

            • per_ngs
              Junior Member
              • Apr 2011
              • 8

              #7
              Quality scores in fastqc for ENCODE RNASeq data

              Hello,
              I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
              Apologies if my question is silly and if i am not understanding the way fastqc works.

              Thanks for help.
              NGSnewbie
              Attached Files

              Comment

              • Sujani
                Junior Member
                • Aug 2012
                • 5

                #8
                hello all,

                When I try to sequence 16s bacterial RNA using ABI 3130 it gives me heterozygous peaks. Since the microbes only contain a haploid set of chromosomes I am puzzled how it could be possible to indicate two peaks?
                can someone please explain

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  Sujani,

                  You should create a new post for this question rather than this current thread. Perhaps one of the moderators can do it for you.


                  Originally posted by Sujani View Post
                  hello all,

                  When I try to sequence 16s bacterial RNA using ABI 3130 it gives me heterozygous peaks. Since the microbes only contain a haploid set of chromosomes I am puzzled how it could be possible to indicate two peaks?
                  can someone please explain

                  Comment

                  • Sujani
                    Junior Member
                    • Aug 2012
                    • 5

                    #10
                    GenoMax,

                    Im really sorry for the inconvenience.Unfortunately,Im finding it hard to post a new thread.

                    Comment

                    • GenoMax
                      Senior Member
                      • Feb 2008
                      • 7142

                      #11
                      Once you log into SeqAnswers, click on the "Forum" link in the top left quadrant under "site navigation".

                      Select the appropriate forum to post in by clicking on the main title of the forum (e.g. core facilities).

                      On the page that opens next there should be a "new thread" button towards top left.

                      Originally posted by Sujani View Post
                      GenoMax,

                      Im really sorry for the inconvenience.Unfortunately,Im finding it hard to post a new thread.

                      Comment

                      • Sujani
                        Junior Member
                        • Aug 2012
                        • 5

                        #12
                        Originally posted by GenoMax View Post
                        Once you log into SeqAnswers, click on the "Forum" link in the top left quadrant under "site navigation".

                        Select the appropriate forum to post in by clicking on the main title of the forum (e.g. core facilities).

                        On the page that opens next there should be a "new thread" button towards top left.

                        Genomax,

                        Thanks alot for the help!! I could post my issues as a new thread!!!

                        Comment

                        • cwzkevin
                          Member
                          • Mar 2012
                          • 13

                          #13
                          Remember that you combined three runs results into one. It is very likely that the three runs does not have the same Phred offset. The mode at 38 is from Phred33 and the mode at 68 is from Phred64, obviously.
                          Now, you must combined the three runs this way:
                          Phred33 then Phred?? then Phred64
                          Such that when FastQC trying to guess the offset, all it can see are the codes from Phred33, and it concludes the data is from Phred33. FastQC doesn't use all reads in your data to guess, it only use 200,000 reads if I am correct.
                          After its conclusion of Phred33, FastQC keeps memorizing your data quality and maps the ascii codes based on Phred33. That is why the 2nd mode at 68 showing up.


                          Originally posted by per_ngs View Post
                          Hello,
                          I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
                          Apologies if my question is silly and if i am not understanding the way fastqc works.

                          Thanks for help.
                          NGSnewbie

                          Comment

                          • per_ngs
                            Junior Member
                            • Apr 2011
                            • 8

                            #14
                            Hello Kevin,
                            Thanks for the response. I did find out from other sources on seqanswers that the data that i combined had data with different Phred offset. I ran fastqc on each of the files individually and noticed this as well. So, for now i am processing the data separately.
                            Regards,
                            NGSnewbie

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              New Genomics Tools and Methods Shared at AGBT 2025
                              by seqadmin


                              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                              The Headliner
                              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                              03-03-2025, 01:39 PM
                            • seqadmin
                              Investigating the Gut Microbiome Through Diet and Spatial Biology
                              by seqadmin




                              The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                              02-24-2025, 06:31 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-20-2025, 05:03 AM
                            0 responses
                            18 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-19-2025, 07:27 AM
                            0 responses
                            25 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-18-2025, 12:50 PM
                            0 responses
                            19 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-03-2025, 01:15 PM
                            0 responses
                            187 views
                            0 reactions
                            Last Post seqadmin  
                            Working...