Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alexdobin
    Senior Member
    • Feb 2009
    • 161

    ENCODE long RNA-seq remapped

    Dear All,

    there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

    To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:


    This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


    To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

    The files are directly compatible with Cufflinks.
    CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
    Caltech and HAIB data are unstranded and can be run with default --library-type.

    STAR version: STAR_2.3.1u (2013/11/24)
    Genome: hg19 + phiX + NIST ERCC spike-ins
    Annotations: Gencode18

    Please let me know if you have any issues or questions
    Cheers
    Alex
    Last edited by alexdobin; 06-04-2014, 06:58 PM. Reason: replaced with up-to-date URL
  • Liy
    Member
    • Feb 2012
    • 19

    #2
    Thanks Alex!

    Is these data normalized in some way?

    L
    Last edited by Liy; 12-06-2013, 02:51 AM.

    Comment

    • alexdobin
      Senior Member
      • Feb 2009
      • 161

      #3
      Hi Liy,

      at the moment I have posted only the alignments - BAM files, so there is no normalization of any kind. I was contemplating also making the signal (wiggle) tracks - these can be made in many different ways (normalization, unique- vs multi-mappers, etc.).

      Cheers
      Alex

      Comment

      • Maike
        Junior Member
        • Dec 2013
        • 3

        #4
        mouse long RNASeq

        Originally posted by alexdobin View Post
        Dear All,

        there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

        To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:ftp://ftp2.cshl.edu/gingeraslab/trac...t/ENCODE2/BAM/

        This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


        To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

        The files are directly compatible with Cufflinks.
        CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
        Caltech and HAIB data are unstranded and can be run with default --library-type.

        STAR version: STAR_2.3.1u (2013/11/24)
        Genome: hg19 + phiX + NIST ERCC spike-ins
        Annotations: Gencode18

        Please let me know if you have any issues or questions
        Cheers
        Alex

        Alex, does anything like that exist also for the mouse RNASeq dataset?

        I`m trying to get the counts per gene using HTseq on files I generated from the encode (CSHL long RNASeq) bam files (sorted & turned into sam using samtools). However only 13% of the reads actually map to features, regardless of the GTF fie I use. Could the reason be the same?

        I appreciate any hints!

        Maike

        Comment

        • alexdobin
          Senior Member
          • Feb 2009
          • 161

          #5
          Hi Maike,

          it's very likely that HTseq has troubles with the old BAM format. The main problem is that in this old format the mates were assigned the same strand (for better viewability on UCSC browser), however, this is not a standard convention for Illumina reads.

          I am remapping the ENCODE CSHL mouse data to mm10 and Gencode M2 annotations (just released!), and will post the BAMs early next week.

          Cheers
          Alex

          Comment

          • Maike
            Junior Member
            • Dec 2013
            • 3

            #6
            Thank you Alex, for answering and doing the work!
            Maike

            Comment

            • alexdobin
              Senior Member
              • Feb 2009
              • 161

              #7
              The re-mapped ENCODE2 mouse CSHL data is posted here:
              ftp://ftp2.cshl.edu/gingeraslab/trac...AM/Mouse_CSHL/

              Comment

              • Maike
                Junior Member
                • Dec 2013
                • 3

                #8
                mouse encode rnaseq

                This is really helpful, thank you!

                Comment

                • Auction
                  Member
                  • Jul 2009
                  • 24

                  #9
                  Originally posted by alexdobin View Post
                  The re-mapped ENCODE2 mouse CSHL data is posted here:
                  ftp://ftp2.cshl.edu/gingeraslab/trac...AM/Mouse_CSHL/
                  Dobin, regarding the mouse-remapping, are you using this reference ftp://ftp2.cshl.edu/gingeraslab/trac...GencodeM2.tgz?

                  Thanks.

                  Comment

                  • alexdobin
                    Senior Member
                    • Feb 2009
                    • 161

                    #10
                    Originally posted by Auction View Post
                    Dobin, regarding the mouse-remapping, are you using this reference ftp://ftp2.cshl.edu/gingeraslab/trac...GencodeM2.tgz?

                    Thanks.
                    Yes, this is correct.

                    Comment

                    • Auction
                      Member
                      • Jul 2009
                      • 24

                      #11
                      Dobin

                      The reference in ftp://ftp2.cshl.edu/gingeraslab/trac..._GencodeM2.tgz only provides the GTF file and STAR indexed reference. Where can we download the fasta files for both mm10 and ERCC markers?

                      Thanks.

                      Comment

                      • rzhang
                        Junior Member
                        • Sep 2011
                        • 6

                        #12
                        Hi Alex,

                        I could not connect to the cshl ftp address and the link is broken. Could you please tell me where I can download the data now?

                        Many thanks,
                        Rui

                        Originally posted by alexdobin View Post
                        Dear All,

                        there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

                        To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:ftp://ftp2.cshl.edu/gingeraslab/trac...t/ENCODE2/BAM/

                        This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


                        To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

                        The files are directly compatible with Cufflinks.
                        CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
                        Caltech and HAIB data are unstranded and can be run with default --library-type.

                        STAR version: STAR_2.3.1u (2013/11/24)
                        Genome: hg19 + phiX + NIST ERCC spike-ins
                        Annotations: Gencode18

                        Please let me know if you have any issues or questions
                        Cheers
                        Alex

                        Comment

                        • alexdobin
                          Senior Member
                          • Feb 2009
                          • 161

                          #13
                          Originally posted by rzhang View Post
                          Hi Alex,

                          I could not connect to the cshl ftp address and the link is broken. Could you please tell me where I can download the data now?

                          Many thanks,
                          Rui
                          Hi Rui,

                          this is the new location of the ENCODE2 RNA-seq BAMs:


                          Cheers
                          Alex

                          Comment

                          • apredeus
                            Senior Member
                            • Jul 2012
                            • 151

                            #14
                            Originally posted by alexdobin View Post
                            Hi Rui,

                            this is the new location of the ENCODE2 RNA-seq BAMs:


                            Cheers
                            Alex
                            this is so very useful, thank you very much, for mouse ENCODE CSHL data in particular! I was aligning these data: http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE39524, but it's been pretty painful since it's ABI SOLID platform...

                            BTW, Alex: have the mouse ENCODE data (CSHL long RNA-seq, the ones you have shared) been published yet?
                            Last edited by apredeus; 06-11-2014, 10:32 AM.

                            Comment

                            • alexdobin
                              Senior Member
                              • Feb 2009
                              • 161

                              #15
                              Originally posted by apredeus View Post
                              this is so very useful, thank you very much, for mouse ENCODE CSHL data in particular! I was aligning these data: http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE39524, but it's been pretty painful since it's ABI SOLID platform...

                              BTW, Alex: have the mouse ENCODE data (CSHL long RNA-seq, the ones you have shared) been published yet?
                              Hi @apredeus,
                              our mouse paper is under review, however, these mouse data were released by ENCODE in 2013 and are now free of any restrictions, you can check this in the last column of this table:

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 10:09 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...