Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alexdobin
    Senior Member
    • Feb 2009
    • 161

    ENCODE long RNA-seq remapped

    Dear All,

    there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

    To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:


    This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


    To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

    The files are directly compatible with Cufflinks.
    CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
    Caltech and HAIB data are unstranded and can be run with default --library-type.

    STAR version: STAR_2.3.1u (2013/11/24)
    Genome: hg19 + phiX + NIST ERCC spike-ins
    Annotations: Gencode18

    Please let me know if you have any issues or questions
    Cheers
    Alex
    Last edited by alexdobin; 06-04-2014, 06:58 PM. Reason: replaced with up-to-date URL
  • Liy
    Member
    • Feb 2012
    • 19

    #2
    Thanks Alex!

    Is these data normalized in some way?

    L
    Last edited by Liy; 12-06-2013, 02:51 AM.

    Comment

    • alexdobin
      Senior Member
      • Feb 2009
      • 161

      #3
      Hi Liy,

      at the moment I have posted only the alignments - BAM files, so there is no normalization of any kind. I was contemplating also making the signal (wiggle) tracks - these can be made in many different ways (normalization, unique- vs multi-mappers, etc.).

      Cheers
      Alex

      Comment

      • Maike
        Junior Member
        • Dec 2013
        • 3

        #4
        mouse long RNASeq

        Originally posted by alexdobin View Post
        Dear All,

        there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

        To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:ftp://ftp2.cshl.edu/gingeraslab/trac...t/ENCODE2/BAM/

        This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


        To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

        The files are directly compatible with Cufflinks.
        CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
        Caltech and HAIB data are unstranded and can be run with default --library-type.

        STAR version: STAR_2.3.1u (2013/11/24)
        Genome: hg19 + phiX + NIST ERCC spike-ins
        Annotations: Gencode18

        Please let me know if you have any issues or questions
        Cheers
        Alex

        Alex, does anything like that exist also for the mouse RNASeq dataset?

        I`m trying to get the counts per gene using HTseq on files I generated from the encode (CSHL long RNASeq) bam files (sorted & turned into sam using samtools). However only 13% of the reads actually map to features, regardless of the GTF fie I use. Could the reason be the same?

        I appreciate any hints!

        Maike

        Comment

        • alexdobin
          Senior Member
          • Feb 2009
          • 161

          #5
          Hi Maike,

          it's very likely that HTseq has troubles with the old BAM format. The main problem is that in this old format the mates were assigned the same strand (for better viewability on UCSC browser), however, this is not a standard convention for Illumina reads.

          I am remapping the ENCODE CSHL mouse data to mm10 and Gencode M2 annotations (just released!), and will post the BAMs early next week.

          Cheers
          Alex

          Comment

          • Maike
            Junior Member
            • Dec 2013
            • 3

            #6
            Thank you Alex, for answering and doing the work!
            Maike

            Comment

            • alexdobin
              Senior Member
              • Feb 2009
              • 161

              #7
              The re-mapped ENCODE2 mouse CSHL data is posted here:
              ftp://ftp2.cshl.edu/gingeraslab/trac...AM/Mouse_CSHL/

              Comment

              • Maike
                Junior Member
                • Dec 2013
                • 3

                #8
                mouse encode rnaseq

                This is really helpful, thank you!

                Comment

                • Auction
                  Member
                  • Jul 2009
                  • 24

                  #9
                  Originally posted by alexdobin View Post
                  The re-mapped ENCODE2 mouse CSHL data is posted here:
                  ftp://ftp2.cshl.edu/gingeraslab/trac...AM/Mouse_CSHL/
                  Dobin, regarding the mouse-remapping, are you using this reference ftp://ftp2.cshl.edu/gingeraslab/trac...GencodeM2.tgz?

                  Thanks.

                  Comment

                  • alexdobin
                    Senior Member
                    • Feb 2009
                    • 161

                    #10
                    Originally posted by Auction View Post
                    Dobin, regarding the mouse-remapping, are you using this reference ftp://ftp2.cshl.edu/gingeraslab/trac...GencodeM2.tgz?

                    Thanks.
                    Yes, this is correct.

                    Comment

                    • Auction
                      Member
                      • Jul 2009
                      • 24

                      #11
                      Dobin

                      The reference in ftp://ftp2.cshl.edu/gingeraslab/trac..._GencodeM2.tgz only provides the GTF file and STAR indexed reference. Where can we download the fasta files for both mm10 and ERCC markers?

                      Thanks.

                      Comment

                      • rzhang
                        Junior Member
                        • Sep 2011
                        • 6

                        #12
                        Hi Alex,

                        I could not connect to the cshl ftp address and the link is broken. Could you please tell me where I can download the data now?

                        Many thanks,
                        Rui

                        Originally posted by alexdobin View Post
                        Dear All,

                        there have been multiple questions about the ENCODE RNA-seq alignments on the UCSC portal. These alignments had been generated by a 3-year old version of STAR and use some non-conventional formatting (e.g. they are not compatible with Cufflinks).

                        To bring this data up-to-date, I have remapped it using the latest version of STAR. The new alignments use conventional formatting and should be compatible with most downstream software. Importantly, annotations are used to improve the mapping accuracy. The BAMs for all of the ENCODE phase 2 (2008-2012) long RNA-seq data can be downloaded here:ftp://ftp2.cshl.edu/gingeraslab/trac...t/ENCODE2/BAM/

                        This is NOT an official ENCODE release. For all the metadata, please refer to UCSC ENCODE portal:


                        To reduce file sizes, the quality scores were not recorded, and the read names were replaced with numbers.

                        The files are directly compatible with Cufflinks.
                        CSHL data is stranded (dUTP protocol) and Cufflinks has to be run with --library-type fr-firststrand
                        Caltech and HAIB data are unstranded and can be run with default --library-type.

                        STAR version: STAR_2.3.1u (2013/11/24)
                        Genome: hg19 + phiX + NIST ERCC spike-ins
                        Annotations: Gencode18

                        Please let me know if you have any issues or questions
                        Cheers
                        Alex

                        Comment

                        • alexdobin
                          Senior Member
                          • Feb 2009
                          • 161

                          #13
                          Originally posted by rzhang View Post
                          Hi Alex,

                          I could not connect to the cshl ftp address and the link is broken. Could you please tell me where I can download the data now?

                          Many thanks,
                          Rui
                          Hi Rui,

                          this is the new location of the ENCODE2 RNA-seq BAMs:


                          Cheers
                          Alex

                          Comment

                          • apredeus
                            Senior Member
                            • Jul 2012
                            • 151

                            #14
                            Originally posted by alexdobin View Post
                            Hi Rui,

                            this is the new location of the ENCODE2 RNA-seq BAMs:


                            Cheers
                            Alex
                            this is so very useful, thank you very much, for mouse ENCODE CSHL data in particular! I was aligning these data: http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE39524, but it's been pretty painful since it's ABI SOLID platform...

                            BTW, Alex: have the mouse ENCODE data (CSHL long RNA-seq, the ones you have shared) been published yet?
                            Last edited by apredeus; 06-11-2014, 10:32 AM.

                            Comment

                            • alexdobin
                              Senior Member
                              • Feb 2009
                              • 161

                              #15
                              Originally posted by apredeus View Post
                              this is so very useful, thank you very much, for mouse ENCODE CSHL data in particular! I was aligning these data: http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE39524, but it's been pretty painful since it's ABI SOLID platform...

                              BTW, Alex: have the mouse ENCODE data (CSHL long RNA-seq, the ones you have shared) been published yet?
                              Hi @apredeus,
                              our mouse paper is under review, however, these mouse data were released by ENCODE in 2013 and are now free of any restrictions, you can check this in the last column of this table:

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                Yesterday, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              37 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              43 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...