Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • iloveneworleans
    Member
    • Jun 2009
    • 12

    Where can I find the complete FASTA format sequence(human and mouse)?

    On the EBI database website(http://www.ebi.ac.uk/astd/download.html), they only provide the FASTA format sequence of all exons or transcripts to download.
    Anybody know where I can find the complete FASTA format sequence(human and mouse) that can match with "Feb 2008 Release 1.1"? I want to use the complete FASTA format sequence as the reference genome to align the RNA-seq data.
    Thanks in advance!
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    You can get all complete assemblies from Ensembl:



    ..or NCBI

    ftp://ftp.ncbi.nih.gov/genomes/

    You'll need to check your details about the exact assembly to use though. The description you included doesn't obviously match to any human or mouse assembly - maybe you're looking at a description of an annotation set rather than an underlying assembly? Both of those sites will give you the latest assembly for each species by default.

    Comment

    • Simon Anders
      Senior Member
      • Feb 2010
      • 995

      #3
      Simon Andrews pointed out the right places to look at.

      Three remarks on Ensembl's human FASTA files to save you the time of falling in these traps:

      - Do not use the repeat-mapped sequences ("_rm" in the filenames). Judging which repeats are detrimental is better left to the aligner.

      - It seems convenient to download the file denoted "toplevel", as it contains all the other FASTA sequences in one big file. However, this means that all the MHC variants are included. if you feed this to the aligner, it will not realize that all these MHC sequences are variant of the _same_ region and consider it as repetitive. Better kick out the variant sequences before using the toplevel file, or download all the chromosome files individually and feed them all together to the aligner.

      - If you later use annotation, be sure to use the corresponding data, e.g., the GTF file from Ensembl. If you mix different assemblies, or maybe even NCBI's and Ensembl's representation of the same assembly build, the coordinates might not fit.

      Simon

      Comment

      • iloveneworleans
        Member
        • Jun 2009
        • 12

        #4
        Thanks Simon Andrews and Simon Anders!

        From Ensemble and NCBI ftp server, we can get all complete assemblies. But I think EBI might have their own complete assemblies to download. As Simon Anders said, if I am using the complete assemblies downloaded from Ensemble or NCBI to align the RNA-seq data and using the annotation file (GTF file) from EBI, then the coordinates might not fit.

        Although EBI has provided the FASTA sequence file and annotation file (GTF file) to download, the FASTA format sequence files are based on all exons or transcripts instead of complete sequence file. I think these FASTA sequence file for all exons or transcripts should be extracted from the complete sequence file. Why EBI doesn't provide it to download? Or is EBI also using the same complete assemblies from Ensemble or NCBI?

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          First of all: I got quite confused what you mean by EBI. Note that the European Biooinformatics Institute (EBI, in Hinxton, Cambs., England) hosts a lot of data, among them the whole EnsEMBL project (which they administer jointly with the Sanger Institute, also in Hinxton) and the ASTD project that you mentioned in the first post.

          That confusion aside, two points:

          - How deeply do you want to go into alternative splicing? Note that the GTF file from Ensembl also contains information about all well-documented transcripts, i.e., it is usually all you need. Making use of this information is actually not that easy, but the new 'cufflinks' tool might help a lot.

          - I'd suppose that you have very good chances that the GTF files from the ASTD project are compatible with the coordinates from the Ensembl FASTA files, as both come from Hinxton.

          I just had a look into one of the GTF files from ASTD. The features are annotated with Ensembl Gene IDs ("ENSG000..."), which look promising. You can simply compare the coordinates of a few of the features from the file with the same genes on the Ensembl web site to make sure that the coordinates are consistent.

          However, the file also states:

          # Datasources:
          # ASTD release 1.1(15/02/2008)
          # EnsEMBL homo_sapiens 41_36c

          This might indicate an old data version. The current Ensembl version is 56, using Homo sapiens build GRCh37. Maybe this is for the previous build, NCBI36? Note the small link "View in archive site" at the bottom of the Ensembl home page, which allows you to access old versions of the data.

          Simon

          Comment

          • iloveneworleans
            Member
            • Jun 2009
            • 12

            #6
            Thanks Simon very much!

            I thought Ensemble is also an institute like European Biooinformatics Institute (EBI) and NCBI, actually Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute. That's why I was also confused.

            So, actually the annotation file and FASTA formate sequence file provided by EBI webiste(http://www.ebi.ac.uk/astd/download.html) are also same with those releases on the Ensembl web site(http://uswest.ensembl.org/info/data/ftp/index.html).
            The only difference is that the current release on EBI website is the old data version (41_36c) from Ensemble instead of the latest version(56).

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Today, 08:59 AM
            0 responses
            8 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            16 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Working...