Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • where to get aligned datasets ?

    I usually download the data from genbank

    but it's tedious to align the sequences, filter out the possible errors
    or incorrect insertions or just distant not well matching strains.

    Others must have done the same thing ...

    It should be useful to provide the aligned data to others,
    so they needn't redo it.
    But I didn't find it. Genbank doesn't seem interested
    to provide it or to store it and make it available from other's uploads

  • #2
    What datasets are you referring to?

    If you are looking for gene level pre-compiled alignments then "Homologene" is the place you want to visit. Here is an example: http://www.ncbi.nlm.nih.gov/homologene/?term=brca2

    UCSC provides alignments. Look in the alignments section: http://hgdownload.soe.ucsc.edu/downloads.html#human

    Ensembl also has similar information available: http://www.ensembl.org/info/website/...s/compara.html

    Genome level alignments are also at Ensembl: http://www.ensembl.org/info/genome/c.../analyses.html

    Comment


    • #3
      I'm mainly doing influenza sequencing.

      So, I need aligned datasets of ~10000 sequences of length 838-2280 nucleotides
      for avian influenza of the 8 segments and 15 different strains for the HA and 9 for the
      NA and each of these probably divided into an Eurasian and North American lineage.

      Earlier here I had mitochondrial human DNA, 15000 sequences of length 16680
      I also (occasionally) did Dengue, the 4 groups, Ebola etc.
      Today I was trying helicobacter pylori ...

      it's always the same problem, takes hours to generate suitable aligned datasets

      Comment


      • #4
        A search brought this up. You must have seen this already: http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

        Then there is http://www.fludb.org/brc/home.spg?decorator=influenza

        For Mitochondria: http://www.ncbi.nlm.nih.gov/genome/organelle/

        As you know first hand, it takes time/effort to create meaningful MSA's. I am going to speculate that NCBI creates those for genes of model organisms/common genomes using the limited resources they have.

        You should consider making your own alignments available since that would save someone else some frustration.

        Comment


        • #5
          For influenza, I think the best is to download all the ~400000 unaligned genbank sequences
          in fasta-format, which they provide in one file of ~650MB.
          But then you must filter for segments, groups, align, sort etc.
          I'm doing this regularly ~1-2 times per year for the ~130000 avian sequences
          into 5+2+9+16 aligned files. Takes 10-20hours.
          If only one person in the world would be doing the same ...it would save much time.

          Ideally you would have ~100 files with aligned sequences for the strains with an index from each.
          And the files sorted by best neighbor match. From these you can extract and filter whatever you want.
          flugenome.org did something like this, but is no longer being updated.


          flu comes from birds , whenever
          it jumps to new hosts you want to know where it came from,
          the genome and each of the 8 segments separately, how it evolved,
          whether/where there is pandemic danger.

          And then the human and swine sequences for special types less regularly,
          when the flu-season starts and there are new variants or such.

          I assume it's similar for other organisms : the data should be provided
          in filtered,sorted,aligned form.

          I could easily make my files available from my HD, where to put them so other will find it ?
          Best to send them on micro-SD



          what's MSA
          Last edited by gsgs; 11-26-2015, 05:55 AM.

          Comment


          • #6
            MSA = Multiple sequence alignment

            Isn't NCBI allowing you to do something similar to flugenome here (it is limited to 1000 genomes): http://www.ncbi.nlm.nih.gov/genomes/...i?go=alignment

            That said, I agree with you that the analysis you are doing would be a useful resource for the flu community. But since the number of people working on flu must be relatively small can't you propose this internally (at a relevant meeting/working group) that a resource such as this be created and then hosted by the group.

            Or you could write to NCBI and the group that manages the flu database and see if they would be interested in presenting the data the way you are proposing.

            Comment


            • #7
              it's not just the MSA, you must remove/separate errors and nonmatches and
              single-nucleotide insertions (==> probably error) , pseudo-recombinations , wrong segments,
              wrong or missing strain-classifications, and such.
              And then sort the sequences. And these are typically 10000 sequences.
              It can be done, but takes some time (or tedious automization...)

              I've been talking with the genbank flu expert in emails since 2006.
              They are not interested. Genbank-flu has improved since
              2006, though. More features, more uniform=computer friendly,

              I could upload it somewhere, but noone will find it.

              the flu-community may be small (and I'm not a member with meetings or writing papers
              or professional=being paid or such) but this problem in general should apply to all sequencing.
              It's just my amateur pandemic concern, that started with H5N1 in 2005

              They may have somehow "solved" it in the human community (?)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Non-Coding RNA Research and Technologies
                by seqadmin




                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                Nobel Prize for MicroRNA Discovery
                This week,...
                10-07-2024, 08:07 AM
              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-11-2024, 06:55 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              110 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              114 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              1 response
              119 views
              0 likes
              Last Post EmiTom
              by EmiTom
               
              Working...
              X