Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gsgs
    Senior Member
    • Oct 2009
    • 139

    where to get aligned datasets ?

    I usually download the data from genbank

    but it's tedious to align the sequences, filter out the possible errors
    or incorrect insertions or just distant not well matching strains.

    Others must have done the same thing ...

    It should be useful to provide the aligned data to others,
    so they needn't redo it.
    But I didn't find it. Genbank doesn't seem interested
    to provide it or to store it and make it available from other's uploads
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    What datasets are you referring to?

    If you are looking for gene level pre-compiled alignments then "Homologene" is the place you want to visit. Here is an example: http://www.ncbi.nlm.nih.gov/homologene/?term=brca2

    UCSC provides alignments. Look in the alignments section: http://hgdownload.soe.ucsc.edu/downloads.html#human

    Ensembl also has similar information available: http://www.ensembl.org/info/website/...s/compara.html

    Genome level alignments are also at Ensembl: http://www.ensembl.org/info/genome/c.../analyses.html

    Comment

    • gsgs
      Senior Member
      • Oct 2009
      • 139

      #3
      I'm mainly doing influenza sequencing.

      So, I need aligned datasets of ~10000 sequences of length 838-2280 nucleotides
      for avian influenza of the 8 segments and 15 different strains for the HA and 9 for the
      NA and each of these probably divided into an Eurasian and North American lineage.

      Earlier here I had mitochondrial human DNA, 15000 sequences of length 16680
      I also (occasionally) did Dengue, the 4 groups, Ebola etc.
      Today I was trying helicobacter pylori ...

      it's always the same problem, takes hours to generate suitable aligned datasets

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        A search brought this up. You must have seen this already: http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html

        Then there is http://www.fludb.org/brc/home.spg?decorator=influenza

        For Mitochondria: http://www.ncbi.nlm.nih.gov/genome/organelle/

        As you know first hand, it takes time/effort to create meaningful MSA's. I am going to speculate that NCBI creates those for genes of model organisms/common genomes using the limited resources they have.

        You should consider making your own alignments available since that would save someone else some frustration.

        Comment

        • gsgs
          Senior Member
          • Oct 2009
          • 139

          #5
          For influenza, I think the best is to download all the ~400000 unaligned genbank sequences
          in fasta-format, which they provide in one file of ~650MB.
          But then you must filter for segments, groups, align, sort etc.
          I'm doing this regularly ~1-2 times per year for the ~130000 avian sequences
          into 5+2+9+16 aligned files. Takes 10-20hours.
          If only one person in the world would be doing the same ...it would save much time.

          Ideally you would have ~100 files with aligned sequences for the strains with an index from each.
          And the files sorted by best neighbor match. From these you can extract and filter whatever you want.
          flugenome.org did something like this, but is no longer being updated.


          flu comes from birds , whenever
          it jumps to new hosts you want to know where it came from,
          the genome and each of the 8 segments separately, how it evolved,
          whether/where there is pandemic danger.

          And then the human and swine sequences for special types less regularly,
          when the flu-season starts and there are new variants or such.

          I assume it's similar for other organisms : the data should be provided
          in filtered,sorted,aligned form.

          I could easily make my files available from my HD, where to put them so other will find it ?
          Best to send them on micro-SD



          what's MSA
          Last edited by gsgs; 11-26-2015, 05:55 AM.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            MSA = Multiple sequence alignment

            Isn't NCBI allowing you to do something similar to flugenome here (it is limited to 1000 genomes): http://www.ncbi.nlm.nih.gov/genomes/...i?go=alignment

            That said, I agree with you that the analysis you are doing would be a useful resource for the flu community. But since the number of people working on flu must be relatively small can't you propose this internally (at a relevant meeting/working group) that a resource such as this be created and then hosted by the group.

            Or you could write to NCBI and the group that manages the flu database and see if they would be interested in presenting the data the way you are proposing.

            Comment

            • gsgs
              Senior Member
              • Oct 2009
              • 139

              #7
              it's not just the MSA, you must remove/separate errors and nonmatches and
              single-nucleotide insertions (==> probably error) , pseudo-recombinations , wrong segments,
              wrong or missing strain-classifications, and such.
              And then sort the sequences. And these are typically 10000 sequences.
              It can be done, but takes some time (or tedious automization...)

              I've been talking with the genbank flu expert in emails since 2006.
              They are not interested. Genbank-flu has improved since
              2006, though. More features, more uniform=computer friendly,

              I could upload it somewhere, but noone will find it.

              the flu-community may be small (and I'm not a member with meetings or writing papers
              or professional=being paid or such) but this problem in general should apply to all sequencing.
              It's just my amateur pandemic concern, that started with H5N1 in 2005

              They may have somehow "solved" it in the human community (?)

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              12 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              46 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              106 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              125 views
              0 reactions
              Last Post SEQadmin2  
              Working...