Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with FASTA header in QIIME

    Dear all,

    I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

    After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

    I face 2 main problems:

    1) The FASTA header of the sequences.

    The current header has this format:

    >Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

    After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

    >Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

    And I have no means of obtaining all the information lacking in my headers.


    2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

    I guess this second issue can be fixed easily if the first Issue can be fixed.

    Thanks in advance.


    JL

  • #2
    Originally posted by Jluis View Post
    Dear all,

    I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

    After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

    I face 2 main problems:

    1) The FASTA header of the sequences.

    The current header has this format:

    >Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

    After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

    >Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

    And I have no means of obtaining all the information lacking in my headers.


    2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

    I guess this second issue can be fixed easily if the first Issue can be fixed.

    Thanks in advance.


    JL
    JL,

    It appears that your service provider has already done all this work for you.

    - You do not need to have the barcode sequences because they have already demultiplexed the reads.

    - You probably do not need the primer sequences because it is likely they already trimmed the primers as part of the merging process. If they did not state explicitly whether or not primer sequences were trimmed ask them. This is essential for you to know.

    - The header format they provided you is nearly what you need; just change

    Code:
    >Sample_Name tagX
    to
    >Sample_Name_X
    [Honestly QIIME may be perfectly happy with the format of the FASTA deflines already in the file. I don't use QIIME so can't say for sure.]

    - All the other stuff on the example defline in the QIIME manual is worthless. The example is from a Roche 454 GS-FLX read which is a dead platform.

    Comment


    • #3
      Dear kmcarr,

      Thank you very much for your answer!
      I'm currently on holidays, but I will try to test your solution as soon as I get back to work.

      Best

      JL

      Comment


      • #4
        Here is how I'm handling demultiplexed data from a MiSeq (I think it should be very similar to HiSeq as far as headers go). Be aware that qiime uses _ as a field deliminator, so you can't have any in your sample name.



        I'm not a fan of qiime, so my script just gets you to the beginning of the process clustering process. If you are just starting out with this kind of analysis, I think mothur is much better documented which makes it easier to learn. Plus mothur does fully de novo clustering, as opposed to qiime's closed reference then de novo the ones that don't match approach. Clustering your data by 2 methods based on an incomplete reference is sketchy.
        Microbial ecologist, running a sequencing core. I have lots of strong opinions on how to survey communities, pretty sure some are even correct.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Best Practices for Single-Cell Sequencing Analysis
          by seqadmin



          While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
          06-06-2024, 07:15 AM
        • seqadmin
          Latest Developments in Precision Medicine
          by seqadmin



          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

          Somatic Genomics
          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
          05-24-2024, 01:16 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 06-07-2024, 06:58 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 06-06-2024, 08:18 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 06-06-2024, 08:04 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 06-03-2024, 06:55 AM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Working...
        X