Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with FastQ/CASAVA format problems

    Hey all, a newbie here, and not sure if this is the appropriate place to post this but was wondering if I could get some help with an issue involving Illumina deepseq data. I'm trying to run a batch of deepseq data that we have recently got through CASAVA v 1.7 and align it to a genome. The file is formated in .fastq and the reads look like this:



    @6:1:1410:944:N
    NNNNCAAACACAAAGTTACCTAAACTATAGAAGTCAAACA
    +
    ####&&()''@@@@@@8@@@31888@@@@@3885817775



    However, when I try to run it through the program, it gives the following error:



    Could not identify index of the following line:
    *********************************
    6:1:1410:944:N
    *********************************

    Please check your files, we expect the following syntax:
    <machine-id>_<run-number>(flow_cell-id):lane:tile:x:y#<index>:<pair>
    machine-id: all characters except '_'



    I realize this is a formating issue as CASAVA wants the file in the format of:



    @<machine_id>:<lane>:<tile>:<x_coord>:<y_coord>#<index
    >/<read_#>



    But am unsure how to go about fixing it. I'm pretty sure the machine_id is missing, as well as any information dealing with the index and read. Any help would be much appreciated. Thanks!

  • #2
    You could make a small script to chug through the file and add the machine id field (either the real one if you can acquire it, or else a made-up placeholder).

    Regarding the "#<index>:<pair>" fields, some more info on the experiment might be needed. Is this single-end or paired-end (and how many data files are there? Illumina paired-end data usually comes in paired files with each read pair positioned on corresponding lines in the files). Any multiplexing?

    Comment


    • #3
      It is not paired ends, and I'm almost certain there is no multiplexing at all in the sample. A sample input would be great help. Thanks for the assistance!
      Last edited by Airwalker810; 01-12-2011, 06:57 AM.

      Comment


      • #4
        If it's single-end and no multiplexing, then you have all the information you need and it should just be a matter of formatting the ID line to make your program happy. The program is expecting read ID lines to look like this:

        @ILxx_1234:1:1:1103:6172#1/1
        @ILxx_1234:1:1:1103:16929#7/1
        @ILxx_1234:1:1:1103:13497#2/2

        where the first field is the ID/name of the machine that performed the experiment followed by the run number, the number after the "#" is the sample ID (if there are multiple samples) and the number after the "/" is the pair info for paired-end experiments (so it's either 1 or 2). If the program really wants a machine name, I guess you could just make up a phony machine name (ILmymachine_0001 or something more clever or whatever) for the first field. And since you have only a single sample, if the program really wants an index I guess you could just add "#1" after the y-coordinate (removing the ":N" part - I'm not sure what it signifies). For the pair-info, my guess is that you can just leave that info out (i.e. simply skip the "/1" part) and the program will treat the data as single-end.

        (NOTE: I don't know anything about CASAVA - as I understand things it is Illumina's own program that can do a bunch of stuff. It's not inconceivable that CASAVA itself can generate the correct ID lines from lower level files - but again I don't know much about the pre-fastq pipeline.)

        If you know a little Perl or Python scripting you should be able to make those changes to the ID lines to make CASAVA accept them - however this is just a quick-and-dirty practical fix, I don't know the underlying reason why your read ID lines look they way they do (maybe whoever generated the files does).

        Comment


        • #5
          Thanks for the help, should make things a bit easier with a little scripting. Yeah, I'm not sure what the deal with this data is, as I said, it was outsourced, and it came back looking like this mess. No idea why specific lines are missing from the data. My lab just procured a DeepSeq machine and I'm trying to force the data through that pipeline to make everything from the past and future work on the same analysis program.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 12:08 PM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          43 views
          0 likes
          Last Post seqadmin  
          Working...
          X