Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Ender985
    Member
    • Mar 2009
    • 12

    Short Read Archive format problems

    Hello all,

    I've been dowloading some Illumina/Solexa short read files from SRA such as this one, to test and get used to MAQ and BWA.

    It seems the format of the provided short reads is Solexa fastq, ie.,
    Code:
    @SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36
    TCTGTCTCAAAAACAAAACAAAACAAAACAAAAAAA
    +SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAII1
    However, whenever I try to either convert this format to sagner format using
    $ maq sol2sanger SRR002322.fastq SRR002322.sang.fastq
    or try to convert this .fastq file to binary .bfq format, an extremely large warning list is shown on the terminal, spanning several thousands of errors like these,

    Code:
    [seq_read_fastq] Inconsistent sequence name: II9II<%IIIII6I. Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: +II'IIII(). Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: .IIIIII'IIIIIIIIIIIIIII3IIE. Continue anyway.
    (...)
    Well, I've investigated a little, and I think I've found the origin of all this errors. All the problems concern short reads whose quality score involves an '@' symbol. For example, the three short reads matching the three errors I've just shown are

    Code:
    @SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36
    GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG
    +SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36
    IIIIIIIIIDIIHIIIIIIII[B]@II9II<%IIIII6I[/B]
    
    @SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36
    TGTATTTTTAGTAGAGACGTGGTTTCACCATCTTGT
    +SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36
    IIIIIIIII%III+IIIIIIIIIII[B]@+II'IIII()[/B]
    
    @SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36
    TAAAAATGCCCTAGCCTACTTCTTACCACAAGGCAC
    +SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36
    IIIIIIII[B]@.IIIIII'IIIIIIIIIIIIIII3IIE[/B]
    all the other sequences are converted just fine.

    My bet is that MAQ scripts interprets everything after an @ as a sequence name and thus misinterprets the following lines as well. If I let the script run to the end of the file, the resulting .sagner.fastq file contains some funny short reads, apart from the normal reads like this one,

    Code:
    @SRR002322.11
    GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG
    +
    !"!!!"@&.!,+&!-+7!!!3'1'%5@!!!!"!"!"
    I also get a ton of for example

    Code:
    @II9II<%IIIII6I
    SRR.CM-KID-LIV--REPEATlengthTTTTTGCATCAAAAAGCTTTATTTCCATTTGGTCCA
    +
    %&%%%&B)0%.-)%/-9%%%5*3*(7B%%%%&%&%&!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    Note how the 'name' of this nonsensical short read is the end of the first problematic quailty score I've shown before, II9II<%IIIII6I.


    So since I've searched this forum and haven't found anyone else with the same problems as me, I think I must be doing something wrong. Are the SRA files not in Solexa/Illumina fastq format? What am I missing?

    Lots of thanks!
  • jkbonfield
    Senior Member
    • Jul 2008
    • 146

    #2
    I have a totally hideous script, but FAST, to convert solexa fastq with log-odds +64 to phred +33 format.

    The horrid tr is basically just doing the quality mapping and was generated by a simple perl 1-liner.

    However that said, your data doesn't look to be in solexa format anyway. All those 'I's are quality 40 (ascii 73 => 33+40).

    Code:
    # Read the fastq file, with blind faith it's in the correct format.
    while (<>) {
        print;                      # name
        $_=<>; print;               # sequence
        $_=<>; print "+\n";         # quality header (was name)
        $_=<>;
        tr/\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175/\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\042\042\042\042\042\042\043\043\044\044\045\045\046\046\047\050\051\052\053\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136/;
        print;                      # quality
    }
    edit: that hideous auto-generated tr I think actually boils down to:

    tr/!-\175/!!!!!!!!!!!!!!!!!!!!!!""""""##$$%%&&-++,-\136/;

    It still looks like wonderful line noise though :-)
    Last edited by jkbonfield; 04-09-2009, 05:13 AM.

    Comment

    • kmcarr
      Senior Member
      • May 2008
      • 1181

      #3
      The file you downloaded appears to already be in standard Sanger FASTQ format so there is no reason to convert. For Sanger FASTQ the conversion is to Phred score is ASCII(n)-33 (where 'n' is the character in the quality string). The majority of your quality values are 'I' which is ASCII 73, so 73-33 = 40, reasonable Phred scores. If the file was using Solexa scoring (ASCII(n)-64) the majority of Phred scores would be 9!. Further, the original file has one '3' in the quality string; if this were a Solexa file this would translate to a Q score of -14 which I think is below the lower limit for Solexa Q scores.

      I think you can skip the sol2sanger step and proceed with the file as downloaded.

      Comment

      • aaronh
        Member
        • Sep 2008
        • 46

        #4
        I had this problem and solved it by removing the spaces from the sequence name. If you look at the fastq definition on the MAQ page, you will see that spaces are not allowed, <seqname>:=[A-Za-z0-9_.:-]. I'm not sure if this is the official definition of a fastq file but that is what MAQ uses. Get rid of the spaces and you should be fine.

        Comment

        • Ender985
          Member
          • Mar 2009
          • 12

          #5
          As I'm still fairly new to the world of DNA-seq I didn't realise the sequences were already in solexa fastq format, so the sol2sanger step was indeed not needed at all. Nontheless, the problem with maq was still persisting when I tried to $maq match using those sequences.

          So I tried aaronh solution, and it worked perfectly! After replacing all of the blank spaces, the sequencing is running smoothly and with no errors.
          I still don't get why only the sequences containing an @ on their quality score were failing since all of them contained blank spaces on the name, but I guess it is just the way it is coded.

          Lots of thanks!

          Comment

          • aaronh
            Member
            • Sep 2008
            • 46

            #6
            From what I recall, actually all of the reads are failing but it is only complaining about the ones with the @. If you take the bfq file and convert it back to fastq, I think it looks like junk.

            Comment

            • polivares
              Member
              • Jan 2009
              • 29

              #7
              As wikipedia's article states, SRA files are already in Sanger's qualities. You should only remove the spaces. Please tell me if I am wrong.

              Comment

              • abelcable
                Junior Member
                • Jul 2010
                • 1

                #8
                Space in the name

                In case anyone stumbles across this problem again, I figured out how to solve it in the code. Open file seq.c in the top directory of maq (This is for maq 0.7.1, may work for other versions if the code is the same in this file) and look for the function called seq_read_fastq. Look for this while loop:
                Code:
                   while (!feof(fp) && (c = fgetc(fp)) != ' ' && c != '\t' && c != '\n')
                		if (c != '\r' && *p++ != c) {
                			fprintf(stderr, "[seq_read_fastq] Inconsistent sequence name: %s. Continue anyway.\n", name);
                			return seq->l;
                		}
                Insert this code immediately after the loop and re-compile.
                Code:
                 if (c != '\n') while (!feof(fp) && fgetc(fp) != '\n');
                That should fix the space in the name problem. The @ symbols have nothing to do with it, the code uses that symbol as an anchor and when there are spaces in the name it messes it all up. So you don't have to remove the @ symbols in the quality scores.

                This makes the code ignore anything after the first white space. If your name includes spaces, this will truncate the name to the part before the space. I guess that's not that great if you need the whole name, but this will at least give you a hint as to how to fix that too.
                Last edited by abelcable; 07-28-2010, 11:25 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM
                • seqadmin
                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                  by seqadmin




                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                  02-24-2025, 06:31 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                17 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                18 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                19 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                185 views
                0 reactions
                Last Post seqadmin  
                Working...