Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1000genomes SRR107002.filt.fastq.gz bad format?

    The first few lines of ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz are:
    @SRR010375.1 VAB_S0083_BCM_20080731_1_Pilot_2_YRI_Fragment2199_1705_1499 length=35
    T01000333201121302130101020001110011
    +
    !5-,'324)%,72%91/<9'50*+&%$(%$6&(.)1

    I was expecting the second line to contain something like:
    TGGGGCGCTTGTCATTCATTGGTTCCCCACAAACA

    When fed into Bowtie it complains "Reads file contained a pattern with more than 1024 quality values."

    Does anyone recognise the T01000333201121302130101020001110011
    format?
    Thank you
    Bill

  • #2
    It's a fastq file from a SOLiD sequencer - so the base encodings are not in 'base space' but in 'color space'. It's 2 base encoding.

    Take a look at:

    Researchers use Applied Biosystems integrated systems for sequencing, flow cytometry, and real-time, digital and end point PCR—from sample prep to data analysis.

    Comment


    • #3
      Dear Bukowski,
      Many thanks for your rapid and helpful reply:-)
      (I must admit I did not follow CSHL_Fu.pdf but I guess thats not necessary to use
      the data.)

      I am now using bowtie with --color. However I guess it will take 4 or 5 hours for
      bowtie-build to create me a colorspace index.

      BTW is there any reason why bowtie does not read-convert colorspace files
      and use them with its usual indexes?
      [I guess a more helpful error message would not go amiss either.]

      Alternatively does anyone have a colorsequence to fasta conversion tool.

      Many thanks
      Bill

      Comment


      • #4
        Originally posted by wlangdon View Post
        Dear Bukowski,
        Many thanks for your rapid and helpful reply:-)
        (I must admit I did not follow CSHL_Fu.pdf but I guess thats not necessary to use
        the data.)

        I am now using bowtie with --color. However I guess it will take 4 or 5 hours for
        bowtie-build to create me a colorspace index.

        BTW is there any reason why bowtie does not read-convert colorspace files
        and use them with its usual indexes?
        [I guess a more helpful error message would not go amiss either.]

        Alternatively does anyone have a colorsequence to fasta conversion tool.

        Many thanks
        Bill
        I haven't done any work with SOLiD data for a couple of years, but the di-base encoding means that if you have an error in a base when you do your color space > base space conversion (and there are tools for this, but I never used them) then all subsequent bases in the read are wrong (as each colour encodes the transition between bases).

        There's a good (brief) overview of some of the considerations in the SHRiMP paper:

        Author Summary Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods. Additionally, we develop color-space extensions to classical alignment algorithms, allowing us to map color-space, or “dibase”, reads generated by AB SOLiD sequencers.


        Consequently it is better to do the alignment in a color space aware tool, and if I worked with SOLiD data anymore that is what I would do. However some tools (such as BWA) have already dropped color space support.

        Comment


        • #5
          Dear Bukowski,
          Once again thank you for your very helpful reply.

          It took just under 3 hours for bowtie-build to create a colorspace index
          for the human genome (NCBI 37.5 ASM). It seems to be working well.

          Thanks again
          Bill

          Comment


          • #6
            Dear Bukowski,
            Once again thank you for your very helpful reply.

            It took just under 3 hours for bowtie-build to create a colorspace index
            for the human genome (NCBI 37.5 ASM). It seems to be working well.

            Thanks again
            Bill

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X