Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 1000genomes SRR107002.filt.fastq.gz bad format?

    The first few lines of ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz are:
    @SRR010375.1 VAB_S0083_BCM_20080731_1_Pilot_2_YRI_Fragment2199_1705_1499 length=35
    T01000333201121302130101020001110011
    +
    !5-,'324)%,72%91/<9'50*+&%$(%$6&(.)1

    I was expecting the second line to contain something like:
    TGGGGCGCTTGTCATTCATTGGTTCCCCACAAACA

    When fed into Bowtie it complains "Reads file contained a pattern with more than 1024 quality values."

    Does anyone recognise the T01000333201121302130101020001110011
    format?
    Thank you
    Bill

  • #2
    It's a fastq file from a SOLiD sequencer - so the base encodings are not in 'base space' but in 'color space'. It's 2 base encoding.

    Take a look at:

    With a comprehensive portfolio of products, Applied Biosystems solutions from Thermo Fisher Scientific empower you to address today’s most pressing genetic challenges.

    Comment


    • #3
      Dear Bukowski,
      Many thanks for your rapid and helpful reply:-)
      (I must admit I did not follow CSHL_Fu.pdf but I guess thats not necessary to use
      the data.)

      I am now using bowtie with --color. However I guess it will take 4 or 5 hours for
      bowtie-build to create me a colorspace index.

      BTW is there any reason why bowtie does not read-convert colorspace files
      and use them with its usual indexes?
      [I guess a more helpful error message would not go amiss either.]

      Alternatively does anyone have a colorsequence to fasta conversion tool.

      Many thanks
      Bill

      Comment


      • #4
        Originally posted by wlangdon View Post
        Dear Bukowski,
        Many thanks for your rapid and helpful reply:-)
        (I must admit I did not follow CSHL_Fu.pdf but I guess thats not necessary to use
        the data.)

        I am now using bowtie with --color. However I guess it will take 4 or 5 hours for
        bowtie-build to create me a colorspace index.

        BTW is there any reason why bowtie does not read-convert colorspace files
        and use them with its usual indexes?
        [I guess a more helpful error message would not go amiss either.]

        Alternatively does anyone have a colorsequence to fasta conversion tool.

        Many thanks
        Bill
        I haven't done any work with SOLiD data for a couple of years, but the di-base encoding means that if you have an error in a base when you do your color space > base space conversion (and there are tools for this, but I never used them) then all subsequent bases in the read are wrong (as each colour encodes the transition between bases).

        There's a good (brief) overview of some of the considerations in the SHRiMP paper:

        Author Summary Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods. Additionally, we develop color-space extensions to classical alignment algorithms, allowing us to map color-space, or “dibase”, reads generated by AB SOLiD sequencers.


        Consequently it is better to do the alignment in a color space aware tool, and if I worked with SOLiD data anymore that is what I would do. However some tools (such as BWA) have already dropped color space support.

        Comment


        • #5
          Dear Bukowski,
          Once again thank you for your very helpful reply.

          It took just under 3 hours for bowtie-build to create a colorspace index
          for the human genome (NCBI 37.5 ASM). It seems to be working well.

          Thanks again
          Bill

          Comment


          • #6
            Dear Bukowski,
            Once again thank you for your very helpful reply.

            It took just under 3 hours for bowtie-build to create a colorspace index
            for the human genome (NCBI 37.5 ASM). It seems to be working well.

            Thanks again
            Bill

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Addressing Off-Target Effects in CRISPR Technologies
              by seqadmin






              The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
              08-27-2024, 04:44 AM
            • seqadmin
              Selecting and Optimizing mRNA Library Preparations
              by seqadmin



              Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
              08-07-2024, 12:11 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 08-27-2024, 04:40 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 08-22-2024, 05:00 AM
            0 responses
            293 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 08-21-2024, 10:49 AM
            0 responses
            135 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 08-19-2024, 05:12 AM
            0 responses
            124 views
            0 likes
            Last Post seqadmin  
            Working...
            X