Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Alessandro1976
    Junior Member
    • May 2014
    • 4

    Problem with SOLiD data

    Hi everyone,
    I'm trying to analyze small RNA data from SOLiD using both Lifescope and a different pipeline which uses miRDeep2.
    I converted the XSQ files into FASTQ files with the XSQ tools, then I also ran Lifescope and got the BAM file with the mapped sequences.
    However, when I try to compare the reads from the FASTQ file and the reads extracted from the BAM file, they are completely different.
    For example, the two following entries are from FASTQ and BAM, respectively:

    @Library43:559_548_933/1
    CGGTGCAGGGACGAAATACAGTTAGACATATCTC
    +
    @@@@@@@@@@@@@@@@@6@@@@6@@@@/@@;@/@

    559_548_933 0 chr9 23209018 1 21M14H * 0 0 CAGATCAAGAGGTCCCCGGTT JJJJJJJJJJJJJJJJJJJJJ RG:Z:Library43_11 NH:i:10 CM:i:0 NM:i:0 CQ:Z:@@@@@@@@@@@@@@@@@@6@@@@6@@@@/@@;@/@ CS:Z:T21223210222012000301023302010303131


    The IDs are the same (559_548_933), but the sequence in the FASTQ file (CGGTGCAGGGACGAAATACAGTTAGACATATCTC) is completely different than the one in the BAM file (CAGATCAAGAGGTCCCCGGTT). It's not just a matter of trimming the adaptor sequences, the sequences are just different overall.
    Also, when I try to map the reads from the BAM file with either miRDeep or Tophat I have a high percentage of success, when I try the same thing with the FASTQ file I have 0% of mapped sequences.

    Does anyone know why there is such a difference between reads with the same ID and what the FASTQ file reads actually are?
  • ShaunMahony
    Member
    • Apr 2008
    • 27

    #2
    Hi Alessandro1976,

    You can't convert colorspace reads (e.g. from SOLiD) into sequence space with any degree of accuracy. Since colorspace bases are defined relative to the previous base, sequencing errors are propagated through the rest of the read. It's explained well here:
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment

    • gringer
      David Eccles (gringer)
      • May 2011
      • 845

      #3
      Programs that map colour-space are likely to also correct the reads when reporting the sequences in the BAM files (I know bowtie does this, and it makes sense for others to do this as well). The indexes that are mapped against must be in colour-space, and there are a few nice error-correction features in colour-space that mean it can be easier to distinguish between sequence changes and instrument error (e.g. a SNP needs two adjacent colour changes).

      However, you will always run into issues when trying to interpret or compare the results of a colour-space run (e.g. in a genome browser) because colour-space is a completely different beast to base-space and doesn't make sense to humans -- see the post ShaunMahoney linked to for more details. Here's my recommended approach for carrying out such a comparison:
      1. Transfer all the colour-space files onto an external hard disk
      2. Delete all other copies of the colour-space files
      3. Remove the hard drive from the computer
      4. Use a sledgehammer or similar to squash the disk platters closer together
      5. Withdraw $500 from the bank
      6. Place the $500 on top of the hard drive
      7. Return the hard drive (with the money) back to the client
      8. Report to the client that there was insufficient data for a suitable analysis, and recommend that the experiment is repeated using a base-space sequencer

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Agreed.

        Colorspace was a terrible design decision, and the fact that colorspace data persists wastes a lot of people's time and energy. It will always give inferior results in anything other than purely quantitative analysis like chip-seq. But, because of Solid's high error rate, it will give inferior results there, too.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Pathogen Surveillance with Advanced Genomic Tools
          by seqadmin




          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
          03-24-2025, 11:48 AM
        • seqadmin
          New Genomics Tools and Methods Shared at AGBT 2025
          by seqadmin


          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

          The Headliner
          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
          03-03-2025, 01:39 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 03-20-2025, 05:03 AM
        0 responses
        49 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-19-2025, 07:27 AM
        0 responses
        57 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-18-2025, 12:50 PM
        0 responses
        50 views
        0 reactions
        Last Post seqadmin  
        Started by seqadmin, 03-03-2025, 01:15 PM
        0 responses
        201 views
        0 reactions
        Last Post seqadmin  
        Working...