Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • why should BAM be shuffled before extracting to FASTQ?

    I came across this apparent bias when recovering FASTQ paired read data from BAM for re-mapping purpose (public data obtained from SRA or new reference genome available).

    It seems that if I extract reads from an existing BAM file, the order in which the reads are presented in the BAM (and therefore extracted thereof) will affect the later remapping results!

    People having a piece of answer told me that the bias affects computation of the read insert distance during remapping and will 'tend' to reproduce results obtained during the first alignment.

    This is second-hand information (thanks a lot to Geraldine who shared this in http://gatkforums.broadinstitute.org...o-fastq-format) but is not very satisfactory and I would like to read a more conclusive discussion.

    Thanks to all who know the detailed answer and could make it clear for us all.

    http://www.bits.vib.be/index.php

  • #2
    The explanation given at the link is actually pretty good. What part of that wasn't satisfactory? I should note that if the input BAM file is name-sorted (or even unsorted) then you probably don't need to shuffle things (it'l. The idea is that the insert sizes that will map to a single region of the genome might not be representative of the entire experiment. If you don't shuffle the reads, then your aligner might estimate the insert size incorrectly, which will slightly bias the alignments downstream (obviously, bias is an issue if you're going to call SNPs).

    Comment


    • #3
      Thanks for this answer, it already sheds some light.

      No part is unsatisfactory but the process leading to a wrong estimate is not really explained (or I did not understand it correctly). Is it only the initial step where BWA collects a read sample and measures distance to fine tune the remaining alignments?

      In my understanding, the bias was present also in name-sorted BAM which did not make sense to me since the only thing exported to FASTQ are seq and quals.

      Considering that public BAM often only report nicely mapped reads I guess that the bias is unavoidable, right!

      S
      Last edited by splaisan; 12-06-2013, 03:32 AM. Reason: need more
      http://www.bits.vib.be/index.php

      Comment


      • #4
        To a certain extent yes, particularly if the original authors didn't do a good job with things. It's probably a good idea to be cautious if you can't get the original fastq files.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 10:49 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-25-2024, 11:49 AM
        0 responses
        23 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-24-2024, 08:47 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        62 views
        0 likes
        Last Post seqadmin  
        Working...
        X