Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • making SAM: too few good pairs

    Hi Everyone,

    I am new to the community - learning to analyze whole exome data using bwa and samtools. First I will tell you what I have been doing, then point out the red flags, and then my question is at the bottom.

    This is what I have done so far:

    1. I analyzed the fastq files using fastQC and they look good. The reads are from Otogenetics.
    2. I downloaded the most recent b37 folder from the GATK resource bundle.
    3. I indexed the genome using bwa-0.6.2, and aligned 2 illumina fastq files and then merged them into a SAM file.

    Here are the commands I used for steps 2 and 3:
    ./bwa index -a bwtsw human_g1k_v37.fasta
    ./bwa aln human_g1k_v37.fasta Ot6000_R1_001.fastq > Ot6000_R1.sai
    ./bwa aln human_g1k_v37.fasta Ot6000_R2_001.fastq > Ot6000_R2.sai
    ./bwa sampe -r "@RG\tID:6000\tSM:6000\tLB:ga\tPL:Illumina" human_g1k_v37.fasta Ot6000_R1.sai Ot6000_R2.sai Ot6000_R1_001.fastq Ot6000_R2_001.fastq > Ot6000.sam

    4. I also made bam files, sorted the bam files and indexed the bam files.
    5. I ran flagstat on the sorted bam files. Here is the output:

    There are these 3 red flags in the output:
    1. output during merging of 2 sai files into sam files:

    [bwa_sai2sam_pe_core] convert to sequence coordinate...
    [infer_isize] fail to infer insert size: too few good pairs
    [bwa_sai2sam_pe_core] time elapses: 1.00 sec
    [bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
    [bwa_sai2sam_pe_core] align unmapped mate...
    [bwa_sai2sam_pe_core] time elapses: 0.39 sec
    [bwa_sai2sam_pe_core] refine gapped alignments... 0.01 sec
    [bwa_sai2sam_pe_core] print alignments... 0.18 sec
    [bwa_sai2sam_pe_core] 28604456 sequences have been processed.

    2. bai files are empty

    3. output from flagstat:
    57208912 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    0 + 0 mapped (0.00%:nan%)
    57208912 + 0 paired in sequencing
    28604456 + 0 read1
    28604456 + 0 read2
    0 + 0 properly paired (0.00%:nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    Something is going on with the pairing of the reads, based on the output while making the sam file and from the flagstat output.

    I have re-downloaded the GATK resource bundle, re-indexed the genome, tried other fastq files, etc, and I'm getting the same result. I can't find the problem. Any ideas?

    Thanks!
    Allison

  • #2
    Taking the data as face value; nothing mapped. Have you spot-checked a few reads and confirmed that they BLAT to the human genome? Did you at least check to see if on all the indexing files that should be there are there?

    Have you looked at the .sam file? What do the entries look like? They all have the 4 flag? The quality strings look sane? The sequences look sane?

    I don't think pairing is the problem, but if you think it is, have you tried making .sam files based on read 1 alone? What do those look like?

    Comment


    • #3
      hi, thanks for your help.

      I confirmed that the fastq reads BLAT to the human genome. However, looking at the SAM file, the flags are 77 and 141, throughout. I looked up the flags and they mean this:

      "All Bad"
      At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:
      77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"
      141 - 0010001101 - Second in pair and "all bad".

      Comment


      • #4
        You could try reindexing the reference. Or maybe a chromosome, to speed up troubleshooting.

        Have you eyeballed the fastq? What about the rest of the .sam entry, besides the flag? Does that look sane? Did you eyeball the reference fasta?

        Something is off, and likely, it's because something is wrong with one of your files

        Comment


        • #5
          Can you post examples of the fastq headers for both files? Maybe the pairs are not correctly recognised as pairs because the header is not, for example, similar, only with /1 and /2 at the end.

          Comment


          • #6
            Hi again,
            I gave the fastq files to a person who helps me and they worked fine for him. Something is going wrong (for me) with either indexing the genome, or aligning the fastq files to the reference genome. The fastq reads aren't aligning to the genome. I have downloaded fresh copies of both the human genome and the bwa software, and I'm still getting the same results. My sai files are 114 MB and they should be 800 MB (that's what the other person got with the same files).

            Comment


            • #7
              hi everyone,
              I guess the newer versions of bwa don't work on a Mac. I downloaded an older one - 0.5.10 - and already things are looking different. Hopefully that solved the problem.
              Thanks!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 05-14-2024, 07:03 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-10-2024, 06:35 AM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-09-2024, 02:46 PM
              0 responses
              46 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              39 views
              0 likes
              Last Post seqadmin  
              Working...
              X