Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA, mostly unmapped reads

    Hi all,
    Hoping someone may know exactly what I did wrong off the bat here since I think we have a lot of BWA gurus here. This is my first time using BWA. Previously I've used Novoalign on the same exome-seq data to great success, aligning the majority of the reads. So I was surprised after running BWA on it that less than 1% of the data mapped and most of it was unmapped.

    The data in question are single lanes of HiSeq human exome-seq data.

    I indexed the reference genome:
    Code:
    bwa index -a bwtsw human_g1k_v37.fasta
    That created (in the same folder):
    Code:
    human_g1k_v37.fasta.amb
    human_g1k_v37.fasta.ann
    human_g1k_v37.fasta.pac
    human_g1k_v37.fasta.rpac
    (I also indexed for colorspace in the same directory since I have SOLiD data I need to align in a few days.)

    Then I ran BWA as follows:
    Code:
    $bwa aln -t 8 $ref $f1 > $out.aln_sa1.sai
    $bwa aln -t 8 $ref $f2 > $out.aln_sa2.sai
    $bwa sampe -r "$rg" $ref $out.aln_sa1.sai $out.aln_sa2.sai $f1 $f2 > $out.sam
    There were no errors while it ran except it mapped almost nothing.

    Can anyone see a glaring problems in my commands here that would lead to tons of unmapped reads? Any help appreciated!
    Last edited by Michael.James.Clark; 03-02-2011, 12:45 PM.
    Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
    Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
    Projects: U87MG whole genome sequence [Website] [Paper]

  • #2
    Looks like some of the index files are missing. This is an example of what I have in my bwa index directory.

    Code:
    -rw-r--r-- 1 jkeats domainuser 3142044949 Feb  3 16:23 hg18.fasta
    -rw-r--r-- 1 jkeats domainuser       6152 Feb  3 16:23 hg18.fasta.amb
    -rw-r--r-- 1 jkeats domainuser        946 Feb  3 16:23 hg18.fasta.ann
    -rw-r--r-- 1 jkeats domainuser 1155163564 Feb  3 16:23 hg18.fasta.bwt
    -rw-r--r-- 1 jkeats domainuser  770109014 Feb  3 16:23 hg18.fasta.pac
    -rw-r--r-- 1 jkeats domainuser 1155163564 Feb  3 16:23 hg18.fasta.rbwt
    -rw-r--r-- 1 jkeats domainuser  770109014 Feb  3 16:23 hg18.fasta.rpac
    -rw-r--r-- 1 jkeats domainuser  385054532 Feb  3 16:23 hg18.fasta.rsa
    -rw-r--r-- 1 jkeats domainuser  385054532 Feb  3 16:23 hg18.fasta.sa

    Comment


    • #3
      Thanks Jon!

      I think I do have those. Here's my whole (top secret) reference folder:
      Code:
      lrwxrwxrwx 1 mjclark rpm   56 Jan 18 22:02 human_g1k_v37.dict -> ../GATK/human_g1k_v37.dict
      lrwxrwxrwx 1 mjclark rpm   57 Jan 18 22:02 human_g1k_v37.fasta -> ../GATK/human_g1k_v37.fasta
      -rw-r--r-- 1 mjclark rpm 6.5K Feb 28 20:15 human_g1k_v37.fasta.amb
      -rw-r--r-- 1 mjclark rpm 6.7K Feb 28 20:15 human_g1k_v37.fasta.ann
      -rw-r--r-- 1 mjclark rpm 1.1G Feb 28 21:04 human_g1k_v37.fasta.bwt
      lrwxrwxrwx 1 mjclark rpm   61 Jan 18 22:02 human_g1k_v37.fasta.fai -> ../GATK/human_g1k_v37.fasta.fai
      -rw-r--r-- 1 mjclark rpm 6.5K Feb 28 20:14 human_g1k_v37.fasta.nt.amb
      -rw-r--r-- 1 mjclark rpm 6.7K Feb 28 20:14 human_g1k_v37.fasta.nt.ann
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:14 human_g1k_v37.fasta.nt.pac
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:15 human_g1k_v37.fasta.pac
      -rw-r--r-- 1 mjclark rpm 1.1G Feb 28 21:05 human_g1k_v37.fasta.rbwt
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:15 human_g1k_v37.fasta.rpac
      -rw-r--r-- 1 mjclark rpm 370M Feb 28 21:22 human_g1k_v37.fasta.rsa
      -rw-r--r-- 1 mjclark rpm 370M Feb 28 21:13 human_g1k_v37.fasta.sa
      -rwxr--r-- 1 mjclark rpm 6.1G Oct 15 16:35 human_g1k_v37.nix
      Maybe it's that I did the colorspace indexing in the same folder. I'll try re-doing it seperate from one-another in lieu of another idea.
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]

      Comment


      • #4
        Pretty straightforward answer, but I'll post it anyway in case anyone else encounters this in the future (not that there are that many people out there dealing with both Illumina and SOLiD, but here you go).

        It was indeed the indexes. When indexing the first time, I indexed normal and colorspace in the same folder, colorspace second, using default output. It seems some of the resulting indexes, therefore, overwrite. Of course the colorspace indexes don't work with Illumina data.

        Second time around, I indexed them with different names (in different folders, actually), and now things are aligning beautifully.
        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
        Projects: U87MG whole genome sequence [Website] [Paper]

        Comment


        • #5
          Thanks for posting the answer on the thread.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          33 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X