Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA, mostly unmapped reads

    Hi all,
    Hoping someone may know exactly what I did wrong off the bat here since I think we have a lot of BWA gurus here. This is my first time using BWA. Previously I've used Novoalign on the same exome-seq data to great success, aligning the majority of the reads. So I was surprised after running BWA on it that less than 1% of the data mapped and most of it was unmapped.

    The data in question are single lanes of HiSeq human exome-seq data.

    I indexed the reference genome:
    Code:
    bwa index -a bwtsw human_g1k_v37.fasta
    That created (in the same folder):
    Code:
    human_g1k_v37.fasta.amb
    human_g1k_v37.fasta.ann
    human_g1k_v37.fasta.pac
    human_g1k_v37.fasta.rpac
    (I also indexed for colorspace in the same directory since I have SOLiD data I need to align in a few days.)

    Then I ran BWA as follows:
    Code:
    $bwa aln -t 8 $ref $f1 > $out.aln_sa1.sai
    $bwa aln -t 8 $ref $f2 > $out.aln_sa2.sai
    $bwa sampe -r "$rg" $ref $out.aln_sa1.sai $out.aln_sa2.sai $f1 $f2 > $out.sam
    There were no errors while it ran except it mapped almost nothing.

    Can anyone see a glaring problems in my commands here that would lead to tons of unmapped reads? Any help appreciated!
    Last edited by Michael.James.Clark; 03-02-2011, 12:45 PM.
    Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
    Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
    Projects: U87MG whole genome sequence [Website] [Paper]

  • #2
    Looks like some of the index files are missing. This is an example of what I have in my bwa index directory.

    Code:
    -rw-r--r-- 1 jkeats domainuser 3142044949 Feb  3 16:23 hg18.fasta
    -rw-r--r-- 1 jkeats domainuser       6152 Feb  3 16:23 hg18.fasta.amb
    -rw-r--r-- 1 jkeats domainuser        946 Feb  3 16:23 hg18.fasta.ann
    -rw-r--r-- 1 jkeats domainuser 1155163564 Feb  3 16:23 hg18.fasta.bwt
    -rw-r--r-- 1 jkeats domainuser  770109014 Feb  3 16:23 hg18.fasta.pac
    -rw-r--r-- 1 jkeats domainuser 1155163564 Feb  3 16:23 hg18.fasta.rbwt
    -rw-r--r-- 1 jkeats domainuser  770109014 Feb  3 16:23 hg18.fasta.rpac
    -rw-r--r-- 1 jkeats domainuser  385054532 Feb  3 16:23 hg18.fasta.rsa
    -rw-r--r-- 1 jkeats domainuser  385054532 Feb  3 16:23 hg18.fasta.sa

    Comment


    • #3
      Thanks Jon!

      I think I do have those. Here's my whole (top secret) reference folder:
      Code:
      lrwxrwxrwx 1 mjclark rpm   56 Jan 18 22:02 human_g1k_v37.dict -> ../GATK/human_g1k_v37.dict
      lrwxrwxrwx 1 mjclark rpm   57 Jan 18 22:02 human_g1k_v37.fasta -> ../GATK/human_g1k_v37.fasta
      -rw-r--r-- 1 mjclark rpm 6.5K Feb 28 20:15 human_g1k_v37.fasta.amb
      -rw-r--r-- 1 mjclark rpm 6.7K Feb 28 20:15 human_g1k_v37.fasta.ann
      -rw-r--r-- 1 mjclark rpm 1.1G Feb 28 21:04 human_g1k_v37.fasta.bwt
      lrwxrwxrwx 1 mjclark rpm   61 Jan 18 22:02 human_g1k_v37.fasta.fai -> ../GATK/human_g1k_v37.fasta.fai
      -rw-r--r-- 1 mjclark rpm 6.5K Feb 28 20:14 human_g1k_v37.fasta.nt.amb
      -rw-r--r-- 1 mjclark rpm 6.7K Feb 28 20:14 human_g1k_v37.fasta.nt.ann
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:14 human_g1k_v37.fasta.nt.pac
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:15 human_g1k_v37.fasta.pac
      -rw-r--r-- 1 mjclark rpm 1.1G Feb 28 21:05 human_g1k_v37.fasta.rbwt
      -rw-r--r-- 1 mjclark rpm 740M Feb 28 20:15 human_g1k_v37.fasta.rpac
      -rw-r--r-- 1 mjclark rpm 370M Feb 28 21:22 human_g1k_v37.fasta.rsa
      -rw-r--r-- 1 mjclark rpm 370M Feb 28 21:13 human_g1k_v37.fasta.sa
      -rwxr--r-- 1 mjclark rpm 6.1G Oct 15 16:35 human_g1k_v37.nix
      Maybe it's that I did the colorspace indexing in the same folder. I'll try re-doing it seperate from one-another in lieu of another idea.
      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
      Projects: U87MG whole genome sequence [Website] [Paper]

      Comment


      • #4
        Pretty straightforward answer, but I'll post it anyway in case anyone else encounters this in the future (not that there are that many people out there dealing with both Illumina and SOLiD, but here you go).

        It was indeed the indexes. When indexing the first time, I indexed normal and colorspace in the same folder, colorspace second, using default output. It seems some of the resulting indexes, therefore, overwrite. Of course the colorspace indexes don't work with Illumina data.

        Second time around, I indexed them with different names (in different folders, actually), and now things are aligning beautifully.
        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
        Projects: U87MG whole genome sequence [Website] [Paper]

        Comment


        • #5
          Thanks for posting the answer on the thread.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:54 AM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-14-2024, 07:24 AM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-13-2024, 08:58 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-12-2024, 02:20 PM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Working...
          X