I've used Mosaik tools quite sucessfully in the past and I'm having an issue that I want to ask for help on....
I have _sequence.txt files which were sent to me from another lab for analysis, so I don't have access to the raw data or QC data. The reads are from an enriched library (Nimblegen, I believe; Exons from a region of interest).
For a given lane of solexa data (36 nt, not paired end), I have 11.6 M reads. I use MosaikBuild to create the dat file, and then I align these reads to an artificial sequence which represents all the exons from the enrichment region. my Aligner parameters are:
MosaikAligner -in lane4.dat -ia chip.dat -out lane4.align -hs 15 -mm 3 -p 7 -a all -m all -mhp 100
the resulting output is the conundrum ...... Why would I be losing almost 60% of the reads to a hash failure??? and 30 more to filtering ???? I'm losing 90% of my sequence in this step. I've tried several samples from 2 different solexa runs and gotten the same result.
All thoughts and comments are welcome !!
Jim
*******************
- Using the following alignment algorithm: all positions
- Using the following alignment mode: aligning reads to all possible locations
- Using a maximum mismatch threshold of 3
- Using a hash size of 15
- Using 7 processors
- Setting hash position threshold to 100
Hashing reference sequence:
100%[==========================================================================================] 621,565.7 ref bases/s in 5 s
- loading reference sequence... finished.
Aligning read library (11573312):
100%[==============================================================================================] 12,524.9 reads/s in 15:24
Alignment statistics:
===================================
# failed hash: 6818036 (58.9 %)
# filtered out: 3537110 (30.6 %)
# unique: 343500 ( 3.0 %)
# non-unique: 874666 ( 7.6 %)
---------------------------------------------
total: 11573312
total aligned: 1218166 (10.5 %)
I have _sequence.txt files which were sent to me from another lab for analysis, so I don't have access to the raw data or QC data. The reads are from an enriched library (Nimblegen, I believe; Exons from a region of interest).
For a given lane of solexa data (36 nt, not paired end), I have 11.6 M reads. I use MosaikBuild to create the dat file, and then I align these reads to an artificial sequence which represents all the exons from the enrichment region. my Aligner parameters are:
MosaikAligner -in lane4.dat -ia chip.dat -out lane4.align -hs 15 -mm 3 -p 7 -a all -m all -mhp 100
the resulting output is the conundrum ...... Why would I be losing almost 60% of the reads to a hash failure??? and 30 more to filtering ???? I'm losing 90% of my sequence in this step. I've tried several samples from 2 different solexa runs and gotten the same result.
All thoughts and comments are welcome !!
Jim
*******************
- Using the following alignment algorithm: all positions
- Using the following alignment mode: aligning reads to all possible locations
- Using a maximum mismatch threshold of 3
- Using a hash size of 15
- Using 7 processors
- Setting hash position threshold to 100
Hashing reference sequence:
100%[==========================================================================================] 621,565.7 ref bases/s in 5 s
- loading reference sequence... finished.
Aligning read library (11573312):
100%[==============================================================================================] 12,524.9 reads/s in 15:24
Alignment statistics:
===================================
# failed hash: 6818036 (58.9 %)
# filtered out: 3537110 (30.6 %)
# unique: 343500 ( 3.0 %)
# non-unique: 874666 ( 7.6 %)
---------------------------------------------
total: 11573312
total aligned: 1218166 (10.5 %)
Comment