Some questions regarding blacklisting
Hello Brian,
Sorry for bothering you again, but I have a question regarding blacklisting and its (correct) usage.
As a favor for a collegue, I quickly mapped some genomic PE data (ChIP-seq) to the human genome. The mapping statistics looked good and I handed the data over to him didn't give it a moment's thought anymore.
However he assessed the coverage profiles and now suspects a contamination of his library with DNA fragments originating from certain BACs, which are used in his laboratory. He showed me some example chromosomal regions and indeed, there are are several ~150kb big regions with remarkable increase in coverage overlapping the annotated BAC.
I offered to exclude reads from those BAC areas of the genome and try to identify real ChIP-signal somewhere else in the data.
Thus I restarted the alignment from FASTQ and specified a Multi-FASTA file containing the genomic sequences of the BACs as blacklist. bbmap indeed now generated the specified outb sam file, but not a single read pair went there - for every sample the file was always empty except the header.
To cross-validate, I indexed the same Multi-FASTA file and used bbmap to align the reads with exactly the same parameter settings against it. This are the results from mapping the same sample as above against the BAC collection:
My questions now are:
Hello Brian,
Sorry for bothering you again, but I have a question regarding blacklisting and its (correct) usage.
As a favor for a collegue, I quickly mapped some genomic PE data (ChIP-seq) to the human genome. The mapping statistics looked good and I handed the data over to him didn't give it a moment's thought anymore.
However he assessed the coverage profiles and now suspects a contamination of his library with DNA fragments originating from certain BACs, which are used in his laboratory. He showed me some example chromosomal regions and indeed, there are are several ~150kb big regions with remarkable increase in coverage overlapping the annotated BAC.
I offered to exclude reads from those BAC areas of the genome and try to identify real ChIP-signal somewhere else in the data.
Thus I restarted the alignment from FASTQ and specified a Multi-FASTA file containing the genomic sequences of the BACs as blacklist. bbmap indeed now generated the specified outb sam file, but not a single read pair went there - for every sample the file was always empty except the header.
Code:
Genome: 3 Key Length: 13 Max Indel: 16000 Minimum Score Ratio: 0.56 Mapping Mode: normal Reads Used: 148059914 (14335498222 bases) Mapping: 51822.670 seconds. Reads/sec: 1428.52 kBases/sec: 276.63 Pairing data: pct reads num reads pct bases num bases mated pairs: 98.5211% 72935102 98.8062% 14164366056 bad pairs: 1.3022% 964043 1.2593% 180529886 insert size avg: 425.12 Read 1 data: pct reads num reads pct bases num bases mapped: 99.8531% 73921244 99.8812% 7174307831 unambiguous: 92.3162% 68341677 92.4104% 6637694291 ambiguous: 7.5369% 5579567 7.4708% 536613540 low-Q discards: 0.0000% 0 0.0000% 0 perfect best site: 78.4885% 58104980 78.7959% 5659785119 semiperfect site: 78.5457% 58147328 78.8410% 5663027631 rescued: 0.2712% 200743 Match Rate: NA NA 97.9450% 7152758588 Error Rate: 21.3317% 15768653 2.0538% 149983878 Sub Rate: 19.9830% 14771656 0.2769% 20218648 Del Rate: 1.1258% 832171 1.7599% 128524996 Ins Rate: 0.8954% 661856 0.0170% 1240234 N Rate: 0.0807% 59648 0.0012% 90361 Read 2 data: pct reads num reads pct bases num bases mapped: 99.8967% 73953500 99.9086% 7146116046 unambiguous: 92.1534% 68221148 92.2105% 6595499668 ambiguous: 7.7433% 5732352 7.6981% 550616378 low-Q discards: 0.0000% 13 0.0000% 470 perfect best site: 68.2412% 50518933 68.6293% 4908820958 semiperfect site: 68.3924% 50630855 68.7725% 4919060978 rescued: 0.3138% 232341 Match Rate: NA NA 96.7175% 7114752246 Error Rate: 20.8947% 15452343 3.1485% 231610349 Sub Rate: 19.4650% 14395026 0.2754% 20259503 Del Rate: 1.4361% 1062021 2.8562% 210107455 Ins Rate: 0.8087% 598072 0.0169% 1243391 N Rate: 13.1625% 9734162 0.1340% 9860906 Total time: 53846.780 seconds
Code:
Genome: 1 Key Length: 11 Max Indel: 16000 Minimum Score Ratio: 0.56 Mapping Mode: normal Reads Used: 148059914 (14335498222 bases) Mapping: 17929.866 seconds. Reads/sec: 4128.86 kBases/sec: 799.53 Pairing data: pct reads num reads pct bases num bases mated pairs: 0.2606% 192935 0.2584% 37041616 bad pairs: 0.2608% 193052 0.2578% 36960674 insert size avg: 1540.38 Read 1 data: pct reads num reads pct bases num bases mapped: 2.9434% 2179014 2.8388% 203905833 unambiguous: 1.6845% 1247069 1.6340% 117369776 ambiguous: 1.2589% 931945 1.2048% 86536057 low-Q discards: 0.0000% 0 0.0000% 0 perfect best site: 0.0508% 37607 0.0407% 2926546 semiperfect site: 0.0514% 38023 0.0409% 2939958 rescued: 0.0097% 7213 Match Rate: NA NA 6.7979% 181724888 Error Rate: 98.2727% 2141375 93.2020% 2491518976 Sub Rate: 98.2038% 2139875 0.7663% 20485726 Del Rate: 28.9776% 631426 92.3724% 2469340316 Ins Rate: 20.6261% 449445 0.0633% 1692934 N Rate: 0.0796% 1735 0.0001% 2285 Read 2 data: pct reads num reads pct bases num bases mapped: 4.8730% 3607461 4.6617% 333437379 unambiguous: 2.7381% 2026992 2.6422% 188984411 ambiguous: 2.1349% 1580469 2.0196% 144452968 :
- Shouldn't I have just by random some pairs going to the outb sam file ? (since this are BACs which contain fragments of the human genome, against which I align)
- Does the latter mapping results point into the direction of a BAC contamination at all? I am a bit puzzling, because of the low number of mated pairs.
Comment