Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

kevinrue replied

09-02-2014, 12:41 PM
Hi Brent,

It seems the CIGAR string was due to an old version of bwa (0.7.2). I have installed 0.7.10 today and found the same good CIGAR and alignment positions as you posted above.
(By the way, the reads I posted are the ones I aligned for this pair. I trimmed reads based on quality and adapter sequence priori to alignment).

Now, I am running your tabulate-methylation.py script (using Python 2.7 as per our separate email). I will let you know about the results. Anything you'd like me to check in particular?

Best,
Kevin

Last edited by kevinrue; 09-02-2014, 12:48 PM.
Leave a comment:

brentp replied

09-02-2014, 12:06 PM

here are the alignments that I get with bwa-meth for that read-pair that you extracted:

Code:

HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	97	1	147356549	55	78M1D66M	=	147356709	304	TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG	FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C	NM:i:4	MD:Z:15A62^T18A20A26	AS:i:82	XS:i:27	RG:Z:a_R	YC:Z:CT	YD:Z:f
HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	145	1	147356709	60	144M	=	147356549	-304	TTTATTTTGAGGATGCGTTATACGTTTTTCGGTATTTGGGATTTGTGTGGGAGATGTAGTGTTTGCGATTTTTAAGAGGAAGTTTATTTTTGGGATGGTTTTAGTAGTTTTTTAATAAAGTTTTTTTATTTGGAAAATATAAGT	CDCCDDDDDCC>BDDDCEDCBDDBDBDBDDCDDDBDDDDDDDDDBBBC?DDDCDDDDDDDBDDDDDCDDDDDDEDCDEEEFFFFFFHHHHJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJHHHHHFF	NM:i:2	MD:Z:30A23G89	AS:i:138	XS:i:22	RG:Z:a_R	YC:Z:GA	YD:Z:f

They are mapped to the same vicinity as the alignment you pasted, but there is no odd cigar string. Maybe you mapped trimmed reads?

Leave a comment:

kevinrue replied

09-02-2014, 09:27 AM
Careful about the header of the next read which got extracted in the "next four lines".

Forward mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 1:N:0:ATCACG
TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG
+
FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 1:N:0:ATCACG

Reverse mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_2P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 2:N:0:ATCACG
ACTTATATTTTCCAAATAAAAAAACTTTATTAAAAAACTACTAAAACCATCCCAAAAATAAACTTCCTCTTAAAAATCGCAAACACTACATCTCCCACACAAATCCCAAATACCGAAAAACGTATAACGCATCCTCAAAATAAA
+
FFHHHHHJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJHHHHFFFFFFEEEDCDEDDDDDDCDDDDDBDDDDDDDCDDD?CBBBDDDDDDDDDBDDDCDDBDBDBDDBCDECDDDB>CCDDDDDCCDC
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 2:N:0:ATCACG

Also:
$ samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
12807890
which is = 6,403,945 * 2
hence ~73% of 8,691,655 input filtered/trimmed read pairs.

Thanks for looking into this
Kévin
Leave a comment:
brentp replied

09-02-2014, 08:59 AM
kevinrue, can you send the output of

grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 $FQ

for both ends of your input fastqs?

Also, note that the percentage of output that you are looking at for bwa-meth is high because it will map some reads as secondary (-M flag to bwa mem) and you should also filter by quality score. Without that, samtools flagstat is only a summary overview of what happened. I filter by mapping quality score > 15 (at least ) in downstream analyses. You can get a more realistic mapping number by:

samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
Leave a comment:
kevinrue replied

09-02-2014, 08:11 AM
Hi ,

Felix, I agree: as I wrote earlier, this utopic alignment rate by bwa makes me a bit nervous too. I am also curious to get to the bottom of things here, and really appreciate the help from everyone involved in this conversation.

Brent, dpryan, as I said in the email, I was a few bwa versions late, so I just updated it on our server and I am aligning again to see if that solves the problem. I had a look at their changelog, and didn't spot recent changes regarding the CIGAR string,so I don't have high hopes on this side of things.

As soon as the updated bwa aligner has fnished realigning those reads, I will try both the "bwameth.py tabulate" and the separate "scripts/tabulate-methylation.py" script to see if I can get either working.

Please let me know if you need any additional piece of information from me.
I can also discuss with my suprvisor the possibility to share some data if you'd prefer having a look at certain aspects of the data yourselves.
For the record, we have two groups of 6 samples, paired between infected and control treatments. Since I spotted that low alignment rate with bismark, I have been doing all the testing on the sample which returned the lowestalignment rate in paired-end mode using bismark (~28%).

By the way,speaking of "someone experienced with programming", I didn't want to write me CV in the first post, but a little detail might help you understand my situation at the minute. I am a PhD student/bioinformatician starting 4th year and who stayed close to the lab side (= use tool to derive biological interpretation), rather than the software development side of aligners and statistical frameworks. I know my way around several programming languages - when I know what I want to program. Here I am stuck mainly because I am in that twilight zone where I don't know exactly what I want/need to program to do things right with this type of data.
Methylation has very different issues to the RNA-seq that I am used to!

Cheers,
Kevin
Leave a comment:
brentp replied

09-02-2014, 07:24 AM
as far as I can guess, dpryan is correct that what you're seeing is more likely a bwa bug than a bwa-meth bug since I don't mess with the cigar string.
As I told you, you can simply use the scripts/tabulate-methylation.py script rather than relying on BisSNP. But, yes, it'd be good to know what is going on.
I am downloading chromosome 1 of bos taurus to see.
Leave a comment:
fkrueger replied

09-02-2014, 07:24 AM
Originally posted by dpryan View Post

Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.

I can only second this, and let's not kid ourselves, you can't possibly achieve a 99.56% mapping rate while including only sensible reads.... (not even simulated data from well known parts of the genome would come close, and your library likely contains a lot of repeats and other unmappable things that are not in the assembly, in addition to possible crossovers and what-ever-else-you are-going-to-dig-up)
Leave a comment:
dpryan replied

09-02-2014, 06:51 AM
I don't think the SAM spec explicitly says it, but it's non-sensical to start or end an alignment with a deletion. By definition, there'd be no bases in the alignment actually supporting said deletion. The CIGAR for that should be 78M1D66M, for what it's worth. My guess is that this is actually a BWA bug, rather than one in bwa-meth itself.

BTW, if it's local alignment that's saving the day on this one, then there are other bowtie2-based aligners that support it (Bison and BS-Seeker2). You generally need to use the --very-sensitive-local preset with bowtie2 to get it to align as well as BWA (at least that was the case when Brent and I compared bwa-meth against Bison). Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.
Leave a comment:
kevinrue replied

09-02-2014, 06:39 AM
Hi Felix,

Thanks, I see what you mean about making the most of your data when more sequencing is not an option. I am not sure how much more sequencing we can get out of our samples either, as those were used for RNA-seq as well. I will discuss the different options with colleagues.

In the meantime, I think I will simultaneously try:
- to complete the "Dirty Harry" solution to salvage as much information using bismark, and see what I can get from downstream analyses this way.
- to see if we can work around the issue described above with bwa-meth to see if the higher alignment rate (for my particular dataset of long reads) can be used to obtain higher information content. (this utopic alignment rate makes me nervous about false positives).

Here is the strange read (thanks for the command, I am always concerned about reinventing the wheel when things like samtools exist!):
HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 99 1 147356530 60 19D78M1D66M = 147356709 323 TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C NM:i:23 AS:i:82 XS:i:55 RG:Z:C1_ATCACG_P YC:Z:CT YD:Z:f

Still reading the SAM format description to spot where the problem is, if any..

Thanks
Kevin
Leave a comment:
dpryan replied

09-02-2014, 06:36 AM
What Felix said regarding finding the read.

I don't know if there's anything prebuilt that will extract soft-clipped sequences. It'd be an easy* enough thing to write though.

Anything aligning to the end should get hard clipped off, though I recall there being an issue with at least earlier versions of bwa concatenating contigs and then allowing alignments to flow of the end of one onto the next. Having said that, even that shouldn't cause an alignment to start with a deletion.

*Only for someone experienced with programming.
Leave a comment:
fkrueger replied

09-02-2014, 06:13 AM
very quick and dirty would be something like:
samtools view mappedfile.bam | grep 19D78M1D66M > where_is_this_read.sam
Leave a comment:
kevinrue replied

09-02-2014, 06:08 AM
Hi Ryan,

I am in touch with Brent. Just waiting for his answer to that surprising issue.

I don't know if I could find and visualise that particular read/pair somehow (bash command looking for that CIGAR string?)

So far I could only imagine a read aligning to the start of a chromosome/contig maybe? In which case, shouldn't the start of the read be clipped off?
Or is this an expected bahavior of tools supporting local alignment such as bwa-meth? If it was soft-clipped (= kept in the sequence, but not aligned, right?), then I suppose the CIGAR string would need to describe this part of the sequence as deleted from the 5', right?

How would you suggest I "extract some of the longer soft-clipped sequences"? Do you define those as the ones which may start/end with a deletion? Is there a samtools command for that, or will that be a bash script? Then, I suppose you suggest I create a subset bam file containing those sequences and look at their locations?

Sorry for the many questions. I never had to dig into SAM/BAM files so far :/

Thanks!
K.
Leave a comment:
fkrueger replied

09-02-2014, 06:00 AM
Kevin,

We don't tend to play around with the mapping parameters very much normally (what you called over-fitting), but tend to run PE alignments first (with --unmapped) and then re-run the left-over single-end alignments. Having said that, we only do this if we've got very precious samples with limited availability and we really want to squeeze as much out of the data as possible (e.g. oocytes that have been collected by mouth-pipetting for half a year...). I guess if you've got enough material it would be a bit 'cleaner' to just run paired-end mode to make use of the --no_overlap feature. Cheers, Felix
Leave a comment:
dpryan replied

09-02-2014, 05:51 AM
You might send brentp a message and ask him about that error. It's really really strange for a read to start with a deletion (what would that even mean?). I don't recall bwa-meth messing with the CIGAR string, but it's been a while since I looked so my memory might be failing me.

The increased alignment rate by bwa-meth is presumably due to heavily soft-clipping things. It's be interesting to extract some of the longer soft-clipped sequences and try to see where they go. That might provide some insight into what's going on here.
Leave a comment:
kevinrue replied

09-02-2014, 05:43 AM
Hi again Felix,

(I am wondering whether to continue the conversation I started in this thread, or move to the better suited "paired end alignment rate issue" that you linked earlier.)

bismark
As promised, I have completed running the tests and here are the statistics of unique, multiple, and unaligned reads/pairs:

Can I do anything else to help you help me explain what is going on with long paired-end reads? Given that I am seeking biological inisight between infected and control samples, I'd rather avoid over-fitting parameters which give me my favorite answer, and instead follow your advice on how to correctly use your tool.

bwa-meth
I also report the alignment rate of bwa-meth in the screenshot above, which seem much better (although the numbers don't add up to the same number of input read pairs?!), but you can read below how I am stuck with that tool as well. Here are the detailed stats for bwa-meth:
$samtools flagstat bwa-meth.bam
17570633 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
17493812 + 0 mapped (99.56%:-nan%)
17570633 + 0 paired in sequencing
8787708 + 0 read1
8782925 + 0 read2
16950242 + 0 properly paired (96.47%:-nan%)
17470980 + 0 with itself and mate mapped
22832 + 0 singletons (0.13%:-nan%)
209437 + 0 with mate mapped to a different chr
66124 + 0 with mate mapped to a different chr (mapQ>=5)

However, I am blocked one step further with that one. The "tabulate" step using Bis-SNP to call methylation for CpG's crashes due to a read in the following configuration. I know I am going out of topic here, but here is the error message (I will contact the developer there to see about that):
ERROR MESSAGE: SAM/BAM file SAMFileReader{/workspace/scratch/krue/Methylation/2014-08-27_analysis/bismark/bwa_meth/bwa-meth.bam} is malformed: Read starting with deletion. Cigar: 19D78M1D66M

Any advice welcome. Really feeling lost about what to do now.
Thanks very much
Leave a comment:

Previous 1 7 14 15 16 17 18 19 20 27 34 template Next

Latest Developments in Precision Medicine

by seqadmin

Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

Somatic Genomics
“We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
- Channel: Articles
05-24-2024, 01:16 PM

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 26 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 216 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News