Hi Brent,
It seems the CIGAR string was due to an old version of bwa (0.7.2). I have installed 0.7.10 today and found the same good CIGAR and alignment positions as you posted above.
(By the way, the reads I posted are the ones I aligned for this pair. I trimmed reads based on quality and adapter sequence priori to alignment).
Now, I am running your tabulate-methylation.py script (using Python 2.7 as per our separate email). I will let you know about the results. Anything you'd like me to check in particular?
Best,
Kevin
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
here are the alignments that I get with bwa-meth for that read-pair that you extracted:
Code:HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 97 1 147356549 55 78M1D66M = 147356709 304 TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C NM:i:4 MD:Z:15A62^T18A20A26 AS:i:82 XS:i:27 RG:Z:a_R YC:Z:CT YD:Z:f HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 145 1 147356709 60 144M = 147356549 -304 TTTATTTTGAGGATGCGTTATACGTTTTTCGGTATTTGGGATTTGTGTGGGAGATGTAGTGTTTGCGATTTTTAAGAGGAAGTTTATTTTTGGGATGGTTTTAGTAGTTTTTTAATAAAGTTTTTTTATTTGGAAAATATAAGT CDCCDDDDDCC>BDDDCEDCBDDBDBDBDDCDDDBDDDDDDDDDBBBC?DDDCDDDDDDDBDDDDDCDDDDDDEDCDEEEFFFFFFHHHHJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJHHHHHFF NM:i:2 MD:Z:30A23G89 AS:i:138 XS:i:22 RG:Z:a_R YC:Z:GA YD:Z:f
Leave a comment:
-
Careful about the header of the next read which got extracted in the "next four lines".
Forward mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 1:N:0:ATCACG
TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG
+
FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 1:N:0:ATCACG
Reverse mate
$grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_2P.fastq
@HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 2:N:0:ATCACG
ACTTATATTTTCCAAATAAAAAAACTTTATTAAAAAACTACTAAAACCATCCCAAAAATAAACTTCCTCTTAAAAATCGCAAACACTACATCTCCCACACAAATCCCAAATACCGAAAAACGTATAACGCATCCTCAAAATAAA
+
FFHHHHHJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJHHHHFFFFFFEEEDCDEDDDDDDCDDDDDBDDDDDDDCDDD?CBBBDDDDDDDDDBDDDCDDBDBDBDDBCDECDDDB>CCDDDDDCCDC
@HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 2:N:0:ATCACG
Also:
$ samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
12807890
which is = 6,403,945 * 2
hence ~73% of 8,691,655 input filtered/trimmed read pairs.
Thanks for looking into this
Kévin
Leave a comment:
-
kevinrue, can you send the output of
grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 $FQ
for both ends of your input fastqs?
Also, note that the percentage of output that you are looking at for bwa-meth is high because it will map some reads as secondary (-M flag to bwa mem) and you should also filter by quality score. Without that, samtools flagstat is only a summary overview of what happened. I filter by mapping quality score > 15 (at least ) in downstream analyses. You can get a more realistic mapping number by:
samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
Leave a comment:
-
Hi ,
Felix, I agree: as I wrote earlier, this utopic alignment rate by bwa makes me a bit nervous too. I am also curious to get to the bottom of things here, and really appreciate the help from everyone involved in this conversation.
Brent, dpryan, as I said in the email, I was a few bwa versions late, so I just updated it on our server and I am aligning again to see if that solves the problem. I had a look at their changelog, and didn't spot recent changes regarding the CIGAR string,so I don't have high hopes on this side of things.
As soon as the updated bwa aligner has fnished realigning those reads, I will try both the "bwameth.py tabulate" and the separate "scripts/tabulate-methylation.py" script to see if I can get either working.
Please let me know if you need any additional piece of information from me.
I can also discuss with my suprvisor the possibility to share some data if you'd prefer having a look at certain aspects of the data yourselves.
For the record, we have two groups of 6 samples, paired between infected and control treatments. Since I spotted that low alignment rate with bismark, I have been doing all the testing on the sample which returned the lowestalignment rate in paired-end mode using bismark (~28%).
By the way,speaking of "someone experienced with programming", I didn't want to write me CV in the first post, but a little detail might help you understand my situation at the minute. I am a PhD student/bioinformatician starting 4th year and who stayed close to the lab side (= use tool to derive biological interpretation), rather than the software development side of aligners and statistical frameworks. I know my way around several programming languages - when I know what I want to program. Here I am stuck mainly because I am in that twilight zone where I don't know exactly what I want/need to program to do things right with this type of data.
Methylation has very different issues to the RNA-seq that I am used to!
Cheers,
Kevin
Leave a comment:
-
as far as I can guess, dpryan is correct that what you're seeing is more likely a bwa bug than a bwa-meth bug since I don't mess with the cigar string.
As I told you, you can simply use the scripts/tabulate-methylation.py script rather than relying on BisSNP. But, yes, it'd be good to know what is going on.
I am downloading chromosome 1 of bos taurus to see.
Leave a comment:
-
Originally posted by dpryan View PostHaving said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.
Leave a comment:
-
I don't think the SAM spec explicitly says it, but it's non-sensical to start or end an alignment with a deletion. By definition, there'd be no bases in the alignment actually supporting said deletion. The CIGAR for that should be 78M1D66M, for what it's worth. My guess is that this is actually a BWA bug, rather than one in bwa-meth itself.
BTW, if it's local alignment that's saving the day on this one, then there are other bowtie2-based aligners that support it (Bison and BS-Seeker2). You generally need to use the --very-sensitive-local preset with bowtie2 to get it to align as well as BWA (at least that was the case when Brent and I compared bwa-meth against Bison). Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.
Leave a comment:
-
Hi Felix,
Thanks, I see what you mean about making the most of your data when more sequencing is not an option. I am not sure how much more sequencing we can get out of our samples either, as those were used for RNA-seq as well. I will discuss the different options with colleagues.
In the meantime, I think I will simultaneously try:
- to complete the "Dirty Harry" solution to salvage as much information using bismark, and see what I can get from downstream analyses this way.
- to see if we can work around the issue described above with bwa-meth to see if the higher alignment rate (for my particular dataset of long reads) can be used to obtain higher information content. (this utopic alignment rate makes me nervous about false positives).
Here is the strange read (thanks for the command, I am always concerned about reinventing the wheel when things like samtools exist!):
HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 99 1 147356530 60 19D78M1D66M = 147356709 323 TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C NM:i:23 AS:i:82 XS:i:55 RG:Z:C1_ATCACG_P YC:Z:CT YD:Z:f
Still reading the SAM format description to spot where the problem is, if any..
Thanks
Kevin
Leave a comment:
-
What Felix said regarding finding the read.
I don't know if there's anything prebuilt that will extract soft-clipped sequences. It'd be an easy* enough thing to write though.
Anything aligning to the end should get hard clipped off, though I recall there being an issue with at least earlier versions of bwa concatenating contigs and then allowing alignments to flow of the end of one onto the next. Having said that, even that shouldn't cause an alignment to start with a deletion.
*Only for someone experienced with programming.
Leave a comment:
-
very quick and dirty would be something like:
samtools view mappedfile.bam | grep 19D78M1D66M > where_is_this_read.sam
Leave a comment:
-
Hi Ryan,
I am in touch with Brent. Just waiting for his answer to that surprising issue.
I don't know if I could find and visualise that particular read/pair somehow (bash command looking for that CIGAR string?)
So far I could only imagine a read aligning to the start of a chromosome/contig maybe? In which case, shouldn't the start of the read be clipped off?
Or is this an expected bahavior of tools supporting local alignment such as bwa-meth? If it was soft-clipped (= kept in the sequence, but not aligned, right?), then I suppose the CIGAR string would need to describe this part of the sequence as deleted from the 5', right?
How would you suggest I "extract some of the longer soft-clipped sequences"? Do you define those as the ones which may start/end with a deletion? Is there a samtools command for that, or will that be a bash script? Then, I suppose you suggest I create a subset bam file containing those sequences and look at their locations?
Sorry for the many questions. I never had to dig into SAM/BAM files so far :/
Thanks!
K.
Leave a comment:
-
Kevin,
We don't tend to play around with the mapping parameters very much normally (what you called over-fitting), but tend to run PE alignments first (with --unmapped) and then re-run the left-over single-end alignments. Having said that, we only do this if we've got very precious samples with limited availability and we really want to squeeze as much out of the data as possible (e.g. oocytes that have been collected by mouth-pipetting for half a year...). I guess if you've got enough material it would be a bit 'cleaner' to just run paired-end mode to make use of the --no_overlap feature. Cheers, Felix
Leave a comment:
-
You might send brentp a message and ask him about that error. It's really really strange for a read to start with a deletion (what would that even mean?). I don't recall bwa-meth messing with the CIGAR string, but it's been a while since I looked so my memory might be failing me.
The increased alignment rate by bwa-meth is presumably due to heavily soft-clipping things. It's be interesting to extract some of the longer soft-clipped sequences and try to see where they go. That might provide some insight into what's going on here.
Leave a comment:
-
Hi again Felix,
(I am wondering whether to continue the conversation I started in this thread, or move to the better suited "paired end alignment rate issue" that you linked earlier.)
bismark
As promised, I have completed running the tests and here are the statistics of unique, multiple, and unaligned reads/pairs:
Can I do anything else to help you help me explain what is going on with long paired-end reads? Given that I am seeking biological inisight between infected and control samples, I'd rather avoid over-fitting parameters which give me my favorite answer, and instead follow your advice on how to correctly use your tool.
bwa-meth
I also report the alignment rate of bwa-meth in the screenshot above, which seem much better (although the numbers don't add up to the same number of input read pairs?!), but you can read below how I am stuck with that tool as well. Here are the detailed stats for bwa-meth:
$samtools flagstat bwa-meth.bam
17570633 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
17493812 + 0 mapped (99.56%:-nan%)
17570633 + 0 paired in sequencing
8787708 + 0 read1
8782925 + 0 read2
16950242 + 0 properly paired (96.47%:-nan%)
17470980 + 0 with itself and mate mapped
22832 + 0 singletons (0.13%:-nan%)
209437 + 0 with mate mapped to a different chr
66124 + 0 with mate mapped to a different chr (mapQ>=5)
However, I am blocked one step further with that one. The "tabulate" step using Bis-SNP to call methylation for CpG's crashes due to a read in the following configuration. I know I am going out of topic here, but here is the error message (I will contact the developer there to see about that):
ERROR MESSAGE: SAM/BAM file SAMFileReader{/workspace/scratch/krue/Methylation/2014-08-27_analysis/bismark/bwa_meth/bwa-meth.bam} is malformed: Read starting with deletion. Cigar: 19D78M1D66M
Any advice welcome. Really feeling lost about what to do now.
Thanks very much
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.
Somatic Genomics
“We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...-
Channel: Articles
05-24-2024, 01:16 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-03-2024, 06:55 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
06-03-2024, 06:55 AM
|
||
Started by seqadmin, 05-30-2024, 03:16 PM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
05-30-2024, 03:16 PM
|
||
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability
by seqadmin
Started by seqadmin, 05-29-2024, 01:32 PM
|
0 responses
29 views
0 likes
|
Last Post
by seqadmin
05-29-2024, 01:32 PM
|
||
Started by seqadmin, 05-24-2024, 07:15 AM
|
0 responses
216 views
0 likes
|
Last Post
by seqadmin
05-24-2024, 07:15 AM
|
Leave a comment: