Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kevinrue
    replied
    Hi Brent,

    It seems the CIGAR string was due to an old version of bwa (0.7.2). I have installed 0.7.10 today and found the same good CIGAR and alignment positions as you posted above.
    (By the way, the reads I posted are the ones I aligned for this pair. I trimmed reads based on quality and adapter sequence priori to alignment).

    Now, I am running your tabulate-methylation.py script (using Python 2.7 as per our separate email). I will let you know about the results. Anything you'd like me to check in particular?

    Best,
    Kevin
    Last edited by kevinrue; 09-02-2014, 12:48 PM.

    Leave a comment:


  • brentp
    replied
    here are the alignments that I get with bwa-meth for that read-pair that you extracted:

    Code:
    HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	97	1	147356549	55	78M1D66M	=	147356709	304	TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG	FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C	NM:i:4	MD:Z:15A62^T18A20A26	AS:i:82	XS:i:27	RG:Z:a_R	YC:Z:CT	YD:Z:f
    HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545	145	1	147356709	60	144M	=	147356549	-304	TTTATTTTGAGGATGCGTTATACGTTTTTCGGTATTTGGGATTTGTGTGGGAGATGTAGTGTTTGCGATTTTTAAGAGGAAGTTTATTTTTGGGATGGTTTTAGTAGTTTTTTAATAAAGTTTTTTTATTTGGAAAATATAAGT	CDCCDDDDDCC>BDDDCEDCBDDBDBDBDDCDDDBDDDDDDDDDBBBC?DDDCDDDDDDDBDDDDDCDDDDDDEDCDEEEFFFFFFHHHHJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJHHHHHFF	NM:i:2	MD:Z:30A23G89	AS:i:138	XS:i:22	RG:Z:a_R	YC:Z:GA	YD:Z:f
    They are mapped to the same vicinity as the alignment you pasted, but there is no odd cigar string. Maybe you mapped trimmed reads?

    Leave a comment:


  • kevinrue
    replied
    Careful about the header of the next read which got extracted in the "next four lines".

    Forward mate
    $grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq
    @HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 1:N:0:ATCACG
    TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG
    +
    FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C
    @HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 1:N:0:ATCACG


    Reverse mate
    $grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_2P.fastq
    @HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 2:N:0:ATCACG
    ACTTATATTTTCCAAATAAAAAAACTTTATTAAAAAACTACTAAAACCATCCCAAAAATAAACTTCCTCTTAAAAATCGCAAACACTACATCTCCCACACAAATCCCAAATACCGAAAAACGTATAACGCATCCTCAAAATAAA
    +
    FFHHHHHJJJJJJJJJJJJJIJJJJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJHHHHFFFFFFEEEDCDEDDDDDDCDDDDDBDDDDDDDCDDD?CBBBDDDDDDDDDBDDDCDDBDBDBDDBCDECDDDB>CCDDDDDCCDC
    @HWI-D00238R:175:H8RA2ADXX:1:1216:1653:36566 2:N:0:ATCACG



    Also:
    $ samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l
    12807890

    which is = 6,403,945 * 2
    hence ~73% of 8,691,655 input filtered/trimmed read pairs.

    Thanks for looking into this
    Kévin

    Leave a comment:


  • brentp
    replied
    kevinrue, can you send the output of

    grep -A4 HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 $FQ

    for both ends of your input fastqs?

    Also, note that the percentage of output that you are looking at for bwa-meth is high because it will map some reads as secondary (-M flag to bwa mem) and you should also filter by quality score. Without that, samtools flagstat is only a summary overview of what happened. I filter by mapping quality score > 15 (at least ) in downstream analyses. You can get a more realistic mapping number by:


    samtools view -f2 -F0x200 $BAM | awk '$5 > 15' | wc -l

    Leave a comment:


  • kevinrue
    replied
    Hi ,

    Felix, I agree: as I wrote earlier, this utopic alignment rate by bwa makes me a bit nervous too. I am also curious to get to the bottom of things here, and really appreciate the help from everyone involved in this conversation.

    Brent, dpryan, as I said in the email, I was a few bwa versions late, so I just updated it on our server and I am aligning again to see if that solves the problem. I had a look at their changelog, and didn't spot recent changes regarding the CIGAR string,so I don't have high hopes on this side of things.

    As soon as the updated bwa aligner has fnished realigning those reads, I will try both the "bwameth.py tabulate" and the separate "scripts/tabulate-methylation.py" script to see if I can get either working.

    Please let me know if you need any additional piece of information from me.
    I can also discuss with my suprvisor the possibility to share some data if you'd prefer having a look at certain aspects of the data yourselves.
    For the record, we have two groups of 6 samples, paired between infected and control treatments. Since I spotted that low alignment rate with bismark, I have been doing all the testing on the sample which returned the lowestalignment rate in paired-end mode using bismark (~28%).


    By the way,speaking of "someone experienced with programming", I didn't want to write me CV in the first post, but a little detail might help you understand my situation at the minute. I am a PhD student/bioinformatician starting 4th year and who stayed close to the lab side (= use tool to derive biological interpretation), rather than the software development side of aligners and statistical frameworks. I know my way around several programming languages - when I know what I want to program. Here I am stuck mainly because I am in that twilight zone where I don't know exactly what I want/need to program to do things right with this type of data.
    Methylation has very different issues to the RNA-seq that I am used to!

    Cheers,
    Kevin

    Leave a comment:


  • brentp
    replied
    as far as I can guess, dpryan is correct that what you're seeing is more likely a bwa bug than a bwa-meth bug since I don't mess with the cigar string.
    As I told you, you can simply use the scripts/tabulate-methylation.py script rather than relying on BisSNP. But, yes, it'd be good to know what is going on.
    I am downloading chromosome 1 of bos taurus to see.

    Leave a comment:


  • fkrueger
    replied
    Originally posted by dpryan View Post
    Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.
    I can only second this, and let's not kid ourselves, you can't possibly achieve a 99.56% mapping rate while including only sensible reads.... (not even simulated data from well known parts of the genome would come close, and your library likely contains a lot of repeats and other unmappable things that are not in the assembly, in addition to possible crossovers and what-ever-else-you are-going-to-dig-up)

    Leave a comment:


  • dpryan
    replied
    I don't think the SAM spec explicitly says it, but it's non-sensical to start or end an alignment with a deletion. By definition, there'd be no bases in the alignment actually supporting said deletion. The CIGAR for that should be 78M1D66M, for what it's worth. My guess is that this is actually a BWA bug, rather than one in bwa-meth itself.

    BTW, if it's local alignment that's saving the day on this one, then there are other bowtie2-based aligners that support it (Bison and BS-Seeker2). You generally need to use the --very-sensitive-local preset with bowtie2 to get it to align as well as BWA (at least that was the case when Brent and I compared bwa-meth against Bison). Having said that, if this were my data I'd prefer to get to the bottom of this before resorting to local alignment.

    Leave a comment:


  • kevinrue
    replied
    Hi Felix,

    Thanks, I see what you mean about making the most of your data when more sequencing is not an option. I am not sure how much more sequencing we can get out of our samples either, as those were used for RNA-seq as well. I will discuss the different options with colleagues.

    In the meantime, I think I will simultaneously try:
    - to complete the "Dirty Harry" solution to salvage as much information using bismark, and see what I can get from downstream analyses this way.
    - to see if we can work around the issue described above with bwa-meth to see if the higher alignment rate (for my particular dataset of long reads) can be used to obtain higher information content. (this utopic alignment rate makes me nervous about false positives).

    Here is the strange read (thanks for the command, I am always concerned about reinventing the wheel when things like samtools exist!):
    HWI-D00238R:175:H8RA2ADXX:1:1216:1567:36545 99 1 147356530 60 19D78M1D66M = 147356709 323 TATGTATATATTTATGCGTATATTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATATTTTTTATGTATATGTATGTTTTTTATGTATATGTATATGTACGTGTTATGGTAGTAGGTTAGGAG FFHHHHHJJJJJJIJJJJIJJJJJJJJIJJJJIJJIGIJIJJJJJJIJJIIJJJJIJJJJJJJJJJJJIIJIJJIIGFHHHHHDCEDEEEEEEDECDBDEDDDDDDDEEDEECCDDEDEDED?CAACDDDCDDDEDDCCCDC>C NM:i:23 AS:i:82 XS:i:55 RG:Z:C1_ATCACG_P YC:Z:CT YD:Z:f

    Still reading the SAM format description to spot where the problem is, if any..

    Thanks
    Kevin

    Leave a comment:


  • dpryan
    replied
    What Felix said regarding finding the read.

    I don't know if there's anything prebuilt that will extract soft-clipped sequences. It'd be an easy* enough thing to write though.

    Anything aligning to the end should get hard clipped off, though I recall there being an issue with at least earlier versions of bwa concatenating contigs and then allowing alignments to flow of the end of one onto the next. Having said that, even that shouldn't cause an alignment to start with a deletion.

    *Only for someone experienced with programming.

    Leave a comment:


  • fkrueger
    replied
    very quick and dirty would be something like:
    samtools view mappedfile.bam | grep 19D78M1D66M > where_is_this_read.sam

    Leave a comment:


  • kevinrue
    replied
    Hi Ryan,

    I am in touch with Brent. Just waiting for his answer to that surprising issue.

    I don't know if I could find and visualise that particular read/pair somehow (bash command looking for that CIGAR string?)

    So far I could only imagine a read aligning to the start of a chromosome/contig maybe? In which case, shouldn't the start of the read be clipped off?
    Or is this an expected bahavior of tools supporting local alignment such as bwa-meth? If it was soft-clipped (= kept in the sequence, but not aligned, right?), then I suppose the CIGAR string would need to describe this part of the sequence as deleted from the 5', right?

    How would you suggest I "extract some of the longer soft-clipped sequences"? Do you define those as the ones which may start/end with a deletion? Is there a samtools command for that, or will that be a bash script? Then, I suppose you suggest I create a subset bam file containing those sequences and look at their locations?

    Sorry for the many questions. I never had to dig into SAM/BAM files so far :/

    Thanks!
    K.

    Leave a comment:


  • fkrueger
    replied
    Kevin,

    We don't tend to play around with the mapping parameters very much normally (what you called over-fitting), but tend to run PE alignments first (with --unmapped) and then re-run the left-over single-end alignments. Having said that, we only do this if we've got very precious samples with limited availability and we really want to squeeze as much out of the data as possible (e.g. oocytes that have been collected by mouth-pipetting for half a year...). I guess if you've got enough material it would be a bit 'cleaner' to just run paired-end mode to make use of the --no_overlap feature. Cheers, Felix

    Leave a comment:


  • dpryan
    replied
    You might send brentp a message and ask him about that error. It's really really strange for a read to start with a deletion (what would that even mean?). I don't recall bwa-meth messing with the CIGAR string, but it's been a while since I looked so my memory might be failing me.

    The increased alignment rate by bwa-meth is presumably due to heavily soft-clipping things. It's be interesting to extract some of the longer soft-clipped sequences and try to see where they go. That might provide some insight into what's going on here.

    Leave a comment:


  • kevinrue
    replied
    Hi again Felix,

    (I am wondering whether to continue the conversation I started in this thread, or move to the better suited "paired end alignment rate issue" that you linked earlier.)

    bismark
    As promised, I have completed running the tests and here are the statistics of unique, multiple, and unaligned reads/pairs:



    Can I do anything else to help you help me explain what is going on with long paired-end reads? Given that I am seeking biological inisight between infected and control samples, I'd rather avoid over-fitting parameters which give me my favorite answer, and instead follow your advice on how to correctly use your tool.

    bwa-meth
    I also report the alignment rate of bwa-meth in the screenshot above, which seem much better (although the numbers don't add up to the same number of input read pairs?!), but you can read below how I am stuck with that tool as well. Here are the detailed stats for bwa-meth:
    $samtools flagstat bwa-meth.bam
    17570633 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    17493812 + 0 mapped (99.56%:-nan%)
    17570633 + 0 paired in sequencing
    8787708 + 0 read1
    8782925 + 0 read2
    16950242 + 0 properly paired (96.47%:-nan%)
    17470980 + 0 with itself and mate mapped
    22832 + 0 singletons (0.13%:-nan%)
    209437 + 0 with mate mapped to a different chr
    66124 + 0 with mate mapped to a different chr (mapQ>=5)


    However, I am blocked one step further with that one. The "tabulate" step using Bis-SNP to call methylation for CpG's crashes due to a read in the following configuration. I know I am going out of topic here, but here is the error message (I will contact the developer there to see about that):
    ERROR MESSAGE: SAM/BAM file SAMFileReader{/workspace/scratch/krue/Methylation/2014-08-27_analysis/bismark/bwa_meth/bwa-meth.bam} is malformed: Read starting with deletion. Cigar: 19D78M1D66M


    Any advice welcome. Really feeling lost about what to do now.
    Thanks very much

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 06-03-2024, 06:55 AM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-30-2024, 03:16 PM
0 responses
26 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-29-2024, 01:32 PM
0 responses
29 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-24-2024, 07:15 AM
0 responses
216 views
0 likes
Last Post seqadmin  
Working...
X