No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Offhand, I can't think of any application where this would cause a problem. With genome viewers, you need to coordinate sort anyway and the pairing isn't done at the read-name level (there's no fast index for querying the position of reads in BAM files by name).


    • Hi,

      I was wondering about using WGBS data for structural variant prediction - according to the bismark manual, the bowtie2 paired end options --no-mixed and --no-discordant are always set on - is there any way of disabling this apart from editing the source code? Perhaps change these options to --allow-mixed and --allow-discordant so that the default behaviour does not change? It seems a bit odd to have options which impossible to turn off!


      • Hi Aaron,
        I have to admit that I haven't spent any time thinking about whether it would be possible or if it would be difficult to allow these settings. I would imagine that just enabling these options in the code would probably lead to some other part failing in some way, even though it is difficult to predict how. This is something that sounds very straight forward to implement, but might turn out to be surprisinglly difficult ...


        • Ah sure, that makes sense - I may test it out for myself on the unaligned read from a cancer cell line with some known translocations and see if anything falls apart - if I get around to testing it I'll let you know how it goes.


          • Originally posted by frozenlyse View Post
            Ah sure, that makes sense - I may test it out for myself on the unaligned read from a cancer cell line with some known translocations and see if anything falls apart - if I get around to testing it I'll let you know how it goes.
            Great, I'll be interested to hear about the outcome!
            Last edited by fkrueger; 09-11-2013, 12:54 AM. Reason: typo


            • I want to know about mapping efficiency of bisulfite-sequencing, I have tested the test data(Bismark test dataset on http://www.bioinformatics.babraham.a...d.html#bismark), it's mapping efficiency is of 47.6%, also,my own bisulfite-sequencing data with mapping efficiency of 0.1%(this may be caused by mostly lab stuff's wrong protocol).

              I want to know if the mapping efficiency of bisulfite-sequencing is lower than other normal sequencing? Can every template's C>T version and G>A version of OT stand and OB stand map to Geneome(C>T) and Genome(G>A)?


              • The mapping efficiency for very short bisulfite converted sequences is substantially lower than for 'normal' sequencing, but for read lengths of 40bp or longer the difference is only a few percent. Fig. 2a of this review compares the mapping efficiencies of BS-Seq vs. normal alignments as a function of read length.

                0.1% mapping efficiency sounds very very low, this is already something you would probably see if you aligned sequences to a wrong genome ... (e.g. human/mouse).


                • Thank fkrueger, you are very kind. I want to know the reason deeply.

                  here is an example:
                  genome sequence is: ACGCTGA
                  the real sample's sequence is:
                  the Red"C" is methylated base.

                  Genome(C>T) is ATGTTGA
                  Genome(G>A) is ACACTAA
                  OT(+) is ACGTTGA
                  OB(-) is TGTGATT
                  OB(+) is AATCACA

                  In the directional library, both OT and OB strand can be sequenced.
                  OT(C>T) ATGTTGA, which can be map to Genome(C>T)
                  OB(C>T) AATTATA, can not be map to Genome(C>T) or Genome(G>A)
                  OB(G>A) AATCACA, can not be map to Genome(C>T) or Genome(G>A)

                  so, in this example, only OT can be aligned, OB can not, so is this the problem of low mapping efficiency for BS-seq?
                  Last edited by litc; 09-11-2013, 06:07 PM. Reason: formatting for easy reading.


                  • sorry fkrueger, I have made an wrong conclusion for writing a wrong sequence of OB(-) and OB(+)

                    in the my above post, the OB(-) should be TTAGTGT, OB(+) should be ACACTAA.
                    OB(C>T) ATATTAA
                    OB(G>A) ACACTAA, can map to Genome(G>A),

                    so in this case, all the stand(OT, OB) can theoretically map to either Genome(C>T) or Genome(G>A) no matter whether there was(were) methylated base(s) in the original strands. so the mapping efficiency of BS-seq can not be too low.

                    I finally found out the reason why my data's mapping efficiency is 0.1%, my data is 250PE, it is the adapter in the last part of the read that cause the failure of mapping. After trimming the reads to 50bp, it can map to 76%. But I don't know why the Bismark test dataset(http://www.bioinformatics.babraham.a...d.html#bismark) be with a low mapping efficiency of 47.6%, it make me confusion and give me an impression that the BS-seq's mapping efficiency is low.


                    • The data used as a test data set is from the 2009 Lister et al paper, the reads were not specifically trimmed for adapters but just shortened to 50bp. Still, a lot of the reads suffer from poor quality sequence (as was the norm back in those days) and possibly adapter contamination. I am sure if you would remove them you would also see an increased mapping efficiency. If you follow this QC and trimming guide you should see fairly good results for your application (very long reads might need some specific attention though, e.g. using Bowtie2 for mapping).

                      The test dataset is meant as a quick test that the program runs correctly after installation, and was not intended to showcase a staggeringly high mapping efficiency of Bisulfite-Seq in general .


                      • We have just released a new version of Bismark (v0.10.0) that adds a variety of convenient features and bug fixes. The changes in detail are:

                        Bismark: The option '--prefix' does now also work for the C->T and G->A transcribed temporary files to allow multiple instances of Bismark to be run on the same file(s) in the same folder (e.g. using Bowtie and Bowtie 2 or some stricter and laxer parameters concurrently)

                        bismark2report: Changed the behavior of this module to automatically find all Bismark mapping reports in the current working directory, and to try and work out whether the optional reports are present as well (i.e. deduplication, splitting and M-bias reports). This uses the file basename and will fail if the files have been renamed at any stage
                        bismark2report: Added commas as separator for large numbers to improve readability

                        Bismark methylation extractor: will now delete unsused methylation context files (e.g. CTOT and CTOB files for a directional library)

                        bismark2bedGraph: Dropped the option -k3,3 from the sort command to result in a dramatic speed increase while sorting. This option had been used previously to enable sorting by chromosome in addition to position, but should no longer be needed because the files are being read in sorted by chromosome already
                        bismark2bedGraph: This module does now produce these two output files:
                        (1) A bedGraph file, which now contains a header line: 'track type=bedGraph'. The genomic start coords are 0-based, the end coords are 1-based. These changes should make the file truly compatible with the UCSC genome browser.
                        (2) A coverage file ending in .cov. This file replaces the former 'bedGraph --counts' file and is required to proceed with the subsequent step to generate a genome-wide cytosine report (the module doing this has been renamed to coverage2cytosine to reflect this file name change)

                        coverage2cytosine: Changed the name of this module from 'bedGraph2cytosine' to 'coverage2cytosine' to reflect the change that this module now requires the methylation coverage file produced by the bismark2bedGraph module (this coverage file replaces the former "bedGraph --counts" output)
                        coverage2cytosine: Previously, the cytosine report would always report every C position in any context, even though the default should have reported CpG positions only. This has now been fixed

                        Bismark genome preparation: Made a couple of changes to make the genome preparation fully non-interactive. This means that the path to the genome folder and to Bowtie (1/2) have to be specified up front (for Bowtie (1/2) it is otherwise assumed that it is in the PATH). Furthermore, already existing bisulfite indices in the target folder will be overwritten and the user is no longer prompted if he agrees to this. We got rid of this because creating a second index (Bowtie 1 as well as 2) in the same folder in non-interactive mode got stuck in loops asking whether it is alright to proceed or not, generating therabyte sized log files without ever starting doing anything useful...)

                        deduplicate_bismark: Renamed the rather long deduplication script to this slightly shorter one. Also added some filehandle closing statements that might have caused buffering issues under certain circumstances

                        Bismark is available from https://www.bioinformatics.babraham....jects/bismark/


                        • We have just made available a new version of Bismark (v0.10.1) which fixes a few issues and adds some useful support for unfinished genomes with lots of scaffolds instead of just a handful of chromosomes. Here are all the changes in more detail:

                          Bismark methylation extractor: The methylation extractor does now detect automatically whether Bismark alignment file(s) were run in single-end or paired-end mode. The automatic detection can be overridden by manually specifying -s or -p and this option is only available for SAM/BAM files

                          bismark2bedGraph: When run in stand-alone mode, the coverage file will replace 'bedGraph' as the file ending with 'bismark.cov'. If the output filename is anything other than 'bedGraph', '.bismark.cov' will be appended to the filename
                          bismark2bedGraph: When run in stand-alone mode, '--counts' will be enabled by default for the coverage output
                          bismark2bedGraph: Added a new option '--scaffolds/--gazillion' for users working with unfinished genomes sporting tens or even hundreds of thousands of scaffolds/contigs/chromosomes. Such a large number of reference sequences frequently resulted in errors with pre-sorting reads to individual chromosome files because of the operating system's limitation of the number of filehandles that can be written to at any one time (typically this limit is anything between 128 and 1024 filehandles; to find out this limit on Linux, type: ulimit -a). To bypass the limitation of open filehandles, the option '--scaffolds' does not pre-sort methylation calls into individual chromosome files. Instead, all input files are temporarily merged into a single file (unless there is only a single file), and this file will then be sorted by both chromosome AND position using the UNIX sort command. Please be aware that this option might take a looooong time to complete, depending on the size of the input files, and the memory you allocate to this process (see '--buffer_size')
                          bismark2bedGraph: Added a new option '--ample_memory'. Using this option will not sort chromosomal positions using the UNIX sort command, but will instead use two arrays to sort methylated and unmethylated calls, respectively. This may result in a faster sorting process for very large files, but this comes at the cost of a larger memory footprint (as an estimate, two arrays of the length of the largest human chromosome 1 (~250 million bp) consume around 16GB of RAM). Note however that due to the overhead of creating and looping through huge arrays this option might in fact be *slower* for small-ish files (up to a few million alignments). Note also that this option is not currently compatible with options '--scaffolds/--gazillion'. This option still needs some efficiency testing as to when it actually makes sense to use it, but it produces identical results to the default sort option. Thanks to Yi-Shiou Chen for contributing this twist

                          deduplicate_bismark: The deduplication script does now detect automatically whether a Bismark alignment file was run in single-end or paired-end mode (this happens separately for every file analysed). The automatic detection can be overridden by manually specifying -s or -p and this option is only available for SAM/BAM files

                          bismark2report: Specifying a single file for each of the optional reports does now will now work as intended, instead of being skipped

                          coverage2cytosine: Added some counting and statements to indicate when the run finished successfully (it proved to be difficult to follow the report process for a genome with nearly half a million scaffolds...)

                          Bismark is available from http://www.bioinformatics.babraham.a...jects/bismark/. Bug reports or comments most welcome.
                          Last edited by fkrueger; 11-27-2013, 04:03 AM. Reason: typo


                          • Originally posted by fkrueger View Post
                            comments most welcome.
                            Hi, Bismark is creating a lot of plain-text output--e.g. 16GB of SAM from 4GB of fastq.gz. It would be nice if it output BAM format instead of SAM. Even if it could be forced to output .sam.gz, that would help those of us on clusters with limited disk space.


                            • ... and now I see that it already does write directly to bam. thanks.


                              • Hi Brent,

                                Using --bam will output a bam file, or, if there is no samtools available, make a sam.gz file.



                                Latest Articles


                                • seqadmin
                                  Advanced Methods for the Detection of Infectious Disease
                                  by seqadmin

                                  The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                                  11-27-2023, 01:15 PM
                                • seqadmin
                                  Strategies for Investigating the Microbiome
                                  by seqadmin

                                  Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                                  11-09-2023, 07:02 AM





                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 02:24 PM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 07:37 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 12-04-2023, 08:23 AM
                                0 responses
                                Last Post seqadmin  
                                Started by seqadmin, 12-01-2023, 09:55 AM
                                0 responses
                                Last Post seqadmin