Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

fkrueger replied

03-27-2015, 05:16 AM
We've just released a new version of Bismark 0.14.1 which mainly addresses a few bugs/glitches:

Bismark: Fixed the cleaning up stage in a --multicore run when --gzip had been specified as well
Bismark: Fixed the handling of files in a --multicore run when the input files had been specified including file path information
Bismark: Please note that the option -B/--basename in conjunction with --multicore is currently not supported (as in: disabled), but we are aiming to address this soon

Methylation Extractor: Fixed a bug with the position adjustment of paired-end reads when the reads should have been trimmed from their 3' ends (option --ignore_3prime)

deduplicate_bismark: Now also removing newline characters from the read conversion tag in case other programs interfered with the tag ordering and put this tag into the very last column

Download is available from the Bismark project page: http://www.bioinformatics.babraham.a...jects/bismark/
Leave a comment:
barbarian replied

03-23-2015, 08:33 PM
Hello again,

So, I decided to try single end dataset because I think paired end takes too many times. With current dataset, I've successfully use Bismark to do aligning process with 66% efficiency. But, I still have some problem. I read the paper published together with the dataset, it stated that they can get map efficiency 72% and higher percentage of CpG methylation. So, is it possible to increase the mapping efficiency with changing some parameter in Bismark or Trim Galore? In the paper, they didn't use Bismark or Trim galore. They developed their own code to do the trimming and methlyation call and use Bowtie as aligner. Thank you.
Leave a comment:
dpryan replied

03-19-2015, 12:20 AM
Fastq-dump is painfully slow. Sometimes you can get lucky and either ENA or DDBJ (conveniently located in Japan) has the files in fastq format. That speeds things up quite a bit, but unfortunately these other sources sometimes only have the SRA file (e.g., for this is experiment). Whenever you're dealing with large files, this can start taking a while. This is why many of us push jobs onto clusters where we use multiple machine at once (e.g., I used 13 nodes on ours to perform the alignment of this dataset in 30-40 minutes, which would have been less had someone else not decided to start 1000 jobs at the same time...).
Leave a comment:
barbarian replied

03-18-2015, 05:33 PM
Ok, thank you for your guidance. I will try to check the first million sequence to see which argument is best suited for analyzing so that I don't need to wait for hours to check whether I make a mistake or not.

Btw, is it really takes a lot of time for analysis using Bismark? My computer spesification is i7 (8 cores) and 28 GB of RAM. The analysis process seems take too long, for fastqdump, trim galore and bismark. It's around 7 hours or maybe more to complete

Last edited by barbarian; 03-18-2015, 07:21 PM.
Leave a comment:

dpryan replied

03-18-2015, 05:18 AM

For the sake of comparison, below are some metrics that I get using local alignment on that dataset. You won't get identical metrics (we're using different tools and trimming differently), but they won't be that terribly different (except CpG methylation). So >75% alignment is definitely possible with this dataset (at least after removing very low quality reads...of which there are many).

Code:

Alignment:
	76660262 total reads analysed
	59596692 paired-end reads mapped ( 77.74%).

	27577367 concordant pairs
	1545115 discordant pairs
	1351728 reads aligned as singletons

Number of hits aligning to each of the orientations:
	11086480	 14.46%	OT (original top strand)
	10666480	 13.91%	OB (original bottom strand)
	19288820	 25.16%	CTOT (complementary to the original top strand)
	18554912	 24.20%	CTOB (complementary to the original bottom strand)

Cytosine Methylation (N.B., statistics from overlapping mates are added together!):
	Number of C's in a CpG context: 188158298
	Percentage of methylated C's in a CpG context:  44.23%
	Number of C's in a CHG context: 173622589
	Percentage of methylated C's in a CHG context:   2.27%
	Number of C's in a CHH context: 348611484
	Percentage of methylated C's in a CHH context:   7.04%

Leave a comment:

barbarian replied

03-18-2015, 01:24 AM
Seems I made a mistake while performing trimming_galore. Thank you so much. I will add non directional parameter for tonight and tomorrow I will se the result again.
Leave a comment:
dpryan replied

03-18-2015, 01:04 AM
The simplest method is to just take a million or so reads and align them in a non-directional manner. If there's considerable alignment to the CTOB and CTOT strands, then it's non-directional.
Leave a comment:
dpryan replied

03-18-2015, 01:04 AM
I'll also note that you can probably get >75% alignment rate with this dataset, at least I did with a subset of it and using local alignment. This would probably be 80-85% if I included all of the multimappers in the metrics.
Leave a comment:
barbarian replied

03-18-2015, 01:03 AM
Thank you for your info. I always think it is directional. Seems I need to change the command again. Can you tell me how to check whether a sequence is directional or not?
Leave a comment:
dpryan replied

03-18-2015, 12:55 AM
FYI, this is a non-directional dataset, so make sure to use the appropriate options.
Leave a comment:
barbarian replied

03-17-2015, 11:53 PM
Thank you for your reply. I've just realized it this afternoon. Now I'm waiting for the result. Maybe tomorrow I will have another question because usually it will not finish today. Good luck with your meeting.
Leave a comment:
fkrueger replied

03-17-2015, 11:50 PM
For paired-end files you need to run Trim Galore in paired-end mode like this:

trim_galore --rrbs --paired <fastq1> <fastq2>

If you run it in twice in single-end mode it will break the sequence-by-sequence order of the files which then results in very low mapping efficiency.
I am in a meeting all day but can take a look myself at the file in question tonight or tomorrow.
Leave a comment:
barbarian replied

03-17-2015, 05:51 PM
Ok, it's strange. I tried with another sample data. The result for mapping efficiency of both files is 0.1% and if it is only one file it's 13.5%. Before this step, what I do is using
fastq-dump --split-files <sra file>
trim_galore --rrbs <fastq1>
trim_galore --rrbs <fastq2>
For both files:
bismark --bowtie2 <ref> -1 <fastq1> -2 <fastq2>
For 1 file:
bismark --bowtie2 <ref> <fastq1>

For reference, I'm sure that I already build with bowtie2 and I have checked it with Bismark data samples and the result is similar with the document. I'm trying to do with the next sample to see if it's the sample fault or my command fault. Any suggestion? By the way, I download the sample from NCBI data. Here is the link : http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE61150
The sample that I checked is the first sample. Here : http://www.ncbi.nlm.nih.gov/geo/quer...acc=GSM1498453

Thank you for your help.

Additional:
Tried to check it again using Fastqc after trimming, the result for both Fastq file is 50-50, not all good. The bad result is in per tile sequence quality, per base sequence content, sequence duplication levels, Kmre constant

Last edited by barbarian; 03-17-2015, 06:06 PM.
Leave a comment:
fkrueger replied

03-17-2015, 01:31 AM
Thanks Devon for jumping in. Here is a protocol that is worth reading in order to achieve good mapping results in most cases: http://www.epigenesys.eu/en/protcols...q-data-prot-57
Leave a comment:
barbarian replied

03-17-2015, 12:35 AM
Ok. I will try now. Maybe will have another question tomorrow after the result is out
Leave a comment:

Previous 1 8 9 10 11 12 13 14 21 34 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News