The data used as a test data set is from the 2009 Lister et al paper, the reads were not specifically trimmed for adapters but just shortened to 50bp. Still, a lot of the reads suffer from poor quality sequence (as was the norm back in those days) and possibly adapter contamination. I am sure if you would remove them you would also see an increased mapping efficiency. If you follow this QC and trimming guide you should see fairly good results for your application (very long reads might need some specific attention though, e.g. using Bowtie2 for mapping).
The test dataset is meant as a quick test that the program runs correctly after installation, and was not intended to showcase a staggeringly high mapping efficiency of Bisulfite-Seq in general .
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
sorry fkrueger, I have made an wrong conclusion for writing a wrong sequence of OB(-) and OB(+)
in the my above post, the OB(-) should be TTAGTGT, OB(+) should be ACACTAA.
OB(C>T) ATATTAA
OB(G>A) ACACTAA, can map to Genome(G>A),
so in this case, all the stand(OT, OB) can theoretically map to either Genome(C>T) or Genome(G>A) no matter whether there was(were) methylated base(s) in the original strands. so the mapping efficiency of BS-seq can not be too low.
I finally found out the reason why my data's mapping efficiency is 0.1%, my data is 250PE, it is the adapter in the last part of the read that cause the failure of mapping. After trimming the reads to 50bp, it can map to 76%. But I don't know why the Bismark test dataset(http://www.bioinformatics.babraham.a...d.html#bismark) be with a low mapping efficiency of 47.6%, it make me confusion and give me an impression that the BS-seq's mapping efficiency is low.
Leave a comment:
-
Thank fkrueger, you are very kind. I want to know the reason deeply.
here is an example:
genome sequence is: ACGCTGA
the real sample's sequence is:
ACGCTGA
TGCGACT
the Red"C" is methylated base.
then:
Genome(C>T) is ATGTTGA
Genome(G>A) is ACACTAA
OT(+) is ACGTTGA
OB(-) is TGTGATT
OB(+) is AATCACA
In the directional library, both OT and OB strand can be sequenced.
OT(C>T) ATGTTGA, which can be map to Genome(C>T)
OB(C>T) AATTATA, can not be map to Genome(C>T) or Genome(G>A)
OB(G>A) AATCACA, can not be map to Genome(C>T) or Genome(G>A)
so, in this example, only OT can be aligned, OB can not, so is this the problem of low mapping efficiency for BS-seq?
Leave a comment:
-
The mapping efficiency for very short bisulfite converted sequences is substantially lower than for 'normal' sequencing, but for read lengths of 40bp or longer the difference is only a few percent. Fig. 2a of this review compares the mapping efficiencies of BS-Seq vs. normal alignments as a function of read length.
0.1% mapping efficiency sounds very very low, this is already something you would probably see if you aligned sequences to a wrong genome ... (e.g. human/mouse).
Leave a comment:
-
I want to know about mapping efficiency of bisulfite-sequencing, I have tested the test data(Bismark test dataset on http://www.bioinformatics.babraham.a...d.html#bismark), it's mapping efficiency is of 47.6%, also,my own bisulfite-sequencing data with mapping efficiency of 0.1%(this may be caused by mostly lab stuff's wrong protocol).
I want to know if the mapping efficiency of bisulfite-sequencing is lower than other normal sequencing? Can every template's C>T version and G>A version of OT stand and OB stand map to Geneome(C>T) and Genome(G>A)?
Leave a comment:
-
Originally posted by frozenlyse View PostAh sure, that makes sense - I may test it out for myself on the unaligned read from a cancer cell line with some known translocations and see if anything falls apart - if I get around to testing it I'll let you know how it goes.
Leave a comment:
-
Ah sure, that makes sense - I may test it out for myself on the unaligned read from a cancer cell line with some known translocations and see if anything falls apart - if I get around to testing it I'll let you know how it goes.
Leave a comment:
-
Hi Aaron,
I have to admit that I haven't spent any time thinking about whether it would be possible or if it would be difficult to allow these settings. I would imagine that just enabling these options in the code would probably lead to some other part failing in some way, even though it is difficult to predict how. This is something that sounds very straight forward to implement, but might turn out to be surprisinglly difficult ...
Leave a comment:
-
Hi,
I was wondering about using WGBS data for structural variant prediction - according to the bismark manual, the bowtie2 paired end options --no-mixed and --no-discordant are always set on - is there any way of disabling this apart from editing the source code? Perhaps change these options to --allow-mixed and --allow-discordant so that the default behaviour does not change? It seems a bit odd to have options which impossible to turn off!
Cheers,
Aaron
Leave a comment:
-
Offhand, I can't think of any application where this would cause a problem. With genome viewers, you need to coordinate sort anyway and the pairing isn't done at the read-name level (there's no fast index for querying the position of reads in BAM files by name).
Leave a comment:
-
Hi all,
Running the last version of Bismark and focusing on the name of the reads (we discuss about that a while ago here) I have found that the names of the pairs are no longer /1 & /2. In my case both member of a pair are names "...../1" & "....../1".
I know it doesn't matter too much since Bismark do the methylation call properly but I was wondering whether it can interfere with other downstream applications or genome viewers.
Left alignment
----------------------
Read name = FCD1LHLACXX:8:2308:5026:30317#ACCAGACT/1
Location = groupXXI:767
Alignment start = 756 (+)
Cigar = 99M
Mapped = yes
Mapping quality = 255
----------------------
Base = A
Base phred quality = 39
----------------------
Pair start = groupXXI:900 (-)
Pair is mapped = yes
Insert size = 242
Pair orientation = F2R1
----------------------
Second in pair
-------------------
XG = GA
NM = 16
XM = ...........x.............h..........x...xh......hh..xh..........x.h...
x........h..x.....x.h........
XR = GA
XX = 11G13G10G3GG6GG2GG10G1G3G8G2G5G1G8
-------------------Right alignment
----------------------
Read name = FCD1LHLACXX:8:2308:5026:30317#ACCAGACT/1
Location = groupXXI:767
Alignment start = 900 (-)
Cigar = 98M
Mapped = yes
Mapping quality = 255
----------------------
----------------------
Pair start = groupXXI:756 (+)
Pair is mapped = yes
Insert size = -242
Pair orientation = F2R1
----------------------
First in pair
-------------------
XG = GA
NM = 19
XM = ........x..hh..x.Z..xh....x.....xh.....h.........Z...xh......x........
....xh..xh.......x.....Z....
XR = CT
XX = 8G2GG2G4GG4G5GG5G13GG6G12GG1TGG7G10
-------------------
Leave a comment:
-
Seems like a point worthy addressing... But now I shall focus on holiday!
Leave a comment:
-
jute a follow-up, the same file took 1500 minutes to sort with -k3,3 parameter
gerald
Leave a comment:
-
Thank you Felix for your time, but don't forget : you are on holidays
Gérald
Leave a comment:
-
Sorting by chromosome in addition to the position might indeed be a relict of former versions of the script back when files weren't sorted into individual chromosome files. I'll take a look at this once I am back, but for the moment you should be fine just deleting the -k 3,3 from the sort command.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
17 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: