Seqanswers Leaderboard Ad

**fkrueger** · 09-21-2012, 10:53 PM

Originally posted by my_bio View Post

To accurately calculate methylation level of cytosine, it's necessary to add another option to filter low sequencing quality reads. that is to say, if a base's sequencing quality is lower than 20, methylation extractor will ignore it.

We strongly recommend adapter and quality trimming of sequencing files before the alignments are carried out in the first place, and indeed we run all our samples through Trim Galore to do this (a protocol is available here). If you adhere to this procedure there is no need to filter for good quality basecalls afterwards.

**my_bio** · 09-21-2012, 11:41 PM

Originally posted by fkrueger View Post

We strongly recommend adapter and quality trimming of sequencing files before the alignments are carried out in the first place, and indeed we run all our samples through Trim Galore to do this (a protocol is available here). If you adhere to this procedure there is no need to filter for good quality basecalls afterwards.

Thank you for your prompt reply. Trim Galore is a powerful tools to perform quality control and I have run our data through Trim_galore. but a read may still have a little lower sequencing quality bases after quality trimming. for example, sequencing quality of a read after trimming may like this:
GEEEEEFFCFFEEEEEECEEEGGGGEFFFFDGDGFGGGGDGFGDFGGFDG#CCCCCBBBAA.
there is a base has lower quality("#") in this read, if this base have unique aligned reference genome's cytosine, this base may affect the accuracy of the methylation level, So we need ignore this base.

**fkrueger** · 09-22-2012, 02:47 AM

Originally posted by my_bio View Post

Thank you for your prompt reply. Trim Galore is a powerful tools to perform quality control and I have run our data through Trim_galore. but a read may still have a little lower sequencing quality bases after quality trimming. for example, sequencing quality of a read after trimming may like this:
GEEEEEFFCFFEEEEEECEEEGGGGEFFFFDGDGFGGGGDGFGDFGGFDG#CCCCCBBBAA.
there is a base has lower quality("#") in this read, if this base have unique aligned reference genome's cytosine, this base may affect the accuracy of the methylation level, So we need ignore this base.

Some individual bases with a low base call quality can make it through indeed; this is due to the way the quality trimming is performed by Cutadapt. While I wouldn't have a certain number to go with it I would imagine that the amount of these is tiny (my guess is well below 0.1%). In the example you linked, the '#' is probably the Illumina flag for expressing that the pipeline had trouble determining the base signal, and this does not equal a general poor quality call.

After all, it would only matter if the base in question was a cytosine position in the genome (so ~20% of the time). And even then, the call might have been correct (albeit with a poor quality score), or might be one of the other bases that are not involved in methylation calling at all anyway, i.e. A, G or N (N is quite likely if the score was '#'). So overall I agree that this *might* result in an incorrect methylation call very occasionally, however if there were 10 such calls in ~5 billion correct calls (which you may easily get from one lane of HiSeq), I believe this is something one could live with (especially since quality checking would slow down the methylation extraction process quite noticeably...). Don't you think?

**my_bio** · 09-22-2012, 06:26 AM

Originally posted by fkrueger View Post

Some individual bases with a low base call quality can make it through indeed; this is due to the way the quality trimming is performed by Cutadapt. While I wouldn't have a certain number to go with it I would imagine that the amount of these is tiny (my guess is well below 0.1%). In the example you linked, the '#' is probably the Illumina flag for expressing that the pipeline had trouble determining the base signal, and this does not equal a general poor quality call.

After all, it would only matter if the base in question was a cytosine position in the genome (so ~20% of the time). And even then, the call might have been correct (albeit with a poor quality score), or might be one of the other bases that are not involved in methylation calling at all anyway, i.e. A, G or N (N is quite likely if the score was '#'). So overall I agree that this *might* result in an incorrect methylation call very occasionally, however if there were 10 such calls in ~5 billion correct calls (which you may easily get from one lane of HiSeq), I believe this is something one could live with (especially since quality checking would slow down the methylation extraction process quite noticeably...). Don't you think?

What you said is reasonable, thanks.

**ELoomis** · 09-27-2012, 11:13 AM

Filtering out low poorly converted reads

I'm using the non-CpG context cytosines as a measure of conversion efficiency for my sample, and I'd like to filter out any reads with a particularly low efficiency. This might be a bigger problem for my application (different sequencing platform, longer reads and locus specific) than most users doing RRBS, but I imagine this type of filter would be good for any bisulfite mapping application...
Ideally, you would be able to adjust a threshold in the command line and select which context to use as the measure (non-CpG vs. only CHH?).
Does this sound like something that other users would find useful?

**fkrueger** · 10-02-2012, 02:04 AM

We have just released a new version of Bismark (version 0.7.7), which mainly extends the functionality of the Bismark methylation extractor, as recently discussed here on SeqAnswers. The methylation extractor does now include the functionality of the two additional scripts genome_methylation_bismark2bedGraph as well as genome_wide_cytosine_report; this means that it can, in addition to the standard methylation extractor output, generate sorted bedGraph and/or genome-wide cytosine report output files directly using the options --bedGraph or --cytosine_report, respectively.

Here are all changes in more detail:

Bismark
• When reading in the genome file Bismark does now automatically remove \r line ending characters as well. This sometimes caused problems when genome files had been edited on Windows machines.
• Added support for the Bowtie 2 options '--rdg int1,int2' and '--rfg int1,int2' to adjust the gap open and extension penalties for both read and reference sequence. This might be useful for very special conditions (e.g. PacBio data...)

Bismark methylation extractor
• Renamed methylation_extractor to bismark_methylation_extractor
• Added new function '-o/--output' to specify an output directory. This became necessary for integration into Galaxy
• Added new function '--no_header' to suppress the Bismark version header in the output files if plain alignment data is more desirable
• Added option '--bedGraph' to produce a bedGraph output file once the methylation extraction has finished; this reports the genomic location of a cytosine and its methylation state (in %). By default, only cytosines in CpG context will be sorted/reported
• Implemented option '--cutoff threshold' to set the minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported
• Implemented option '--counts' which adds two additional columns to the bedGraph output file to enable further calculations:
Column 5: count of methylated calls per position
Column 6: count of unmethylated calls per position
• Implemented option '--CX_context' so that the sorted bedGraph output file contains information on every single cytosine that was covered in the experiment irrespective of its sequence context
• Added option '--cytosine_report' which produces a genome-wide methylation report for all cytosines. By default, the output uses 1-based chromosome coordinates and reports CpG context only. The output considers all Cs on both forward and reverse strands and reports their position, strand, trinucleotide content and methylation state
• Option '--CX_context' applies to the cytosine report as well. The output file wil contain information on every single cytosine in the genome irrespective of its context. This applies to both forward and reverse strands
• Implemented option '--zero_based' to use zero-based coordinates like used in e.g. bed files instead of 1-based coordinates
• Implemented option '--genome_folder PATH' to be used to extract sequences from. Accepted formats are FastA files ending with '.fa' or '.fasta'
• Added an option '--split_by_chromosome' which writes the cytosine report output to individual chromosome files instead of to one single very large file

Bismark is available for download at www.bioinformatics.babraham.ac.uk/projects/

**shadow19c** · 10-04-2012, 07:02 AM

Hello,
everyone, I'm new here i have a lot of questions about Bs-seq and more precisely about bismark.
I see that with paire-end seq we can use Bismark to do the mapping but before do you know if I have to remore adaptors, short fragments?
If it is yes do you know if there is some programs to do that or juste a remove directly in my reads.

Thanks

**fkrueger** · 10-04-2012, 07:08 AM

It is highly recommended to remove adapters and poor quality portions from reads to increase the mapping efficiency and confidence in the methylation data.

A typical workflow would be:

Raw data --> FastQC (quality control) --> Trim Galore (adapter/quality trimming) --> Bismark (alignments) --> deduplication --> downstream analysis of your choice

Here is a guide-document explaining all these steps in more detail.

Best,
Felix

**shadow19c** · 10-05-2012, 04:23 AM

hello,
thank you very much.
I have a question a bout the deduplication (what is mean?)

If I am doing a BS-seq in Thaliana, the deduplication is needeed?

**fkrueger** · 10-05-2012, 04:28 AM

Originally posted by shadow19c View Post

hello,
thank you very much.
I have a question a bout the deduplication (what is mean?)

If your sample contains several reads that start and end at the very same position there is a good chance that you are not looking at genuinely unique sequences but that you are sequencing the same fragment that has been amplified by PCR multiple times over and over again. A deduplication step would reduce all alignments with the same genomic coordinates to a single read. Some more details should also be mentioned in the aforementioned guide document.

**shadow19c** · 10-12-2012, 01:48 AM

Hello,
I have a question concerning the parameters when you are doing the mapping what is the best if you have 90 bp for each paire_end reads?
Because I see the -I
150 and -X 300 !!

**fkrueger** · 10-12-2012, 07:55 AM

I would personally use the defaults to start with (0-500 bp) since often the size selection step does not quite what you would expect it to do. Only come back and change them if you are trying to track down errors such as low mapping efficiency.

**shadow19c** · 10-14-2012, 10:55 PM

Hello,
thank you for your answer so I made the mapping with default parameters :
Bismark report for: /data/a2e/kassam/BS-seq-WT/1.fq and /data/a2e/kassam/BS-seq-WT/2.fq (version: v0.7.7)
Bowtie was run against the bisulfite genome of /import/gr_a2e/TAIR9/ with the specified options: -q -n 1 -k 2 --best --maxins 500 --chunkmbs 512

1) Is it normal to have just the 1 sam file, because I have only 1.fq_bismark_pe.sam?

-------------

Sorry I have the answer so It is yes.

------------------------------------------------------

2)I have a question concerning the description of the vertical coverage, how to do that after the mapping and the filtering ?

Thanks

**fkrueger** · 10-15-2012, 12:47 AM

Originally posted by shadow19c View Post

Hello,

2)I have a question concerning the description of the vertical coverage, how to do that after the mapping and the filtering ?

Thanks

We primarily use SeqMonk for downstream analysis which lets you identify and exclude regions with too high read coverage from subseqent quantitations.

**shadow19c** · 10-15-2012, 01:53 AM

There is difference with the option --directional for mthylation extractor?

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News