Hi Felix - thanks for the prompt reply, and sorry I forgot to include my library type/bowtie version. I was just making sure this was the intended behaviour before I updated my pipeline to ignore the /1 /2 tags in remove PCR duplicates - thanks
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi all
I am trying to re-align my reads again using a new assembly of the reference genome and I am finding many problems in the alignment step.
After running genome_preparation I am trying to run bismark for paired end reads trimmed with trim_galore (--phred64 -t --paired filename_2 filename_2). Right after the script creates the two version os the reads (c-t, a-g) it gives back a un ending number of errors. Many of them are related to the impossibility to determine the genomic context (due to Cs located at the end of the contigs) but others are quite new for me.
The resulting SAM file is fairly small and the number of Cs analyzed (from the report file) are ridiculously low; around 50 000.
I am using the last version of Bismark with the --bowtie2 option (also --botwtie2 option for the genome preparation step)
I am running in parallel an alignment with the same .fq files against a masked version of the genome and no errors so far (4 millions reads...)
Any thoughts?
Cheers
PS. some of the lines with multiple errors
Reading in the sequence files BPM24_FIS17_trim_galore_1.fq and BPM24_FIS17_trim_galore_2.fq
Processed 100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 203898.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 203898.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:2724:10559#ACCAGACT/1 scaffold_1495
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 223364.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 223364.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:7055:11458#ACCAGACT/1 scaffold_402
Processed 200000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 592170.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 592170.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:8410:27041#ACCAGACT/1 scaffold_1917
Processed 300000 sequences so far
Processed 400000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 957754.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 957754.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14266:27423#ACCAGACT/1 scaffold_1862
Processed 500000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1095048.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1095048.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14067:33341#ACCAGACT/1 scaffold_297
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1171846.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1171846.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:20587:36545#ACCAGACT/1 scaffold_391
Processed 600000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1323878.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1323878.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:10776:58374#ACCAGACT/1 scaffold_1028
Processed 700000 sequences so far
Processed 800000 sequences so far
Processed 900000 sequences so far
Processed 1000000 sequences so far
Processed 1100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 2259346.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 2259346.
Last edited by oria34; 07-05-2013, 10:53 AM.
Comment
-
Hi oria,
Interestingly I encountered the same error messages yesterday myself; they are caused by trying to print out the starting position of the read which aligns so close to the edge of a chromosome or contig/scaffold that additional 2bp sequence can't be extracted. These warnings are harmless though and can be ignored safely; I have still fixed this bug yesterday (if you want the new version just send me an email).
From the output below it seems that you are getting roughly 1 of these warning per 100000 analysed sequences, so this can hardly be the reason for poor overall mapping results. Are you by any chance running the other alignment to the repeat masked genome from the same files in the same folder? This would certainly have the potential to interfere with the results... In any case, if you continue to have problems can you email me further details of your problems? If necessary I could also create an FTP server for you to upload your files in question so I could take a look myself.
Comment
-
Thanks for your quick reply!
I am not really worried for the errors, I remember reading in this thread that they are due to the Cs located at the very end and, as you said, they are not relevant...Is the outcome what is killing me!
I have created two different folders for the two different alignments (unmasked and masked), each with copies of bismark´s scripts. Each folder has two subfolders, one per individual (two .fq files each)...I have even duplicated the bowtie2 folder just in case....
It is true that the two genomes are in different folders but each genome is used at the same time for the alignment of the different individuals (2x)...but this doesn´t seem to affect to the masked version of the genome.
I´ll post tomorrow how it is going.
Many thanks!
Comment
-
Hi again,
OK, I have some more news. I made a mistake checking online my runs ....the one that is running error-free is the alignment against the standard genome.
All the errors I´ve reported have been produced in the alignment run against the masked genome. These runs are now over (see report below) and they didn´t work at all.
To my understanding, Bowtie doesn´t have any type of restrictions in the use of masked genomes, does it? Do you think I could miss something during the genome preparation step?
Cheers
Bismark report for: M24_FIS17_trim_galore_1.fq and M24_FIS17_trim_galore_2.fq (version: v0.7.12)
Bowtie was run against the bisulfite genome of /Bisulfite/BISMARK_REALIGNMENT/MASKED_Version_Genome/ with the specified options: -q --phred64 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500
Option '--directional' specified: alignments to complementary strands will be ignored (i.e. not performed)
Final Alignment report
======================
Sequence pairs analysed in total: 62787165
Number of paired-end alignments with a unique best hit: 1587
Mapping efficiency: 0.0%
Sequence pairs with no alignments under any condition: 62785501
Sequence pairs did not map uniquely: 77
Sequence pairs which were discarded because genomic sequence could not be extracted: 0
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 557 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 1030 ((converted) bottom strand)
Number of alignments to (merely theoretical) complementary strands being rejected in total: 0
Final Cytosine Methylation Report
=================================
Total number of C's analysed: 50806
Total methylated C's in CpG context: 1118
Total methylated C's in CHG context: 99
Total methylated C's in CHH context: 172
Total C to T conversions in CpG context: 2166
Total C to T conversions in CHG context: 16576
Total C to T conversions in CHH context: 30675
C methylated in CpG context: 34.0%
C methylated in CHG context: 0.6%
C methylated in CHH context: 0.6%Last edited by oria34; 07-06-2013, 12:39 AM.
Comment
-
Good to hear that the alignments against the standard genome went well.
The errors from the masked genome are quite likely to be a result of the masking process itself, do you have an idea about how much sequence has been masked into Ns (presumably)?
As far as I am aware Bowtie2 supports indexing of Ns in there reference sequence (whereas Bowtie 1 does not), so in theory it should work in the same way. N's in the reads or reference do receive a mapping penalty as well though (albeit not as much as true mismatches), so if you simply have masked too much of your genome this might just exceed the fairly stringent penalty allowance of Bismark's default setting which is governed by the "--score_min" parameter. The longer your reads and the paired-end nature of your library could make it possible that effectively all reads would fall into a 'masked' region.
To see if it is the degree of masking you could just run a few sample reads (e.g. with the parameter --qupto 500000 for the firt 500K reads), run only Read 1 in single end mode and use --score_min L,0,-0.6. This should finish in a couple of minutes and tell you quickly if the mapping efficiency increases notably.
Just out of interest, why would you want to align reads to both a masked and unmasked genome in parallel (especially since the normal genome alignments were fine)?
Comment
-
Well, apparently errors started to appear also in the alignment run against the standard genome, but a bit later I left the office (see below). Not really concerning if I get a good enough mapping efficiency.... Still running
Well this is embarrassing…… No idea how "they" masked the genome for the interspersed repeats. I can't even ask till next Monday...Sorry for this also, but I have no idea how to extract the first 500K reads. I am quite newbie in all this
The reason for aligning to both versions of the genome is just to compare results...well, I am just curious about how the interspersed repeats can determine certain patterns of methylation in certain parts of a couple of particular chromosomes. I just thought it was worth the shot.
Processed 47700000 sequences so far
Processed 47800000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95642710.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95642710.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:2821:99210#TCTTATAT/1 scaffold_1878
Processed 47900000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95989234.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95989234.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:12271:44482#TCTTATAT/1 scaffold_232
Processed 48000000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96116284.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96116284.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2315:3124:23585#TCTTATAT/1 scaffold_297
Processed 48100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96230734.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96230734.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:15481:55756#TCTTATAT/1 scaffold_1105
Processed 48200000 sequences so far
Processed 48300000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96765254.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96765254.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2316:2905:27999#TCTTATAT/1 scaffold_1711
Processed 48400000 sequences so far
Processed 48500000 sequences so far
Processed 48600000 sequences so far
Processed 48700000 sequences so far
Processed 48800000 sequences so far
Processed 48900000 sequences so far
FYI - Updated info: The alignment to standard genome is now finished. It took around 17h and the mapping efficiency was fairly high....thanks again for this worderful tool! I copy/paste the report:
Final Alignment report
======================
Sequence pairs analysed in total: 48900263
Number of paired-end alignments with a unique best hit: 31991336
Mapping efficiency: 65.4%
Sequence pairs with no alignments under any condition: 13461118
Sequence pairs did not map uniquely: 3447809
Sequence pairs which were discarded because genomic sequence could not be extracted: 215
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 15981506 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 16009615 ((converted) bottom strand)
Number of alignments to (merely theoretical) complementary strands being rejected in total: 0
Final Cytosine Methylation Report
=================================
Total number of C's analysed: 1298097901
Total methylated C's in CpG context: 122944840
Total methylated C's in CHG context: 870088
Total methylated C's in CHH context: 2311989
Total C to T conversions in CpG context: 71226620
Total C to T conversions in CHG context: 301248948
Total C to T conversions in CHH context: 799495416
C methylated in CpG context: 63.3%
C methylated in CHG context: 0.3%
C methylated in CHH context: 0.3%Last edited by oria34; 07-06-2013, 05:19 AM.
Comment
-
Good news. So apparently the warning messages you saw amounted to a total of 215 (out of 49 million read pairs), indeed something you should be able to live with.
Bismark has an option '--qupto' to just use the first so many reads; just type bismark --help to see further explanations on it
So you could run single end alignments against your masked genome for the first 500000 reads with not-so-stringent criteria using the following command:
bismark --bowtie2 --qupto 500000 --score_min L,0,-0.6 /path/to/genome/ file_val_1.fq
Comment
-
M-Bias plot and alignment stats pie chart
We have just released a new version of Bismark (v0.8.0). The main purpose of this release is the introduction of an M-bias plot for the methylation extractor (thanks to K.D. Hansen for the suggestion, see reference below). The M-bias plot allows researchers to see whether there are fundamental technical biases in the methylation calling of their reads.
Attached are two examples:
a) the profile of a PBAT-Seq library (read1) which should in theory start with a 'random' 4-mer oligo that was used to bind to bisulfite converted reads. The M-bias (and also base composition profiles of these libraries as judged by FastQC (not shown here)) do show some serious biases primarily within, but not limited to, the first 4bp of the library.
b) Read 2 of any paired-end BS-Seq library. Due to the end-repair reaction and A-tailing procedure for Illumina libaries, the 3' ends of sonicated fragments are filled in by the Klenow enzyme with unmethylated cytosines. These will then be converted to Ts in the bisulfite treatment, resulting in a very high amount of apparently unmethylated cytosines at the first couple of positions of Read 2. (The problem is somewhat reminiscent of the fill-in problem in RRBS libraries).
In addition to just the methylation profile, the M-bias plot also shows the actual number of events for an entire read file. This should enable investigators to make an informed decision of whether to remove biased positions in the reads, e.g. by using the option --ignore <int> of the methylation extractor or by hard-trimming read 2 by 2bp prior to running the alignments, or whether it is OK to live with what they observe.
Other changes in the latest version include:
• Bismark: Added new option '--prefix' to add a prefix to the output filenames. For example, '--prefix test' with 'file.fq' would result in the output file 'test.file.fq_bismark.sam' etc.
• Bismark: Fixed a warning message that occurred when chromosomal sequences could not be extracted in paired-end Bowtie2 mode
• Bismark: will now generate a pie chart with the alignment statistics once a run has finished; this allows to get a quick overview of how many sequences aligned uniquely or sequences that did not align, either due to producing no alignment at all, multiple mapping or because it was impossible to extract the chromosomal sequence
• Methylation Extractor: upon completion, the methylation extractor will now produce an M-bias (methylation bias) plot, which shows the methylation proportion across each possible position in the reads (described in: Hansen et al., Genome Biology, 2012, 13:R83). The data for the M-bias plot will be written into a text file (to generate graphs by alternative means) and drawn into a .png file. The plot also contains the absolute number of methylation calls per position (methylated + unmethylated)
Bismark is avaialable for download here: https://www.bioinformatics.babraham....jects/bismark/
Comment
-
Hi Felix,
Sorry for the delay in replying. For some reason I couldn't use the --qupto option, Bismark gives back an error message:
Unknown option: qupto
Please respecify command line options
Any thoughts?
Thanks for the new release of Bismark! This new M-bias tool is really helpful!
Also, I am looking for some advice regarding the post processing of all the methylation calls. In order to have a statistically based tool to decide whether a C is actually a methylated C, I've decide to perform the well-known Binomial probability test.
When I started in my present position, all the sequencing experiments were already performed and the data ready to be analyzed so, to my knowledge, no lambda control sequence exists. The question is, Is there any straightforward way to use the methylation extractor output as a input file for any other soft performing this Binomial probability test?
I would be very grateful in receiving some advice on this
Thanks a lot
Comment
-
Hi oria,
Sorry the upto command is actually:
Code:-u/--upto <int> Only aligns the first <int> reads or read pairs from the input. Default: no limit.
We personally use SeqMonk for most downstream methylation analyses, so I am probably not the best person to advise you on other downstream applications. There are couple of packages though that devoted to such analyses, e.g. methylKit by A. Akalin or the Bioconductor package bsseq by K.D. Hansen.
Comment
-
We just released a new version of Bismark (v0.8.1) which addresses a few minor issues such as plot colours and legends, but notably also extends the '--ignore' option of the methylation extractor to now work independently for Read 1 and Read 2 of paired-end files. This allows more flexible control of removing biases that can now be visualised via the M-bias plot. Here are the current changes in more detail:
• Methylation Extractor: Changed the function of the option '--ignore <int>' to ignore the first <int> bp from the 5' end of single-end reads or Read 1 of paired-end files. In addition, added a new option '--ignore_r2 <int>' to ignore the first <int> bp from the 5' end of Read 2 of paired-end files. Since the first couple bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of the end-repair of sonicated fragments with unmethylated cytosines (see M-bias plot), it is recommended that the the first couple of bp of Read 2 are removed before starting downstream analysis. Please see the section on M-bias plots in the Bismark User Guide for more details
• Methylation Extractor: Changed colours, legends and background colour of the M-bias plot
• Bismark: Changed the way in which the alignment overview file is being named to actually work
Together with the options '--clip_r1' and '--clip_r2' of Trim Galore one can now choose to remove known biases, e.g. for PBAT-Seq, prior to running the alignments using Trim Galore, or ignoring biased methylation calls after the alignments have been carried out using the methylation extractor.
Bismark can be downloaded here: https://www.bioinformatics.babraham....jects/bismark/
Comment
-
Originally posted by frozenlyse View PostHi Felix,
Any chance read groups could be incorporated into the output bams in a future version of bismark?
Cheers!
Sorry but I am not too familiar with the concept of read groups in the output BAMs, would that be more or less just adding an @RG tag and ID into the header section? Or would you want to specify more details yourself?
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment