Seqanswers Leaderboard Ad

**fkrueger** · 07-05-2016, 01:17 PM

Hola Diego,

The typical directional libraries are a result of the following procedure: the DNA is sheared and blunt-ended and the Illumina adapters are ligated onto the reads. This is done in a fashion so that a specific adapter sequence binds to the 5' end of the reads (from both the forward and reverse strand), and a different primer binds to the 3' ends. This works because the first part of the adapter is initially complementary but they then fork so that they are no longer complementary. It is exactly this forking part of the adapter that will only allow the former top and bottom strand reads that have the adapter on their 5' end to anneal to the primers on the flow cell - and these get sequenced as Read 1 (of paired end files) or as the only read of single-end sequencing.

Amplicon sequencing is indeed different because this normally works by bisulfite converting and then amplifying a specific sequence. Because this is usually done for the top strand only I would expect your sequences to align to either the OT or CTOT strands, but this depends a bit on the kit design. I am happy to advise on that if you could send me the mapping reports of your --non_directional run.

And to your last point: No, single-end and paired-end reads should not be combined but run individually. (Having said that the reduplication and methylation extraction steps should detect automatically which kind of file they are working with, but if I were you I would still keep them separate to be safe). If you have sequenced the same sample several times with SE and PE reads you could merge the results e.g. at the bedGraph conversion stage once you have deduplicated and extracted the files individually.

Hope this helps getting you started, Cheers, Felix

**dzavallo** · 07-06-2016, 07:10 AM

Originally posted by fkrueger View Post

Hola Diego,

The typical directional libraries are a result of the following procedure: the DNA is sheared and blunt-ended and the Illumina adapters are ligated onto the reads. This is done in a fashion so that a specific adapter sequence binds to the 5' end of the reads (from both the forward and reverse strand), and a different primer binds to the 3' ends. This works because the first part of the adapter is initially complementary but they then fork so that they are no longer complementary. It is exactly this forking part of the adapter that will only allow the former top and bottom strand reads that have the adapter on their 5' end to anneal to the primers on the flow cell - and these get sequenced as Read 1 (of paired end files) or as the only read of single-end sequencing.

Amplicon sequencing is indeed different because this normally works by bisulfite converting and then amplifying a specific sequence. Because this is usually done for the top strand only I would expect your sequences to align to either the OT or CTOT strands, but this depends a bit on the kit design. I am happy to advise on that if you could send me the mapping reports of your --non_directional run.

And to your last point: No, single-end and paired-end reads should not be combined but run individually. (Having said that the reduplication and methylation extraction steps should detect automatically which kind of file they are working with, but if I were you I would still keep them separate to be safe). If you have sequenced the same sample several times with SE and PE reads you could merge the results e.g. at the bedGraph conversion stage once you have deduplicated and extracted the files individually.

Hope this helps getting you started, Cheers, Felix

Felix, thanks for your promt reply!

Here is the mapping report with non_directional optional, and you were right! Most reads mapped to OT and CTOT. But I still dont undersand why is that. It is because de bisulfite sodium only treat the top strand of the DNA? Is that what you mean?

As you might notice, the mapping efficiency is near 0%. This library is a test with only two amplicons and I'm still optimazing the protocol. I dont know exactly why is that, but in the report from the fastqc analisis on the fastq reads the overpresetated reads were only a few. I'm attaching the report so you can see it.
Nevertheless, when I ran the methylation_extractor I did find my sequence of interest, but with very low coverage.
Could you give a hint on that?
Thank you very much for your help

Gracias!

Bismark report for: /home/zavallo.diego/Downloads/Methylation_analisis/prueba/OR1_paired.fastq and /home/zavallo.diego/Downloads/Methylation_analisis/prueba/OR2_paired.fastq (version: v0.16.1)
Bismark was run with Bowtie 2 against the bisulfite genome of /home/zavallo.diego/Downloads/Arabidopsis/Arabidopsis_Genome/Arabidopsis_genome/ with the specified options: -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --no-mixed --no-discordant --maxins 500
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)

Final Alignment report
======================
Sequence pairs analysed in total: 1238610
Number of paired-end alignments with a unique best hit: 252
Mapping efficiency: 0.0%
Sequence pairs with no alignments under any condition: 1238263
Sequence pairs did not map uniquely: 95
Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 120 ((converted) top strand)
GA/CT/CT: 117 (complementary to (converted) top strand)
GA/CT/GA: 7 (complementary to (converted) bottom strand)
CT/GA/GA: 8 ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 21576

Total methylated C's in CpG context: 2708
Total methylated C's in CHG context: 931
Total methylated C's in CHH context: 2266
Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 693
Total unmethylated C's in CHG context: 2050
Total unmethylated C's in CHH context: 12928
Total unmethylated C's in Unknown context: 0

C methylated in CpG context: 79.6%
C methylated in CHG context: 31.2%
C methylated in CHH context: 14.9%
Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

Attached Files

Methylation_analisis.tar.gz (349.0 KB, 20 views)

**fkrueger** · 07-06-2016, 02:53 PM

Hi Diego,

I suggest you really take one region of interest and draw it out on a sheet of paper. Once you take a DNA strand and bisulfite convert it it loses its complementarity to the former complementary strand (except for a few exceptions such as AT only sequences or the like). Now if you take that bisulfite converted forward strand sequence and design primers against it, the forward primer will probably have the same sequence as the converted forward strand, and the reverse oligo will be the reverse complement of that converted sequence. Due to the conversion you only amplify the top strand (OT) and not the bottom strand anymore (OB). If you then ligate adapters onto the PCR product you should see the OT strand in 50% of cases, and the sequence complementary to it also in 50% of cases (CTOT). Or just take my word for it ☺

Regarding your sequencing: 0% mapping efficiency is really very bad. This could have several reasons: Was the data adapter/quality trimmed with Trim Galore? There is a rather strong sequence bias in the first positions, is that because of the adapter? The sequences are very long, how long are your amplicons? You could try aligning only R1 in single-end mode to see if that achieves a higher mapping rate. You could try hard-trimming the reads to say 100bp to if they align better then. Or try to see if the sequences are contaminated with something else. If you wanted to you could send me R1 (as gzipped fastq) as email attachment and I could have a quick go at this if you like. Cheers, Felix

**dzavallo** · 07-07-2016, 07:15 AM

Hi Felix!

Now I'm get it! Sorry, I totally missed the primers part. Next I'll need puppets to undestand it.
Thanks!

Regarding the mapping: I trimmed the raw data with trimmomatic, using an adapter file that I create for NEBNext library (attached as .txt). When I ran the files wiht Fastqc pre and post runing the trimmomatic, the adapter is deleated. (I show you the analisis for R1, pre and post trmimming).
Theoretically, there are two sizes of amplicons, 300 and 600 bp, corresponding to two regions from one gen (an intron and the promoter). when I ran the methylation_extractor I could find these regions, with an average of 50-100 coverage, but I also found another regions that I didnt amplified, although with 1 or 2 read coverage only.
Once the protocol for the libraries construction are optimize, the idea is to have a coverage around 100X for each amplicon. This run was made in parallel with an RNA-Seq experiment of another group. They let me test my libraries to see if it works and to adjust the amounts for the coverage. This over 1 millon reads are way out of our proportion.

I cant attach the R1 fastq file because is to large (gzipped is over 200MB), maybe google drive?
I ran the R1 single-stranded and the percentage were very similar, only a few more reads; here are the stats:

Bismark report for: /home/zavallo.diego/Downloads/Methylation_analisis/OR1_paired.fastq (version: v0.16.1)
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)
Bismark was run with Bowtie 2 against the bisulfite genome of /home/zavallo.diego/Downloads/Arabidopsis/Arabidopsis_Genome/Arabidopsis_genome/ with the specified options: -q -N 1 --score-min L,0,-0.2 -p 4 --reorder --ignore-quals

Final Alignment report
======================
Sequences analysed in total: 1238610
Number of alignments with a unique best hit from the different alignments: 299
Mapping efficiency: 0.0%
Sequences with no alignments under any condition: 1237907
Sequences did not map uniquely: 404
Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 130 ((converted) top strand)
CT/GA: 29 ((converted) bottom strand)
GA/CT: 123 (complementary to (converted) top strand)
GA/GA: 17 (complementary to (converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 13481

Total methylated C's in CpG context: 1487
Total methylated C's in CHG context: 599
Total methylated C's in CHH context: 2173
Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 405
Total unmethylated C's in CHG context: 1087
Total unmethylated C's in CHH context: 7730
Total unmethylated C's in Unknown context: 0

C methylated in CpG context: 78.6%
C methylated in CHG context: 35.5%
C methylated in CHH context: 21.9%
Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

Thanks again for all your help
Diego

Attached Files

**fkrueger** · 07-07-2016, 07:41 AM

Hi Diego,

The adapters used look like standard Illumina sequencing adapters. I am not sure how you used Trimmomatic, but I would recommend using Trim Galore for this puropose because it is optimized exactly for that purpose. The command for your case would simply be:

Code:

trim_galore --paired file1.fq.gz file2.fq.gz

As a general thing I am not sure that it is a good idea to mix different sample types together (e.g. see here: https://sequencing.qcfail.com/articl...ion-artefacts/) but here you just wanted to check something. Depending on whether the other samples were from the same organism or not you might see cross mapping artefacts though.

I have just sent you details for an FTP server to upload your raw FastQ files. If you could upload them I would try to get back to you by tomorrow should I find something.

Cheers, Felix

**dzavallo** · 07-07-2016, 08:15 AM

Thanks Felix,

I'll try trim_galore for trimming

Regarding to the library construction, I'm not mixing different organisim or treatments. Only different amplicons from the same organism (Arabidpsis) in one particular condition (viral infection). The different conditions would be contructed in separated libraries.

I'm uploading the raw data to our server and send it to you soon.
Thanks a lot

Cheers, Diego

**dzavallo** · 07-07-2016, 12:55 PM

Felix, I have just send you a private message with information on the files. Please let me know if you got them.

Cheers, Diego

**fkrueger** · 07-09-2016, 12:05 AM

Hi Diego,

I downloaded your files and tried several things with them. I trimmed them with Trim Galore which removes all traces of adapter contamination, and the quality looks fine indeed. There is a weird spiky bias at the start, somewhat reminiscent of what we see in RNA-Seq. other than that the sequence composition profiles are pretty flat throughout. This could be an effect of the non-directional nature of your sample, or the data could also be non-bisulfite converted DNA but we simply don’t know from which organism.

I have used FastQ screen and some manual alignments to try to identify a potential contaminant, but I couldn’t find anything prominent. I checked human, mouse, rat, TAIR10, Lambda, PhiX, E. coli, C. elegans and a few others, so if there is a contaminant then it is none if these ‘usual suspects’.

Since the sequences are very long I also tried to hard-trim them to 50bp (from their 3’ end) but that didn’t much. I also removed the first 15 or 30bp do get rid of the spikey bias, and tried aligning the data in a more relaxed fashion (score_min L,0,-0.6) but even that did not increase the mapping efficiency to more than 0.2%.

To cut a long story short I really don’t know what’s the reason why you don’t see more alignments. I went back to look at the GC content of and saw that there are two prominent and distinct peaks in there, one at around 30% GC which I reckon would be the overall GC content of A. thaliana, and there is a second bigger peak around 50%. I bet that this is the main source of problem in the sample, and it would be good if you could identify what it actually is.

You said that your samples were infected with a virus, which virus was that and is there a genome for it? Alternatively it be worth going back to your sequencing facility to ask them which kind of other samples they sequence regularly, maybe another plant species of whatever else you sequence in South America :P

Let me know if I can try out anything else,
Best wishes, Felix

**dzavallo** · 07-10-2016, 08:57 AM

Hi Felix,

thanks for all the test you did with my samples. I'll talk to the people from my sequencing facility. But now that you mention it, my samples were ran along with a transcriptome RNA-Seq experiments from an insects that infects Maize (Delphacodes kuscheli). Could that be the contamination? Unfortunately its not sequenced yet, but I'll give the data to my coworker and see if that sequences are from that organism.
The virus in my experiments is the ORMV (oil seed mosaic virus) and it's sequenced.
Tomorrow I'll analyze another set of data (from another virus) that have much less sequences. I though that having less sequeces was bad, but maybe they are just not contaminated.

Thanks again for the help

Cheers, Diego

**fkrueger** · 07-10-2016, 10:18 AM

Hi Diego, I would bet a small amount of money that either the maize bug or your virus are the contaminants you are looking for, definitely something you can try to improve on for your follow-up experiments.... Good luck! Felix

**dzavallo** · 07-11-2016, 10:04 AM

Hi Felix, you were right! I hope you bet a lot of money!
I talk to my coworker and to the people at our sequencing facility and we realize that we use the same index for both libraries. Most of the sequences belong to his bug

Thanks again for all your help

Cheers!

Diego

**dzavallo** · 07-11-2016, 11:51 AM

Hi Felix, sorry to bother you again. I ran the bedGraph2coverage script in order to use the output as input for DMRcaller program from Bioconductor

Code:

coverage2cytosine --CX --split_by_chromosome --genome_folder /home/zavallo.diego/Downloads/Arabidopsis/Arabidopsis_Genome/Arabidopsis_genome/ CG1_paired_bismark_bt2_pe.bedGraph.gz -o bedGraph2Coverage_CG

It generates the files, but it looks like it doesnt make the methylation calling. Also during the run, these "errors" appears several times:

Use of uninitialized value $nonmeth in join or string at /tools/bin/coverage2cytosine line 495, <IN> line 315.
Use of uninitialized value $meth in join or string at /tools/bin/coverage2cytosine line 495, <IN> line 315.

and also:

Use of uninitialized value $nonmeth in join or string at /tools/bin/coverage2cytosine line 660, <IN> line 386.
Use of uninitialized value $meth in join or string at /tools/bin/coverage2cytosine line 660, <IN> line 386.

The program end well, but I can't "see" the coverage

take a look at this:

PHP Code:


Chr3    157845    +    0    0    CHH    CAC
Chr3    1157845    +    0    0    CHH    CCA
Chr3    1578453    -    0    0    CHH    CTT
Chr3    1578455    -    0    0    CHH    CAC
Chr3    1578456    +    0    0    CHH    CAT
Chr3    5157845    -    0    0    CHH    CAA
Chr3    7157845    +    0    0    CG    CGT
Chr3    10157845    -    0    0    CHH    CCA
Chr3    11157845    -    0    0    CHH    CAA
Chr3    11578450    -    0    0    CHH    CAT
Chr3    11578453    -    0    0    CHH    CTT
Chr3    11578459    +    0    0    CHH    CTT
Chr3    15784500    -    0    0    CG    CGG
Chr3    15784505    +    0    0    CHH    CTC
Chr3    15784507    +            CHH    CCC
Chr3    15784508    +            CHH    CCC
Chr3    15784509    +            CHH    CCT
Chr3    15784510    +    0    0    CHH    CTA
Chr3    15784513    -    0    0    CHH    CTA
Chr3    15784514    -            CHH    CCT
Chr3    15784515    +            CHH    CCC
Chr3    15784516    +            CHG    CCG
Chr3    15784517    +    0    0    CG    CGA
Chr3    15784518    -    0    0    CG    CGG
Chr3    15784522    +            CHH    CCA
Chr3    15784523    +    0    0    CHH    CAC
Chr3    15784525    +    0    0    CHH    CAT
Chr3    15784528    +    0    0    CHH    CAT
Chr3    15784531    -    0    0    CHH    CAT
Chr3    15784540    -    0    0    CHH    CTA
Chr3    15784544    +    0    0    CG    CGT
Chr3    15784545    -    0    0    CG    CGA
Chr3    15784547    +            CHH    CCC
Chr3    15784548    +            CHH    CCA
Chr3    15784549    +    0    0    CHH    CAA
Chr3    15784552    -            CHH    CTT
Chr3    15784553    +    0    0    CG    CGT
Chr3    15784554    -    0    0    CG    CGC
Chr3    15784558    +    0    0    CHH    CTC
Chr3    15784560    +    0    0    CG    CGA
Chr3    15784561    -    0    0    CG    CGA
Chr3    15784564    +    0    0    CHG    CTG
Chr3    15784566    -            CHG    CAG
Chr3    15784567    +            CHG    CCG
Chr3    15784568    +    0    0    CG    CGA
Chr3    15784569    -    0    0    CG    CGG
Chr3    15784571    +            CHH    CCT
Chr3    15784572    +    0    0    CHH    CTC
Chr3    15784574    +    0    0    CHH    CTA
Chr3    15784578    -            CHH    CTT
Chr3    15784579    +    0    0    CHH    CTT
Chr3    15784582    +    0    0    CG    CGC
Chr3    15784583    -            CG    CGA
Chr3    15784584    +            CHH    CCT
Chr3    15784585    +    0    0    CHH    CTA
Chr3    15784588    -    0    0    CHH    CTA
Chr3    15784589    -    0    0    CHH    CCT
Chr3    15784590    -    0    0    CHH    CCC
Chr3    15784592    +    0    0    CHH    CTT
Chr3    15784597    +            CHH    CCA
Chr3    15784598    +    0    0    CHH    CAA
Chr3    17157845    +    0    0    CHG    CTG
Chr3    19157845    +    0    0    CHH    CTT
Chr3    20157845    +    0    0    CG    CGA
Chr3    21157845    +    0    0    CHH    CCT
Chr3    21578454    +    0    0    CHH    CTC
Chr3    21578456    +    0    0    CHH    CAA
Chr3    21578459    -    0    0    CHH    CTT

Are these "gaps" (no 0) ok?
I think that my bedGraph file is ok.
What do you think?

Thanks

Diego

Attached Files

CG1_paired_bismark_bt2_pe.bedGraph.gz (2.9 KB, 21 views)

**fkrueger** · 07-11-2016, 01:09 PM

Sorry it is quite late now over here, I'll take a look tomorrow.

**fkrueger** · 07-12-2016, 01:34 AM

Hi Diego,
You used the bedGraph file as input but you need to feed it the coverage file because this is the one that contains the counts for methylated and unmethylated calls (the bedGraph file only contains a percentage methylation). Cheers, Felix

**fkrueger** · 07-26-2016, 02:53 AM

We have just posted a new version of Bismark (v0.16.3) which is available from https://github.com/FelixKrueger/Bismark or http://www.bioinformatics.babraham.a...jects/bismark/. It includes essential bug fixes regarding ambiguous alignments in Bowtie 2 mode that had arisen in version 0.16.0, and includes several new convenience options, details are listed below:

Bismark: Essential fixes (2 in total) to address a bug for Bowtie 2 alignments where reads that should be considered ambiguous were incorrectly assigned to the first alignment thread. These errors had crept in during releases 0.16.0 and 0.16.2). More info available on Github
Bismark: Added support for large Bowtie (1) index files ending in .ebwtl which had been added in Bowtie v1.1.0

Changed the Shebang in all scripts of the Bismark suite to #!/usr/bin/env perl instead of #!/usr/bin/perl
deduplicate_bismark: Does now bail with a useful error message when the input files are empty

bismark_genome_preparation: Added new option '--genomic_composition' so that the genomic composition can be calculated and written right at the genome preparation stage rather than by using bam2nuc

bam2nuc: Now also calculates a fold coverage for the various (di-)nucleotides. The changes in the nucleotide_stats text file are also picked up and plotted by bismark2report
bam2nuc: Added a new option '--genomic_composition_only' to just process the genomic sequence without requiring any data files

bismark2summary: Added option -o/--basename FILENAME to specify a certain filename. If not specified the name will remain bismark_summary_report.txt/html
bismark2summary: Added documentation and the options '--help' and '--version' to be consistent with the rest of Bismark
bismark2summary: Added option '--title STRING' to give the HTML report a different title

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News