Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • This is indeed a somewhat odd ratio of strands, it reminds me very much of the single-cell BS-Seq technique which undergoes several rounds of PBAT-type pull-down and strand-regeneration. I have not heard of an RRBS-method that would be doing this, do you happen to know how these libraries were generated (e.g. using the Zymo Pico Methyl-Seq kit?).
    I would probably just try to align only Read 1 (also in --non-directional mode) just to see if the mapping efficiency goes up a bit because 38% seems rather low).

    In principle it is fine to merge files at the BAM level and then feed the merged file to the methylation extractor. A word of caution is in order for paired-end end files though if you are using Samtools merge because this merging does not normally guarantee that Read 1 and Read 2 are kept in the same order in the following file unless you specify the option -n:
    Code:
    Options: -n       sort by read names
    Similarly, already merged files my be sorted by read name using
    Code:
    samtools sort -n
    to bring them back into the correct format.

    If the R1/R2 order is muddled with then the overlap detection and the strand assignments would go a little mad so this is to be avoided.

    I generally tend to merge files right from the start at the FastQ file level because this also reduces the number of files generated in the process considerably. Hope this helps, Felix

    Comment


    • Felix, thanks for the reply.

      Yes this library was prepared with the Zymo Meth-Seq kit as you have guessed. I have been thinking perhaps the library was not prepared that well. I ran a assessment on a raw sample in Fastqc and got the following base sequence content:



      The % of cytosine is awfully high compared to other reports I have seen and considering this is RRBS data.


      Actually, I have been simply concatenating the output files from the alignment step for input into the methylation extractor rather than using any sort of merging tool (i.e. cat sample1_1.sam sample1_2.sam > sample1). Would this cause the extractor to make incorrect methylation calls? I'm going to switch to combining fastq (with cat) instead like you have suggested to avoid this issue altogether.
      Attached Files

      Comment


      • Given that the majority of reads come from the CTOT and CTOB strands it is to be expected that C is high while G is low. The fraction aligning to the OT and OB strands would do the exact opposite, so I think overall the picture you are seeing is quite alright.

        If you were operating on SAM files then cat should have worked (because this works on a line by line basis and doesn't do any clever (or weird) things as samtools merge does without -n. You only might have extra header lines in the middle somewhere but they should be ignored by the methylation extractor. Merging right from the start is probably the cleanest solution though.

        Comment


        • Dear Felix

          I am getting really low mapping efficiencies nearly 37% when i am using Bismark ( bowtie1, n=3).
          when I am using BSmap with same sequences (pair end) on the same parameter, i am getting mapping efficiency nearly 80%.

          But the problem is with bsmap i am further struggling for data analysis, like how to get methylated and non methylated reads from the .txt file.

          So plz suggest me for better mapping efficiency. it will be really helpful for me.

          Thank you

          Comment


          • Originally posted by alim123 View Post
            Dear Felix

            I am getting really low mapping efficiencies nearly 37% when i am using Bismark ( bowtie1, n=3).
            when I am using BSmap with same sequences (pair end) on the same parameter, i am getting mapping efficiency nearly 80%.

            But the problem is with bsmap i am further struggling for data analysis, like how to get methylated and non methylated reads from the .txt file.

            So plz suggest me for better mapping efficiency. it will be really helpful for me.

            Thank you
            I am copying below the reply I have sent you via email already:

            Hi Alim,

            In order to give you a better diagnosis of the issues it would be good to have more information about your experiment, such as the sequencing length, the type of library, the genome in question and the version numbers of all programs involved etc. It would also be good to get access to the FastQC profiles of your samples.

            General rules you should adhere to are:

            1) Adapter and quality trimming, e.g. using Trim Galore
            2) Use Bowtie2 mode (which is the default mode anyway)
            3) Update to the latest version (v0.16.1)

            More detailed hands-on information may also be found here:

            or here:


            Hope this helps,
            Best, Felix

            Comment


            • Thanks for sending some QC plots via email. The images were fairly small but it looks like you are using 2x100bp reads generated with the EpiGnome kit (there is a hefty bias at the start of reads that needs to be removed before mapping.

              Also you are using a version of Bismark that is almost 3 years and 22 releases outdated. Just get the latest versions here: https://github.com/FelixKrueger/Bismark or here: http://www.bioinformatics.babraham.a...jects/bismark/.

              You then need to run Trim Galore with --clip_r1 6 --clip_r2 6 (potentially more), and then run the alignments using Bismark v0.16.1 in default mode.

              Also please make sure to read up on biases using paired-end PBAT at QC Fail. I am pretty sure that you will see nice mapping efficiencies after following the steps in the cookbooks attached. Cheers, Felix

              Comment


              • Sir, one more question, as you suggest to trim the sequences further..So would i used raw seq or already trimmed sequence for further trimming.

                Thank you

                Comment


                • Hi Felix,

                  I was sent some bismark outputs (+ reports) to analyze. I am looking at the mapping reports, and I wondering if the ratio of mapped reads between OT, OB, CTOT, CTOB is strange or not. Can I conclude that the library was constructed in a directionnal manner? If so, I would have expected to see less reads aligned to CTOT, CTOB.


                  Final Alignment report
                  ======================
                  Sequences analysed in total: 44844469
                  Number of alignments with a unique best hit from the different alignments: 30974987
                  Mapping efficiency: 69.1%
                  Sequences with no alignments under any condition: 4171455
                  Sequences did not map uniquely: 9698027
                  Sequences which were discarded because genomic sequence could not be extracted: 36

                  Number of sequences with unique best (first) alignment came from the bowtie output:
                  CT/CT: 15485820 ((converted) top strand)
                  CT/GA: 15393574 ((converted) bottom strand)
                  GA/CT: 47834 (complementary to (converted) top strand)
                  GA/GA: 47723 (complementary to (converted) bottom strand)

                  Thanks!

                  Amira

                  Comment


                  • Hi Amira,

                    The numbers do indeed suggest that the library was constructed in a directional manner. The alignments to the CTOT and CTOB strands are < 0.3% so this is not terrible (and it is quite expected).

                    To move on I would either suggest you re-align the data in the default mode, i.e. without --non_directional specified, or if you wanted to continue with the data as it is I would discard the CTO* files before continuing with the bismark2bedGraph step.

                    Cheers, Felix

                    Comment


                    • Hi Felix,

                      Thanks for the quick response!

                      I have another question: I recently tried to understand why my paired-end BS-seq data maps with a mapping efficiency of 40% (it seemed rather low to me), so I conducted some tests as suggested in previous messages dealing with low mappability efficiency. One thing I did is map a subset (-u 1000000) of read 1 and read 2 individually in single end mode to see if the mapping efficiency would go up. Plus, to be safe, I ran the mapping in --non_directional mode because I am not sure of the library type.

                      Here's the part I don't understand: I have different OT, OB, CTOT, CTOB ratios between read1 and read2.

                      Mapping report for read1
                      Final Alignment report
                      ======================
                      Sequences analysed in total: 1000000
                      Number of alignments with a unique best hit from the different alignments: 602427
                      Mapping efficiency: 60.2%
                      Sequences with no alignments under any condition: 177212
                      Sequences did not map uniquely: 220361
                      Sequences which were discarded because genomic sequence could not be extracted: 0

                      Number of sequences with unique best (first) alignment came from the bowtie output:
                      CT/CT: 299725 ((converted) top strand)
                      CT/GA: 302357 ((converted) bottom strand)
                      GA/CT: 150 (complementary to (converted) top strand)
                      GA/GA: 195 (complementary to (converted) bottom strand)

                      Mapping report for read2
                      Final Alignment report
                      ======================
                      Sequences analysed in total: 1000000
                      Number of alignments with a unique best hit from the different alignments: 457737
                      Mapping efficiency: 45.8%
                      Sequences with no alignments under any condition: 365712
                      Sequences did not map uniquely: 176551
                      Sequences which were discarded because genomic sequence could not be extracted: 1

                      Number of sequences with unique best (first) alignment came from the bowtie output:
                      CT/CT: 3950 ((converted) top strand)
                      CT/GA: 3937 ((converted) bottom strand)
                      GA/CT: 223725 (complementary to (converted) top strand)
                      GA/GA: 226124 (complementary to (converted) bottom strand)

                      Have you come across a similar situation? what could cause such discordance?

                      Thanks again,
                      Amira

                      Comment


                      • Hi Amira,

                        The information from which a sequence originated is determined by Read 1 (in single-end sequencing there is only a Read 1). Read 2 is simply a PCR amplification of the Read 1 fragments, and is this by definition the reverse complement of R1. So if Read 1 aligns to the OT strands its Read 2 will align to the CTOT strand, and the same for OB and CTOB. In a paired-end mapping scenario Bismark takes care of this for you, so it is all fine. Maybe with the exception that your Read 2 features a much lower mapping efficiency, which is most likely also the reason why your overall efficiency is rather low.

                        Comment


                        • Hi Felix,

                          I clearly misunderstood the origin of read pairs during bisulfite sequencing. It's clear, thank you !

                          So based on this test, if I try to explain the overall low mapping efficiency, I am basically losing reads (Read1), dispite them mapping correctly in signle mode, mainly because theirs mates (Read2) for some reason do not map.

                          I read that Bismark could find alignments for individual mates if no concordant or discordant alignment is found. However, the invariate option --no-mixed prevents it. What's the rational behind it? I think it could allow us to get more information.

                          Before my next steps (deduplication, methylation extraction), I think the best I could do is to add the extra reads I was able to extract from the left-over reads I get using --unmapped when runing the initial PE alignment.

                          Comment


                          • Yes it does indeed appear as if Read 2 is letting the read pairs as whole fail even though Read 1 might still align fine. You could indeed run the paired-end mode with --unmapped, and then run R1 in single end mode (default mode, directional) and Read 2 in single-end mode using the option --pbat (to just get the complementary reads). The rationale why Bismark is running in --no-mixed mode is mainly that we are expecting properly paired fragments and didn’t want any sort of weird stuff align to the genome as well.

                            There are some other things to consider as well though: maybe it would be worth looking again at the FastQC reports, especially of R2, if you can spot anything suspicious. Did you perform trimming with Trim Galore or something like it? Which kind of library preparation did you use (e.g. EpiGnome samples would require special trimming at the 5’ end prior to starting etc).

                            Comment


                            • The FastQC reports did not reveal any problems (apart from typical warning for BS-seq ), I don't see a decline in quality for R2 so I didn't preform any trimming, probably because it was already preformed.

                              It looks like the library was prepared using Qiagen EpiTect kit, it's a public data set that was published a while ago.

                              I was indeed able to get save some reads from the unmapped R1 (about 50% mapped in signe end mode). >1% of R2 reads mapped in signe end mode, I will try remapping them with the option --pbat.

                              Comment


                              • Hi Felix

                                I´m new with bisulfite data, so I have a few question about Bismark

                                First, let me notice that my reads came from "amplicon-sequencing", that is: DNA bisulfite treatment, PCR amplification with specific primers, amplicons cloning with NeBNext DNA library and then sequencing with MISeq Ilumina.

                                Having that in mind, I almost sure that I´should run the mapping with the --non_directional option, since I´would have all four strand combinations sequenced, right? But what I dont understand, is why the strand-specific (or directional) is the most common library tipe in a DNA methylation analisis? Dont you whant to sequences both strands? I think Iam missing something theoretically speaking.

                                Also, could I run both paired-end and single-end files at the same time? Or should I do it separately? And in any case, when I run the methylation_extractor, could I use all the .bam files together?

                                Thanks!
                                Best regards

                                DIego

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                19 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X