Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fkrueger
    replied
    Hi Diego, I would bet a small amount of money that either the maize bug or your virus are the contaminants you are looking for, definitely something you can try to improve on for your follow-up experiments.... Good luck! Felix

    Leave a comment:


  • dzavallo
    replied
    Hi Felix,

    thanks for all the test you did with my samples. I'll talk to the people from my sequencing facility. But now that you mention it, my samples were ran along with a transcriptome RNA-Seq experiments from an insects that infects Maize (Delphacodes kuscheli). Could that be the contamination? Unfortunately its not sequenced yet, but I'll give the data to my coworker and see if that sequences are from that organism.
    The virus in my experiments is the ORMV (oil seed mosaic virus) and it's sequenced.
    Tomorrow I'll analyze another set of data (from another virus) that have much less sequences. I though that having less sequeces was bad, but maybe they are just not contaminated.

    Thanks again for the help

    Cheers, Diego

    Leave a comment:


  • fkrueger
    replied
    Hi Diego,

    I downloaded your files and tried several things with them. I trimmed them with Trim Galore which removes all traces of adapter contamination, and the quality looks fine indeed. There is a weird spiky bias at the start, somewhat reminiscent of what we see in RNA-Seq. other than that the sequence composition profiles are pretty flat throughout. This could be an effect of the non-directional nature of your sample, or the data could also be non-bisulfite converted DNA but we simply don’t know from which organism.

    I have used FastQ screen and some manual alignments to try to identify a potential contaminant, but I couldn’t find anything prominent. I checked human, mouse, rat, TAIR10, Lambda, PhiX, E. coli, C. elegans and a few others, so if there is a contaminant then it is none if these ‘usual suspects’.

    Since the sequences are very long I also tried to hard-trim them to 50bp (from their 3’ end) but that didn’t much. I also removed the first 15 or 30bp do get rid of the spikey bias, and tried aligning the data in a more relaxed fashion (score_min L,0,-0.6) but even that did not increase the mapping efficiency to more than 0.2%.

    To cut a long story short I really don’t know what’s the reason why you don’t see more alignments. I went back to look at the GC content of and saw that there are two prominent and distinct peaks in there, one at around 30% GC which I reckon would be the overall GC content of A. thaliana, and there is a second bigger peak around 50%. I bet that this is the main source of problem in the sample, and it would be good if you could identify what it actually is.

    You said that your samples were infected with a virus, which virus was that and is there a genome for it? Alternatively it be worth going back to your sequencing facility to ask them which kind of other samples they sequence regularly, maybe another plant species of whatever else you sequence in South America :P

    Let me know if I can try out anything else,
    Best wishes, Felix

    Leave a comment:


  • dzavallo
    replied
    Felix, I have just send you a private message with information on the files. Please let me know if you got them.

    Cheers, Diego

    Leave a comment:


  • dzavallo
    replied
    Thanks Felix,

    I'll try trim_galore for trimming

    Regarding to the library construction, I'm not mixing different organisim or treatments. Only different amplicons from the same organism (Arabidpsis) in one particular condition (viral infection). The different conditions would be contructed in separated libraries.

    I'm uploading the raw data to our server and send it to you soon.
    Thanks a lot

    Cheers, Diego

    Leave a comment:


  • fkrueger
    replied
    Hi Diego,

    The adapters used look like standard Illumina sequencing adapters. I am not sure how you used Trimmomatic, but I would recommend using Trim Galore for this puropose because it is optimized exactly for that purpose. The command for your case would simply be:

    Code:
    trim_galore --paired file1.fq.gz file2.fq.gz
    As a general thing I am not sure that it is a good idea to mix different sample types together (e.g. see here: https://sequencing.qcfail.com/articl...ion-artefacts/) but here you just wanted to check something. Depending on whether the other samples were from the same organism or not you might see cross mapping artefacts though.

    I have just sent you details for an FTP server to upload your raw FastQ files. If you could upload them I would try to get back to you by tomorrow should I find something.

    Cheers, Felix

    Leave a comment:


  • dzavallo
    replied
    Hi Felix!

    Now I'm get it! Sorry, I totally missed the primers part. Next I'll need puppets to undestand it.
    Thanks!

    Regarding the mapping: I trimmed the raw data with trimmomatic, using an adapter file that I create for NEBNext library (attached as .txt). When I ran the files wiht Fastqc pre and post runing the trimmomatic, the adapter is deleated. (I show you the analisis for R1, pre and post trmimming).
    Theoretically, there are two sizes of amplicons, 300 and 600 bp, corresponding to two regions from one gen (an intron and the promoter). when I ran the methylation_extractor I could find these regions, with an average of 50-100 coverage, but I also found another regions that I didnt amplified, although with 1 or 2 read coverage only.
    Once the protocol for the libraries construction are optimize, the idea is to have a coverage around 100X for each amplicon. This run was made in parallel with an RNA-Seq experiment of another group. They let me test my libraries to see if it works and to adjust the amounts for the coverage. This over 1 millon reads are way out of our proportion.

    I cant attach the R1 fastq file because is to large (gzipped is over 200MB), maybe google drive?
    I ran the R1 single-stranded and the percentage were very similar, only a few more reads; here are the stats:

    Bismark report for: /home/zavallo.diego/Downloads/Methylation_analisis/OR1_paired.fastq (version: v0.16.1)
    Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)
    Bismark was run with Bowtie 2 against the bisulfite genome of /home/zavallo.diego/Downloads/Arabidopsis/Arabidopsis_Genome/Arabidopsis_genome/ with the specified options: -q -N 1 --score-min L,0,-0.2 -p 4 --reorder --ignore-quals

    Final Alignment report
    ======================
    Sequences analysed in total: 1238610
    Number of alignments with a unique best hit from the different alignments: 299
    Mapping efficiency: 0.0%
    Sequences with no alignments under any condition: 1237907
    Sequences did not map uniquely: 404
    Sequences which were discarded because genomic sequence could not be extracted: 0

    Number of sequences with unique best (first) alignment came from the bowtie output:
    CT/CT: 130 ((converted) top strand)
    CT/GA: 29 ((converted) bottom strand)
    GA/CT: 123 (complementary to (converted) top strand)
    GA/GA: 17 (complementary to (converted) bottom strand)

    Final Cytosine Methylation Report
    =================================
    Total number of C's analysed: 13481

    Total methylated C's in CpG context: 1487
    Total methylated C's in CHG context: 599
    Total methylated C's in CHH context: 2173
    Total methylated C's in Unknown context: 0

    Total unmethylated C's in CpG context: 405
    Total unmethylated C's in CHG context: 1087
    Total unmethylated C's in CHH context: 7730
    Total unmethylated C's in Unknown context: 0

    C methylated in CpG context: 78.6%
    C methylated in CHG context: 35.5%
    C methylated in CHH context: 21.9%
    Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

    Thanks again for all your help
    Diego
    Attached Files

    Leave a comment:


  • fkrueger
    replied
    Hi Diego,

    I suggest you really take one region of interest and draw it out on a sheet of paper. Once you take a DNA strand and bisulfite convert it it loses its complementarity to the former complementary strand (except for a few exceptions such as AT only sequences or the like). Now if you take that bisulfite converted forward strand sequence and design primers against it, the forward primer will probably have the same sequence as the converted forward strand, and the reverse oligo will be the reverse complement of that converted sequence. Due to the conversion you only amplify the top strand (OT) and not the bottom strand anymore (OB). If you then ligate adapters onto the PCR product you should see the OT strand in 50% of cases, and the sequence complementary to it also in 50% of cases (CTOT). Or just take my word for it ☺

    Regarding your sequencing: 0% mapping efficiency is really very bad. This could have several reasons: Was the data adapter/quality trimmed with Trim Galore? There is a rather strong sequence bias in the first positions, is that because of the adapter? The sequences are very long, how long are your amplicons? You could try aligning only R1 in single-end mode to see if that achieves a higher mapping rate. You could try hard-trimming the reads to say 100bp to if they align better then. Or try to see if the sequences are contaminated with something else. If you wanted to you could send me R1 (as gzipped fastq) as email attachment and I could have a quick go at this if you like. Cheers, Felix

    Leave a comment:


  • dzavallo
    replied
    Originally posted by fkrueger View Post
    Hola Diego,

    The typical directional libraries are a result of the following procedure: the DNA is sheared and blunt-ended and the Illumina adapters are ligated onto the reads. This is done in a fashion so that a specific adapter sequence binds to the 5' end of the reads (from both the forward and reverse strand), and a different primer binds to the 3' ends. This works because the first part of the adapter is initially complementary but they then fork so that they are no longer complementary. It is exactly this forking part of the adapter that will only allow the former top and bottom strand reads that have the adapter on their 5' end to anneal to the primers on the flow cell - and these get sequenced as Read 1 (of paired end files) or as the only read of single-end sequencing.

    Amplicon sequencing is indeed different because this normally works by bisulfite converting and then amplifying a specific sequence. Because this is usually done for the top strand only I would expect your sequences to align to either the OT or CTOT strands, but this depends a bit on the kit design. I am happy to advise on that if you could send me the mapping reports of your --non_directional run.

    And to your last point: No, single-end and paired-end reads should not be combined but run individually. (Having said that the reduplication and methylation extraction steps should detect automatically which kind of file they are working with, but if I were you I would still keep them separate to be safe). If you have sequenced the same sample several times with SE and PE reads you could merge the results e.g. at the bedGraph conversion stage once you have deduplicated and extracted the files individually.

    Hope this helps getting you started, Cheers, Felix
    Felix, thanks for your promt reply!

    Here is the mapping report with non_directional optional, and you were right! Most reads mapped to OT and CTOT. But I still dont undersand why is that. It is because de bisulfite sodium only treat the top strand of the DNA? Is that what you mean?

    As you might notice, the mapping efficiency is near 0%. This library is a test with only two amplicons and I'm still optimazing the protocol. I dont know exactly why is that, but in the report from the fastqc analisis on the fastq reads the overpresetated reads were only a few. I'm attaching the report so you can see it.
    Nevertheless, when I ran the methylation_extractor I did find my sequence of interest, but with very low coverage.
    Could you give a hint on that?
    Thank you very much for your help

    Gracias!

    Bismark report for: /home/zavallo.diego/Downloads/Methylation_analisis/prueba/OR1_paired.fastq and /home/zavallo.diego/Downloads/Methylation_analisis/prueba/OR2_paired.fastq (version: v0.16.1)
    Bismark was run with Bowtie 2 against the bisulfite genome of /home/zavallo.diego/Downloads/Arabidopsis/Arabidopsis_Genome/Arabidopsis_genome/ with the specified options: -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --no-mixed --no-discordant --maxins 500
    Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)

    Final Alignment report
    ======================
    Sequence pairs analysed in total: 1238610
    Number of paired-end alignments with a unique best hit: 252
    Mapping efficiency: 0.0%
    Sequence pairs with no alignments under any condition: 1238263
    Sequence pairs did not map uniquely: 95
    Sequence pairs which were discarded because genomic sequence could not be extracted: 0

    Number of sequence pairs with unique best (first) alignment came from the bowtie output:
    CT/GA/CT: 120 ((converted) top strand)
    GA/CT/CT: 117 (complementary to (converted) top strand)
    GA/CT/GA: 7 (complementary to (converted) bottom strand)
    CT/GA/GA: 8 ((converted) bottom strand)

    Final Cytosine Methylation Report
    =================================
    Total number of C's analysed: 21576

    Total methylated C's in CpG context: 2708
    Total methylated C's in CHG context: 931
    Total methylated C's in CHH context: 2266
    Total methylated C's in Unknown context: 0


    Total unmethylated C's in CpG context: 693
    Total unmethylated C's in CHG context: 2050
    Total unmethylated C's in CHH context: 12928
    Total unmethylated C's in Unknown context: 0


    C methylated in CpG context: 79.6%
    C methylated in CHG context: 31.2%
    C methylated in CHH context: 14.9%
    Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0
    Attached Files

    Leave a comment:


  • fkrueger
    replied
    Hola Diego,

    The typical directional libraries are a result of the following procedure: the DNA is sheared and blunt-ended and the Illumina adapters are ligated onto the reads. This is done in a fashion so that a specific adapter sequence binds to the 5' end of the reads (from both the forward and reverse strand), and a different primer binds to the 3' ends. This works because the first part of the adapter is initially complementary but they then fork so that they are no longer complementary. It is exactly this forking part of the adapter that will only allow the former top and bottom strand reads that have the adapter on their 5' end to anneal to the primers on the flow cell - and these get sequenced as Read 1 (of paired end files) or as the only read of single-end sequencing.

    Amplicon sequencing is indeed different because this normally works by bisulfite converting and then amplifying a specific sequence. Because this is usually done for the top strand only I would expect your sequences to align to either the OT or CTOT strands, but this depends a bit on the kit design. I am happy to advise on that if you could send me the mapping reports of your --non_directional run.

    And to your last point: No, single-end and paired-end reads should not be combined but run individually. (Having said that the reduplication and methylation extraction steps should detect automatically which kind of file they are working with, but if I were you I would still keep them separate to be safe). If you have sequenced the same sample several times with SE and PE reads you could merge the results e.g. at the bedGraph conversion stage once you have deduplicated and extracted the files individually.

    Hope this helps getting you started, Cheers, Felix

    Leave a comment:


  • dzavallo
    replied
    Hi Felix

    I´m new with bisulfite data, so I have a few question about Bismark

    First, let me notice that my reads came from "amplicon-sequencing", that is: DNA bisulfite treatment, PCR amplification with specific primers, amplicons cloning with NeBNext DNA library and then sequencing with MISeq Ilumina.

    Having that in mind, I almost sure that I´should run the mapping with the --non_directional option, since I´would have all four strand combinations sequenced, right? But what I dont understand, is why the strand-specific (or directional) is the most common library tipe in a DNA methylation analisis? Dont you whant to sequences both strands? I think Iam missing something theoretically speaking.

    Also, could I run both paired-end and single-end files at the same time? Or should I do it separately? And in any case, when I run the methylation_extractor, could I use all the .bam files together?

    Thanks!
    Best regards

    DIego

    Leave a comment:


  • akramdi
    replied
    The FastQC reports did not reveal any problems (apart from typical warning for BS-seq ), I don't see a decline in quality for R2 so I didn't preform any trimming, probably because it was already preformed.

    It looks like the library was prepared using Qiagen EpiTect kit, it's a public data set that was published a while ago.

    I was indeed able to get save some reads from the unmapped R1 (about 50% mapped in signe end mode). >1% of R2 reads mapped in signe end mode, I will try remapping them with the option --pbat.

    Leave a comment:


  • fkrueger
    replied
    Yes it does indeed appear as if Read 2 is letting the read pairs as whole fail even though Read 1 might still align fine. You could indeed run the paired-end mode with --unmapped, and then run R1 in single end mode (default mode, directional) and Read 2 in single-end mode using the option --pbat (to just get the complementary reads). The rationale why Bismark is running in --no-mixed mode is mainly that we are expecting properly paired fragments and didn’t want any sort of weird stuff align to the genome as well.

    There are some other things to consider as well though: maybe it would be worth looking again at the FastQC reports, especially of R2, if you can spot anything suspicious. Did you perform trimming with Trim Galore or something like it? Which kind of library preparation did you use (e.g. EpiGnome samples would require special trimming at the 5’ end prior to starting etc).

    Leave a comment:


  • akramdi
    replied
    Hi Felix,

    I clearly misunderstood the origin of read pairs during bisulfite sequencing. It's clear, thank you !

    So based on this test, if I try to explain the overall low mapping efficiency, I am basically losing reads (Read1), dispite them mapping correctly in signle mode, mainly because theirs mates (Read2) for some reason do not map.

    I read that Bismark could find alignments for individual mates if no concordant or discordant alignment is found. However, the invariate option --no-mixed prevents it. What's the rational behind it? I think it could allow us to get more information.

    Before my next steps (deduplication, methylation extraction), I think the best I could do is to add the extra reads I was able to extract from the left-over reads I get using --unmapped when runing the initial PE alignment.

    Leave a comment:


  • fkrueger
    replied
    Hi Amira,

    The information from which a sequence originated is determined by Read 1 (in single-end sequencing there is only a Read 1). Read 2 is simply a PCR amplification of the Read 1 fragments, and is this by definition the reverse complement of R1. So if Read 1 aligns to the OT strands its Read 2 will align to the CTOT strand, and the same for OB and CTOB. In a paired-end mapping scenario Bismark takes care of this for you, so it is all fine. Maybe with the exception that your Read 2 features a much lower mapping efficiency, which is most likely also the reason why your overall efficiency is rather low.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
17 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X