Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fkrueger
    replied
    We've just released a new version of Bismark 0.14.1 which mainly addresses a few bugs/glitches:

    Bismark: Fixed the cleaning up stage in a --multicore run when --gzip had been specified as well
    Bismark: Fixed the handling of files in a --multicore run when the input files had been specified including file path information
    Bismark: Please note that the option -B/--basename in conjunction with --multicore is currently not supported (as in: disabled), but we are aiming to address this soon

    Methylation Extractor: Fixed a bug with the position adjustment of paired-end reads when the reads should have been trimmed from their 3' ends (option --ignore_3prime)

    deduplicate_bismark: Now also removing newline characters from the read conversion tag in case other programs interfered with the tag ordering and put this tag into the very last column

    Download is available from the Bismark project page: http://www.bioinformatics.babraham.a...jects/bismark/

    Leave a comment:


  • barbarian
    replied
    Hello again,

    So, I decided to try single end dataset because I think paired end takes too many times. With current dataset, I've successfully use Bismark to do aligning process with 66% efficiency. But, I still have some problem. I read the paper published together with the dataset, it stated that they can get map efficiency 72% and higher percentage of CpG methylation. So, is it possible to increase the mapping efficiency with changing some parameter in Bismark or Trim Galore? In the paper, they didn't use Bismark or Trim galore. They developed their own code to do the trimming and methlyation call and use Bowtie as aligner. Thank you.

    Leave a comment:


  • dpryan
    replied
    Fastq-dump is painfully slow. Sometimes you can get lucky and either ENA or DDBJ (conveniently located in Japan) has the files in fastq format. That speeds things up quite a bit, but unfortunately these other sources sometimes only have the SRA file (e.g., for this is experiment). Whenever you're dealing with large files, this can start taking a while. This is why many of us push jobs onto clusters where we use multiple machine at once (e.g., I used 13 nodes on ours to perform the alignment of this dataset in 30-40 minutes, which would have been less had someone else not decided to start 1000 jobs at the same time...).

    Leave a comment:


  • barbarian
    replied
    Ok, thank you for your guidance. I will try to check the first million sequence to see which argument is best suited for analyzing so that I don't need to wait for hours to check whether I make a mistake or not.

    Btw, is it really takes a lot of time for analysis using Bismark? My computer spesification is i7 (8 cores) and 28 GB of RAM. The analysis process seems take too long, for fastqdump, trim galore and bismark. It's around 7 hours or maybe more to complete
    Last edited by barbarian; 03-18-2015, 07:21 PM.

    Leave a comment:


  • dpryan
    replied
    For the sake of comparison, below are some metrics that I get using local alignment on that dataset. You won't get identical metrics (we're using different tools and trimming differently), but they won't be that terribly different (except CpG methylation). So >75% alignment is definitely possible with this dataset (at least after removing very low quality reads...of which there are many).

    Code:
    Alignment:
    	76660262 total reads analysed
    	59596692 paired-end reads mapped ( 77.74%).
    
    	27577367 concordant pairs
    	1545115 discordant pairs
    	1351728 reads aligned as singletons
    
    Number of hits aligning to each of the orientations:
    	11086480	 14.46%	OT (original top strand)
    	10666480	 13.91%	OB (original bottom strand)
    	19288820	 25.16%	CTOT (complementary to the original top strand)
    	18554912	 24.20%	CTOB (complementary to the original bottom strand)
    
    Cytosine Methylation (N.B., statistics from overlapping mates are added together!):
    	Number of C's in a CpG context: 188158298
    	Percentage of methylated C's in a CpG context:  44.23%
    	Number of C's in a CHG context: 173622589
    	Percentage of methylated C's in a CHG context:   2.27%
    	Number of C's in a CHH context: 348611484
    	Percentage of methylated C's in a CHH context:   7.04%

    Leave a comment:


  • barbarian
    replied
    Seems I made a mistake while performing trimming_galore. Thank you so much. I will add non directional parameter for tonight and tomorrow I will se the result again.

    Leave a comment:


  • dpryan
    replied
    The simplest method is to just take a million or so reads and align them in a non-directional manner. If there's considerable alignment to the CTOB and CTOT strands, then it's non-directional.

    Leave a comment:


  • dpryan
    replied
    I'll also note that you can probably get >75% alignment rate with this dataset, at least I did with a subset of it and using local alignment. This would probably be 80-85% if I included all of the multimappers in the metrics.

    Leave a comment:


  • barbarian
    replied
    Thank you for your info. I always think it is directional. Seems I need to change the command again. Can you tell me how to check whether a sequence is directional or not?

    Leave a comment:


  • dpryan
    replied
    FYI, this is a non-directional dataset, so make sure to use the appropriate options.

    Leave a comment:


  • barbarian
    replied
    Thank you for your reply. I've just realized it this afternoon. Now I'm waiting for the result. Maybe tomorrow I will have another question because usually it will not finish today. Good luck with your meeting.

    Leave a comment:


  • fkrueger
    replied
    For paired-end files you need to run Trim Galore in paired-end mode like this:

    trim_galore --rrbs --paired <fastq1> <fastq2>

    If you run it in twice in single-end mode it will break the sequence-by-sequence order of the files which then results in very low mapping efficiency.
    I am in a meeting all day but can take a look myself at the file in question tonight or tomorrow.

    Leave a comment:


  • barbarian
    replied
    Ok, it's strange. I tried with another sample data. The result for mapping efficiency of both files is 0.1% and if it is only one file it's 13.5%. Before this step, what I do is using
    fastq-dump --split-files <sra file>
    trim_galore --rrbs <fastq1>
    trim_galore --rrbs <fastq2>
    For both files:
    bismark --bowtie2 <ref> -1 <fastq1> -2 <fastq2>
    For 1 file:
    bismark --bowtie2 <ref> <fastq1>

    For reference, I'm sure that I already build with bowtie2 and I have checked it with Bismark data samples and the result is similar with the document. I'm trying to do with the next sample to see if it's the sample fault or my command fault. Any suggestion? By the way, I download the sample from NCBI data. Here is the link : http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE61150
    The sample that I checked is the first sample. Here : http://www.ncbi.nlm.nih.gov/geo/quer...acc=GSM1498453

    Thank you for your help.

    Additional:
    Tried to check it again using Fastqc after trimming, the result for both Fastq file is 50-50, not all good. The bad result is in per tile sequence quality, per base sequence content, sequence duplication levels, Kmre constant
    Last edited by barbarian; 03-17-2015, 06:06 PM.

    Leave a comment:


  • fkrueger
    replied
    Thanks Devon for jumping in. Here is a protocol that is worth reading in order to achieve good mapping results in most cases: http://www.epigenesys.eu/en/protcols...q-data-prot-57

    Leave a comment:


  • barbarian
    replied
    Ok. I will try now. Maybe will have another question tomorrow after the result is out

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
52 views
0 likes
Last Post seqadmin  
Working...
X