Seqanswers Leaderboard Ad

**chxu02** · 03-16-2015, 05:49 AM

I didn't because the manual says --gzip is for zipping the temp files, not for unzipping the input fastq.gz files. And without --gzip, the running was successful until the end.

**fkrueger** · 03-16-2015, 06:20 AM

Originally posted by chxu02 View Post

Hi Felix,

I'm reporting a bug from v0.14.0. When I used fastq in gz format to run bismark --multicore, in the end bismark failed to assemble all separate files into one. The files were named in *.fastq.gz_* initially, but in the end of running, bismark unambiguously tried to assemble files with name *.fastq_*. Obviously it failed. Hope it helps.

Youyou

Hmm, in case you didn't use --gzip I don't think I quite understand the error you are reporting then. Both running files ending in .fastq or .fastq.gz works fine for me here. Would you mind sending me the entire error message you are seeing as email?

Attached is the latest development version of Bismark which should also understand the option --gzip.

Attached Files

bismark_0.14.1_devel.zip (72.0 KB, 49 views)

**chxu02** · 03-16-2015, 10:15 AM

Sorry Felix, I'm bad. I checked my running history yesterday and found I used --gzip. But why did it happen though, if the purpose of --gzip is just to zip temp conversion files?

**fkrueger** · 03-16-2015, 10:44 AM

Ah good that explains it. As a said a few posts before --gzip was a corner case that wasn't handled properly, so it was not intended that the merging went wrong... If you use the development version I attached in the last post --gzip should be working now.

**barbarian** · 03-16-2015, 05:38 PM

Hello Felix,

Thank you for your response. I have installed samtools. I found another problem. From Biostar, I found it that I should generate 2 different fastq files for paired end reads. So, I use fastq-dump --split files to extract from my SRA. I got 2 files of fastq and seems I have no problem so far (before, I only dump it to 1 files and bismark found duplicate ID error). The only problem is, I only got 1 files of BAM from Bismark. The file name is the same as the first fastq file with bam extension so I assume it should have the bam file with the second file name,But it's only one, for the first fastq. Is it normal or wrong? I use bismark <options> -1 first_1.fq -2 second_2.fq. The result is only first_1.bam.

This is my Final ALignment report :

Sequence pairs analysed in total: 29829521
Number of paired-end alignments with a unique best hit: 4156425
Mapping efficiency: 13.9%
Sequence pairs with no alignments under any condition: 23649277
Sequence pairs did not map uniquely: 2023819
Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Mapping efficiency is really low. What do you think it caused?

For Bismark example data, I got this result:
Final Alignment report
======================
Sequences analysed in total: 10000
Number of alignments with a unique best hit from the different alignments: 4732
Mapping efficiency: 47.3%
Sequences with no alignments under any condition: 4279
Sequences did not map uniquely: 989
Sequences which were discarded because genomic sequence could not be extracted: 0

So I think my human genome reference is not bad.

**dpryan** · 03-17-2015, 12:13 AM

I just replied to you on Biostars, but producing 1 BAM file from paired-end reads is the appropriate result. The reads from each file are indicated appropriately in the BAM format.

The low mapping efficiency is a different question then. There are a number of likely causes of that, the most common being fastq files that are out of sync. Try mapping fastq_1.fq by itself and see if the mapping efficiency jumps up.

**barbarian** · 03-17-2015, 12:17 AM

If mapping fastq_1.fq to itself, is there any biological meaning behind that? Will the result still represent the actual methylation condition? Thank you.

**dpryan** · 03-17-2015, 12:19 AM

"by itself", not "to itself", big difference. This is purely to diagnose the cause of the low mapping efficiency.

**barbarian** · 03-17-2015, 12:20 AM

oh, do you mean only use the first file, not together with the second file?

**dpryan** · 03-17-2015, 12:22 AM

That's correct. You essentially act as though you have a single-end dataset. If the mapping efficiency jumps to a more reasonable level when doing that, then either the fastq files are out of sync or there's something weird with fastq_2.fq.

**barbarian** · 03-17-2015, 12:35 AM

Ok. I will try now. Maybe will have another question tomorrow after the result is out

**fkrueger** · 03-17-2015, 01:31 AM

Thanks Devon for jumping in. Here is a protocol that is worth reading in order to achieve good mapping results in most cases: http://www.epigenesys.eu/en/protcols...q-data-prot-57

**barbarian** · 03-17-2015, 05:51 PM

Ok, it's strange. I tried with another sample data. The result for mapping efficiency of both files is 0.1% and if it is only one file it's 13.5%. Before this step, what I do is using
fastq-dump --split-files <sra file>
trim_galore --rrbs <fastq1>
trim_galore --rrbs <fastq2>
For both files:
bismark --bowtie2 <ref> -1 <fastq1> -2 <fastq2>
For 1 file:
bismark --bowtie2 <ref> <fastq1>

For reference, I'm sure that I already build with bowtie2 and I have checked it with Bismark data samples and the result is similar with the document. I'm trying to do with the next sample to see if it's the sample fault or my command fault. Any suggestion? By the way, I download the sample from NCBI data. Here is the link : http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE61150
The sample that I checked is the first sample. Here : http://www.ncbi.nlm.nih.gov/geo/quer...acc=GSM1498453

Thank you for your help.

Additional:
Tried to check it again using Fastqc after trimming, the result for both Fastq file is 50-50, not all good. The bad result is in per tile sequence quality, per base sequence content, sequence duplication levels, Kmre constant

**fkrueger** · 03-17-2015, 11:50 PM

For paired-end files you need to run Trim Galore in paired-end mode like this:

trim_galore --rrbs --paired <fastq1> <fastq2>

If you run it in twice in single-end mode it will break the sequence-by-sequence order of the files which then results in very low mapping efficiency.
I am in a meeting all day but can take a look myself at the file in question tonight or tomorrow.

**barbarian** · 03-17-2015, 11:53 PM

Thank you for your reply. I've just realized it this afternoon. Now I'm waiting for the result. Maybe tomorrow I will have another question because usually it will not finish today. Good luck with your meeting.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News