Hello,
I wasn't sure where to post this but I was hoping to get some insight into a problem I am experiencing with my paired end fastq files. I should also say I am molecular microbiologist by training with very little hands on training in bioinformatics and programming in general so apologies if this is a stupid question!
I have had intermittent problems when downloading .fastq.gz files from an external sequencing provider (I have seen this when using both FTP file server and direct download links on the company's website).
The download itself appears fine (I checked the md5 values and they match) and I use 7zip (as recommended by the sequence provider) to extract the fastq files, all done on my local computer. I transfer the files onto a Unix machine to carry out bwa and samtools analysis to identify SNPs.
The problem only becomes apparent when I try to merge paired end fastq data in BWA, when the bwa sampe step fails - when I check the number of lines in each file (using wc –l), they do not match, which explains why the step fails. My colleague wrote a script to pull out where the mismatch between the paired files occurs - the first time we noticed this we identified an insertion of about 10-20 sequencing reads from a completely different sequencing run – when we showed the headers of the reads to the sequencing provider they traced it back a previous sequencing project done by them, and said the problem has occurred at our end. However since this has happened at least twice now.
My question: is this “contamination” (sorry that’s the microbiologist in me!) likely to occur just from unzipping the data? I haven’t done anything else to the data (e.g. quality trimming) and appears to happen randomly – in the last batch of sequencing I received, 3 of the strains sequenced went through the bwa analysis absolutely fine, others needed a 2nd or 3rd attempt at downloading before I got “clean” files. Is the problem more likely to be at the sequencing providers end? Either way, what can I do to stop this happening? From now on I will always check the number of lines in each file before I proceed with any analysis but I would like to resolve the problem too.
Thanks!
I wasn't sure where to post this but I was hoping to get some insight into a problem I am experiencing with my paired end fastq files. I should also say I am molecular microbiologist by training with very little hands on training in bioinformatics and programming in general so apologies if this is a stupid question!
I have had intermittent problems when downloading .fastq.gz files from an external sequencing provider (I have seen this when using both FTP file server and direct download links on the company's website).
The download itself appears fine (I checked the md5 values and they match) and I use 7zip (as recommended by the sequence provider) to extract the fastq files, all done on my local computer. I transfer the files onto a Unix machine to carry out bwa and samtools analysis to identify SNPs.
The problem only becomes apparent when I try to merge paired end fastq data in BWA, when the bwa sampe step fails - when I check the number of lines in each file (using wc –l), they do not match, which explains why the step fails. My colleague wrote a script to pull out where the mismatch between the paired files occurs - the first time we noticed this we identified an insertion of about 10-20 sequencing reads from a completely different sequencing run – when we showed the headers of the reads to the sequencing provider they traced it back a previous sequencing project done by them, and said the problem has occurred at our end. However since this has happened at least twice now.
My question: is this “contamination” (sorry that’s the microbiologist in me!) likely to occur just from unzipping the data? I haven’t done anything else to the data (e.g. quality trimming) and appears to happen randomly – in the last batch of sequencing I received, 3 of the strains sequenced went through the bwa analysis absolutely fine, others needed a 2nd or 3rd attempt at downloading before I got “clean” files. Is the problem more likely to be at the sequencing providers end? Either way, what can I do to stop this happening? From now on I will always check the number of lines in each file before I proceed with any analysis but I would like to resolve the problem too.
Thanks!
Comment