Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Paired end fastq files contamination

    Hello,

    I wasn't sure where to post this but I was hoping to get some insight into a problem I am experiencing with my paired end fastq files. I should also say I am molecular microbiologist by training with very little hands on training in bioinformatics and programming in general so apologies if this is a stupid question!

    I have had intermittent problems when downloading .fastq.gz files from an external sequencing provider (I have seen this when using both FTP file server and direct download links on the company's website).

    The download itself appears fine (I checked the md5 values and they match) and I use 7zip (as recommended by the sequence provider) to extract the fastq files, all done on my local computer. I transfer the files onto a Unix machine to carry out bwa and samtools analysis to identify SNPs.

    The problem only becomes apparent when I try to merge paired end fastq data in BWA, when the bwa sampe step fails - when I check the number of lines in each file (using wc –l), they do not match, which explains why the step fails. My colleague wrote a script to pull out where the mismatch between the paired files occurs - the first time we noticed this we identified an insertion of about 10-20 sequencing reads from a completely different sequencing run – when we showed the headers of the reads to the sequencing provider they traced it back a previous sequencing project done by them, and said the problem has occurred at our end. However since this has happened at least twice now.

    My question: is this “contamination” (sorry that’s the microbiologist in me!) likely to occur just from unzipping the data? I haven’t done anything else to the data (e.g. quality trimming) and appears to happen randomly – in the last batch of sequencing I received, 3 of the strains sequenced went through the bwa analysis absolutely fine, others needed a 2nd or 3rd attempt at downloading before I got “clean” files. Is the problem more likely to be at the sequencing providers end? Either way, what can I do to stop this happening? From now on I will always check the number of lines in each file before I proceed with any analysis but I would like to resolve the problem too.

    Thanks!

  • #2
    Let me say this at the beginning this all sounds odd.

    If you must use a windows machine then minimize things you need to do with the data file to a minimum there. Just download the file and then move it to server. If your server has direct internet connection then download the sequence files directly using wget or curl on unix). Now a days all NGS tools understand compressed fastq files so there is not need to uncompress them with 7-zip but if you must then use gunzip on unix end.

    As for the insertion of unrelated data (however small) that should never happen and there is no way that can happen on your end.
    Last edited by GenoMax; 12-02-2016, 11:17 AM.

    Comment


    • #3
      I suggest you verify the pairing before doing anything further processing. You can do that with the BBMap package like this:

      reformat.sh in1=file2.fastq.gz in2=file2.fastq.gz vpair


      If that shows a problem, the problem is absolutely occurring on their end. It's unlikely but theoretically possible that you caused the corruption during the unzipping process - say, if you were unzipping lots of things at once, and outputting some of them to the same file so they were overwriting each other - but if the gzips pass the gzip integrity test, then they were not corrupted during transmission.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Understanding Genetic Influence on Infectious Disease
        by seqadmin




        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
        09-09-2024, 10:59 AM
      • seqadmin
        Addressing Off-Target Effects in CRISPR Technologies
        by seqadmin






        The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
        08-27-2024, 04:44 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 06:25 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 01:02 PM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-18-2024, 06:39 AM
      0 responses
      14 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 09-11-2024, 02:44 PM
      0 responses
      14 views
      0 likes
      Last Post seqadmin  
      Working...
      X