Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate read names - BWA mem - paired reads have different names

    Hi,

    running BWA mem (- PE; - Illumina), I'm getting the following error (replaced the ids):



    [mem_sam_pe] paired reads have different names: "XXX:5:YYY:1:11102:4257:13510", "XXX:5:YYY:1:11102:15792:1058"

    I checked the fastq file and found out that each read name is duplicated 7 times in the file (exact same name). However, the order of the read names is not matching between the pairs (see bold positions).

    Example:

    > grep -n "XXX:5:YYY:1:11102:4257:13510" R1.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    962773:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1164149:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA

    > grep "XXX:5:YYY:1:11102:4257:13510" R2.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1028309:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1229685:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA


    Is it ok for a fastq file to have multiple reads with the same read name?
    If not, could this be a problem of BCL conversion?
    How can I fix it?


    Thanks for your help,
    Stephan


    PS: bwa mem command:

    bwa mem -t 40 -v 1 hg19.fa R1.fastq R2.fastq > aln.sam

  • #2
    Fastq headers should always start with an "@" so what you have is not following the standard. Have you asked the folks who gave you this data as to whether it has been post-processed in some way? And there should be no duplicates (let alone multiples) in raw sequence files, as far as the fastq header ID's are concerned.
    Last edited by GenoMax; 02-02-2016, 06:44 AM.

    Comment


    • #3
      Hi,

      that's not the problem. See "head" result (Sequence and quality trimmed) and also the grep result I posted.

      > head R1.fastq
      @XXX:5:YYY:1:11101:12923:1051 1:N:0:AGGCAGAA+NCGATCTA
      CTT...TTC
      +
      AAA...</<
      @XXX:5:YYY:1:11101:4797:1055 1:N:0:AGGCAGAA+NCGATCTA
      ACC...CTA
      +
      AAA...<A/


      Thanks,
      Stephan

      Comment


      • #4
        My apologies.

        If the order of the reads in your files is messed up then you can "re-pair" the order of reads using the repair tool from BBMap suite like follows:

        Code:
        $ repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
        That said each fastq sequence header should be unique in every sequence file. If that is not the case then there is something wrong with this data.

        Comment


        • #5
          Thanks for you reply.

          I was also suspecting that the raw file is not ok.

          Best regards,
          Stephan

          Comment


          • #6
            If the sequence/Q-scores are identical for those 7 copies then you could potentially keep just one and throw away other 6.

            I am puzzled by how this could have happened though. No logical explanation comes to mind.

            Comment


            • #7
              It happened to me twice and a new demultiplexing fixed the problem. I suspect there is something to do with the number of threads to write fastq data. Best, Daniele

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Best Practices for Single-Cell Sequencing Analysis
                by seqadmin



                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                Yesterday, 07:15 AM
              • seqadmin
                Latest Developments in Precision Medicine
                by seqadmin



                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                Somatic Genomics
                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                05-24-2024, 01:16 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 06:58 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 08:18 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 08:04 AM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 06-03-2024, 06:55 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Working...
              X