Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • spabinger
    Member
    • Jun 2011
    • 13

    Duplicate read names - BWA mem - paired reads have different names

    Hi,

    running BWA mem (- PE; - Illumina), I'm getting the following error (replaced the ids):



    [mem_sam_pe] paired reads have different names: "XXX:5:YYY:1:11102:4257:13510", "XXX:5:YYY:1:11102:15792:1058"

    I checked the fastq file and found out that each read name is duplicated 7 times in the file (exact same name). However, the order of the read names is not matching between the pairs (see bold positions).

    Example:

    > grep -n "XXX:5:YYY:1:11102:4257:13510" R1.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    962773:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1164149:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA

    > grep "XXX:5:YYY:1:11102:4257:13510" R2.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1028309:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1229685:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA


    Is it ok for a fastq file to have multiple reads with the same read name?
    If not, could this be a problem of BCL conversion?
    How can I fix it?


    Thanks for your help,
    Stephan


    PS: bwa mem command:

    bwa mem -t 40 -v 1 hg19.fa R1.fastq R2.fastq > aln.sam
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Fastq headers should always start with an "@" so what you have is not following the standard. Have you asked the folks who gave you this data as to whether it has been post-processed in some way? And there should be no duplicates (let alone multiples) in raw sequence files, as far as the fastq header ID's are concerned.
    Last edited by GenoMax; 02-02-2016, 06:44 AM.

    Comment

    • spabinger
      Member
      • Jun 2011
      • 13

      #3
      Hi,

      that's not the problem. See "head" result (Sequence and quality trimmed) and also the grep result I posted.

      > head R1.fastq
      @XXX:5:YYY:1:11101:12923:1051 1:N:0:AGGCAGAA+NCGATCTA
      CTT...TTC
      +
      AAA...</<
      @XXX:5:YYY:1:11101:4797:1055 1:N:0:AGGCAGAA+NCGATCTA
      ACC...CTA
      +
      AAA...<A/


      Thanks,
      Stephan

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        My apologies.

        If the order of the reads in your files is messed up then you can "re-pair" the order of reads using the repair tool from BBMap suite like follows:

        Code:
        $ repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
        That said each fastq sequence header should be unique in every sequence file. If that is not the case then there is something wrong with this data.

        Comment

        • spabinger
          Member
          • Jun 2011
          • 13

          #5
          Thanks for you reply.

          I was also suspecting that the raw file is not ok.

          Best regards,
          Stephan

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            If the sequence/Q-scores are identical for those 7 copies then you could potentially keep just one and throw away other 6.

            I am puzzled by how this could have happened though. No logical explanation comes to mind.

            Comment

            • danieleyumi
              Junior Member
              • Jun 2011
              • 1

              #7
              It happened to me twice and a new demultiplexing fixed the problem. I suspect there is something to do with the number of threads to write fastq data. Best, Daniele

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Pathogen Surveillance with Advanced Genomic Tools
                by seqadmin




                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                Yesterday, 11:48 AM
              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM
              • seqadmin
                Investigating the Gut Microbiome Through Diet and Spatial Biology
                by seqadmin




                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                02-24-2025, 06:31 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              39 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              44 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-18-2025, 12:50 PM
              0 responses
              35 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              191 views
              0 reactions
              Last Post seqadmin  
              Working...