Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate read names - BWA mem - paired reads have different names

    Hi,

    running BWA mem (- PE; - Illumina), I'm getting the following error (replaced the ids):



    [mem_sam_pe] paired reads have different names: "XXX:5:YYY:1:11102:4257:13510", "XXX:5:YYY:1:11102:15792:1058"

    I checked the fastq file and found out that each read name is duplicated 7 times in the file (exact same name). However, the order of the read names is not matching between the pairs (see bold positions).

    Example:

    > grep -n "XXX:5:YYY:1:11102:4257:13510" R1.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    962773:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1164149:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA

    > grep "XXX:5:YYY:1:11102:4257:13510" R2.fastq
    761397:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    862085:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1028309:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1063461:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1229685:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1264837:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
    1365525:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA


    Is it ok for a fastq file to have multiple reads with the same read name?
    If not, could this be a problem of BCL conversion?
    How can I fix it?


    Thanks for your help,
    Stephan


    PS: bwa mem command:

    bwa mem -t 40 -v 1 hg19.fa R1.fastq R2.fastq > aln.sam

  • #2
    Fastq headers should always start with an "@" so what you have is not following the standard. Have you asked the folks who gave you this data as to whether it has been post-processed in some way? And there should be no duplicates (let alone multiples) in raw sequence files, as far as the fastq header ID's are concerned.
    Last edited by GenoMax; 02-02-2016, 06:44 AM.

    Comment


    • #3
      Hi,

      that's not the problem. See "head" result (Sequence and quality trimmed) and also the grep result I posted.

      > head R1.fastq
      @XXX:5:YYY:1:11101:12923:1051 1:N:0:AGGCAGAA+NCGATCTA
      CTT...TTC
      +
      AAA...</<
      @XXX:5:YYY:1:11101:4797:1055 1:N:0:AGGCAGAA+NCGATCTA
      ACC...CTA
      +
      AAA...<A/


      Thanks,
      Stephan

      Comment


      • #4
        My apologies.

        If the order of the reads in your files is messed up then you can "re-pair" the order of reads using the repair tool from BBMap suite like follows:

        Code:
        $ repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
        That said each fastq sequence header should be unique in every sequence file. If that is not the case then there is something wrong with this data.

        Comment


        • #5
          Thanks for you reply.

          I was also suspecting that the raw file is not ok.

          Best regards,
          Stephan

          Comment


          • #6
            If the sequence/Q-scores are identical for those 7 copies then you could potentially keep just one and throw away other 6.

            I am puzzled by how this could have happened though. No logical explanation comes to mind.

            Comment


            • #7
              It happened to me twice and a new demultiplexing fixed the problem. I suspect there is something to do with the number of threads to write fastq data. Best, Daniele

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Latest Developments in Precision Medicine
                by seqadmin



                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                Somatic Genomics
                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                Yesterday, 01:16 PM
              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 07:15 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-23-2024, 10:28 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-23-2024, 07:35 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-22-2024, 02:06 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Working...
              X