Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM/BAM sort by read names produces truncated read names

    Hi,

    I tried to sort the alignment file by read name, but it appears that truncated read names were produced. This phenomenon was observed no matter which program I used: SAMtools sort (0.1.8), Picard SortSam (1.77) or Novosort (2.08) .

    Here is the first few records of the original SAM file:
    Code:
    HWI-ST621:415:D197AACXX:8:1101:1        113     chr2    236798427       70      100M1S  chr8    3088040 0       ACCTCTGTTTCTAAGCAGTGGAATAGAATTGCTTATGGAATAGCCAGGTCATAGGATGTNATAANTTCCCTGGAAATCAGAGGGGAAAAGAAGCAAAACAN   C@?>?AC@:C@>CECDEE@ACFEBFFDEEHECDACADHFHFEHIJGJIGIHJJIHDB80#HF?1#GDJIHCIGGHHAIIIJJHEHJJIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:18 UQ:i:18 NM:i:2  MD:Z:59G4T35
    HWI-ST621:415:D197AACXX:8:1101:1        177     chr8    3088040 70      101M    chr2    236798427       0       AAATACATACATACACACAGACTGATTTTCTCTTCAGCAATATTTTAATGAAACCCCATACTGCAAATTACATAAACTAGTTAAAGTACACCAACCTCAAG   DEEDDDFDCEECEEDDBFFFDHHHFGHECJJIHFJJJIJJJIJHHGIHGDDGGJJJIIHGHIJJJIIJIGJJIIIFIIJJJJJJIIHFFAHHHDFFDFCCB        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        83      chr8    143208201       70      100M1S  =       143207998       -303    CGCTGAGAGCAAGGTGCCAGCAGGGTGGGCCCTTCTGGAGGCTCCGGCCGGGATCTGTTCCAGGCCACCCCCGCCTTCCGGCCATCCTCAGCTTGGCTCCN   >@CA>A:A>>>3(CA<AACDDDB<<?3?@9?CDCDCBCC?7<BBDBB@<93?DCCAA8<B?A<<DB7DCIGGBHGAHIIHFJJIEJIIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:47 UQ:i:47 NM:i:1  MD:Z:6C93       PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        163     chr8    143207998       70      92M     =       143208201       303     TTGTGGAGTCAGGTGTCCCTGGGGTCACGGTGACTGGCCAGGCGNGGGGAGCCAGGAGGCACACGGTCCTGGGCTCTNGCAGGGCTGGAGTG    @BBDFFADD?FHH@@EGGGGIIII@BCGHG8?DGHGB@FHHGAG#-<CC;@E?ACEE?B7?BCA?B;?BDDCB9??A#++28?B?B@B1<>A PG:Z:novoalign  RG:Z:LS148      AS:i:12 UQ:i:12 NM:i:2  MD:Z:44C32G14   PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:14       65      chr6    74783346        70      1S100M  chr1    1867309 0       NGATTAAGCAGCCAAGCTGTATCCTGAGGGAAACATGGGCAATGGAAAGCATCAGATTTCCTGGGTCAAAGCTATCCTGAGCTCAGGCACTGGGCTAACTG   #4=DFFFFGHHHHJJJJJJIJJJJJJJJJJGHIJIHIIJIGIIJJBFHIIIJJJJDIJJIHHIJJIGGHHHHHFFFFFFEDEEEEDDD@DDDDDDCDCDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       129     chr1    1867309 70      101M    chr6    74783346        0       ACACACACACACACACACGAACTGCAGGGGGCTCTGGAGCCATGGAGTTAGAAAAGCTCTCTGAGAGGCCAGGTGTAGTGGCTCATGCCTGTAATCCCAGC   CCCFDFFFHHHHGJJJIJJIJJJJJFHIJIJFHIJJJDHEHHHHG@D?BDACCEDCBDDDDDDCDDDDBDBDB@CCCCCCBDDCCC@ACAC@>AB>CCACD        PG:Z:novoalign  RG:Z:LS148      AS:i:30 UQ:i:30 NM:i:1  MD:Z:68T32
    HWI-ST621:415:D197AACXX:8:1101:14       97      chr2    62756955        70      1S100M  chr6    74783591        0       NGTGCTGTTTGGTTTGTGTGTATTATATGGGTTTGGATTACAATAATTCCTCCCTTTTGTATAATGTTTTGCAGTTTTTAAAGCACTTCATGCTCTAAATC   #1=DDFFDHHGGFHIIHHEHGFGIDHHIIIIFGIIICGGEHHHIIIII>GGGIIIIIIIICFGHHGGHIIIIDAAEHHHEBDDFCEEECCDCCCCCC>ACC        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       145     chr6    74783591        70      101M    chr2    62756955        0       ATTTTTGTAAGTCACCAATGGTTGGATGTTGGCAGTTTCATAAGGTTCATTCTAATAGTTCCTGGGACACAAATGACTCGAAGTAGGTCAAGACAGGTTCA   <DDDDDDDDDDDDEEECCFDFFGHEHGJJIIJJJJIGHIHCIIIJIGCGIIGDIHEIIHGIJJJJIHIIJJIIHGBHHJIJJJJJJJJHHHFHFFFFD?C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        81      chr1    155944063       70      101M    chr11   19838477        0       CAGCTGTACCTGGCAGCAGCCCCTTCCCCAAGATGGTGACACCTCTGTCCACACCCTCTGTAATAGTGACCGGAGAGCCTGTGGAGCATTCCACCAGGATT   DDDEDAA:BCAA:DD@BDDDDB?@=BDEDEEDFFFD@;??=HHIIIIGJIHF<JIHFGBIHIJIIIIIHJJJJJJJIJJJJIJJJJJIHHHHHFFFFFCC@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        161     chr11   19838477        70      101M    chr1    155944063       0       AGCCCCTTATGCAGAAAAAGGGACTCCACCTGGAGCCCTCTCTGGATCTACTTCTCCCAGATAAATCAGTCGGCTGTGTAATCTTTCAGGAAACCTGACCC   ??<DDFFFFHHDDDHIGDDAFE9FFGHGCHEGG9FGGHGGGGCFHBF*0BBCBGGE@GHGCHA@ECE@H;ADBFDCDDCCDD@CCC;33:32:595<9>3<        PG:Z:novoalign  RG:Z:LS148      AS:i:1  UQ:i:1  NM:i:0  MD:Z:101
    After sorting:
    Code:
    HWI-ST  81      chr7    83652142        70      82M     chr8    142160880       0       CTTTGTATTTACAGATACCACGGCCATTTTGCAATGTCCTCAGCACATAGTGGAAGCTGAACAAACAATCACATTTTCTAAT      @D<EA?7)==77@=7)('-'FF;FABB*0>EDB9DFDGDEBDEECC<FHHHBE@9HHEAB<;>FFDBBFA<DFA;A,B48;?   PG:Z:novoalign  RG:Z:LS148      AS:i:22 UQ:i:22 NM:i:1  MD:Z:76A5
    HWI-ST  65      chr9    120922414       70      101M    chr6    160312253       0       TCACTGAGTCTGATTGAAGCAACTGGCATTGGTGATCATACTTCAATATTTCTCTCATATTTGAAGTTAGAATTAGTTGATGTGAGATATTATATTAGCCT   @CCFFFFFHFHFAHHIDGHIJGIIJGHCGIJICFHIIIIIJIJJIIJGIIEIJHHGGIICGHIBGHFGHHGGHIDC@DHGIHGIGHHHHECBDFFFFFEDE        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  81      chr2    46872242        70      101M    chr17   79461315        0       CATGGATTAAAATATTAAGTAATTTGATCTAGATGATTGTTTACAGTTTAACGCAAATACACTTAGTCTGTTCTGATTATTTACTCAAGGATTATATTACT   >C>:EDDFCDDFFDFFHHHHHHJIHGG=GIGJJIIIIJGIIIHIJHDGGHHJFIIJIIGC:JHHAIIFJJJIHGH@IJJJHHCGB>HGGHGHHFFFFF@C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  65      chr8    103315908       70      93M     chr17   40205036        0       AGATATCTGAGAAACTGACCTAAATAAGCAATCTGAAAAGATTAAGGTTCCTTCAATTATTATACTACTTGTTCTCCAAATAACACACTAACT   <@@ADD>DDBA<FG?A43?@FFF:3AEB>DFECE91:C<CFCFCFFC::4?D>FCDDD<FC8DFEFDG88@.==C=4@D;7@:7?CCBDD@>@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:93
    HWI-ST  89      chr16   61016706        70      101M    =       61016706        0       TGTTGAGTCAATGTAAGACCTTGGTAAGAATTCTTCAATTTAGACATGGCTAATTTTTAATGTCAACCACAGCTATTGAGGTACTTATATTAATTAACCTT   C?CECACCFFFFDDDE?=CCGGIIIGGEGIIIGGIIEGIIHHDBFGIGFIIIHGIIIIIGGIHG@CHHHGHHHGHDEIFIIGIHBIIIHHDDHEDEDF@?@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  97      chr12   16510044        70      101M    chr9    75346048        0       TAATAAAAATTCAGTTTTAACTATAGATGCCTTCTTCTCCTCTTGTGTTTGATTTATTGCTCCAAATGGGCCAACCTGGATGTCTATATTTCTTCCACTAA   CCCFFFFFHHHHHJIIGIIJJIJIIJJJJJJIEIIJJJJJHJIJGFGFHJJJIIJJJJJJJJIGJJJJIIJJJIJHJHHFHHBBEDFFCFEFEEEEDDDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  73      chr5    22843028        70      97M     =       22843028        0       TAACTGTGTTTACTTTTCTCAGTTTCTACCAGAGAAAAGGCAGGTGCATTTTTTTGGTATGTTTGTGTAAAGTGAATTTGGCTTTACTTTTTCAAAT       =?<DD>=;FHDFFHGE@EFH?EA<B4AA@EBGCC1?91*:8CFG0?@?<D@@B;AFB=7=3?CHEEBE77B@6>;(6;.;;@;?>A>5(5:@CC5@>    PG:Z:novoalign  RG:Z:LS148      AS:i:3  UQ:i:3  NM:i:0  MD:Z:97
    HWI-ST  73      chr6    152150636       70      101M    =       152150636       0       CATTTGTCATCATTACACGGTCATGGGAGTGCTAAGAAGACTTAAATGCAGGGCTACCACCCCTTCCCAATTCATCTTTTATCCATTTTATTTCTCTAAGG   @CCDDDDEHHFHHFBHGGHHAFEFFHIGG:?CFGIGIGGHHEGIEHIGHGDE@;B=FA@F@FGGGEEHECCFFEFFCECDECCCDDDEDDCC@BCC>CCCC        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  113     chr7    63064316        30      101M    chr17   26080536        0       CCTGCTCATCTCAGGCCTGCCGGCTCCTCCACCTGCCTTTTCGAGTACCCTGGGAACCCCCCGAGGACAGGTGTCATCGGTTGCTTCATCTCACCATCCCT   A94+(:ACCC??@BB@@7DDBDB<2????@8;BDB@A@BCDBCCCA<-DCC>3?8DB=7@@IHCIIJIGIJIIIJGHHGGGGHGEIDIFFFFAFFFDF@@@        PG:Z:novoalign  RG:Z:LS148      AS:i:31 UQ:i:31 NM:i:1  MD:Z:42C58
    HWI-ST  89      chr4    96140737        70      101M    =       96140737        0       AACAACGAGCCTCACTAGGTGACGATTAGCTATGGTTTCCCTGGTCTATACTGGATTTGGGTTCATTGGTAAATCATTCTATTCATAGCAATACAAGATAT   <<A?8DDDDDDCCAEEEFFFFHHHHHFIJJJJIIIIIGIGHIFIGHIIGDGGIJIJJIIHIHIEHIIJJJJJJIIJJJIJJIIJJIJJHGHHHFFFFFB@@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    Does anyone have any idea of what's wrong with the programs or data?

    Thanks a lot!

    Allen

  • #2
    Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?

    Comment


    • #3
      The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?

      Comment


      • #4
        Originally posted by maubp View Post
        Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?
        You are right, that was a typo mistake. Thanks for spotting that.

        Comment


        • #5
          Originally posted by dpryan View Post
          The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?
          Yes you are right, it seems the read titles were screwed up by novoalign. The original read titles were fine.

          Code:
          @HWI-ST621:415:D197AACXX:7:1101:1179:2146 1:N:0:
          NCAGAATGAGCAATTAGAAATCCTCTGTNNTNNTAGNNNNCTGGAAATTAAACCAAGTGTATAATGCACCTAATGAAGTGTATGGTCTGANGTTTAANTAG
          +
          #1=DDFFFHHHHHJJJJJJJJJJJJJJI##2##1:C####00?DHGIJJJEHIHIEHCHFGIIJJJIGEEHHFEHFFFDDDFEEECDEDC#,5<@@C####
          @HWI-ST621:415:D197AACXX:7:1101:1185:2187 1:N:0:
          TTTGAACATCCCCACTAGGTTCTTTTCCATTGNCAANNNGGAGCATCAGCCAGTGAATCTGTTTCAGGTTTCCATTCTGCAGAACTCCTCCAAAGCATGTG
          +
          CCCFDFFFHHHHHEHIJJJCHHIIJJIIGGIG#1:C###00?DHIJHGIIJJJGHIEHIIIGDHGIJI@DHFH>AEHFFFFFFECCCCEDCDCCDDDCDCC

          Comment


          • #6
            Hi Allenyu

            Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
            Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.

            Comment


            • #7
              Hi Allen,

              Yes, you need to sort your Fastq input before running Novoalign. No luck man.


              Originally posted by zee View Post
              Hi Allenyu

              Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
              Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.
              Marco

              Comment


              • #8
                Thanks! Now trying to use sorted reads first.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X