Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wierd SAM format chromosome column

    Dear all,

    According to my previous experience, in sam file output by bismark, the third column is usually chromosome number with "chr" in front. While my new data has a mixture of chr<number> and <number> in that column. Is that common?

    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/1 115 15 3399663 255 100M = 3399567 -196 CAAAAT
    ACCAAAAAATAAAACACAAAATAAAAAAACTATTCTTCCTACCTAAAAACATAATAACTTCCACATCAATAATTCTTTATTACATAAATTATAN #DDDC@</=BCEEEC>ECCD@B@?7EA=HIIIIGBIHG
    F>ACFFF??GGIIIHGCDD<<<B?19CIGHHIIIEHHEIHEIGIGEDEHHFHDHDDDBB=1# NM:i:23 XX:Z:2G3G3G1GGG1G1G5GG4G2GG12T2G1G1G1C2G1G31G4GT XM:Z:.
    .x...h...x.hhh.h.h.....xh....h..hh...............x.h.h....h.h...............................h....h. XR:Z:CT XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/2 179 15 3399567 255 100M = 3399663 196 AACAAA
    AACAATTATAAAACTAAACTAAAAAAATCCCAAATCAAAATTTTAATATTAATTTATTCATTCACCTCACAAATAAATAAAAATATTTATCAAA B@@DD;DDFDDHDEBFHIIGGIJJIJJIJIIE>BD<?F
    DFGBGGID@CGICCGGIJJJJIJJJI=AECHHAE>BFFCCEE@ECCDAAC@=C>CCCDC;AC NM:i:24 XX:Z:1G2GGG3G2G1GGG3GG3GG1G8GG4G5G2G2G25G4G6G3G1 XM:Z:.
    h..xhh...x..h.hhh...xh...xh.h........xh....h.....h..h..h.........................h....h......h...x. XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/1 115 chr9 57387259 255 100M = 57387164 --195 AAACAAGCATCTTAAAATAACTATTAAAATTCAAAAAACTATATATCCTCAAAACTAAAAATAATATCAAATCCATAATCTTAAAATCCTCTTTCTAAGN ?A>5>;-.;(.6:EEDA>?@
    ?=;DDDDACIIDECB=<??DBBEDDDDDD4DBIDIFCDD>EE?C9BDB<4<DEFEFEAEBDC+A9<B>DDBDB=A11# NM:i:21 XX:Z:1G3G8GG2GG2G2G7GG1G14G1G3G2G1G2G4G11GG15T
    XM:Z:.h...xH.......hh..hh..x..h.......xh.h..............x.h...h..h.h..h....h...........hh..............H. XR:Z:CT XG:Z:G
    A
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/2 179 chr9 57387164 255 100M = 57387259 -195 CAAAATATTTATACTACCTACTATATACACAACACAATACTAAATTCCACTTTACTCCCTAATCTTCCACTATTCCCTCTTCCCTAAACAAAAAAAAACA ?@@FF?D4=E?FF3:AA:
    C4C:FGG>3+AFHIJGDH@DHGBDHIGG?@<FGGGHJIDCDHCHBGGIIJIIBEE=EHE;A?E7;7?7;;AA@BDD@? NM:i:19 XX:Z:6G3G1G2G3G2G1G1G4G3GG1G3G10G7G25G4G1G1G3-XM:Z:......h...h.h..x...x..x.h.h....x...zx.h...h..........h.......h.........................h....h.h.h... XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1406:1986_1:N:0:CAAAAG/1 115 12 77985032 255 100M = 77984894 --238 AATATTAACATAAAACCAAAACAACAAAATATCTAAAACACTCCAATCCCCACTCATTCCAAACTTCAAACTACTAAATCAAAAATATTACATTTCATTN CDAEDEED@C>DDBB@EFFFD@
    HHCHHGGECCAEHD=.=8HFJHIHCHEGB9IIIHGGGIJIGIGDGIHHGEEJEJGIGIJJJJJIJHHFGHFFFDD=4# NM:i:30 XX:Z:3G2GG1G1GG1G3GG2GG3GG1G3GG8G16G6GG5GGG3GG
    GG1G2G9A XM:Z:...h..hh.z.hh.h...xh..zx...hh.h...xh........z................x......xh.....xhh...xhhh.h..h.......... XR:Z:C
    T XG:Z:GA

  • #2
    That's odd, you might run the following on your reference fasta file to see if this is expected or not:
    Code:
    grep ">" reference_genome.fa
    If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.

    Comment


    • #3
      Originally posted by dpryan View Post
      That's odd, you might run the following on your reference fasta file to see if this is expected or not:
      Code:
      grep ">" reference_genome.fa
      If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.
      Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.

      Comment


      • #4
        Originally posted by fkrueger View Post
        Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.
        Thanks fkrueger,

        You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!

        Comment


        • #5
          Originally posted by serenaliao View Post
          Thanks fkrueger,

          You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!
          Just to follow up, I used awk '{if($3!~/^chr/){$3="chr"$3} print($0)}' filename. Does this sound reasonable?

          Comment


          • #6
            I am no expert with awk but it looks ok, should be easy enough to test (maybe on a few lines first). Any clues why your fasta files have mixed chromosome names?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Best Practices for Single-Cell Sequencing Analysis
              by seqadmin



              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
              Today, 07:15 AM
            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 08:18 AM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Today, 08:04 AM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 06-03-2024, 06:55 AM
            0 responses
            13 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-30-2024, 03:16 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Working...
            X