Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • splaisan
    senior molecular biologist
    • Jun 2009
    • 32

    compressing reference genome and indexing

    Hi all

    I report here a strange behavior I experience on both OSX10.6.8 and centos 5.7 with latest software installed (macport and yum respectively)

    If I download archived versions of the human reference genome from the Broad ftp (bundle 1.2)
    and run the command:
    Code:
    zgrep ">" human_b36_both.fasta.gz
    I get it all OK until the end and suddenly pages of binary garbage occur

    I thought this had to do with corruption of the archive and I did the following:
    expand the archive back to multifasta
    reformat the fasta content using bioperl (in and out as fasta)
    recompress with razip (samtools 1.18)
    repeat the zgrep command
    I get the garbage again!!

    Is this normal and due to some specificity of razip?

    ## below a series of commands and results
    Code:
    # sorry for the length
    $> razip -c human_b36_both.fa > human_b36_both.fa.gz
    $> zgrep ">" human_b36_both.fa.gz
    >1
    >2
    …
    >NT_113899
    >NT_113965
    >NT_113898
    >NC_007605
    ?f???,ixAp_?Òoё\G?C?Nm????R??[D?;?œ?)?pj5X??UL?`??mM?l%??ZºŐP?BI??W??d???HCoo??DS?ѷivfq(??X??U???w??? }?C"R???¿?.???????.\???,??7???bҳ*?k??F?b?l!??M??Ս??[???D?T?NfJ?8Ɉ?f?p??cGm
                                                 ?<?:vRv?Hd?ղ???C.??߉?ye???N?
                                                                             U?4CY???w??<??v!?@?o?w;??İ?xHD?b+????|????e9???D?
                          ??:??(fUY???m??jL??o?»B}??!;?X?c????` ?\Y4??)ß????<?/Þ?@@j!'?Y?B?2???"?$?	I???9_?5??=묦???H1?l??Q??|?{??6[????G?a;:&??gw?<e?u2???R,]?P?%?Vd')Y?_K?ae
                                                                          -Z ʂ@??g?YvF?By??q?'??m?Z&? (~5?ʈ???????3??8{?W?j?? 7_?L??-??r?kԊRb?׏?g?8?<?6??$???K??M?-
    ?$H?k?r??v%Jp˴lںxSJ
                       ?q????Khg?	db?>??b?q`E?RJ ?~lH?????%?m.???X?+??t?ߒ??%̽ @??ޫ?)`?it[?w??:?݃ݓY
                                                                                                 ?P3fg$j?????t?>??e?9?n?5?????y23?2WgT?f?*?=l??`ԊU?C??????P?TO
                                                          ??~?4dg?mq&z3??ZJ?qP-??j??r?*??????20?;vRe*?B??LD]
    #(… many many such pages of trailing binary garbage)
    
    # while the tail of the fasta file is clean
    $> tail human_b36_both.fa
    ATGGGGGGCCGCGCATTCCTGGAAAAAGTGGAGGGGGCGTGGCCTTCCCCCGCGGCCCCC
    CAGCCCCCCCGCACAGAGCGGCGCTACGGCGGGCGGGCGGCGGGGGGTCGGGGTCCGCGG
    GCTCCGGGGGCTGCGGGCGGTGGATGGCGGCGGACGTTCCGGGGATCGGGGGGGTCGGGG
    GGCGCCGCGCGGGCGCAGCCATGCGTGACCGTGATGAGGGGGCAGGGTCGCAGGGGGTGT
    GTCTGGTGGGGGCGGGAGCGGGGGGCGGCGCGGGAGCCTGCACGCCGTTGGAGGGTAGAA
    TGACAGGGGGCGGGGACAGAGAGGCGGTCGCGCCCCCGGCCGCGCCAGCCAAGCCCCCAA
    GGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCCCAGGGCTGACCCCGGCAAAC
    GTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGTGGGTGACACA
    GGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG
    CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT
    
    # indexing the archive works fine
    $>samtools faidx human_b36_both.fa.gz
    $>cat human_b36_both.fa.gz.fai
    1	247249719	3	60	61
    2	242951149	251370554	60	61
    ...
    NT_113898	1305230	3143346495	60	61
    NC_007605	171823	3144673490	60	61
    
    # extracting the last record also works and ends just like the tail above
    $>samtools faidx human_b36_both.fa.gz NC_007605
    >NC_007605
    AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
    GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
    CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
    ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
    …
    GGGGGGCGGGGAGCGGGCAATGGAGCGTGACGAAGGGCCCCAGGGCTGACCCCGGCAAAC
    GTGACCCGGGGCTCCGGGGTGACCCAGCCAAGCGTGACCAAGGGGCCCGTGGGTGACACA
    GGCAACCCTGACAAAGGCCCCCCAGGAAAGACCCCCGGGGGGCATCGGGGGGGGTGTTGG
    CGGGGGCATGGGGGGGTCGGATTTCGCCCTTATTGCCCTGTTT
    Last edited by splaisan; 01-29-2012, 05:50 AM.
    http://www.bits.vib.be/index.php
  • Richard Finney
    Senior Member
    • Feb 2009
    • 701

    #2
    It appears that razip uses gzip compression (or other kinds of compression) but is not gzip. The output contains chunks of (gzipped) compressed data but the entire file is not a gzipped file. razip output cannot be gUNzipped (but can be uncompressed using razip). zgrep tries to UNgzip the file, succeeds party, but then runs into the uncompressed part and fails.

    Comment

    • splaisan
      senior molecular biologist
      • Jun 2009
      • 32

      #3
      Thanks Richard

      I wrongly assumed this was a bonafide archive. Never mind, I will keep a second copy compressed with bgzip to reduce storage as compared to the plain 3GB fasta and use one or the other depending on the needs.

      Great help, thanks
      Stephane

      Originally posted by Richard Finney View Post
      It appears that razip uses gzip compression (or other kinds of compression) but is not gzip. The output contains chunks of (gzipped) compressed data but the entire file is not a gzipped file. razip output cannot be gUNzipped (but can be uncompressed using razip). zgrep tries to UNgzip the file, succeeds party, but then runs into the uncompressed part and fails.
      http://www.bits.vib.be/index.php

      Comment

      • lh3
        Senior Member
        • Feb 2008
        • 686

        #4
        It is part of the razip problem and part of gzip. If you run "gzip -dc | grep", you get normal output. But if you run "gzip -dcf | grep", which is what zgrep is actually calling, you get those rubbish.

        Comment

        • splaisan
          senior molecular biologist
          • Jun 2009
          • 32

          #5
          thanks Heng

          This helps a lot and I will alias it for regular use!
          You just saved me several gigabites of disk space.
          Cool
          Stephane

          Originally posted by lh3 View Post
          It is part of the razip problem and part of gzip. If you run:
          Code:
          gzip -dc | grep
          , you get normal output. But if you run "gzip -dcf | grep", which is what zgrep is actually calling, you get those rubbish.
          http://www.bits.vib.be/index.php

          Comment

          Latest Articles

          Collapse

          • GATTACAT
            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by GATTACAT
            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
            07-01-2026, 11:43 AM
          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 07-02-2026, 11:08 AM
          0 responses
          7 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-30-2026, 05:37 AM
          0 responses
          12 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          20 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          54 views
          0 reactions
          Last Post SEQadmin2  
          Working...