Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina InterOp Parsers (perl and R)

    Hi All,

    Since starting at Illumina recently, one of my first projects was to write sample scripts to parse the InterOp files. We receive requests for how to parse the binary files quite often and having working code is a bit more helpful than the previous solution of sending a document describing the binary formatting.

    In any case, I've built a package in R that can import the InterOp files into an R session (for visualization) and can also write out the data to flat files. I also have perl scripts that just write out the data to flat files.

    I wrote these a few months ago, and we've tested it internally and have found it to be quite useful. After I came across a couple of other InterOp parsing threads here, I asked my manager for approval to send it out to people who are in need of accurate InterOp parsing outside of SAV. If you have this need, please PM me and I will send you my email address, or email me directly via SEQAnswers. With enough requests, I may even receive approval to post it to github.

    Officially, these are *unsupported*, so please do not call tech support referring to these scripts. They will have no idea what you are talking about. Instead, email me with questions or bugs and I'll do my best to straighten you out.

    Cheers,
    mchen1
    Last edited by mchen1; 05-24-2013, 10:08 AM. Reason: enabled option to email me directly from seqanswers...

  • #2
    For those who have messaged me, thanks for the feedback! If anyone needs code in other languages, please let me know or put a request into this thread. I coded these scripts in R and perl because that is what I am familiar with, but with enough demand, we can develop a few more in other languages for release.

    Comment


    • #3
      The CPAN module Bio::IlluminaSAV is a parser for Perl.

      Comment


      • #4
        Hi earonesty,

        The CPAN module looks to be nice, but I think our scripts meet different needs. There is certainly room for both. The IlluminaSAV module appears to parse the InterOp data into perl arrays. This would seem ideal for someone working in perl who may want to access the InterOp data for further manipulation. On a separate note (perhaps a feature request?), there is a new InterOp file called IndexMetricsOut.bin that describes index metrics and does not appear to be in the IlluminaSAV perl module documentation. If you need the binary format for this file in order to update your perl module, shoot me a PM or email with your email address, and I can send you our latest documentation.

        The scripts I've written here are designed to simply convert InterOp data into flat files, and do so as efficiently as possible. I designed them for those wanting to parse the InterOp data for entry into a LIMS system, for example. Since most LIMS are custom-built, flat files seem to be the most universally accepted format for the data. The perl code is also sent without module packaging so that users can see how I parse the binary files in case they want to integrate the code into their own perl work. The perl code also comes without dependencies on other modules so it works out of the box with any modern perl installation (I personally dislike having to install perl modules, especially in the context of group IT policies). In any case, the goal of the two packages is the same, but it would seem our design parameters differ.

        Hopefully this discussion can illuminate how the packages differ in case users are deciding between the two.

        M

        Comment


        • #5
          Hi MChen,

          Thank you for you previous PM regarding your parsing scripts. The IndexMetricsOut.bin statistics sound interesting to me as well. Would you be so kind to send the latest Theory of Operation (I assume) documentation to the same email address?

          Many thanks in advance,

          B

          Comment


          • #6
            Highly Recommended

            I recently had the opportunity to use mchen1's script package and I highly recommend it. Very straight forward and was extremely easy to use. Well documented as well.

            I was interested in parsing the InterOp folder for 50+ illumina HiSeq runs and gather the statistics to report back our general performance over time, and I was able to do so quickly and accurately.

            Thanks for sharing the code!

            Comment


            • #7
              I would like to try the package for parsing Interop files in well.. Did I read it correct that there is an R version?

              Comment


              • #8
                Reposting for a colleague of mine at InVitae who wrote an open source python parser for exactly this.

                #######
                Greetings all,

                I work at InVitae and we just publicly released a library called Illuminate.



                The purpose of Illuminate is to emulate the stats you see when you load a run data folder within Illumina SAV, providing programmatic access to these metrics for whatever purposes you may have -- data storage, analysis, automated machine monitoring, and so on.

                This is completely free, open source software (MIT License) written in Python with the intent to be used, tested, and improved upon by the bioinformatics community.

                Features:
                Simple command-line tool you can use to quickly inspect a run.
                Built to be easily integrated into other code.
                Easily extensible even if you think you are "not much of a programmer".
                Results standardized to pandas DataFrame objects (so if you know how to work in R, you can probably get up to speed quickly with this)

                Here's an example of the smallest python script you could get away with using this tool.

                Code:
                import illuminate
                myDataset = illuminate.InteropDataset('path/to/rundata/')
                print myDataset.meta
                print myDataset.IndexMetrics()
                print myDataset.TileMetrics()
                print myDataset.QualityMetrics()
                And here's an example of how you would use the command-line reporter to do the same thing:

                Code:
                python illuminate --meta --index --tile --quality /path/to/rundata
                You can even have illuminate open up in an interactive iPython shell, where the dataset will be loaded up into an InteropDataset object for you:

                Code:
                python illuminate -i /path/to/rundata
                Not all of the metrics objects are fully fleshed out yet, although all of the binary parsers are "feature complete" in that you can produce a data dictionary and a DataFrame from them.

                I'm hoping that some of you fine folks can pipe up and let me know what might be useful to you -- or better, submit contributions, bug reports, and so on that will help Illuminate become as full-featured as it needs to be.

                This library has been in our production pipeline for several months now, reporting on cluster density, quality, and yield so we can keep tabs on sequencing run quality in an automated fashion.

                If you use it, or you have questions about it, please comment here and let me know!

                Cheers,
                Naomi

                Comment


                • #9
                  Originally posted by mchen1 View Post
                  Hi earonesty,

                  The CPAN module looks to be nice, but I think our scripts meet different needs. There is certainly room for both. The IlluminaSAV module appears to parse the InterOp data into perl arrays. This would seem ideal for someone working in perl who may want to access the InterOp data for further manipulation. On a separate note (perhaps a feature request?), there is a new InterOp file called IndexMetricsOut.bin that describes index metrics and does not appear to be in the IlluminaSAV perl module documentation. If you need the binary format for this file in order to update your perl module, shoot me a PM or email with your email address, and I can send you our latest documentation.

                  The scripts I've written here are designed to simply convert InterOp data into flat files, and do so as efficiently as possible. I designed them for those wanting to parse the InterOp data for entry into a LIMS system, for example. Since most LIMS are custom-built, flat files seem to be the most universally accepted format for the data. The perl code is also sent without module packaging so that users can see how I parse the binary files in case they want to integrate the code into their own perl work. The perl code also comes without dependencies on other modules so it works out of the box with any modern perl installation (I personally dislike having to install perl modules, especially in the context of group IT policies). In any case, the goal of the two packages is the same, but it would seem our design parameters differ.

                  Hopefully this discussion can illuminate how the packages differ in case users are deciding between the two.

                  M
                  1. I would be interested in the IndexMetrics file (erik at q32.com) is fine

                  2. I would also like to try out your code (same email)

                  3. The LibXML reader is for parsing the RunInfo.xml into a perl hash. Other than that the module is core. Somehow I thought it would be better just to do that right.

                  4. Extraction is fast because usually our apps don't need all the data... many programs are just looking for maximum values, etc. (Our LIMS only gets quantile scores per cycle for example.)

                  Comment


                  • #10
                    Erik, I've emailed the packages to your email address.

                    Regarding #4, it's nice that you are able to speed up data extraction by not parsing all the data. Many times this type of curation summarizes run quality well. The packages I send out have the goal of simply providing all of the data. This leaves it up to the user to decide on what numbers to input into their LIMS.

                    Thanks for your post.

                    Comment


                    • #11
                      Hi Mchen

                      Are you still there ?
                      I'm new here and it seems impossible for new member to post PM.
                      So, I try to contact you by replying to this old thread.

                      Your post and its comments about direct usage of interop files are very interesting and promising.

                      I'd like to try your parsers in R and Perl.
                      I tested a little Bio::IlluminaSAV and Illuminate but I prefer to have a global dump of interop data to integrate them in my QC pipeline.

                      I hope you can help me and contact me (in PM or in this thread).

                      Regards

                      Comment


                      • #12
                        Global dump" is kindof ambiguous. What format do you want?

                        Bio::IlluminaSAV can be used to make a "dump" by using JSON or YAML or whatever, and then dumping each metric to a file.

                        Comment


                        • #13
                          Yes, still here. I will PM you, Florent.

                          Comment


                          • #14
                            Originally posted by earonesty View Post
                            Global dump" is kindof ambiguous. What format do you want?

                            Bio::IlluminaSAV can be used to make a "dump" by using JSON or YAML or whatever, and then dumping each metric to a file.
                            I agree with you, it's ambiguous.
                            When I wrote my post, I didn't know exactly what kind of data (and format) I could obtain from these parsers.
                            I wanted to convert interop files in non binaries files to get a direct access to data.

                            After half day of work, I understand better interop files and Bio::IlluminaSAV and I have written some code.
                            I need to work more but I get data and I plan to store them (maybe filtered) in xml files.
                            I'm still thinking about how to manage the data (keep them all or only useful part) depending of next steps of my future quality control pipeline.

                            Thank you for your comment.

                            Comment


                            • #15
                              jwater, you sent me a PM asking for the InterOp parsers, but you have set your account to reject PMs, and you left no contact information or way for me to reply. I can't help you if I can't reach you.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X