Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CLC workbench sucks ...

    I took advantage of CLC's offer last spring to get a free 6 mo trial evaluation of full version of their Genomic Workbench. Now it is coming to end and I need a make decision whether it worth to pay for. So far it was mostly frustrating experience, but I wonder if someone is more satisfied.

    The main problem that keeps me avoiding commercial supposedly more user-oriented and conveniently integrated software (that is what you are paying for, right?) was in fact it is totally opposite. Once the juggernaut is assembled and sold to you - you, stupid end user do not even try to question programmers' logic behind its beautiful design as you just not capable to get its beauty from your end user pimple point of view. That was always my experience when I tried to ask - why this way and why not this way? Can I do this? Why not, as this makes a little sense?

    For example, why I have to waste the storage space to import gigabytes of my existing databases to create yet another database in a program-specific format that cannot be read by other software? Convenience of associated metadata? OK, but why this convenient format does not allow a reference assembly against selected genome regions, while open-source software allows that using less convenient data formats? No, you have to use the entire genome - it will be slow though. Nice.

    Illumina software very rarely, but does throw into reads bizarre letters, like @, F, or Q, which may be found in a few reads out of, say 20 million. There is absolutely no justification why a user-friendly software cannot handle that and upon importing in its wonderful convenient specific format labels the entire 20+ million fasta dataset as protein leaving the end user with no other choice but to search the dataset manually to remove those 1-2 bizarre reads just to make it usable. No, you cannot question the logic behind that - a standard, however stupid, cannot be changed because of end user, but we do strive to meet expectations.

    Well, all right, at least you pay for well documented and well maintained software. Yep, the manual is intimidating - hundreds of pages. Not much useful though, as indices at the end are a way off, illustrations and examples provided do not correspond what you see on the screen, selections to be made are missing - you got to be kidding me?

    The last thing I was trying to do is just to import an annotated hg19. Not a big deal - you download hg19, and then RefGene.txt, and then connect them to each other so you can see in your reference assembly matches to specific genes, like it is done in a simple and convenient way in 454 GS RefMapper. No way. After downloading overnight hg19 (again!) from somewhere, I ended up only with sequences separated by chromosomes. That is a big help. Why I cannot just use my existing hg19.fna and RefGene.txt the manual does not say, at least I could not find it there - if anything can be ever found in this messed up volume.

    Sorry for being long - it accumulated for six months to the point I cannot hold it anymore. But, if someone has a better experience, please give me a few good reasons why this juggernaut is worth paying for.

  • #2
    It isn't worth paying for. You can do everything and more with free software.

    Comment


    • #3
      Commercial software has its place and applications (otherwise such companies would not exist). I know people who are extremely satisfied because CLC is able to easily do what they need for their projects but then I also know of others who have experiences that is similar to yours.

      No one can give you a good reason to pay for any commercial software without a complete understanding of your exact requirements/expectations (which would be hard to do via a forum). You need to make the purchase (or not) decision based on the best data you have on hand.

      Comment


      • #4
        like the posters before me, you don't HAVE to buy it

        For me and for some colleagues CLC really is accelerating research. Its mostly the visualisation part that helps. I, as a bioinformatician usually use it for some quick and dirty testing, as well as presenting data to other people within our organisation. With CLC we (more than once) encountered some problems in our data just because everything is visualized. If you don't know where to look for, this is quite handy.

        But, I also encounter a lot of annoying things/bugs, but I think because its payed software, people tend to get easily angry about it. For people(all researchers, non bioinformaticians) who don't know yet about CLC, I always say give it a try, and you will see if you like it or not.

        And yes, everything could also be done by freeware/opensource software, and personally I always choose between CLC and some other software. Sometimes CLC is handy, sometimes commandline.

        Comment


        • #5
          One thing that I've learned from all commercial bioinformatics software (not just CLC - they are probably the best of the bunch):

          It is really hard to build software that makes difficult things easier.
          At some point a scientist is just going to have to learn how to code - there are just too many ways for an analysis to go off the beaten path and break a pre-packaged, canned, off-the-shelf suite.
          --
          Jeremy Leipzig
          Bioinformatics Programmer
          --
          My blog
          Twitter

          Comment


          • #6
            Originally posted by yaximik View Post
            For example, why I have to waste the storage space to import gigabytes of my existing databases to create yet another database in a program-specific format that cannot be read by other software? Convenience of associated metadata? OK, but why this convenient format does not allow a reference assembly against selected genome regions, while open-source software allows that using less convenient data formats? No, you have to use the entire genome - it will be slow though. Nice.
            Not that I want to defend CLC, and I've no idea precisely why they do this either, but I am also guilty of using my own internal format for Gap5 requiring a full import. Why? Because BAM isn't a good format for editing.

            Imagine a scenario where you are working on a denovo assembly and you need to join two contigs together containing 20 million sequences each. For BAM the resulting contig will require updating the position of 20 million sequences, possibly reverse complementing 20 million too if that's what the match indicated. Or... you could use a format that stores all data in a recursive R-Tree that requires only a handful of updates to achieve the same task. [1]

            This is for Gap5 which does denovo editing work (and is pretty rubbish for reference based annotation tasks instead). I don't know how that relates to CLC, but sometimes there can be good programmer justification for doing something that seems daft.

            James

            [1] With hindsight I should have invented an overlay system that allows BAM to be used as the backend and only import and indexing system to permit rapid movement and restructuring of data. But that's another level of complexity.

            Comment


            • #7
              We've been using CLC heavily for a few years now, and while it's nowhere near perfect, it does have its merits.

              Firstly, having a GUI interface allows people who are new to bioinformatics to quickly get started with analyzing data without having to wade into command line processing. I know that using command line programs aren't very difficult, but for most people it's very intimidating when they first start and can take quite a while before they fully understand some of the esoteric error messages or faults that can occur.

              Also, there are a number of people who don't want to get that deeply involved in bioinformatics, they just want to analyze their data quickly so they can move their experiments along. For these people, CLC offers a convenient package that lets them do nearly all standard processing methods without having to get bogged down in a lot of details. It's a valid argument that I've given before that if you really want to do good work then you should have a good idea of how the program you use works, but the reality is that most people just want to know the end result and not how the sausage is made.

              Second, often times configuring the software for your particular system is not trivial, and CLC provides a multitude of tools all in one complete package that will pretty much work without fail across all three major operating systems. For labs or research groups that have a mix of different computing systems, having one piece of software that looks and acts the same straight out of the box makes it easier to move data around and let people interact. Yes there are non-commerical tools that can achieve the same thing, like Galaxy for instance, but generally they're not as powerful or are more complex to set up and use. Also, for labs that only use Windows, many command line programs are unavailable to them or require a lot more configuration than on Mac or Linux systems. Since Windows is still the dominant OS, particularly because of Office, CLC offers a solution for data analysis that may not be available otherwise.

              Third case, because everything is provided in a single package, you have the ability to track how your data was manipulated and can trace back from an analysis file to the original read data. This is something that command line programs don't offer unless you take very good notes or create your own processing logs as to what files were input into a program and what the outputs were. This is particularly useful in situations where you process something multiple ways to see what effect different types of options have on the result. This tracking is also very useful for cases where someone who's left the lab has their data in CLC, and someone new to the lab has to take parts of their data and do something else, which happens quite frequently in a lot of labs.

              Now, saying all of that, I do have my fair share of complaints about CLC, and if it were just me I wouldn't consider it worth it. The only commercial software that I purchased using my own funds was Geneious, because it's much better than CLC at doing a lot of the simple sequence and genome editing that I prefer to do with a GUI based program (it's also a heck of a lot cheaper). Outside of that, I mostly use command line programs as I prefer that greater level of control, but then again I also have more experience doing such things than everyone else in my lab, so while that works for me it doesn't work for them.

              Bottom line is, CLC has its merits, but based on your rant it seems like you'd rather stick with command line tools. If that's the case, then that's fine, but no one is forcing you to buy CLC or any other commercial software package.

              Comment


              • #8
                I totally agree with jkbonfield and mcnelson.phd.

                .. as a sidenote, I never found any weird characters like @, F or Q in my Illumina fastq sequence portions ...

                Comment


                • #9
                  Hi

                  I have been using CLC for sometime and was wondering if any one has compared metrics between CLC and other aligners.

                  I found something interesting and wanted to know if anybody has observed it. We took some RNA-seq data, 100bp paired end reads, and aligned it using the latest CLC and Tophat.

                  We then took 10,000 bp region from both the BAM files and looked at number of reads aligned and the accuracy's of the alignment.

                  So far, CLC aligns more reads to the same region compared to tophat (11,200/3500). Now coming to the big question of accuracy, we found twice the number of pairs in CLC than tophat (3346 vs 1640 pair). So the question is how is CLC doing it? Mind you it's only one region..

                  Can test people can suggest that would be very comprehensive.

                  cheers
                  newbie

                  Comment


                  • #10
                    Originally posted by newbietonextgen View Post
                    Hi
                    So far, CLC aligns more reads to the same region compared to tophat (11,200/3500). Now coming to the big question of accuracy, we found twice the number of pairs in CLC than tophat (3346 vs 1640 pair). So the question is how is CLC doing it? Mind you it's only one region..
                    CLC put out a white paper not too long ago (past year around when version 6 was released if I remember correctly) that detailed how their read mapper was more accurate and able to map more reads than bowtie and bwa. I never delved into the details, but I can also attest to the fact that CLC does map more reads to a reference sequence than bowtie/bowtie2. In many cases, I find that this is because the reference is circular and bowtie doesn't seem to handle that case very well. They may also have a more greedy algorithm, although that doesn't appear to be the case entirely. Either way, your findings are correct in that CLC maps more reads... the question still may be whether or not they're all mapped accurately?

                    Comment


                    • #11
                      Originally posted by mcnelson.phd
                      Also, there are a number of people who don't want to get that deeply involved in bioinformatics, they just want to analyze their data quickly so they can move their experiments along. For these people, CLC offers a convenient package that lets them do nearly all standard processing methods without having to get bogged down in a lot of details. It's a valid argument that I've given before that if you really want to do good work then you should have a good idea of how the program you use works, but the reality is that most people just want to know the end result and not how the sausage is made.
                      There is a real danger, particularly when you combine an attitude of "just give me the end result" and easy to use software, of doing it wrong. A lot of people think that simply because they can do it in a program and get a result, that the result must therefore be right. At least when one is forced to learn something about the program, they may be forced to think more critically about it or at least seek out advice from those who do.

                      Comment


                      • #12
                        Originally posted by mcnelson.phd View Post
                        CLC put out a white paper not too long ago (past year around when version 6 was released if I remember correctly) that detailed how their read mapper was more accurate and able to map more reads than bowtie and bwa.

                        I will look into the white paper. Is there a way to look into accuracy of alignment, as far as metrics etc. Any suite that can be used or work flow..

                        Thanks
                        newbie

                        Comment


                        • #13
                          Originally posted by newbietonextgen View Post
                          I will look into the white paper. Is there a way to look into accuracy of alignment, as far as metrics etc. Any suite that can be used or work flow..
                          To find the white paper, just google "CLC read mapping white paper", it should come up as the first thing.

                          I don't know off the top of my head of any good single metric to assess accuracy because that requires knowing where the reads should map to. In most cases, looking at the number of multiply mapped reads and the number of differences between the reads and the consensus may give a good indicator of quality, but only if you know there are no repetitive elements and no sequence variants between the reads and the reference. Sequencing noise would complicate things, because in some cases you might rather have noisy reads not mapped than mapped if you're trying to find something like low frequency variants. It's a bit like trying to assess how good an assembly is, you can use the N50 value, but that really doesn't tell you that much and may be misleading...

                          Comment


                          • #14
                            To be honest, when it comes to bioinformatics, I think all GUI-driven programs suck in comparison to command line alternatives (think e.g. parallelization, piping output from one program into another, and handling of million row tables). I understand the value of e.g. Geneious for people who can't be bothered to learn how to function at the command line, but then, those people aren't very serious bioinformaticians to begin with.
                            savetherhino.org

                            Comment


                            • #15
                              Originally posted by rhinoceros View Post
                              To be honest, when it comes to bioinformatics, I think all GUI-driven programs suck in comparison to command line alternatives (think e.g. parallelization, piping output from one program into another, and handling of million row tables). I understand the value of e.g. Geneious for people who can't be bothered to learn how to function at the command line, but then, those people aren't very serious bioinformaticians to begin with.
                              That's a very ignorant position to take. Simply having a GUI front end to make working with and analyzing data easier doesn't make it less complex or powerful. Do you use a GUI based operating system? If so then your comments can't be taken seriously because it's the same difference. Command line programs are great, but they're not perfect simply because they don't have a GUI and are harder to use.

                              Further, would you say that something like IGV sucks because it provides a GUI interface for looking at mapping files? Where do you draw your limits, if it's a commercial piece of software then it must be bad? As I said earlier, programs like CLC and others can make it too easy for people to do bad analyses, but that's not the fault of the program as there are a lot of good studies that are done using CLC. In fact, it's probably more likely for someone to do bad science with command line programs that aren't very user friendly and have incomplete or incomprehensible documentation. The fact is that high throughput sequencing has become a standard tool like Sanger sequencing before it, and that means a lot more labs and people will be working with such data in the future. It's incumbent upon those of us who are good bioinformaticians to help design and provide tools that allow these newcomers to analyze their data accurately and reliably, and that's what CLC tries to do. You don't blame a car manufacturer for people being bad drivers, so don't do the same with bioinformatics tools.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM
                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2024, 12:57 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X