Header Leaderboard Ad

Collapse

Developing programming experience for bioinformatics

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Developing programming experience for bioinformatics

    I have an extensive molecular biology background but am relatively new to bioinformatics. Would like to extend my computational/programming skills to maximize utility in analyzing sequencing and other high-throughput data, as well as to improve my own marketability.

    Many job postings refer to some combination of Perl/Python/C++/Java experience. Any suggestions regarding where to focus effort, particularly in a forward-looking manner?

    Thanks for any suggestions.

  • #2
    I started teaching myself a year and a half ago (I'm a tech) and still consider myself a novice, so listen to others as well, but I have found that learning linux really well has been very beneficial. Getting a good understanding of how to write bash scripts as well as the basic linux commands (sed/tr/cut/sort/cat/paste/grep) in addition to learning a bit of awk has been tremendously useful for me.

    Comment


    • #3
      I am biased and I would strongly encourage to start learning Python first and R as well. Lot of people find it easy to learn Python. Getting the hang of awesome unix commands would also be very useful.

      Here are a few links to get started with R and Python
      http://cmdlinetips.com/2011/09/free-...-books-online/
      http://cmdlinetips.com/2011/11/free-...for-beginners/

      Comment


      • #4
        I like Perl for scripting. It's powerful, very widely used, and has lots of online resources. I also think it's easy to learn to a level that will quickly make you productive. Many people seem to think Python is easy to learn, but the O'Reilly (a publisher of typically great computer books) book "Learning Python" is ~3 times longer than "Learning Perl" and doesn't even cover regular expressions. That's like a driving class that doesn't cover steering. I would steer away from this book if you choose Python. And of course a strong command of unix/linux is highly recommended though I would choose Perl or Python over extensive shell scripting.

        Comment


        • #5
          My 2 cents:
          • The Unix CLI: This includes common commands such as awk, sed, and cut. I would also include shell scripting in this. This bullet point is required to put together any sort of basic analysis pipeline.

          • R: Inevitably, you end up needing to crunch number in R, so go ahead and get at least a passing familiarity with it. This may include various bioconductor packages, depending on what you're doing

          • Python or Perl: It doesn't really matter which one. You can do pretty much anything in these languages, though they have their limitations.

          • C/C++/Java: If you get to the point of writing more "heavy duty" programs that require any significant performance then you'll need one of these. You would generally learn one of these last.


          It's probably best to learn things in that order, possibly swapping the order of R and Python/Perl.

          Comment


          • #6
            In a forward looking manner I wouldn't bother with Perl/Python/Java they are mostly just fads and any location you might want to work is just as likely to use the one you don't know, for no other reason than the CEO liked the monty python jokes or coffee. These scripting languages are easy enough to pick up if you know how to program in C, and most cool molecular dynamics simulators are in C for obvious performance reasons. Unix command line utilities are very handy for getting things done, and PERL and Python both draw heavily on the conventions so if you encounter a script done in either of these you should be able to figure out what it does(knowing linux that is).

            Comment


            • #7
              Originally posted by dpryan View Post
              My 2 cents:
              • The Unix CLI: This includes common commands such as awk, sed, and cut. I would also include shell scripting in this. This bullet point is required to put together any sort of basic analysis pipeline.

              • R: Inevitably, you end up needing to crunch number in R, so go ahead and get at least a passing familiarity with it. This may include various bioconductor packages, depending on what you're doing

              • Python or Perl: It doesn't really matter which one. You can do pretty much anything in these languages, though they have their limitations.

              • C/C++/Java: If you get to the point of writing more "heavy duty" programs that require any significant performance then you'll need one of these. You would generally learn one of these last.


              It's probably best to learn things in that order, possibly swapping the order of R and Python/Perl.
              +1
              Totally agree - and swapping Python/Perl before R

              Comment


              • #8
                To rskr's point, yes C is a good choice if you plan on doing a lot of fundemental algorithm development or are at a point in in your life where you are interested in programming from a mostly academic perspective and wouldn't be hampered by it's (or Java's) longer development times. If on the other hand you need to be quickly productive and are mostly interested in piecing together and interpreting NGS and other high-throughput data using the vast amount of open source analytic programs available, I believe (having worked in industrial bioinformatics for years) you would be much better off with Perl (my favorite) or Python, neither of which are fads.

                Comment


                • #9
                  rskr does not understand that there are a lot of biologists that are more interested in biological questions then hard core computer science. He likes to try to diminish anyone and any work that uses an interpreted language. He has a lot of work ahead of him. We use the tools that allow us to answer our questions with the least work.

                  This is the path I would suggest:
                  1. You will not learn anything unless you are actively and currently using it for something. So come up with a project you will use this stuff.
                  2. Learn some Unix
                  3. Learn some Python and/or Perl (Python is structured more like R so it can help with the next step, but I know Perl better).
                  4. Learn some R.
                  5. Keep working on what interests you.

                  If after all this you decide you want to mostly give up molecular biology and become a hard core computer scientist, then you can move onto C++.

                  Great place to start:
                  http://korflab.ucdavis.edu/Unix_and_Perl/
                  --------------
                  Ethan

                  Comment


                  • #10
                    Originally posted by ETHANol View Post
                    rskr does not understand
                    Sorry, I thought he was interested in "Forward looking" learning, not quick and dirty piece together a bunch algorithms someone else wrote, without much justification or understanding. I assure you, you don't need to learn much to do the latter, wait until the time comes.

                    Comment


                    • #11
                      Here is some advice, anyone that says language xxxx is garbage and a waste of your time is blind (unless you are talking about some language of yesteryear).

                      Rskr, you ask, why piece together a bunch of algorithms that someone else wrote that are totally sufficient to answer the biological question you are addressing when you can make your own? Because it saves a lot of time, you'll publish your project sooner, which usually means in a higher impact journal, which means better career options.

                      We could blow our own pipets in the lab from glass, but that wouldn't make us better scientists.
                      --------------
                      Ethan

                      Comment


                      • #12
                        Originally posted by ETHANol View Post
                        Here is some advice, anyone that says language xxxx is garbage and a waste of your time is blind (unless you are talking about some language of yesteryear).

                        Rskr, you ask, why piece together a bunch of algorithms that someone else wrote that are totally sufficient to answer the biological question you are addressing when you can make your own? Because it saves a lot of time, you'll publish your project sooner, which usually means in a higher impact journal, which means better career options.

                        We could blow our own pipets in the lab from glass, but that wouldn't make us better scientists.
                        Sometimes the pieced together programs are sufficient to answer the question, other times people don't understand what is in the program well enough to say one way or another(but that doesn't stop them). To that end even if one does not become very proficient in C it will give them an advantage over those who don't. Sometimes there is a missing piece or your perl program isn't fast enough to analyze modern high-throughput experiments, what do you do then? Its been my experience that these experience perl programmers have already figured out everything you can do with canned programs, so good luck publishing something that hasn't been done by taking output from program A and putting it into program B.

                        Comment


                        • #13
                          Rskr, you are into pushing the state-of-the-art on the computing side. Some people are much more interested in the biology and find the computing side really boring. There is limited time in life and limited brain space (maybe not yours). So a lot of us learn what we have to to answer the biological question we are interested in.

                          I could go to a hot-spring and find a bug with a more efficient polymerase for PCR or I could just use one that is currently available to do interesting research. Some people have made their life's work the former, many more have focused on the latter. Some people learn C++ and come up with better algorithms some use the existing ones and write perl scripts to do interesting research. Why do you have a problem with that? How many Cell, Science and Nature papers have been published using reused Perl scripts and Bioconductor packages? Has that all been a waste of time?
                          --------------
                          Ethan

                          Comment


                          • #14
                            Originally posted by ETHANol View Post
                            How many Cell, Science and Nature papers have been published using reused Perl scripts and Bioconductor packages? Has that all been a waste of time?
                            I don't have a problem with it. I just wouldn't consider the Perl scripts "forward thinking" compared to the program or programs that were used by the Perl scripts. It takes quite a bit of forethought to write a program that many people can use for many different purposes, and not much forethought to use these programs.

                            Comment


                            • #15
                              That's it you think it is all about the program and never about the biology. Wake up and realize that a lot of people have other interests then you, which can be better served by learning one of these languages you despise so much. I am a wet lab scientist, I would be a total wast of my time to learn C++, while some Unix, Perl and R are extremely useful.
                              --------------
                              Ethan

                              Comment

                              Working...
                              X