Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sandhya
    Member
    • Sep 2010
    • 11

    C++ library for bioinformatics?

    Hi,

    This is a software related question and I am hoping someone in this forum could throw light in this direction. We work on mathematical analyses with FASTA and FASTQ files. So far we have worked in R. Due to computational issues, we need to move to C++.
    Is there any C++ library available that handles the reading of these files, its alignment etc (i.e. something silmilar to R Bioconductor package).
    Any pointers welcome.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Would C be acceptable instead?

    Comment

    • NicoBxl
      not just another member
      • Aug 2010
      • 264

      #3
      instead of c++, you can use perl with bioperl .

      Comment

      • lh3
        Senior Member
        • Feb 2008
        • 686

        #4
        I believe mine is (arguably) the most sophisticated and standalone library for parsing fasta/q:



        It is definitely not easiest to use, though.

        Comment

        • sandhya
          Member
          • Sep 2010
          • 11

          #5
          The reason I asked for C++ was because I work with it. C should also be alright.
          I shall have a look at (http://lh3lh3.users.sourceforge.net/parsefastq.shtm) in detail but on first-hand I understand that this a parser for FASTA/FASTQ format. Is there any functionality for alignments and further processing.

          Since we need to work with around 20 Million reads, I was already clueless as to what data structures one could use in C++. I have not used Perl yet but can it handle such large datasets (R would not)? I am ready to give that a try too.

          Comment

          • mrawlins
            Member
            • Apr 2010
            • 63

            #6
            Dealing with 20 million reads makes time complexity of your algorithms (particularly your loop over the reads) of particular importance. I did some post-processing of SAM files using Picard (Java equivalent of samtools C library) and found that I needed to do copious indexing of anything I intended to use inside the inner loop.

            One example: I wanted to count the number of reads associated with each gene. My first implementation looped through all the reads, then for each read looped through all possible genes, quitting when it found the "right" one. That took forever. A later iteration included a hash map of <genome location, gene ID> so that I could do a O(1) lookup of which gene the particular read belonged to. Setting up those maps was memory intensive and a bit complicated, but decreased runtime from days to minutes.

            Perl, Java and C/C++ are only limited to handling as large of datasets as your memory allows. Perl, like R, doesn't have a lot of memory management capabilities (though R tends to be less efficient). Java has limited memory management, and C will give you greatest control over your memory issues. I am much more comfortable programming in C and Perl than anything else, but I use Java for much of my next-gen sequencing bioinformatics because there's more available libraries and it fails gracefully when I run out of memory. The slight loss of efficiency is worth the benefits.

            Comment

            • krobison
              Senior Member
              • Nov 2007
              • 734

              #7
              There are a number of C++ libraries out in the world (though I haven't used any, other than one I wrote almost 20 years ago as a grad student & haven't used it in over 15).

              I found this with Google: Biostar. There are probably more out there.

              Large datasets are going to require some trickery in just about any language. With Perl (or Python and probably many other newer languages), you will have somewhat easier access to some of those tricks -- such as hashes but also using file stores or databases for some of this mess -- the higher level languages have rich libraries for serializing and deserializing objects (perhaps C++ does as well, but I haven't inhabited that world in a long time). Of course, you may need to do some sleuthing to figure out what exactly some libraries do if you are worried about performance.

              Personally, I find that for many problems the time to reduce your problem to an algorithm really ends up being more important than the time to run the algorithm for the bioinformatics I'm involved in -- but that is because folks like Heng Li have solved the really slow problems very elegantly. You might also look into languages & toolkits which assist with multiprocessing -- such as Hadoop, Map/Reduce, GATK (specifically for this space), Scala, etc. There are also some slick commercial C++ tools for assisting with multiprocessing (I have a family connection to a Cilk & TBB, which is the only reason I know about them).

              Comment

              • lskatz
                Junior Member
                • Sep 2010
                • 9

                #8
                Doesn't BLAST rest on a C library of some sort? And they just refactored it according to their 2009 BLAST+ paper.

                Comment

                • mkeehan
                  Member
                  • Feb 2010
                  • 13

                  #9
                  Try seqan from http://www.seqan.de
                  It's really good.

                  Comment

                  • akundaje
                    Junior Member
                    • Sep 2008
                    • 5

                    #10
                    For manipulation of FASTA/FASTQ files check out the FASTX toolkit http://hannonlab.cshl.edu/fastx_toolkit/

                    Comment

                    • maubp
                      Peter (Biopython etc)
                      • Jul 2009
                      • 1544

                      #11
                      Originally posted by lskatz View Post
                      Doesn't BLAST rest on a C library of some sort? And they just refactored it according to their 2009 BLAST+ paper.
                      Yes, the NCBI moved BLAST from C to C++, but this has nothing to do with FASTQ files, does it?

                      Comment

                      • sandhya
                        Member
                        • Sep 2010
                        • 11

                        #12
                        Thank you all for the comments and replies. I do agree that one has to resort to trickery when it comes to processing huge files no matter which language one uses. I also checked 'seqan' but they do not seem to have support for FASTQ files. So perhaps I shall resort to Python or Perl for reading and aligning these files.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        41 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        36 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        192 views
                        0 reactions
                        Last Post seqadmin  
                        Working...