Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Biopieces - bioinformatic Swiss army knife



    The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task: modifying or adding records to the data stream, creating plots, or uploading data to databases and web services. The Biopieces are executed in a command line environment where the data stream is initialized by specific Biopieces which read data from files, databases, or web services, and output records to the data stream that is passed to downstream Biopieces until the data stream is terminated at the end of the analysis as outlined below:

    read_data | calculate_something | write_results

    The following example demonstrates how a next generation sequencing experiment can be cleaned and analyzed – including plotting of scores and length distribution, removal of adaptor sequence, trimming and filtering using quality scores, mapping to a specified genome, and uploading the data to the UCSC genome browser for further analysis:

    Code:
    read_fastq -i data.fq |                               #  Initialize data stream from a FASTQ file.
    plot_scores -t png -o scores_unclean.png |            #  Plot scores before cleaning. 
    find_adaptor -c 24 -a TCGTATGCCGTCTTC -p |            #  Locate adaptor - including partial adaptor.
    clip_adaptor |                                        #  Clip any located adaptor.
    trim_seq |                                            #  End trim sequences according to quality scores.
    grab -e 'SEQ_LEN > 18'                                #  Filter short sequences.
    mean_scores -l |                                      #  Locate local quality score minima.
    grab -e 'SCORES_MEAN >= 15' |                         #  Filter low local quality score minima.
    write_fastq -o data_clean.fq |                        #  Write the cleaned data to a FASTQ file.
    plot_scores -t png -o scores_clean.png |              #  Plot scores after cleaning. 
    plot_distribution -k SEQ_LEN -t png -o lengths.png |  #  Plot sequence length distribution.
    bowtie_seq -c 24 -g hg19 -m 2 |                       #  Map sequences to the human genome with Bowtie.
    upload_to_ucsc –d hg19 –t my_data –x                  #  Upload the results to the UCSC Genome Browser.
    The advantage of the Biopieces is that a user can easily solve simple and complex tasks without having any programming experience. Moreover, since the data format used to pass data between Biopieces is text based, different developers can quickly create new Biopieces in their favorite programming language - and all the Biopieces will maintain compatibility. Finally, templates exist for creating new Biopieces in Perl and Ruby.

    There are currently ~175 Biopieces.


    EDIT

    To make Biopieces more accessible an installer has been released here.

    EDIT

    Updated the example
    Last edited by maasha; 10-17-2011, 01:14 AM. Reason: Updated example

  • #2
    Originally posted by maasha View Post
    The advantage of the Biopieces is that a user can easily solve simple and complex tasks without having any programming experience.
    Instead they need to know shell commands and piping?

    On a serious note, you have a read_fastq for Sanger FASTQ files, and a read_solexa for Solexa FASTQ file, but no sign of a read_illumina for Illumina 1.3+ FASTQ files. See:

    Comment


    • #3
      @maubp

      A bit of UNIX knowledge can be acquired with a 20 minutes primer - enough to use Biopieces.

      Also, I shall add a read_illumina Biopiece (I'll do that tomorrow).

      Thanks for the heads-up.

      Comment


      • #4
        This looks terrific. Add in BEDTools and a few basic shell scripts and quite a lot additional glue can be dropped from workflows.

        Comment


        • #5
          Neat!
          Curious question though.
          How do you run bwa and bowtie without the binaries?


          they are not listed in external tools here
          http://code.google.com/p/biopieces/wiki/Installation
          http://kevin-gattaca.blogspot.com/

          Comment


          • #6
            @KevinLam

            Several of the Biopieces are simple wrappers around the binaries, such as BWA and Bowtie. Those Biopieces that have prerequisites have it stated in the "usage" information. E.g. http://code.google.com/p/biopieces/wiki/bowtie_seq

            Comment


            • #7
              I've been using Biopieces off and on for about a year now, and just wanted to say thank you to Martin, it's fantastic! The recent additions of wrappers around other tools has been great. I have been using it more and more, alongside things like samtools, BEDtools, MUMmer, and the Kent source tree. There are some extremely useful utilities in there that just need a little massaging of data format. (As always.)

              I may need to re-read your FAQ about contributing, because I had a couple humble suggestions, if I can be so bold... :-)

              A write_sam or write_bam script would be great. I wrote a little wrapper script psl2bam.pl which converts BLAT results into a sequence-containing Bamfile for seeing the actual alignments. I suspect the read_fasta, read_psl, merge_records, write_tab could be used to do something clone, then just needs to call samtools to sort it and make it "Bam!"

              What would you think about something to interface with the "UCSC table browser" downloads, so we could download a GTF or Bed file from commandline and manipulate in biopieces. E.g. feature intersects, merge annotations, sequence extraction, etc. I could contribute something here if you or others thought it'd be useful. (I saw your FAQ.)

              I saw the BGB tools and a snapshot of it on Flickr. I'll trust your judgement that you need it. :-) But I liked your blog post about wanting more of these types of tools to talk freely with each other. Would you consider interfacing with GBrowse 2? Using Bamfiles, I find it very rapid to go from analysis to visualization (and back.) It seems that the biopieces framework could benefit those people using GBrowse quite a bit if there were a couple more hooks.

              Comment


              • #8
                Originally posted by maasha View Post
                Also, I shall add a read_illumina Biopiece (I'll do that tomorrow).
                I see you have just recently updated the documentation. It looks like read_solexa now expects Illumina 1.3+ FASTQ files (with PHRED scores), and you don't support old Solexa 1.0 to Illumina 1.2 FASTQ files (with Solexa scores). Maybe I'm confused... but I fear you've just complicated things more.

                Comment


                • #9
                  @jmw86069

                  Thanks for the kind words. I normally don't hear much from users, and therefore I simply develop Biopieces according to my own needs. I am willing to write new Biopieces if they will be of general use. And of cause I am also open to suggestions for improving existing Biopieces. If anyone wants to contribute code, they are welcome to do so.

                  Now, a genome browser is a must for any genomic researcher! I have been working a fair bit with the UCSC genome browser, and a couple of Biopieces exists for uploading and downloading tracks, and manipulating the configuration on a local UCSC installation. However, I am now working with prokaryotes, and for that the UCSC genome browser is a bit of an overkill. So I guess, I will not be writing Biopieces for the UCSC genome anytime soon, since I need a working system to test stuff on. The Biopieces Genome Browser (BGB) was meant as a temporary system until Jbrowse matures. Jbrowse is going to be awesome (!!!), but it is rather nasty to install new genomes and custom tracks, and at the same time keeping track of permissions on genomes and tracks. The same goes for Gbrowse2.

                  Now, a request for write_sam/write_bam is a bit tricky. I must admit, that I don't use any tools that take these formats as input (I am probably missing out on important stuff). Also, I am mildly annoying by the SAM format. There is a rant here:



                  But perhaps with a bit of assistance I could get something up and running.

                  Comment


                  • #10
                    @maubp

                    I had a brief look at read_fastq and read_solexa along with the links you send me, and I got confused - yet again - over this pesky matter with quality scores and Phred/Sanger and Solexa and Illumina P. As far as I have understood, the scores stored as char strings have been calculated differently, however, converting the char score to a decimal is simply a matter of adjust with 33 or 64 integer-wise, for Phred/Sanger, Solexa/Illumina respectively. So read_solexa (specially using the -c switch) should work equally well with any version of the Illumina pipeline. A read_illumina Biopiece should then be a copy of read_solexa. I may indeed have complicated things more )

                    Comment


                    • #11
                      Originally posted by maasha View Post
                      @maubp

                      I had a brief look at read_fastq and read_solexa along with the links you send me, and I got confused - yet again - over this pesky matter with quality scores and Phred/Sanger and Solexa and Illumina P. As far as I have understood, the scores stored as char strings have been calculated differently, however, converting the char score to a decimal is simply a matter of adjust with 33 or 64 integer-wise, for Phred/Sanger, Solexa/Illumina respectively. So read_solexa (specially using the -c switch) should work equally well with any version of the Illumina pipeline. A read_illumina Biopiece should then be a copy of read_solexa. I may indeed have complicated things more )
                      There are (at least) THREE different FASTQ formats (see http://en.wikipedia.org/wiki/FASTQ_format or for more detail http://dx.doi.org/10.1093/nar/gkp1137). In summary:
                      • Sanger FASTQ - encodes PHRED scores (at most 0 to 93), offset 33
                      • Solexa FASTQ (and early Illumina) - encodes Solexa scores (at most -5 to 62), offset 64
                      • Illumina 1.3 (or later) FASTQ - encodes PHRED scores (at most 0 to 62), offset 64

                      It would be consistent with BioPerl, Biopython, EMBOSS, etc to call these the Sanger, Solexa and Illumina (1.3+) variants of FASTQ.
                      Last edited by maubp; 03-16-2010, 08:14 AM.

                      Comment


                      • #12
                        Originally posted by maasha View Post
                        So read_solexa (specially using the -c switch) should work equally well with any version of the Illumina pipeline.
                        The documentation for the -c switch says "Convert octal scores to decimal scores". What are octal scores? Do you mean the ASCII encoded representation (which are not base eight, i.e. not octal)?

                        Comment


                        • #13
                          You are right. I shall clean up the docs a bit.

                          Comment


                          • #14
                            Originally posted by ohofmann View Post
                            This looks terrific. Add in BEDTools and a few basic shell scripts and quite a lot additional glue can be dropped from workflows.
                            what an ideal world! :P
                            --
                            bioinfosm

                            Comment


                            • #15
                              Originally posted by bioinfosm View Post
                              what an ideal world! :P
                              What can I say, after ten years of cobbling together parsers I'm not asking for much anymore

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X