Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools piping problem

    Hello SEQanswers,

    My ultimate goal is to pipe multiple different regions of a bam file to different commands in a single command line argument.

    Basically, I am trying to generate some metrics from a BAM file that will be flagged for optical duplicates and then (without storing to disc) I want to pipe each chromosome to a program.

    ex.

    java -jar picard/MarkDuplicates.jar I=test.bam O=/dev/stdout | tee >(samtools view 'chr1' -| ./generate_metrics) >(samtools view 'chr2' -| ./generate_metrics) ...

    any advice if my current implementation isn't the best?
    The issue right now is attempting to pull a segment from the bamfile after an initial pipe.

    Thanks,

    Marco

  • #2
    Well, that works, put the command is going to get really unwieldy. Why don't you either pipe to a perl/python/whatever script and have it pipe accordingly (the input to MarkDuplicates is already sorted, so your script would just need to pipe to one instance of ./generate_metrics at a time) or just rewrite ./generate_metrics to handle multiple chromosomes (assuming you have the source code).

    Comment


    • #3
      The idea is to parrallel these operations. So I will ultimately be submitting qsubs for each chromosomes operations, Its just a matter of segmenting the BAM file into its appropriate segments.

      Comment


      • #4
        I'd be a little surprised if piping like that into qsub would actually work. If you want to parcel this out to multiple nodes, why don't you just markDuplicates on a chromosome per node and generate_metrics on that?

        Comment


        • #5
          That sound like a great idea

          Thanks! The duplicate algorithm (im assuming) is only looking for reads of the same length and position correct (i.e. there wont be duplicates accross different chromosomes).

          Comment


          • #6
            I think it just looks at the 5' position of both mates in a pair and marks duplicates if there are multiple pairs with the same 5' position(s). I could be misremembering that, though.

            Comment


            • #7
              So this is what I'm thinking from what you have told me

              samtools view sample.bam 'chr1' | java -jar $PICARDROOT/MarkDuplicates.jar I= - O=sample_chr1.bam M=metrics_chr1.txt

              Or is there something else I need to pass to I...

              Comment


              • #8
                Yeah, but I'd forgotten that MarkDuplicates can't take input via a pipe, so that won't work. I'm not coming up with a way to do that that doesn't end up involving considerable programming with MPI. Do you have the code for generate_metrics? It would seem easier to just write it to handle multiple contigs. Alternatively, presuming that actually marking the duplicates is the rate limiting step, then just run it on a single node.

                Comment


                • #9
                  Oh, have a look at Biobambam, which can MarkDuplicates but writes its temp files to a local disk (it's geared more toward clusters). BTW, see also this thread on Biostars.

                  Comment


                  • #10
                    Maybe gnuparallel could solve your parallelization? This thread explains it comprehensively..
                    savetherhino.org

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    25 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X