Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Query bam file and assemble reads

    Hi,

    I am quite a newbie in Bioinformatic and maybe asking something stupid.

    I am trying to generate a four-column file for assembling overlapped sequencing reads into longer contigs from a sorted bam file. The file needs to contain the following information in this format:

    chr#: start position of alignment - stop position of alignment: strand (+/-)

    Currently I am trying to awk the bam files to obtain the information, my code is here:

    samtools mpileup input.sorted.bam |\
    cut -d '\t' -f 1,2 |\
    awk -F '\t' 'BEGIN {chr="";start=-1;end=-1} {if(chr!=$1 || int($2)!=end+1) { if(chr!="") {printf("%s:%d-%d\n",chr,start,end);} chr=$1;start=int($2);end=int($2);} else { end=end+1;}} END {if(chr!="") {printf("%s:%d-%d\n",chr,start,end); } }'>output.txt

    It can work except with strand info. Actually, it ignores the strand info. If a read from + strand overlapped with a read from - strand, it will form a contig and that's not what I want. I want to assemble contigs in the same strand.

    How can I improve my code to take in strand info and make the assembly according to strand?

    Please help. Thank you very much.

    Dadi Gao

  • #2
    I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.

    Comment


    • #3
      Originally posted by macrowave View Post
      I think Bio:B::Sam is the way to go, if you know a bit perl programming. You can use its methods to access sam/bam for any information, including map positions and strands.
      Unfortunately I know nothing about perl
      Is there another way to do it directly from linux shell?

      Comment


      • #4
        Originally posted by dustar1986 View Post
        Unfortunately I know nothing about perl
        Is there another way to do it directly from linux shell?
        This would be a great time to learn perl/python/a-scripting-language.

        Comment


        • #5
          Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
          http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html

          Comment


          • #6
            You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.

            Comment


            • #7
              Originally posted by nilshomer View Post
              This would be a great time to learn perl/python/a-scripting-language.
              I can use python actually...

              Comment


              • #8
                Originally posted by macrowave View Post
                Maybe you can dig something out from the sam bitwise flags, I'm sure the strand information would be encoded there. For details, you can see the sam specifications here:
                http://samtools.sourceforge.net/SAM1.pdf. And there is an easy tool to explain the flags here: http://picard.sourceforge.net/explain-flags.html
                Thanks. But the flag only appear in samtools -view not in samtools -mpileup.
                I want to use the latter one coz it's easier for assembly.

                Comment


                • #9
                  Originally posted by cbeck View Post
                  You might also try the bedtools suite - bamtobed and then mergebed. Not sure if that will deal with the strands,you might be able to filter the two strands at the bam output level.
                  Thanks, I'll have a look at that.

                  Comment


                  • #10
                    If you install bedtools, you could use the "mergeBed" program to merge overlapping reads on the same strand as follows:
                    Code:
                    bamToBed -i aln.bam | mergeBed -i stdin -s > merged.intervals.bed
                    Alternatively, you could use the "genomeCoverageBed" tool to create strand-specific BEDGRAPH files, one for each strand. You could then combine them into a single file.
                    Code:
                    bamToBed -i aln.bam | \
                              awk '$6=="+" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.fwd.bedg
                    
                    bamToBed -i aln.bam | \
                              awk '$6=="-" | \
                              genomeCoverageBed -i stdin -g chrom.sizes -bg \
                              > merged.intervals.rev.bedg
                    
                    cat merged.intervals.fwd.bedg merged.intervals.rev.bedg > merged.intervals.both.bedg

                    Comment


                    • #11
                      BAM file is compressed, you need do "samtools view -u -bh aln.bam > my.bam" first.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Understanding Genetic Influence on Infectious Disease
                        by seqadmin




                        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                        09-09-2024, 10:59 AM
                      • seqadmin
                        Addressing Off-Target Effects in CRISPR Technologies
                        by seqadmin






                        The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                        08-27-2024, 04:44 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Today, 06:25 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 01:02 PM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-18-2024, 06:39 AM
                      0 responses
                      14 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-11-2024, 02:44 PM
                      0 responses
                      14 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X