Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Consensus sequence for a subset of sequences in BAM

    Hi everyone,

    I have a BAM file containing aligned reads to a reference. I need to calculate a consensus sequence but not from all the reads in the BAM file but only from a subset, and I know in advance which sequences I am interested in.

    Is there a tool that would allow me to do this task?

    I think I can extract the required sequences from the BAM file and map them to a reference myself. I would like to avoid that (re-mapping part) because the reads in the original BAM file were already aligned based on specific criteria and I want to keep that alignment.

    Thanks.

  • #2
    Just use samtools mpileup:

    Code:
    samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    Comment


    • #3
      Doesn't this command generates a consensus sequence based on all reads in the BAM file? That's not what I need.

      Comment


      • #4
        The example command does, yes, but you don't have to use the exact example. You can either prefilter things or pipe (presumably, I've never tried with mpileup) in a filtered set or look at just one region or whatever else you want with that command as the basis.

        Comment


        • #5
          That's exactly my problem - I cannot figure out how to pre-filter BAM to extract the reads by their Id's.

          Comment


          • #6
            I assume you don't actually want to select reads based on their names, but something else. So just say what criterion you want to use.

            Comment


            • #7
              I actually do want to select reads based on their names. It happened that the original software encodes important information in the sequence title and among all the reads in the BAM file I only want a subset based on read Id's. I do not know if that's possible. Perhaps, not.

              There is a somewhat related question about samtools/vcf, etc. Let's say I have a BAM file. Is it possible to get a list of all nucleotides (from all reads) in a specific position? Something like:

              reference position 10
              read001 A
              read002 A
              read003 T
              ...

              Can it be done? Thanks.

              Comment


              • #8
                There's generally no great way to extract reads by name other than sorting by name and then either grepping the output or passing it through a script (pysam would work nicely). Whether or not to presort would depend on the length of the list. Alternatively, if you're basing things off of the tile and XY coordinate (i.e., these are illumina reads), then just use something like pysam and the parse the qname.

                Regarding the list of calls at a given location, that's similar to the output of mpileup, except the reads aren't labeled. If you really want the reads, you can access the underlying code in C (the samtools C API) and python (via pysam) and should then have access to this.

                I guess a better question would be what you're actually trying to achieve. There's likely a better way to go about things in general.

                Comment


                • #9
                  I am doing some bioinformatics analysis where I need to look at different subsets of reads. It is a very specific task. Currently, the sequencer aligns all the reads to a reference and put this info into a BAM file. For my task, I would like to be able (1) to extract a subset of reads from the BAM file, and (2) get read bases at a specific position. The volume of the data is relatively low so efficiency is not a big deal.

                  Grepping BAM/SAM file for the step 1 actually works. Thanks.

                  For the step 2, I do not really need to know from what read exactly a given base comes from. Thanks to your hint, I found that mpileup does almost what I want, for example: ref XX X XXXXX ,.,-1c,...,,,,.,,...,,.,,,. I will try to figure out what the symbols in the encoding mean.

                  Lastly, is it possible to export reads aligned to the reference from BAM/SAM to a FASTA file? (there are no indels in the reads).
                  Last edited by vladimirp; 05-08-2014, 03:45 PM.

                  Comment


                  • #10
                    I guess it depends on what exactly you mean by "aligned to the reference". Lines in a SAM/BAM file generally are of alignments, so it's unclear what you want. Do you perhaps want something like samtools tview?

                    Comment


                    • #11
                      There are many scripts that just export sequences in FASTA format but the placement of the read relative to the reference is lost. For example,

                      >ref
                      agtgtgagtg
                      >read1
                      gagtg

                      while I want:
                      >ref
                      agtgtgagtg
                      >read1
                      -----gagtg

                      Something like samtools tview would be perfect but can I dump this type of alignment to a text file?

                      Comment


                      • #12
                        If nothing else, I think you can script IGV, which will work if an image (rather than ascii) is sufficient for your needs. Otherwise, I don't personally know of anything off-hand.

                        Comment


                        • #13
                          I need any parseable text file. Thanks for all the help anyways!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          33 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X