Header Leaderboard Ad

Collapse

To get the sequence from a given BAM file

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • To get the sequence from a given BAM file

    Hello, need help for samtools or other methods.

    You know that the Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. If a patient’s BAM file is given, how can you extract the sequence from the storing read alignments(not from the reference) within a region? For example, chr22: 1000000-1234000.

    Thanks for any hint or a piece of code.

  • #2
    samtools view alignment.bam chr22:1000000-1234000

    will give you the alignments for that region. You can then go on to extract the sequence by e.g. doing 'cut -f 10'.

    Comment


    • #3
      Follow up:
      The question is if you select column 10, there are many different segments with the same position. For example:

      Code:
      61AGUAAXX100114:8:70:11288:13082	165	chr22	14430092	0	*	=	14430092	0	NNNNCNAGCNGAGCGNNTCTGGG
      61AGUAAXX100114:8:70:11288:13082	1113	chr22	14430092	0	101M	=	14430092	0	NNNNNNNNNNNTNNNNNNNNANN
      61AJDAAXX100113:7:59:10707:12713	113	chr22	14430092	0	101M	chr14	18863090	0	CCTCGCGGGACTGGTATGGGGAC
      They have same start position 14430092 but different sub sequences.

      Hopefully more details.
      Last edited by ardmore; 11-11-2011, 01:35 PM.

      Comment


      • #4
        samtools view alignment.bam chr22:1000000-1234000 | cut -f 10

        or

        samtools view alignment.bam chr22:1000000-1234000 > subset.sam
        cut -f 10 subset.sam

        10 because the sequence is the tenth column (or "field", hence the -f flag in the SAM file.
        Last edited by kopi-o; 11-11-2011, 01:32 PM.

        Comment


        • #5
          Should we consider the flag such as 165,1113 etc.?
          Which one should be selected for the position 14430092?
          Code:
          NNNNCNAGCNGAGCGNNTCTGGG
          NNNNNNNNNNNTNNNNNNNNANN
          CCTCGCGGGACTGGTATGGGGAC
          Last edited by ardmore; 11-11-2011, 01:33 PM.

          Comment


          • #6
            Ah, this is the same question as on BioStar:

            http://biostar.stackexchange.com/que...s-view-command

            You got some good advice there from brentp.

            Comment


            • #7
              Well, I don't think he answered my question although I market it as an answer.
              So I posted it here to see if anybody can help me. He just extract the sequence from the reference rather than the read.

              Comment


              • #8
                OK, let me spell it out then.

                brentp told you in one of his replies that you should ignore all the reads with mapping quality 0. The reads you show here have mapping quality 0 (MAPQ column: column 5 according to the SAM format specification). Thus, they are completely unreliable.

                Regarding the SAM flags, it's really up to you if and how you want to filter based on those, for instance, if you want to require your reads to be paired or not. This page explains the flags: http://picard.sourceforge.net/explain-flags.html

                EDIT: I see now that brentp has replied that this region is all Ns in the reference sequence. Thus, all of your alignments probably have MAPQ = 0, and these are just spurious alignments that for some reason got assigned to these position, but can't be trusted.

                If it had been a "normal" genome region, you would likely get several variant sequences for it, because sequencing errors are not uncommon. You could then use some SNP calling/genotyping tool like samtools mpileup (or one of many others) to construct a consensus sequence based on your reads.
                Last edited by kopi-o; 11-11-2011, 01:53 PM.

                Comment


                • #9
                  You really need to read the documentation, becase a lot of your questions would be answered there.

                  For starters, that first read, the one with the flag of 165, didn't actually map to that position. sam format specs call for unmapped reads to be given the chromosome and position of their mapped mates.

                  165 = 128+32+4+1 That means the read was in the second fastq, its mate was in the reverse direction as compared to the reference, the read itself did not map, and it's from a paired-end read.

                  So that's why its sequence isn't what it should be.

                  Really, you need to read this:

                  Ten Simple Rules for Getting Help from Online Scientific Communities

                  http://www.ploscompbiol.org/article/...l.pcbi.1002202

                  You need to espeically read all of rule 6.

                  and note:

                  One of the most impolite behaviors toward an online community is asking a question in multiple places at the same time. “Cross-posting”, as this practice is called, can make two distinct online communities work through a solution for you when only one is needed; this is an abuse of forum members' time.

                  Comment


                  • #10
                    Thank you very much. Your statements are much clearer than his. Could you please give me a quick command sample to get a sequence based on my reads? I am not strong on it.
                    Code:
                    samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
                    Last edited by ardmore; 11-11-2011, 02:07 PM.

                    Comment


                    • #11
                      FWIW, if you're working with bowtie and colour-space, note the following about the SAM file results:

                      http://bowtie-bio.sourceforge.net/ma...ace-alignments
                      Note that in -S/--sam mode, the decoded nucleotide sequence is printed for alignments, but the original color sequence (with A=blue, C=green, G=orange, T=red) is printed for unaligned reads without any reported alignments.
                      This means that unaligned sequences will be double-encoded colour-space, while aligned sequences will closely match the reference.

                      I'm currently investigating whether cleaning colour-space sequences using bowtie and a reference genome is actually useful for the purpose of doing an assembly based on the decoded bases (instead of, or in addition to, using SAET).
                      Last edited by gringer; 11-12-2011, 08:05 AM. Reason: added reason for odd sequences

                      Comment


                      • #12
                        I have a strong suspicion reading this and other posts of ardmore that we are helping him/her with his course-work!

                        Comment

                        Working...
                        X