Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kenietz
    Member
    • Nov 2011
    • 86

    samtools, cufflinks, transcripts extraction from bam

    Hi guys,
    i am trying to create a script which can extract the transcripts assembled by cufflinks using perl and samtools.

    I tried several approaches:

    1. samtools mpileup - using the pipeline provided by the samtools man page. Well in a way it works but the extractions starts from the beginning of the chromosome which creates a huge file for a transcript of 100-200 bp for example. So discarded that option.

    2. vcf-tools's vcf-consensus - kind of works but again extracts the consensus for the whole chromosome. couldnt make it extract only a region. Discarded.

    3. samtools pileup -c BAM > pileup - well kind of works but again problems. pileup -c automatically discards reads which are marked as 0x704 in the BIT FIELD:
    0x0100 s the alignment is not primary
    0x0200 f the read fails platform/vendor quality checks
    0x0400 d the read is either a PCR or an optical duplicate

    That is because it uses -m option automatically. I tried to switch off the -m somehow by giving it a huge value like 10000000 which worked in some cases but not in paired-end reads. I couldnt understand why some reads are discarded when they are not marked as 0x704. All this automatically skipping of reads kind of ruin my task.

    4. My current approach is the following:
    a) run samtoools pileup BAM > pileup, no -c option given at all. This produces output like this:
    1 1010 N 5 CcCcc IDFJJ
    1 1011 N 5 AaAaa GHFJJ
    1 1012 N 5 GgGgg IIHJJ
    1 1013 N 5 AaAaa IIBJJ
    1 1014 N 5 CcCcc IGHJJ
    1 1015 N 5 AaAaa IIHJJ
    1 1016 N 5 GgGgg JIJJJ
    1 1017 N 5 TaAaa IGIJJ
    1 1018 N 5 TtTtt IFIJJ
    1 1019 N 5 AaAaa IIGJJ
    1 168457 N 10 <<<<<>><<< A@FHJ=DJJJ
    1 168458 N 10 <<<<<>><<< A@FHJ=DJJJ
    1 168459 N 10 <<<<<>><<< A@FHJ=DJJJ

    b) Then i take the coordinates from the transcript.gtf and search for them in the pileup which has its own tricks.

    c) when found the start and end i parse the entries in between. well here is a bit weird. the lines where column 5 contain A|C|G|T i count how many occurrences for each there are and take the one with max number. I could not understand what these '>' '<' mean in that column tho. Any hints are welcomed.

    d) concatenate all the letters found by the procedure in c) and produce a transcript.

    I have two questions:
    1. what that these '>' '<' mean in column 5?
    2. is my procedure more or less on the right path?

    Thank you for your time and help
  • rboettcher
    Member
    • Oct 2010
    • 71

    #2
    Taken from http://samtools.sourceforge.net/samtools.shtml

    In the pileup format (without -uor-g), each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a ’>’ or ’<’ for a reference skip, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand. A pattern ‘\+[0-9]+[ACGTNacgtn]+’ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern ‘-[0-9]+[ACGTNacgtn]+’ represents a deletion from the reference. The deleted bases will be presented as ‘*’ in the following lines. Also at the read base column, a symbol ‘^’ marks the start of a read. The ASCII of the character following ‘^’ minus 33 gives the mapping quality. A symbol ‘$’ marks the end of a read segment.
    This is a highly interesting topic, since I also intend to do transcript assembly and sequence prediction in the near future. A suggestion to speed up the process would be to use the -r or -l option and limit your pileup to the regions in which cufflinks assembled the transcript.
    Last edited by rboettcher; 10-01-2012, 11:32 PM.

    Comment

    • kenietz
      Member
      • Nov 2011
      • 86

      #3
      Hi rboettcher,
      thanks, i was reading the manual for pileup as well but in my version there was no such an explanation: http://samtools.sourceforge.net/pileup.shtml

      Btw, i tried using the -r but it didn't work as expected. And the problem with automatically skipped reads still remains.

      Comment

      • Emilie
        Member
        • Nov 2010
        • 21

        #4
        Hi kenietz,

        Usually I am using something like that to process all the reads:
        samtools mpileup -BQ0 -d10000000 -r 7:94167000-94167508 bam_file.bam

        From:

        Quote:
        "Use `-BQ0 -d10000000 -f ref.fa' if the purpose is to get
        the precise depth of coverage rather than call SNPs. Under this
        setting, mpileup will count low-quality bases, process all
        reads (by default the depth is capped at 8000), and skip the
        time-demanding BAQ calculation."

        Emilie

        Comment

        • rboettcher
          Member
          • Oct 2010
          • 71

          #5
          Originally posted by kenietz View Post
          Hi rboettcher,
          thanks, i was reading the manual for pileup as well but in my version there was no such an explanation: http://samtools.sourceforge.net/pileup.shtml

          Btw, i tried using the -r but it didn't work as expected. And the problem with automatically skipped reads still remains.
          Ah, the explanation I found relates to mpileup, not sure whether the output format differs significantly in this case.
          Unfortunately I have no idea on the read skipping, but maybe Heng Li can give a more detailed explanation if you send him a message.

          Comment

          • kenietz
            Member
            • Nov 2011
            • 86

            #6
            Ok, after a lot of digging around. I reached some conclusions.

            1. Samtools mpileup skips reads which are marked as:
            0x0100 s the alignment is not primary
            0x0200 f the read fails platform/vendor quality checks
            0x0400 d the read is either a PCR or an optical duplicate

            This may be useful for SNP INDEL analysis. But for RNA seq is giving me troubles. Cos even though a read is marked as not primary still maps to some region and then cufflinks is reporting it as a transcript. But then if i wish to extract the consensus from that region with mpileup, mpileup skips the reads and i get nothing. On the other hand that may not be a problem of samtools but of cufflinks. But i haven't figured out a way to make cufflinks disregard such reads from its analysis.

            2. I found a thread: http://seqanswers.com/forums/showthread.php?t=22752
            where i found out how to disable mpileup from skipping duplicate reads. So i found that:

            bam.h:#define BAM_DEF_MASK (BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP)

            Then just modified accordingly to disable in mpileup skipping of duplicates and reads which are not primary alignment. Recompiled and renamed the binary to 'samtools_no_dup_seq' for example. Now i can use the line:

            samtools_no_dup_sec mpileup -ABQ0 -d 1000000 -r REGION file.BAM > PILEUP

            Now i can create pileup for every region reported by cufflinks. And get a sort of consensus seq directly from the bam and not from the reference seq.

            Btw, -A option in mpileup stands for anomalous reads. What exactly means anomalous reads? In the context of single and paired-end reads.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM
            • seqadmin
              Investigating the Gut Microbiome Through Diet and Spatial Biology
              by seqadmin




              The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
              02-24-2025, 06:31 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            17 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            18 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            19 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            186 views
            0 reactions
            Last Post seqadmin  
            Working...