Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • consensus seq, samtools pileup, transcripts.gtf issues

    Hi guys,
    i have some issues with extracting consensus seq for transcripts from cufflinks from the bam file.

    Progs i used:
    samtools-0.1.16
    tophat-2.0.5
    cufflinks-2.0.2
    samtools-0.1.18

    Firstly, i tried using the mpileup pipeline provided by the samtools 0.1.18 man page. One may say it works. But for example if i use that line:

    samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    It produces output in cns.fq of the following format:

    @An01
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    ...
    ...
    nnnnnnnnnnnnnnnnnnnnnnnMYSEQHERE

    Namely it starts from the beginning of the chromosome. That produces many Ns and my real seq is somewhere in the end in the best case. So if i want to get it i still have to parse the cns.fq file.

    Well then i moved to samtools 0.1.16 pileup -c option. It produces the pileup which i parse for every transcript from cufflinks' transrcipts.gtf file. This was working as i wish until i got a case where i have such a transcript:

    An01 Cufflinks transcript 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327"; full_read_support "yes";
    An01 Cufflinks exon 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; exon_number "1"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327";

    According to the FPKM it looks good. But when i create the pileup file from the accepted.hits.bam and try to get the transcript i got and error in my script. It turned out that there is no such a region in the pileup file. I then dug around and figured that the reads failed QC even tho they are mapped as i see at IGV. Its a bit confusing.

    Questions:

    1. If the reads fail quality check why are they mapped at all then?
    2. How to switch of the quality check of samtools pileup -c? Should i fiddle with the source code?

    Any thoughts and ideas are welcomed
    Thank you

    PS: Some additional info on the reads from the BAM for gene681(not found in pileup) and gene682 found in pileup. I give them cos it occurred to me they might be the same reads mapped on multiple regions but differently. It is so but still cant figure out what all this mess means

    ---- from genes.fpkm_tracking

    gene681 - - gene681 tRNA-Ser%28GCT%29 - An01:1964560-1964665 - - 40.5022 0 180.806 OK
    gene682 - - gene682 tRNA-Ser%28GCT%29 - An01:1965037-1965142 - - 40.5022 0 180.806 OK
    -----------------------------------------

    ---- from samtools view -F 4 regions bam

    GENE 681

    UHDPH:1295:769 272 An01 1964625 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
    UHDPH:1945:2723 272 An01 1964625 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1

    ---------------------------------------------------
    GENE 682

    UHDPH:1295:769 16 An01 1965102 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
    UHDPH:1945:2723 272 An01 1965102 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
    ---------------------------------------------------

  • #2
    1. Usually reads were filtered before processing. I do not found any information about cufflinks's quality filter.
    2. samtools pileup has a -Q parameter, where you can set the base quality limit.

    Comment


    • #3
      Neither did i find any quality filter in cufflinks.
      The -Q option is: min base quality (possibly capped by BAQ) [13], but it seems that is not that the problem. In IGV i see for gene682 which is found in the pileup that some base qualities are 7 but most are above 20.
      Last edited by kenietz; 09-26-2012, 07:01 PM.

      Comment


      • #4
        Ok now i got it:
        so genes 681 and 682 have the same reads mapped onto them. So when the read has FLAG field = 272 which is 100010000, is filtered out cos is nor primary alignment(bit number 9).

        well but the regions are all well separated and different genes.

        Btw, do you know how to get the following pipeline work for a region without getting the annoying number of 'n'? I mean to get directly the sequence from that region and not the sequence from the start of the chromosome. I tried using the -r option as well and the -l option but result is all the same. Resulting CNS.FQ starts from the beginning of the chromosome.

        samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

        for now i decided to use: samtools pileup -c -m 100000 BAM > pileup, and then parse the pileup. with -m 100000 i kind of disable the filtering by samtools. seems to be working even tho maybe not very correct to do
        Last edited by kenietz; 09-26-2012, 08:38 PM.

        Comment


        • #5
          Trimseq from Emboss can remove the unwanted characters, but be careful! You can not use your annotations that based on the original sequence.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Developments in Metagenomics
            by seqadmin





            Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
            09-23-2024, 06:35 AM
          • seqadmin
            Understanding Genetic Influence on Infectious Disease
            by seqadmin




            During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

            Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
            09-09-2024, 10:59 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 10-02-2024, 04:51 AM
          0 responses
          13 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-01-2024, 07:10 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-30-2024, 08:33 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-26-2024, 12:57 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Working...
          X