Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • consensus seq, samtools pileup, transcripts.gtf issues

    Hi guys,
    i have some issues with extracting consensus seq for transcripts from cufflinks from the bam file.

    Progs i used:
    samtools-0.1.16
    tophat-2.0.5
    cufflinks-2.0.2
    samtools-0.1.18

    Firstly, i tried using the mpileup pipeline provided by the samtools 0.1.18 man page. One may say it works. But for example if i use that line:

    samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

    It produces output in cns.fq of the following format:

    @An01
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
    ...
    ...
    nnnnnnnnnnnnnnnnnnnnnnnMYSEQHERE

    Namely it starts from the beginning of the chromosome. That produces many Ns and my real seq is somewhere in the end in the best case. So if i want to get it i still have to parse the cns.fq file.

    Well then i moved to samtools 0.1.16 pileup -c option. It produces the pileup which i parse for every transcript from cufflinks' transrcipts.gtf file. This was working as i wish until i got a case where i have such a transcript:

    An01 Cufflinks transcript 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327"; full_read_support "yes";
    An01 Cufflinks exon 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; exon_number "1"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327";

    According to the FPKM it looks good. But when i create the pileup file from the accepted.hits.bam and try to get the transcript i got and error in my script. It turned out that there is no such a region in the pileup file. I then dug around and figured that the reads failed QC even tho they are mapped as i see at IGV. Its a bit confusing.

    Questions:

    1. If the reads fail quality check why are they mapped at all then?
    2. How to switch of the quality check of samtools pileup -c? Should i fiddle with the source code?

    Any thoughts and ideas are welcomed
    Thank you

    PS: Some additional info on the reads from the BAM for gene681(not found in pileup) and gene682 found in pileup. I give them cos it occurred to me they might be the same reads mapped on multiple regions but differently. It is so but still cant figure out what all this mess means

    ---- from genes.fpkm_tracking

    gene681 - - gene681 tRNA-Ser%28GCT%29 - An01:1964560-1964665 - - 40.5022 0 180.806 OK
    gene682 - - gene682 tRNA-Ser%28GCT%29 - An01:1965037-1965142 - - 40.5022 0 180.806 OK
    -----------------------------------------

    ---- from samtools view -F 4 regions bam

    GENE 681

    UHDPH:1295:769 272 An01 1964625 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
    UHDPH:1945:2723 272 An01 1964625 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1

    ---------------------------------------------------
    GENE 682

    UHDPH:1295:769 16 An01 1965102 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
    UHDPH:1945:2723 272 An01 1965102 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
    ---------------------------------------------------

  • #2
    1. Usually reads were filtered before processing. I do not found any information about cufflinks's quality filter.
    2. samtools pileup has a -Q parameter, where you can set the base quality limit.

    Comment


    • #3
      Neither did i find any quality filter in cufflinks.
      The -Q option is: min base quality (possibly capped by BAQ) [13], but it seems that is not that the problem. In IGV i see for gene682 which is found in the pileup that some base qualities are 7 but most are above 20.
      Last edited by kenietz; 09-26-2012, 07:01 PM.

      Comment


      • #4
        Ok now i got it:
        so genes 681 and 682 have the same reads mapped onto them. So when the read has FLAG field = 272 which is 100010000, is filtered out cos is nor primary alignment(bit number 9).

        well but the regions are all well separated and different genes.

        Btw, do you know how to get the following pipeline work for a region without getting the annoying number of 'n'? I mean to get directly the sequence from that region and not the sequence from the start of the chromosome. I tried using the -r option as well and the -l option but result is all the same. Resulting CNS.FQ starts from the beginning of the chromosome.

        samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

        for now i decided to use: samtools pileup -c -m 100000 BAM > pileup, and then parse the pileup. with -m 100000 i kind of disable the filtering by samtools. seems to be working even tho maybe not very correct to do
        Last edited by kenietz; 09-26-2012, 08:38 PM.

        Comment


        • #5
          Trimseq from Emboss can remove the unwanted characters, but be careful! You can not use your annotations that based on the original sequence.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 07:49 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-20-2024, 07:23 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-17-2024, 06:54 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-14-2024, 07:24 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Working...
          X