Hi guys,
i have some issues with extracting consensus seq for transcripts from cufflinks from the bam file.
Progs i used:
samtools-0.1.16
tophat-2.0.5
cufflinks-2.0.2
samtools-0.1.18
Firstly, i tried using the mpileup pipeline provided by the samtools 0.1.18 man page. One may say it works. But for example if i use that line:
samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
It produces output in cns.fq of the following format:
@An01
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
...
...
nnnnnnnnnnnnnnnnnnnnnnnMYSEQHERE
Namely it starts from the beginning of the chromosome. That produces many Ns and my real seq is somewhere in the end in the best case. So if i want to get it i still have to parse the cns.fq file.
Well then i moved to samtools 0.1.16 pileup -c option. It produces the pileup which i parse for every transcript from cufflinks' transrcipts.gtf file. This was working as i wish until i got a case where i have such a transcript:
An01 Cufflinks transcript 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327"; full_read_support "yes";
An01 Cufflinks exon 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; exon_number "1"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327";
According to the FPKM it looks good. But when i create the pileup file from the accepted.hits.bam and try to get the transcript i got and error in my script. It turned out that there is no such a region in the pileup file. I then dug around and figured that the reads failed QC even tho they are mapped as i see at IGV. Its a bit confusing.
Questions:
1. If the reads fail quality check why are they mapped at all then?
2. How to switch of the quality check of samtools pileup -c? Should i fiddle with the source code?
Any thoughts and ideas are welcomed
Thank you
PS: Some additional info on the reads from the BAM for gene681(not found in pileup) and gene682 found in pileup. I give them cos it occurred to me they might be the same reads mapped on multiple regions but differently. It is so but still cant figure out what all this mess means
---- from genes.fpkm_tracking
gene681 - - gene681 tRNA-Ser%28GCT%29 - An01:1964560-1964665 - - 40.5022 0 180.806 OK
gene682 - - gene682 tRNA-Ser%28GCT%29 - An01:1965037-1965142 - - 40.5022 0 180.806 OK
-----------------------------------------
---- from samtools view -F 4 regions bam
GENE 681
UHDPH:1295:769 272 An01 1964625 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
UHDPH:1945:2723 272 An01 1964625 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
---------------------------------------------------
GENE 682
UHDPH:1295:769 16 An01 1965102 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
UHDPH:1945:2723 272 An01 1965102 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
---------------------------------------------------
i have some issues with extracting consensus seq for transcripts from cufflinks from the bam file.
Progs i used:
samtools-0.1.16
tophat-2.0.5
cufflinks-2.0.2
samtools-0.1.18
Firstly, i tried using the mpileup pipeline provided by the samtools 0.1.18 man page. One may say it works. But for example if i use that line:
samtools mpileup -r An01:1965037-1965142 -uf A_niger_refgen_fin.fa tophat_out_2/accepted_hits.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
It produces output in cns.fq of the following format:
@An01
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
...
...
nnnnnnnnnnnnnnnnnnnnnnnMYSEQHERE
Namely it starts from the beginning of the chromosome. That produces many Ns and my real seq is somewhere in the end in the best case. So if i want to get it i still have to parse the cns.fq file.
Well then i moved to samtools 0.1.16 pileup -c option. It produces the pileup which i parse for every transcript from cufflinks' transrcipts.gtf file. This was working as i wish until i got a case where i have such a transcript:
An01 Cufflinks transcript 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327"; full_read_support "yes";
An01 Cufflinks exon 1964561 1964665 1000 - . gene_id "gene681"; transcript_id "gene681"; exon_number "1"; FPKM "40.5022201188"; frac "1.000000"; conf_lo "0.000000"; conf_hi "180.806026"; cov "4.015327";
According to the FPKM it looks good. But when i create the pileup file from the accepted.hits.bam and try to get the transcript i got and error in my script. It turned out that there is no such a region in the pileup file. I then dug around and figured that the reads failed QC even tho they are mapped as i see at IGV. Its a bit confusing.
Questions:
1. If the reads fail quality check why are they mapped at all then?
2. How to switch of the quality check of samtools pileup -c? Should i fiddle with the source code?
Any thoughts and ideas are welcomed
Thank you
PS: Some additional info on the reads from the BAM for gene681(not found in pileup) and gene682 found in pileup. I give them cos it occurred to me they might be the same reads mapped on multiple regions but differently. It is so but still cant figure out what all this mess means

---- from genes.fpkm_tracking
gene681 - - gene681 tRNA-Ser%28GCT%29 - An01:1964560-1964665 - - 40.5022 0 180.806 OK
gene682 - - gene682 tRNA-Ser%28GCT%29 - An01:1965037-1965142 - - 40.5022 0 180.806 OK
-----------------------------------------
---- from samtools view -F 4 regions bam
GENE 681
UHDPH:1295:769 272 An01 1964625 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
UHDPH:1945:2723 272 An01 1964625 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965102 HI:i:1
---------------------------------------------------
GENE 682
UHDPH:1295:769 16 An01 1965102 0 13M2D26M * 0 0 TAGTAGCAGGCCTGCCTTAAACCACTCGGCCAAAGTGTC AAAB@?=?6@6=<@8A6>1>:5<<CAB9C9@4@?;7883 AS:i:-11 XN:i:0 XM:i:0 XO:i:1 XG:i:2 NM:i:2 MD:Z:13^GC26 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
UHDPH:1945:2723 272 An01 1965102 0 37M1I4M * 0 0 TAGTAGCAGGCCTGCGCCTTAAACCACTCGGCCAAAGCTGTC 536<6??75:48-+-+(+(4,6;16454.+4(1*9=9<87B; AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:41 YT:Z:UU XS:A:- NH:i:6 CC:Z:= CP:i:1965675 HI:i:2
---------------------------------------------------
Comment