Hi guys,
i am trying to create a script which can extract the transcripts assembled by cufflinks using perl and samtools.
I tried several approaches:
1. samtools mpileup - using the pipeline provided by the samtools man page. Well in a way it works but the extractions starts from the beginning of the chromosome which creates a huge file for a transcript of 100-200 bp for example. So discarded that option.
2. vcf-tools's vcf-consensus - kind of works but again extracts the consensus for the whole chromosome. couldnt make it extract only a region. Discarded.
3. samtools pileup -c BAM > pileup - well kind of works but again problems. pileup -c automatically discards reads which are marked as 0x704 in the BIT FIELD:
0x0100 s the alignment is not primary
0x0200 f the read fails platform/vendor quality checks
0x0400 d the read is either a PCR or an optical duplicate
That is because it uses -m option automatically. I tried to switch off the -m somehow by giving it a huge value like 10000000 which worked in some cases but not in paired-end reads. I couldnt understand why some reads are discarded when they are not marked as 0x704. All this automatically skipping of reads kind of ruin my task.
4. My current approach is the following:
a) run samtoools pileup BAM > pileup, no -c option given at all. This produces output like this:
1 1010 N 5 CcCcc IDFJJ
1 1011 N 5 AaAaa GHFJJ
1 1012 N 5 GgGgg IIHJJ
1 1013 N 5 AaAaa IIBJJ
1 1014 N 5 CcCcc IGHJJ
1 1015 N 5 AaAaa IIHJJ
1 1016 N 5 GgGgg JIJJJ
1 1017 N 5 TaAaa IGIJJ
1 1018 N 5 TtTtt IFIJJ
1 1019 N 5 AaAaa IIGJJ
1 168457 N 10 <<<<<>><<< A@FHJ=DJJJ
1 168458 N 10 <<<<<>><<< A@FHJ=DJJJ
1 168459 N 10 <<<<<>><<< A@FHJ=DJJJ
b) Then i take the coordinates from the transcript.gtf and search for them in the pileup which has its own tricks.
c) when found the start and end i parse the entries in between. well here is a bit weird. the lines where column 5 contain A|C|G|T i count how many occurrences for each there are and take the one with max number. I could not understand what these '>' '<' mean in that column tho. Any hints are welcomed.
d) concatenate all the letters found by the procedure in c) and produce a transcript.
I have two questions:
1. what that these '>' '<' mean in column 5?
2. is my procedure more or less on the right path?
Thank you for your time and help
i am trying to create a script which can extract the transcripts assembled by cufflinks using perl and samtools.
I tried several approaches:
1. samtools mpileup - using the pipeline provided by the samtools man page. Well in a way it works but the extractions starts from the beginning of the chromosome which creates a huge file for a transcript of 100-200 bp for example. So discarded that option.
2. vcf-tools's vcf-consensus - kind of works but again extracts the consensus for the whole chromosome. couldnt make it extract only a region. Discarded.
3. samtools pileup -c BAM > pileup - well kind of works but again problems. pileup -c automatically discards reads which are marked as 0x704 in the BIT FIELD:
0x0100 s the alignment is not primary
0x0200 f the read fails platform/vendor quality checks
0x0400 d the read is either a PCR or an optical duplicate
That is because it uses -m option automatically. I tried to switch off the -m somehow by giving it a huge value like 10000000 which worked in some cases but not in paired-end reads. I couldnt understand why some reads are discarded when they are not marked as 0x704. All this automatically skipping of reads kind of ruin my task.
4. My current approach is the following:
a) run samtoools pileup BAM > pileup, no -c option given at all. This produces output like this:
1 1010 N 5 CcCcc IDFJJ
1 1011 N 5 AaAaa GHFJJ
1 1012 N 5 GgGgg IIHJJ
1 1013 N 5 AaAaa IIBJJ
1 1014 N 5 CcCcc IGHJJ
1 1015 N 5 AaAaa IIHJJ
1 1016 N 5 GgGgg JIJJJ
1 1017 N 5 TaAaa IGIJJ
1 1018 N 5 TtTtt IFIJJ
1 1019 N 5 AaAaa IIGJJ
1 168457 N 10 <<<<<>><<< A@FHJ=DJJJ
1 168458 N 10 <<<<<>><<< A@FHJ=DJJJ
1 168459 N 10 <<<<<>><<< A@FHJ=DJJJ
b) Then i take the coordinates from the transcript.gtf and search for them in the pileup which has its own tricks.
c) when found the start and end i parse the entries in between. well here is a bit weird. the lines where column 5 contain A|C|G|T i count how many occurrences for each there are and take the one with max number. I could not understand what these '>' '<' mean in that column tho. Any hints are welcomed.
d) concatenate all the letters found by the procedure in c) and produce a transcript.
I have two questions:
1. what that these '>' '<' mean in column 5?
2. is my procedure more or less on the right path?
Thank you for your time and help
Comment