Hello everybody,
I'm new to the forum, I had found it useful before, because many of the problems I had, someone already had a similar problem before, but this time I'm a little bit more stuck than usually.
I'm trying to do some differential expression analysis with cufflinks, tophat and bowtie. However, at the moment we don't have a complete reference of the organism, so we are trying to use a partial assembly as reference. The software seems to do all the bowtie alignments, but samtools view files at some point in the pipeline. I found in run.log that the last executed command was
samtools view -S -b /tmp/lane_1//tmp/accepted_hits.sam > /tmp/lane_1//tmp/fileZvx32H
So, I executed again that command manually, and the output was t he following:
[samopen] SAM header is present: 25828 sequences.
Parse error at line 1963402: CIGAR and sequence length are inconsistent
/usr/users/tgac/ramirezr/.lsbatch/1290771421.40646: line 8: 16759 Aborted (core dumped) samtools view -S -b lane_1/tmp/accepted_hits.sam > lane_1/tmp/fileZvx32H
So, I went to line 1963402 in accepted_hits.sam, and I found the following:
N73018:1:7:6056:18003#0 99 30063 392907 255 104M178N536870906M138N22M = 393405 0 TGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:???? NM:i:255 XS:A:- NH:i:1
I can see tha the CIGAR 104M178N536870906M138N22M is something completely unexpected. Then I looked in the read files and the reads look ok for both pairs.
@N73018:1:7:6056:18003#0/1
TCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC
+
GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:????
@N73018:1:7:6056:18003#0/2
GAGCAGCTAGTGGAAGGGGACGTAAAAAGGCAGCTGGTGACGATGAAGAAGGTAATGTATCTGACAGAGGAGACGAAGATGAGGAAGAGGAGGCAGCAAGGAAGAATAGACTTGGAATCA
+
4463292 DDGGEFFFFFBEEAEEEEE=BEE?EFFFBFBBE::EEBBAEFAFF:=BB?CE?BCBC5:B@EBBEBEBEBBEBAE@@E:BB:BB+5>>>>=BBB?B??,=<,<<<5????5?=5?%%%%%
Then, I went to the reference, to the read 30063 and I did found that the sequences in the CIGAR is there, not in a single hit, but in three, all of them close within each other, at first I thought it could be that the alignment may be split between regions with Ns, but it isn't. This is part of the scaffold where the alignment occurs. So far, it seems ok.
>30063
TGATTCATCAAACAATACGATTTACAGTCGAAGGACCAGGATTTTGATATTCCGTGCAAA
ATACTCCTTTAGTTCACAATAAACAGGCCCTTGAAACTTACAGTTTCTTTCTATTAGTGG
ATTAATGATCAATTAGATCGTGCTCCTTTTAATTGTTGAATCCTAATTTGGAAATAGTTG
CAATCTTATCCTATAAGTAAAGATTTTGAAAAAGAATATTATTAATCTAACCGAAGAAAA
ATATTAATATTAATTTTTTTTGTTATTTCTTTCCACCCCACCCTTTCTCAGCCCAACTCT
GATTAGTTACAAAATATCATTTTTAAAATCAACTTTAATAAAATCCTAATTTGGGATATA
ATTGATACATAGGCCTTTACTTTTCATCAAAGAAAGCAAAATGCACTATAAATACTTCTG
AATATTGAAACCAATTGACACCCAAATGTTAAGTTTCAATTTAACTCCTAAGTTTTGGAA
CCAGTTGATAAATGTCTTCAGATACAAAAGTGAAAAATGGCAGACATTAATAACTTGCAA
AAACAAATACAAACCTGCTTGATTTCTGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCC
CGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCC
CAATCATCACCTAAAATCAACAAATATACATTTCTAAAGTGAAAATCACTTGAAAAACTA
AATTAAAGATTACACTTTATACAATTTTTTTTTTAAAAAAAAAAAAGAAAAAGAAGTTTC
ATCCGTGTCAGCATCTGTCAAGCTTATTGAACCAGGAAAGATGATTGCAAAACCTTTTCC
ATGTACACCAAAAAGAAGAGAATGCCACCCCTTGAGCACATAAGCAGGATCAATATAAGA
ATGCATAAGAAATCAGTAGTTATTTGTAGATTCAAACTCCTAAAATAAAAAAGCATTAGG
ACATCATACTGAAGAGAAATGCTCACCCTTCTCAATATCATCATCATCAATATCGTGATC
TCCTCCCCTTGGACCTTCCTCATCATCATCACCACCACGCTTGTTGAGTCCAAGTCTATT
CTTCCTTGCTGCCTCCTCTTCCTCATCTTCGTCTCCTCTGTCAGATACATTACCTTCTTC
ATCGTCACCAGCTGCCTTTTTACGTCCCCTTCCACTAGCTGCTCCACTGTCCTTATCATC
AAGCTTCTCCACTTCACCAAATGCAGCGGGTCCATTATTGGCAGCTTTCATCATCCATCT
CTCATATCCATCTGCAGTCTTTCTCCTGTTCCTCATCTTCTCTTCCGCTTCCTCCAAAGT
GAGTTGCTTATACTGAGCAACTTTATTAAAATTATACCTATGAAACATGTCAAGCAATTT
ACAACTTCAATTCTAAAAAGACAGTTAACATTACATCATCAAATTATCTTGCAAATGGGT
GAGTAGACAATTTTAAATGAAAAAAAAAAAAAATCTATGTCCTTGTAGGACAATCATTTC
CAGAATAAAACAAACAAGTAGGTAGGTAAAATGGGACGCCAAGTCGAAAAGTATTCTGCG
CAGTATTTACAAGAGGGAAGAGCATATAGCCCGAATAAAAGAAGCATGGAGAGTACCAAG
AAGAAGCAGGAATAGCAACAAACTCTTTCCCCTGCAAGATCAATAGGTAGTAAGTTGCTG
ATTGTGAACCTTCGAGATGACCTTGGTACTGAAACTGGCCTGTTTCATCCTCCAGGAGCC
AAGGTCTGTTCTTGTATTTCTCCTATAAAATGCAATTACAAATAGCAGTAAGTAGAATGC
AGTGACATTCAGATTTTGATTAGTTGAGTGATTCTTTAGTCCAAGCCAAATAATCTTTAC
CAAAGGGAAAGAAATAGAACAGCACTGAGTACAGTAAGTAAAAGGTACCCGCAAGGCATC
AGTAAGTTGACGTCCTTGAAGACCTTCTTTTTGAAGAGACCATTTGTTCTCAGCATTTTT
TTTCTTAGAAAAACTCGGTAGACCCGTAACAAACCTACCAATAAAGTAGTTCTTGTCGCT
GCTAGAACTTGCTCTAACATTATATTCCTAACAAAACCAAAAAAAATTAAAAGAAATAGA
CATGCAAATTGAGAGCAGAAAAAGGGAGAGGGAGGCTGACCCGAATCAAGCGAGAGACAG
TAGCACCACAATCAAAGCATTTAGCATGATTTTCAGCCATGGTTTTGCCACAAGACAAGC
ACAATGTCATGTGCTTGCAGTTGCTTCCGTATAAGTCGGTGGTTGATCCACATCCACTAC
ATGACCCTTTCAGCATCAGATCCATAGTCGCCATTTCTCTTTTATTTATTTATTTATTTA
thanks for your help
Ricardo.
I'm new to the forum, I had found it useful before, because many of the problems I had, someone already had a similar problem before, but this time I'm a little bit more stuck than usually.
I'm trying to do some differential expression analysis with cufflinks, tophat and bowtie. However, at the moment we don't have a complete reference of the organism, so we are trying to use a partial assembly as reference. The software seems to do all the bowtie alignments, but samtools view files at some point in the pipeline. I found in run.log that the last executed command was
samtools view -S -b /tmp/lane_1//tmp/accepted_hits.sam > /tmp/lane_1//tmp/fileZvx32H
So, I executed again that command manually, and the output was t he following:
[samopen] SAM header is present: 25828 sequences.
Parse error at line 1963402: CIGAR and sequence length are inconsistent
/usr/users/tgac/ramirezr/.lsbatch/1290771421.40646: line 8: 16759 Aborted (core dumped) samtools view -S -b lane_1/tmp/accepted_hits.sam > lane_1/tmp/fileZvx32H
So, I went to line 1963402 in accepted_hits.sam, and I found the following:
N73018:1:7:6056:18003#0 99 30063 392907 255 104M178N536870906M138N22M = 393405 0 TGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:???? NM:i:255 XS:A:- NH:i:1
I can see tha the CIGAR 104M178N536870906M138N22M is something completely unexpected. Then I looked in the read files and the reads look ok for both pairs.
@N73018:1:7:6056:18003#0/1
TCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC
+
GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:????
@N73018:1:7:6056:18003#0/2
GAGCAGCTAGTGGAAGGGGACGTAAAAAGGCAGCTGGTGACGATGAAGAAGGTAATGTATCTGACAGAGGAGACGAAGATGAGGAAGAGGAGGCAGCAAGGAAGAATAGACTTGGAATCA
+
4463292 DDGGEFFFFFBEEAEEEEE=BEE?EFFFBFBBE::EEBBAEFAFF:=BB?CE?BCBC5:B@EBBEBEBEBBEBAE@@E:BB:BB+5>>>>=BBB?B??,=<,<<<5????5?=5?%%%%%
Then, I went to the reference, to the read 30063 and I did found that the sequences in the CIGAR is there, not in a single hit, but in three, all of them close within each other, at first I thought it could be that the alignment may be split between regions with Ns, but it isn't. This is part of the scaffold where the alignment occurs. So far, it seems ok.
>30063
TGATTCATCAAACAATACGATTTACAGTCGAAGGACCAGGATTTTGATATTCCGTGCAAA
ATACTCCTTTAGTTCACAATAAACAGGCCCTTGAAACTTACAGTTTCTTTCTATTAGTGG
ATTAATGATCAATTAGATCGTGCTCCTTTTAATTGTTGAATCCTAATTTGGAAATAGTTG
CAATCTTATCCTATAAGTAAAGATTTTGAAAAAGAATATTATTAATCTAACCGAAGAAAA
ATATTAATATTAATTTTTTTTGTTATTTCTTTCCACCCCACCCTTTCTCAGCCCAACTCT
GATTAGTTACAAAATATCATTTTTAAAATCAACTTTAATAAAATCCTAATTTGGGATATA
ATTGATACATAGGCCTTTACTTTTCATCAAAGAAAGCAAAATGCACTATAAATACTTCTG
AATATTGAAACCAATTGACACCCAAATGTTAAGTTTCAATTTAACTCCTAAGTTTTGGAA
CCAGTTGATAAATGTCTTCAGATACAAAAGTGAAAAATGGCAGACATTAATAACTTGCAA
AAACAAATACAAACCTGCTTGATTTCTGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCC
CGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCC
CAATCATCACCTAAAATCAACAAATATACATTTCTAAAGTGAAAATCACTTGAAAAACTA
AATTAAAGATTACACTTTATACAATTTTTTTTTTAAAAAAAAAAAAGAAAAAGAAGTTTC
ATCCGTGTCAGCATCTGTCAAGCTTATTGAACCAGGAAAGATGATTGCAAAACCTTTTCC
ATGTACACCAAAAAGAAGAGAATGCCACCCCTTGAGCACATAAGCAGGATCAATATAAGA
ATGCATAAGAAATCAGTAGTTATTTGTAGATTCAAACTCCTAAAATAAAAAAGCATTAGG
ACATCATACTGAAGAGAAATGCTCACCCTTCTCAATATCATCATCATCAATATCGTGATC
TCCTCCCCTTGGACCTTCCTCATCATCATCACCACCACGCTTGTTGAGTCCAAGTCTATT
CTTCCTTGCTGCCTCCTCTTCCTCATCTTCGTCTCCTCTGTCAGATACATTACCTTCTTC
ATCGTCACCAGCTGCCTTTTTACGTCCCCTTCCACTAGCTGCTCCACTGTCCTTATCATC
AAGCTTCTCCACTTCACCAAATGCAGCGGGTCCATTATTGGCAGCTTTCATCATCCATCT
CTCATATCCATCTGCAGTCTTTCTCCTGTTCCTCATCTTCTCTTCCGCTTCCTCCAAAGT
GAGTTGCTTATACTGAGCAACTTTATTAAAATTATACCTATGAAACATGTCAAGCAATTT
ACAACTTCAATTCTAAAAAGACAGTTAACATTACATCATCAAATTATCTTGCAAATGGGT
GAGTAGACAATTTTAAATGAAAAAAAAAAAAAATCTATGTCCTTGTAGGACAATCATTTC
CAGAATAAAACAAACAAGTAGGTAGGTAAAATGGGACGCCAAGTCGAAAAGTATTCTGCG
CAGTATTTACAAGAGGGAAGAGCATATAGCCCGAATAAAAGAAGCATGGAGAGTACCAAG
AAGAAGCAGGAATAGCAACAAACTCTTTCCCCTGCAAGATCAATAGGTAGTAAGTTGCTG
ATTGTGAACCTTCGAGATGACCTTGGTACTGAAACTGGCCTGTTTCATCCTCCAGGAGCC
AAGGTCTGTTCTTGTATTTCTCCTATAAAATGCAATTACAAATAGCAGTAAGTAGAATGC
AGTGACATTCAGATTTTGATTAGTTGAGTGATTCTTTAGTCCAAGCCAAATAATCTTTAC
CAAAGGGAAAGAAATAGAACAGCACTGAGTACAGTAAGTAAAAGGTACCCGCAAGGCATC
AGTAAGTTGACGTCCTTGAAGACCTTCTTTTTGAAGAGACCATTTGTTCTCAGCATTTTT
TTTCTTAGAAAAACTCGGTAGACCCGTAACAAACCTACCAATAAAGTAGTTCTTGTCGCT
GCTAGAACTTGCTCTAACATTATATTCCTAACAAAACCAAAAAAAATTAAAAGAAATAGA
CATGCAAATTGAGAGCAGAAAAAGGGAGAGGGAGGCTGACCCGAATCAAGCGAGAGACAG
TAGCACCACAATCAAAGCATTTAGCATGATTTTCAGCCATGGTTTTGCCACAAGACAAGC
ACAATGTCATGTGCTTGCAGTTGCTTCCGTATAAGTCGGTGGTTGATCCACATCCACTAC
ATGACCCTTTCAGCATCAGATCCATAGTCGCCATTTCTCTTTTATTTATTTATTTATTTA
thanks for your help
Ricardo.