Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat not

    Hello everybody,

    I'm new to the forum, I had found it useful before, because many of the problems I had, someone already had a similar problem before, but this time I'm a little bit more stuck than usually.

    I'm trying to do some differential expression analysis with cufflinks, tophat and bowtie. However, at the moment we don't have a complete reference of the organism, so we are trying to use a partial assembly as reference. The software seems to do all the bowtie alignments, but samtools view files at some point in the pipeline. I found in run.log that the last executed command was

    samtools view -S -b /tmp/lane_1//tmp/accepted_hits.sam > /tmp/lane_1//tmp/fileZvx32H

    So, I executed again that command manually, and the output was t he following:

    [samopen] SAM header is present: 25828 sequences.
    Parse error at line 1963402: CIGAR and sequence length are inconsistent
    /usr/users/tgac/ramirezr/.lsbatch/1290771421.40646: line 8: 16759 Aborted (core dumped) samtools view -S -b lane_1/tmp/accepted_hits.sam > lane_1/tmp/fileZvx32H


    So, I went to line 1963402 in accepted_hits.sam, and I found the following:

    N73018:1:7:6056:18003#0 99 30063 392907 255 104M178N536870906M138N22M = 393405 0 TGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:???? NM:i:255 XS:A:- NH:i:1

    I can see tha the CIGAR 104M178N536870906M138N22M is something completely unexpected. Then I looked in the read files and the reads look ok for both pairs.


    @N73018:1:7:6056:18003#0/1
    TCCCGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCCCAATCATCACCCTTCTCAATATCATC
    +
    GFEEGGGGGGGGGGGGGADGEE-FEGFGGGGFGGFGAFGGEEFGGAGDGEFGE@GGFDEFGBGEGEEGGEEBEDAEBDDBEGEFFBEEFE=AD@AB=EEBDEDD@FABBBE=@AE:????

    @N73018:1:7:6056:18003#0/2
    GAGCAGCTAGTGGAAGGGGACGTAAAAAGGCAGCTGGTGACGATGAAGAAGGTAATGTATCTGACAGAGGAGACGAAGATGAGGAAGAGGAGGCAGCAAGGAAGAATAGACTTGGAATCA
    +
    4463292 DDGGEFFFFFBEEAEEEEE=BEE?EFFFBFBBE::EEBBAEFAFF:=BB?CE?BCBC5:B@EBBEBEBEBBEBAE@@E:BB:BB+5>>>>=BBB?B??,=<,<<<5????5?=5?%%%%%


    Then, I went to the reference, to the read 30063 and I did found that the sequences in the CIGAR is there, not in a single hit, but in three, all of them close within each other, at first I thought it could be that the alignment may be split between regions with Ns, but it isn't. This is part of the scaffold where the alignment occurs. So far, it seems ok.

    >30063
    TGATTCATCAAACAATACGATTTACAGTCGAAGGACCAGGATTTTGATATTCCGTGCAAA
    ATACTCCTTTAGTTCACAATAAACAGGCCCTTGAAACTTACAGTTTCTTTCTATTAGTGG
    ATTAATGATCAATTAGATCGTGCTCCTTTTAATTGTTGAATCCTAATTTGGAAATAGTTG
    CAATCTTATCCTATAAGTAAAGATTTTGAAAAAGAATATTATTAATCTAACCGAAGAAAA
    ATATTAATATTAATTTTTTTTGTTATTTCTTTCCACCCCACCCTTTCTCAGCCCAACTCT
    GATTAGTTACAAAATATCATTTTTAAAATCAACTTTAATAAAATCCTAATTTGGGATATA
    ATTGATACATAGGCCTTTACTTTTCATCAAAGAAAGCAAAATGCACTATAAATACTTCTG
    AATATTGAAACCAATTGACACCCAAATGTTAAGTTTCAATTTAACTCCTAAGTTTTGGAA
    CCAGTTGATAAATGTCTTCAGATACAAAAGTGAAAAATGGCAGACATTAATAACTTGCAA
    AAACAAATACAAACCTGCTTGATTTCTGGAGGAGCAGGAACCTCTGGGGCCAAATCTTCC
    CGCTCTTCACGATCATTACCAACAGCTTCATCATCATCGGTGAAAATTTCTTCATGCTCC
    CAATCATCACCTAAAATCAACAAATATACATTTCTAAAGTGAAAATCACTTGAAAAACTA
    AATTAAAGATTACACTTTATACAATTTTTTTTTTAAAAAAAAAAAAGAAAAAGAAGTTTC
    ATCCGTGTCAGCATCTGTCAAGCTTATTGAACCAGGAAAGATGATTGCAAAACCTTTTCC
    ATGTACACCAAAAAGAAGAGAATGCCACCCCTTGAGCACATAAGCAGGATCAATATAAGA
    ATGCATAAGAAATCAGTAGTTATTTGTAGATTCAAACTCCTAAAATAAAAAAGCATTAGG
    ACATCATACTGAAGAGAAATGCTCACCCTTCTCAATATCATCATCATCAATATCGTGATC
    TCCTCCCCTTGGACCTTCCTCATCATCATCACCACCACGCTTGTTGAGTCCAAGTCTATT
    CTTCCTTGCTGCCTCCTCTTCCTCATCTTCGTCTCCTCTGTCAGATACATTACCTTCTTC
    ATCGTCACCAGCTGCCTTTTTACGTCCCCTTCCACTAGCTGCTCCACTGTCCTTATCATC
    AAGCTTCTCCACTTCACCAAATGCAGCGGGTCCATTATTGGCAGCTTTCATCATCCATCT
    CTCATATCCATCTGCAGTCTTTCTCCTGTTCCTCATCTTCTCTTCCGCTTCCTCCAAAGT
    GAGTTGCTTATACTGAGCAACTTTATTAAAATTATACCTATGAAACATGTCAAGCAATTT
    ACAACTTCAATTCTAAAAAGACAGTTAACATTACATCATCAAATTATCTTGCAAATGGGT
    GAGTAGACAATTTTAAATGAAAAAAAAAAAAAATCTATGTCCTTGTAGGACAATCATTTC
    CAGAATAAAACAAACAAGTAGGTAGGTAAAATGGGACGCCAAGTCGAAAAGTATTCTGCG
    CAGTATTTACAAGAGGGAAGAGCATATAGCCCGAATAAAAGAAGCATGGAGAGTACCAAG
    AAGAAGCAGGAATAGCAACAAACTCTTTCCCCTGCAAGATCAATAGGTAGTAAGTTGCTG
    ATTGTGAACCTTCGAGATGACCTTGGTACTGAAACTGGCCTGTTTCATCCTCCAGGAGCC
    AAGGTCTGTTCTTGTATTTCTCCTATAAAATGCAATTACAAATAGCAGTAAGTAGAATGC
    AGTGACATTCAGATTTTGATTAGTTGAGTGATTCTTTAGTCCAAGCCAAATAATCTTTAC
    CAAAGGGAAAGAAATAGAACAGCACTGAGTACAGTAAGTAAAAGGTACCCGCAAGGCATC
    AGTAAGTTGACGTCCTTGAAGACCTTCTTTTTGAAGAGACCATTTGTTCTCAGCATTTTT
    TTTCTTAGAAAAACTCGGTAGACCCGTAACAAACCTACCAATAAAGTAGTTCTTGTCGCT
    GCTAGAACTTGCTCTAACATTATATTCCTAACAAAACCAAAAAAAATTAAAAGAAATAGA
    CATGCAAATTGAGAGCAGAAAAAGGGAGAGGGAGGCTGACCCGAATCAAGCGAGAGACAG
    TAGCACCACAATCAAAGCATTTAGCATGATTTTCAGCCATGGTTTTGCCACAAGACAAGC
    ACAATGTCATGTGCTTGCAGTTGCTTCCGTATAAGTCGGTGGTTGATCCACATCCACTAC
    ATGACCCTTTCAGCATCAGATCCATAGTCGCCATTTCTCTTTTATTTATTTATTTATTTA


    thanks for your help
    Ricardo.

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X