Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding tophat intermediate logs

    Hi
    I got on about 6M out of 35M read mapped tophat for my sample. On doing QC I found out that there was some problem in the first 8 bases. I trimmed them and rerunning tophat (with default paramters except a mismatch of 2). I am looking at the intermediate logs and see that I still have log in which 85% of my reads fail to align.
    I have two questions:
    1. Can someone tell us a bit more about the intermediate files and logs generated?

    2. How much does a problem in %GC content affect the alignment.

    I am attaching some of the log files below:
    -thanks
    -LAx
    ______________________________________________
    [liyer01@h01 logs]$ more file2vqTeB.log
    # reads processed: 38843979
    # reads with at least one reported alignment: 5520299 (14.21%)
    # reads that failed to align: 33230242 (85.55%)
    # reads with alignments suppressed due to -m: 93438 (0.24%)
    Reported 8463229 alignments to 1 output stream(s)
    [liyer01@h01 logs]$ more long_spanning_reads.log
    long_spanning_reads v1.1.0 (1606)
    --------------------------------------------
    Opening S6_tophat_out/left_kept_reads.fq for reading
    Opening /dev/null for reading
    Opening S6_tophat_out/tmp/left_kept_reads.bwtout for reading
    Loading spliced hits...done
    Loading junctions...done
    [liyer01@h01 logs]$ more file8W6m7g.log
    # reads processed: 33230242
    # reads with at least one reported alignment: 26854884 (80.81%)
    # reads that failed to align: 3283599 (9.88%)
    # reads with alignments suppressed due to -m: 3091759 (9.30%)
    Reported 77514776 alignments to 1 output stream(s)

  • #2
    LAx,
    1. Tophat uses several smaller programs to do its work. One of these programs is long_spanning_reads; another is bowtie. If you look in logs/run.log you will find the command lines that tophat issues to its subsidiary programs, and see where the intermediate data files in /tmp and the logs with the cryptic names come from.
    The first and third log files that you attached are from bowtie. The first is probably from mapping the whole reads, and the third, judging from the number of reads processed, is probably from mapping segments of the initially unmapped reads in order to find splice junctions. The good news is that >80% of the segments mapped; the bad news is that 33 million aligned segments generated 77.5 million alignments, so many of the read segments aligned in more than one place.
    The second log file is from long_spanning_reads (obviously) and just shows that the program ran without any problems.

    2. The question is unclear. What kind of problem in GC content?

    Comment


    • #3
      Hi Ian
      Thanks. I am trying to figure out how to deal with the sequences that align at multiple locations. Further, I am also trying to figure out what caused it. Is it something to do with the sequencing? i. e. specific artifacts in the reads produced. I did find that there was some issues in the "per base sequence content" plot which plots the %G, %T, %C, %A across all bases. There was flucutations in the first 8 bases and divergence again after about 40 bases. I am beginning to think about it. One think, I thought about was to filter out bases in the the reads using fastax tools to contain only reads with a quality greater than 30 and a minimum length of 25. The hope is that the multiple hits in the alignment are caused by reads portions with bad quality and would be remedied by this.
      Any suggestions?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      25 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      29 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X