Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • polarise
    Member
    • Jan 2011
    • 13

    Tophat ignoring '--max-multihits' flag?

    Hi,

    I tried to exclude reads that map to multiple locations using the flag mentioned above. The run.log file shows this. However, the following commands ignore this. Does anyone have an idea why this would occur?

    Code:
    /home/paulk/software/bin/tophat -r 90 -p 10 --solexa1.3-quals --max-multihits 1 -o 5C /home/paulk/bowtie-0.12.7/scripts/hg19 ../fastq/5C_1_sequence.txt ../fastq/5C_2_sequence.txt
    /home/paulk/software/bin/prep_reads --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir 5C// --max-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --sam-header 5C//tmp/stub_header.sam --inner-dist-mean 90 --inner-dist-std-dev 20 --no-microexon-search --phred64-quals --fastq ../fastq/5C_1_sequence.txt
    Paul
  • fkrueger
    Senior Member
    • Sep 2009
    • 627

    #2
    This is a known bug in TopHat (see here).

    Using -g 1 instead should do the trick.

    Comment

    • polarise
      Member
      • Jan 2011
      • 13

      #3
      Originally posted by fkrueger View Post
      This is a known bug in TopHat (see here).

      Using -g 1 instead should do the trick.
      Thank you. Another five hours of alignment!

      Comment

      • DavidMatthewsBristol
        Junior Member
        • Aug 2010
        • 7

        #4
        This might help....

        Hi,
        I've been using Galaxy to analyse rna seq data from mRNA isolated from Hela cells. Like others I have the problem of reads that are multihits. I have put up a workflow for analysis on Galaxy that involves the following steps:
        1. Run the data using tophat and allow up to 40 maps per read (default)
        2. Use a samtools feature to get rid of all mappings that are not mate paired and in a proper pair.
        3. Count up how many times an individual read is in the sam file and remove all read pairs that are not mapped to a unique site, putting them in a separate "multihits" file.
        4. Keep all the uniquely mapped proper mate paired hits in a unique hits sam file.

        This approach generates more unique hits than asking tophat to throw out reads that do not uniquely map (this may have changed with the latest tophat release - I haven't checked yet). I think (and the tophat guys may correct me on this) this is because tophat may be removing reads where one end is not uniquely mapped but the other is (and therefore only makes sense with one of the mates).
        However, whatever tophat does (now or in the future) this approach does have the advantage of telling you where and how big the multihit problem is. My datset has, for example, 18 million unique proper paired reads, 1.3 million that map to two places, a few hundered thousand that map to 3 places and so on down the line.
        One problem with multihits is that we may be overestimating some genes by including multihits or conversely underestimating some genes by excluding them. This "Bristol" workflow allows us to at least know if a gene has a problem of being prone to multihits.

        I think this approach is useful but I may have missed something or be behind the curve!! Who knows but I thought it might be a useful workflow to start a discussion about what to do with multihit reads.

        Cheers
        David

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Today, 05:37 AM
        0 responses
        5 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        16 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        50 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        109 views
        0 reactions
        Last Post SEQadmin2  
        Working...