Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • frymor
    Senior Member
    • May 2010
    • 151

    STAR vs tophat2 - mapping reads to long exons

    Hi,

    I have some RNA-Seq from the fruit fly, where we test the differential gene expression of several knock-outs and developmental stages.
    As I heard a lot of good things about the STAR algorithm, I tested it in comparison to bbmap and tophat2.

    These are the commands I used in the three algorithms:
    Code:
    	~/software/STAR-STAR_2.4.1c/STAR --runThreadN 15 --genomeDir genomes/Drosophila_melanogaster/STARindex/Dmel/ --readFilesIn $file --readFilesCommand zcat  --sjdbGTFfile genes.gtf --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --outFileNamePrefix $NEW_FILE.STAR. --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
    
     	bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh minidentity=0.9 ambiguous=toss kfilter=85 bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat
    
     	tophat2 -p 15 -g 1 -G genes.gtf -o $NEW_FILE.tophat.out  genomes/Drosophila_melanogaster/Ensembl/BDGP6.80/bowtie2index/genome $file
    In all my runs ( I testes 12 different samples), STAR performed better than topohat2, while bbmap was way back with the number of mapped reads.
    When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

    What might be the reasons (if any), that STAR doesn't map these regions?

    I have here two examples of such exons.

    and


    another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Keep in mind that tophat2 is mapping to the transcriptome first, so if these exons contain repeat regions or are similar to pseudo genes then that's the reason (i.e., tophat2 might only be reporting alignments there due to not looking yet at the whole genome).

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k', like this:

      bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh ambiguous=toss maxindel=100k bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat

      Comment

      • frymor
        Senior Member
        • May 2010
        • 151

        #4
        Originally posted by Brian Bushnell View Post
        When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k'...
        Hi Brian,
        thanks for the remarks. I am trying now to run the suggested corrections. I was though wondering, how I can control the number of allowed mismatches in the mapping process or the quality of reads if using the kfilter and minidentity parameters is not recommended.

        thanks,
        Assa

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          If you are using the latest version, the "subfilter=X" flag will be useful. It removes alignments with more than X mismatches. kfilter and minidentity are nice for DNA mapping, but not for RNA to genome.

          Comment

          • frymor
            Senior Member
            • May 2010
            • 151

            #6
            yes I have the newest version. I'll give it a go with this parameter.

            what parameter can I use to control the overall read quality?

            and

            Is there a way to collect splice junction/ as tophat is doing?

            Comment

            • alexdobin
              Senior Member
              • Feb 2009
              • 161

              #7
              Originally posted by frymor View Post
              When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

              What might be the reasons (if any), that STAR doesn't map these regions?
              You are using --outFilterMultimapNmax 1, which only allows uniquely mapped reads to be output in the SAM/BAM. No alignments will be reported for multi-mapping reads.
              My guess would be that these reads in the long exons are multi-mappers, and thus STAR does not output them.
              Please try mapping with multi-mappers allowed and check these regions again.

              As far as I can tell, with -g 1 option TopHat will report one random alignment out of all multi-mappers with the same highest score. Also, as @dpryan pointed out, TopHat maps to the transcriptome first, and may consider the alignments unique because of that.

              Originally posted by frymor View Post
              another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic
              This caused by --alignIntronMax 1, I replied in more detail at the link above.

              In general, when making the first runs and comparisons, I would recommend running STAR with default parameters.

              Cheers
              Alex

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                #8
                Originally posted by frymor View Post
                what parameter can I use to control the overall read quality?
                Well, you could use the "subfilter=X" flag to reject mapping sites with more than X substitutions, but normally for RNA-seq, I only set maxindel to an appropriate value for the organism's intron lengths, and leave everything else at default.

                Is there a way to collect splice junction/ as tophat is doing?
                BBMap does not do that, but you can find the introns by pileup with Samtools (they will be listed as deletions), and you could probably find the exon locations by using Bedtools to list the locations with coverage above a certain threshold, if you treat bases covered by deletion events as not covered. It's very obvious where the introns and exons are when you look at the bam file in IGV, but I've never actually tried to generate annotation from mapping; there's probably a better way than what I'm suggesting.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Pathogen Surveillance with Advanced Genomic Tools
                  by seqadmin




                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                  03-24-2025, 11:48 AM
                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                41 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                51 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                38 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-03-2025, 01:15 PM
                0 responses
                193 views
                0 reactions
                Last Post seqadmin  
                Working...