Header Leaderboard Ad

Collapse

STAR vs tophat2 - mapping reads to long exons

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • STAR vs tophat2 - mapping reads to long exons

    Hi,

    I have some RNA-Seq from the fruit fly, where we test the differential gene expression of several knock-outs and developmental stages.
    As I heard a lot of good things about the STAR algorithm, I tested it in comparison to bbmap and tophat2.

    These are the commands I used in the three algorithms:
    Code:
    	~/software/STAR-STAR_2.4.1c/STAR --runThreadN 15 --genomeDir genomes/Drosophila_melanogaster/STARindex/Dmel/ --readFilesIn $file --readFilesCommand zcat  --sjdbGTFfile genes.gtf --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --outFileNamePrefix $NEW_FILE.STAR. --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
    
     	bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh minidentity=0.9 ambiguous=toss kfilter=85 bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat
    
     	tophat2 -p 15 -g 1 -G genes.gtf -o $NEW_FILE.tophat.out  genomes/Drosophila_melanogaster/Ensembl/BDGP6.80/bowtie2index/genome $file
    In all my runs ( I testes 12 different samples), STAR performed better than topohat2, while bbmap was way back with the number of mapped reads.
    When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

    What might be the reasons (if any), that STAR doesn't map these regions?

    I have here two examples of such exons.

    and


    another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic

  • #2
    Keep in mind that tophat2 is mapping to the transcriptome first, so if these exons contain repeat regions or are similar to pseudo genes then that's the reason (i.e., tophat2 might only be reporting alignments there due to not looking yet at the whole genome).

    Comment


    • #3
      When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k', like this:

      bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh ambiguous=toss maxindel=100k bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k'...
        Hi Brian,
        thanks for the remarks. I am trying now to run the suggested corrections. I was though wondering, how I can control the number of allowed mismatches in the mapping process or the quality of reads if using the kfilter and minidentity parameters is not recommended.

        thanks,
        Assa

        Comment


        • #5
          If you are using the latest version, the "subfilter=X" flag will be useful. It removes alignments with more than X mismatches. kfilter and minidentity are nice for DNA mapping, but not for RNA to genome.

          Comment


          • #6
            yes I have the newest version. I'll give it a go with this parameter.

            what parameter can I use to control the overall read quality?

            and

            Is there a way to collect splice junction/ as tophat is doing?

            Comment


            • #7
              Originally posted by frymor View Post
              When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

              What might be the reasons (if any), that STAR doesn't map these regions?
              You are using --outFilterMultimapNmax 1, which only allows uniquely mapped reads to be output in the SAM/BAM. No alignments will be reported for multi-mapping reads.
              My guess would be that these reads in the long exons are multi-mappers, and thus STAR does not output them.
              Please try mapping with multi-mappers allowed and check these regions again.

              As far as I can tell, with -g 1 option TopHat will report one random alignment out of all multi-mappers with the same highest score. Also, as @dpryan pointed out, TopHat maps to the transcriptome first, and may consider the alignments unique because of that.

              Originally posted by frymor View Post
              another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic
              This caused by --alignIntronMax 1, I replied in more detail at the link above.

              In general, when making the first runs and comparisons, I would recommend running STAR with default parameters.

              Cheers
              Alex

              Comment


              • #8
                Originally posted by frymor View Post
                what parameter can I use to control the overall read quality?
                Well, you could use the "subfilter=X" flag to reject mapping sites with more than X substitutions, but normally for RNA-seq, I only set maxindel to an appropriate value for the organism's intron lengths, and leave everything else at default.

                Is there a way to collect splice junction/ as tophat is doing?
                BBMap does not do that, but you can find the introns by pileup with Samtools (they will be listed as deletions), and you could probably find the exon locations by using Bedtools to list the locations with coverage above a certain threshold, if you treat bases covered by deletion events as not covered. It's very obvious where the introns and exons are when you look at the bam file in IGV, but I've never actually tried to generate annotation from mapping; there's probably a better way than what I'm suggesting.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                  by seqadmin


                  ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                  01-24-2023, 01:19 PM
                • seqadmin
                  Introduction to Single-Cell Sequencing
                  by seqadmin
                  Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                  The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                  ...
                  01-09-2023, 03:10 PM

                ad_right_rmr

                Collapse
                Working...
                X