Unconfigured Ad

**sindrle** · 07-07-2014, 01:35 PM

Quick question, on a fastq file Tophat2 spends about 3-4h, STAR runs in 5 min - 1.5h.

How come only 5 min? Thats almost unbelievable.. It says success, but Im very sceptical..

**dpryan** · 07-07-2014, 01:49 PM

That sounds about right. STAR is vastly faster than tophat, but requires much more memory.

**sindrle** · 07-07-2014, 01:52 PM

Holy moly.

**apredeus** · 07-07-2014, 02:01 PM

yeah file Log.out has actual mapping speed

so check it to see, it's usually somewhere around hundreds of millions (reads) per hour on 8 CPUs

if it's substantially lower, something is going wrong (i.e. if you use UCSC mRNAs as a reference there is a lot of junctions that are of "-1" length for whatever reason, so it slows STAR down a lot)

**dxkorall** · 07-08-2014, 09:01 PM

Alex, is it possible to integrate your instrument to the Tool Panel of Main Public Galaxy Server (usegalaxy.org) as a working tool?

**alexdobin** · 07-09-2014, 07:55 AM

Originally posted by dxkorall View Post

Alex, is it possible to integrate your instrument to the Tool Panel of Main Public Galaxy Server (usegalaxy.org) as a working tool?

This is certainly doable, however, such decisions are in the hands of the Galaxy Team - please make a request of Anton Nekrutenko & Co.

Cheers
Alex

**kjusto** · 07-11-2014, 06:02 AM

Hi guys,
1.Any package available for STAR for easy installing
2. Architecture: i686 CPU op-mode(s): 32-bit, 64-bit CPU(s): 2 is it compatible with any of STAR app either from binary or source
Thanks

**GenoMax** · 07-11-2014, 06:23 AM

Pre-compiled linux binary is available here: https://code.google.com/p/rna-star/d...4.tgz&can=2&q=

**emmanouela** · 07-11-2014, 09:18 AM

Reads with very long "deletions"

Hello,

I used STAR to map our rna-seq single-end reads which are 50bp long (both with and without a gtf file). However, I get quite a few reads which supposedly have these huge deletions/gaps of hundreds of kb, which look like mapping issues.
Some time the "deletions" even span entire genes within them.

Two examples are:
HISEQ2000-02:509:C4C7EACXX:4:1306:5718:46438 0 chr10 94874729 255 44M196485N7M * 0 0 TAACGGAACTCCTACTAGATACATCAGGATGCAAACTATAAAAGGGTCAGT @@@DDD?D@DDHB>?B<B<<CAC?BEDG?9*)1CF;<??BF*??B)?90?? NH:i:1 HI:i:1 AS:i:45 nM:i:1 jM:B:c,1 jI:B:i,94874773,95071257

HISEQ2000-02:509:C4C7EACXX:4:2303:5831:46194 0 chr10 95008269 255 23M125529N28M * 0 0 CAATAAAAACGTATACCGATTGGCAAAAAAAAAGAAAAAAAAAAAAAAAAA CBCFFFFFHHFHHJJJJHIIJHEHJJJJJJJJJ-5@GIJHFDDDDDDDDDD NH:i:1 HI:i:1 AS:i:39 nM:i:0 jM:B:c,5 jI:B:i,95008292,95133820

Has anyone else seen these? Is there any way to filter them out???

**Brian Bushnell** · 07-11-2014, 09:25 AM

Not sure about the second one, but the first one with a 200kbp deletion anchored by a 7bp of read sequence looks like a probable mapping error to me, considering that a 7bp exact match would be expected purely by chance within about 16kbp of any random location. However, if that 200kbp corresponds exactly to a known intron in the GTF file, and only occurs when using the GTF file, it's probably OK. Does it?

**kjusto** · 07-12-2014, 12:29 AM

Originally posted by GenoMax View Post

Pre-compiled linux binary is available here: https://code.google.com/p/rna-star/d...4.tgz&can=2&q=

Thanks for the link... got use proxies to get it though....google issues here....my question was about 32 bit linux OS,any binaries for it.

**GenoMax** · 07-12-2014, 02:35 AM

Don't think Alex provides 32-bit binaries. If you have a large genome (~ human) 32-bit may not work.

Build from source if you must have 32-bit: https://code.google.com/p/rna-star/d...e.tgz&can=2&q=

**emmanouela** · 07-14-2014, 01:56 AM

Originally posted by Brian Bushnell View Post

However, if that 200kbp corresponds exactly to a known intron in the GTF file, and only occurs when using the GTF file, it's probably OK. Does it?

Hi Brian,
No, I didn't use a gtf to do the mapping in this case. Plus the mapped read corresponds to a known intron (of a short gene) on one side but a random intergenic region way after the end of the gene of which it starts in ( at least according to UCSC) on the other side. And the 200kb overlaps with 4 other known genes too. So to my eyes thats definitely a mapping error too. The question now is how to filter those out (because they are quite a few of them).

**kjusto** · 07-17-2014, 07:45 AM

Hi,
Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

biostat1@biostat[STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
Jul 17 10:55:11 ..... Started STAR run
Jul 17 10:55:11 ... Starting to generate Genome files
terminate called after throwing an instance of 'std:

ut_of_range'
what(): vector::_M_range_check
zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

Any ideas,
Thanks!

**alexdobin** · 07-18-2014, 02:08 PM

Originally posted by emmanouela View Post

Hi Brian,
No, I didn't use a gtf to do the mapping in this case. Plus the mapped read corresponds to a known intron (of a short gene) on one side but a random intergenic region way after the end of the gene of which it starts in ( at least according to UCSC) on the other side. And the 200kb overlaps with 4 other known genes too. So to my eyes thats definitely a mapping error too. The question now is how to filter those out (because they are quite a few of them).

Hi Emma,

these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
A better approach is filter out long-gap alignments supported by too few reads, e.g. :
--outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=10kb by >=4 reads.

There is more discussion on this type of filtering in this post.

Cheers
Alex

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News