Unconfigured Ad

**alexdobin** · 07-18-2014, 02:12 PM

Originally posted by kjusto View Post

Hi,
Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

biostat1@biostat[STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
Jul 17 10:55:11 ..... Started STAR run
Jul 17 10:55:11 ... Starting to generate Genome files
terminate called after throwing an instance of 'std:

ut_of_range'
what(): vector::_M_range_check
zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

Any ideas,
Thanks!

Hi @kjusto,
Is it possible that your L1_1.fq L1_2.fq files are not fasta files - they seem to have FASTQ extension?
If they are valid fasta, please send me a link to them, and I will try to figure out the problem.
Cheers
Alex

**ndaniel** · 09-03-2014, 11:16 AM

Originally posted by alexdobin View Post

Hi Emma,

these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
A better approach is filter out long-gap alignments supported by too few reads, e.g. :
--outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=100kb by >=4 reads.

This looks like a very bad idea. There are plenty of fusion genes which are known to be "adjacent" but still are genuine fusion genes.

For example, the famous fusion gene FGFR3-TACC3 (for more info about it please Google it) has the genes FGFR3 and TACC3 as adjacent and the distance between them is less than 50kb. If this fusion is expressed moderately there could be, for example, 2 reads supporting the fusion and it will be discarded according to the above criteria.

Out there are several fusion finders which can handle this kind of cases with very low false positives rate, e.g. https://code.google.com/p/fusioncatcher/wiki/comparison

**coryfunk** · 10-23-2014, 08:57 AM

In using STAR for my alignments, I'm subsequently using samtools for some basic evaluations (and GATK after that to generate a vcf file). An issue I'm having is that even though STAR is correctly including the mapping location for each read in the sam file, it is not setting the flag that says if a read is mapped or not (-f 4). As such, when I use samtools to count the number of mapped reads, it's always zero. Samtools accurately counts the number of total reads and unmapped reads.

Is this an issue with STAR or possibly something I'm doing incorrectly?

**dpryan** · 10-23-2014, 11:08 AM

STAR doesn't output unmapped reads in the SAM file (at least not by default), so bit 4 shouldn't be set there for any record.

**coryfunk** · 10-23-2014, 11:16 AM

That's a bit of an issue because the default of not having a flag set is to call it "unmapped".

Running:

samtools view -c -F 4 <file.name>

counts all the reads as unmapped. Downstream tools that rely on that flag are thus rendered incompatible.

**dpryan** · 10-23-2014, 11:25 AM

No, not setting that bit means the read is mapped.
Edit:I'll add that samtools view -c -F 4 foo.bam will count the mapped reads. Using -f 4 will count the unmapped reads, which will be 0.

**Brian Bushnell** · 10-23-2014, 11:26 AM

0x4 means 'unmapped'. If no flags are set, that means it's a single-ended read mapped to the plus strand.

**coryfunk** · 10-23-2014, 11:31 AM

My mistake. Thanks!

**serash** · 11-14-2014, 04:47 AM

Hi,
I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

Thanks!

**alexdobin** · 11-18-2014, 02:23 PM

Originally posted by serash View Post

Hi,
I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

Thanks!

Hi,

the --twopass1readsN cannot be used with --genomeLoad LoadAndKeep option, you will need to re-load the genome for every sample. For each of the samples the 2-nd pass genome indices will be different, since the junctions discovered in the 1st pass will change from sample to sample.

For multiple samples, one option is to map (2-pass with --twopass1readsN -1 which will map all reads in the first pass) all the samples together, and use the Read Group tags to mark different samples.

Another option is the manual 2-step operation: run the 1st pass on all samples, collect the detected junctions from all samples, re-generate the genome, and run 2nd pass with the new genome, which can now be loaded with --genomeLoad LoadAndKeep option.

Cheers
Alex

**serash** · 12-28-2014, 04:46 AM

Hi again,

It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

cheers,
Dries

**alexdobin** · 01-08-2015, 02:52 PM

Originally posted by serash View Post

Hi again,

It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

cheers,
Dries

This is a good suggestion, and it's already on my (short) TODO-list - adding annotated junctions on the fly, so that there no need to re-generate the whole genome index when you only want to change annotations.

Cheers
Alex

**serash** · 01-09-2015, 12:04 AM

That's great to know! Thanks

**dan** · 01-15-2015, 05:03 AM

Star is generating files like this in the directory I'm running it from:

SJ.out.tab
Log.std.out
Log.out
Log.final.out
Log.progress.out

How can I put those files into a specific directory so that I can run multiple jobs in parallel from a single directory?

Cheers,
Dan.

**dpryan** · 01-15-2015, 05:08 AM

Just use the --outFileNamePrefix option to give everything an appropriate prefix.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News