Seqanswers Leaderboard Ad

**GenoMax** · 08-12-2015, 10:28 AM

I also noticed that you are running STAR with 16 threads. There must be a slurm command to ensure that you are actually getting that many cores and that they are all on the same physical node.

**stu** · 08-12-2015, 10:34 AM

I've most recently been submitting jobs with this command:

[scripts]$ sbatch -p Long -N 1 -n 16 --no-requeue STAR_generate_genome_indices.sh

Which I believe should allocate a single node of 16 cores to the job. I've tried this new command:

[scripts]$ sbatch -p Long -N 1 -n 16 --mem 64000 --no-requeue STAR_generate_genome_indices.sh

To see if manually specifying the amount of memory to use makes a difference.

**alexdobin** · 08-13-2015, 01:05 PM

Hi @stu,

since you have GFF (not GTF) file, you need to use
--sjdbGTFtagExonParentTranscript Parent
It's actually explained in the manual (chapter 2.2.3)

If this does not work, please post a few "exon" lines of your GFF.

Cheers
Alex

**stu** · 08-14-2015, 04:04 AM

Alex -

Worked like a charm! Thank you for your help!

**pandamon** · 09-03-2015, 04:02 AM

Hello,

Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?

Thanks a lot in advance.

Cheers

**alexdobin** · 09-03-2015, 01:56 PM

Originally posted by pandamon View Post

Hello,

Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?

Thanks a lot in advance.

Cheers

Hi,

the best way is to generate the genome index for a combined reference of mouse and human. This will require ~60GB of RAM. You would need to make the chromosomes names distinct in the mouse and human genome FASTA (say add m to the mouse chromosome names). You also need to do the same renaming of the chromosome names in the annotations GTFs. The GTFs (say GENCODE) typically have distinct transcript names for different species - if not, you would have to rename them as well. The GTF files from two species have to be concatenated.

Cheers
Alex

**rvann** · 11-30-2015, 04:54 AM

--quantMode GeneCounts short read error

I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.

STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500

Nov 28 17:54:22 ...... FATAL ERROR, exiting

Is it because of the trimming with Cutadapt? Is STAR failing to process 'zero length' reads? Thanks for your help. Log.out file attached

cutadapt 1.8.3:
cutadapt -q 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o inpath/sample_R1.fastq.gz -p inpath/sample_R2.fastq.gz pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz

STAR 2.4.2a:
STAR --genomeLoad NoSharedMemory --genomeSAsparseD 2 --outSAMstrandField intronMotif --genomeDir pathto/STARgenome --sjdbGTFfile pathto/STARgenome/gencode.v23.annotation.gtf --runThreadN 2 --quantMode GeneCounts --readFilesIn pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix sample --outSAMtype BAM Unsorted --outStd BAM_Unsorted

Attached Files

Log.out.zip (11.1 KB, 42 views)

**gsgs** · 11-30-2015, 07:42 AM

what's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files

**alexdobin** · 11-30-2015, 03:16 PM

Originally posted by rvann View Post

I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.

STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500

Nov 28 17:54:22 ...... FATAL ERROR, exiting

Hi @rvann,

the error is caused by the zero-length read sequence, STAR cannot process those.
Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.

Cheers
Alex

**alexdobin** · 11-30-2015, 03:37 PM

Originally posted by gsgs View Post

what's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files

Hi @gsgs,

the probability is >0

as I know people who are working on Windows executable for STAR.

Cheers
Alex

**gsgs** · 11-30-2015, 10:51 PM

thanks.

In the meantime I found the manual and looked at it.
Whenever it mentioned "Windows" it meant the boxes
that may open ... so this is just not considered.

I'm not familiar with Linux/Unix, but I remember that often
I could compile similar programs with my
old GCC / DJGPP compiler on Windows.
I haven't done this since long and there are many
programs here with .h and .cpp extension to be included,
(why is it so complicated ?)
so a long list for potential problems.

I don't understand why this conversion is so difficult,
why they have no solution for this already.

I mean, it should be a small step as compared to
creating the programs and getting it to work in the first place ?!?

Currently I'm using MAFFT, the author had helped me to
get a Windows-executable and how to run it from batch.
They used fast fourier transform but it became clear to me,
that this is slow for large problems
and that there should be a faster solution by finding matching

subsequences.

**gsgs** · 11-30-2015, 11:07 PM

so, what can I do ?

---------------------------------------
Buy another computer with Linux on it, install "STAR" on it.
Whenever I have a big Windows/DOS fasta file to be aligned,
(delete the carriage returns since Linux doesn't like them ?)
copy it from the Windows/DOS HD to a micro-SD, insert it into the Linux computer,
run STAR on it, insert it into the Windows computer,
copy it back to Windows HD, insert the carriage returns
-----------------------------------------

do you sell such a Linux computer, with STAR suitably installed on micro-SD ?
Easy to use, boots from SD, aligns the fasta file fasta01.fa on it, writes
the result to fasta02.fa and shuts down.

No display, no keyboard needed, a raspberry computer ?!

**alexdobin** · 12-01-2015, 02:32 PM

Hi @gsgs

porting software designed for Linux to Windows is not an easy task.
As I mentioned, there is a serious effort to do that - I will ask if there is an ETA.
In the meantime, I could suggest the following work-arounds (in the order of increased difficulty):

1. Use Amazon or Google computing clouds. It will cost you a few dollars per run.
2. Run a virtual Linux machine on your Windows server.
3. Make you server dual-boot Windows/Linux, with a shared FAT partition to transfer data.
4. Try to compile and run STAR under cygwin Linux-like environment. This should be easier than full porting, however, I am not sure if this will work.

Cheers
Alex

**salamay** · 12-07-2015, 12:40 PM

Hi Alex,

Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:

N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0

My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:

##gff-version 3
##annot-version v1.0
scaffold_1 phytozomev10 gene 10215584 10239664 . + . ID=Thhalv10024176m.g.v1.0;Name=Thhalv10024176m.g
scaffold_1 phytozomev10 mRNA 10215584 10239664 . + . ID=Thhalv10024176m.v1.0;Name=Thhalv10024176m;pacid=20194900;longest=1;Parent=Thhalv10024176m.g.v1.0
scaffold_1 phytozomev10 exon 10215584 10215918 . + . ID=Thhalv10024176m.v1.0.exon.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 five_prime_UTR 10215584 10215821 . + . ID=Thhalv10024176m.v1.0.five_prime_UTR.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10215822 10215918 . + 0 ID=Thhalv10024176m.v1.0.CDS.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216476 10216579 . + . ID=Thhalv10024176m.v1.0.exon.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216476 10216579 . + 2 ID=Thhalv10024176m.v1.0.CDS.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216865 10216999 . + . ID=Thhalv10024176m.v1.0.exon.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216865 10216999 . + 0 ID=Thhalv10024176m.v1.0.CDS.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10217082 10217132 . + . ID=Thhalv10024176m.v1.0.exon.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10217082 10217132 . + 0 ID=Thhalv10024176m.v1.0.CDS.4;Parent=Thhalv10024176m.v1.0;pacid=20194900

I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.

Thanks!

**alexdobin** · 12-07-2015, 03:20 PM

Originally posted by salamay View Post

Hi Alex,

Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:

N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0

My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:

I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.

Thanks!

Hi Yasser,

at the moment the best option is to convert GFF3 file into GTF file.
For instance, you can use gffread tool from Cufflinks package:
$ gffread -T annot.gff3 -o annot.gtf
It creates the gtf file with proper transcript_id and gene_id tags, which you can supply as --sjdbGTFfile without any Parent options.

Cheers
Alex

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News