I also noticed that you are running STAR with 16 threads. There must be a slurm command to ensure that you are actually getting that many cores and that they are all on the same physical node.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I've most recently been submitting jobs with this command:
[scripts]$ sbatch -p Long -N 1 -n 16 --no-requeue STAR_generate_genome_indices.sh
Which I believe should allocate a single node of 16 cores to the job. I've tried this new command:
[scripts]$ sbatch -p Long -N 1 -n 16 --mem 64000 --no-requeue STAR_generate_genome_indices.sh
To see if manually specifying the amount of memory to use makes a difference.Last edited by stu; 08-12-2015, 11:49 AM.
Comment
-
Originally posted by pandamon View PostHello,
Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?
Thanks a lot in advance.
Cheers
the best way is to generate the genome index for a combined reference of mouse and human. This will require ~60GB of RAM. You would need to make the chromosomes names distinct in the mouse and human genome FASTA (say add m to the mouse chromosome names). You also need to do the same renaming of the chromosome names in the annotations GTFs. The GTFs (say GENCODE) typically have distinct transcript names for different species - if not, you would have to rename them as well. The GTF files from two species have to be concatenated.
Cheers
Alex
Comment
-
--quantMode GeneCounts short read error
I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.
STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500
Nov 28 17:54:22 ...... FATAL ERROR, exiting
Is it because of the trimming with Cutadapt? Is STAR failing to process 'zero length' reads? Thanks for your help. Log.out file attached
cutadapt 1.8.3:
cutadapt -q 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o inpath/sample_R1.fastq.gz -p inpath/sample_R2.fastq.gz pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz
STAR 2.4.2a:
STAR --genomeLoad NoSharedMemory --genomeSAsparseD 2 --outSAMstrandField intronMotif --genomeDir pathto/STARgenome --sjdbGTFfile pathto/STARgenome/gencode.v23.annotation.gtf --runThreadN 2 --quantMode GeneCounts --readFilesIn pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix sample --outSAMtype BAM Unsorted --outStd BAM_UnsortedAttached Files
Comment
-
Originally posted by rvann View PostI am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.
STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500
Nov 28 17:54:22 ...... FATAL ERROR, exiting
the error is caused by the zero-length read sequence, STAR cannot process those.
Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.
Cheers
Alex
Comment
-
Originally posted by gsgs View Postwhat's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files
the probability is >0 as I know people who are working on Windows executable for STAR.
Cheers
Alex
Comment
-
thanks.
In the meantime I found the manual and looked at it.
Whenever it mentioned "Windows" it meant the boxes
that may open ... so this is just not considered.
I'm not familiar with Linux/Unix, but I remember that often
I could compile similar programs with my
old GCC / DJGPP compiler on Windows.
I haven't done this since long and there are many
programs here with .h and .cpp extension to be included,
(why is it so complicated ?)
so a long list for potential problems.
I don't understand why this conversion is so difficult,
why they have no solution for this already.
I mean, it should be a small step as compared to
creating the programs and getting it to work in the first place ?!?
Currently I'm using MAFFT, the author had helped me to
get a Windows-executable and how to run it from batch.
They used fast fourier transform but it became clear to me,
that this is slow for large problems
and that there should be a faster solution by finding matching
subsequences.
Comment
-
so, what can I do ?
---------------------------------------
Buy another computer with Linux on it, install "STAR" on it.
Whenever I have a big Windows/DOS fasta file to be aligned,
(delete the carriage returns since Linux doesn't like them ?)
copy it from the Windows/DOS HD to a micro-SD, insert it into the Linux computer,
run STAR on it, insert it into the Windows computer,
copy it back to Windows HD, insert the carriage returns
-----------------------------------------
do you sell such a Linux computer, with STAR suitably installed on micro-SD ?
Easy to use, boots from SD, aligns the fasta file fasta01.fa on it, writes
the result to fasta02.fa and shuts down.
No display, no keyboard needed, a raspberry computer ?!
Comment
-
Hi @gsgs
porting software designed for Linux to Windows is not an easy task.
As I mentioned, there is a serious effort to do that - I will ask if there is an ETA.
In the meantime, I could suggest the following work-arounds (in the order of increased difficulty):
1. Use Amazon or Google computing clouds. It will cost you a few dollars per run.
2. Run a virtual Linux machine on your Windows server.
3. Make you server dual-boot Windows/Linux, with a shared FAT partition to transfer data.
4. Try to compile and run STAR under cygwin Linux-like environment. This should be easier than full porting, however, I am not sure if this will work.
Cheers
Alex
Comment
-
Hi Alex,
Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:
N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0
My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:
##gff-version 3
##annot-version v1.0
scaffold_1 phytozomev10 gene 10215584 10239664 . + . ID=Thhalv10024176m.g.v1.0;Name=Thhalv10024176m.g
scaffold_1 phytozomev10 mRNA 10215584 10239664 . + . ID=Thhalv10024176m.v1.0;Name=Thhalv10024176m;pacid=20194900;longest=1;Parent=Thhalv10024176m.g.v1.0
scaffold_1 phytozomev10 exon 10215584 10215918 . + . ID=Thhalv10024176m.v1.0.exon.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 five_prime_UTR 10215584 10215821 . + . ID=Thhalv10024176m.v1.0.five_prime_UTR.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10215822 10215918 . + 0 ID=Thhalv10024176m.v1.0.CDS.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216476 10216579 . + . ID=Thhalv10024176m.v1.0.exon.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216476 10216579 . + 2 ID=Thhalv10024176m.v1.0.CDS.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216865 10216999 . + . ID=Thhalv10024176m.v1.0.exon.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216865 10216999 . + 0 ID=Thhalv10024176m.v1.0.CDS.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10217082 10217132 . + . ID=Thhalv10024176m.v1.0.exon.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10217082 10217132 . + 0 ID=Thhalv10024176m.v1.0.CDS.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.
Thanks!
Comment
-
Originally posted by salamay View PostHi Alex,
Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:
N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0
My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:
I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.
Thanks!
Hi Yasser,
at the moment the best option is to convert GFF3 file into GTF file.
For instance, you can use gffread tool from Cufflinks package:
$ gffread -T annot.gff3 -o annot.gtf
It creates the gtf file with proper transcript_id and gene_id tags, which you can supply as --sjdbGTFfile without any Parent options.
Cheers
Alex
Comment
Latest Articles
Collapse
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
-
by seqadmin
The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...-
Channel: Articles
08-27-2024, 04:44 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 06:25 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Today, 06:25 AM
|
||
Started by seqadmin, Yesterday, 01:02 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 01:02 PM
|
||
Started by seqadmin, 09-18-2024, 06:39 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-18-2024, 06:39 AM
|
||
Started by seqadmin, 09-11-2024, 02:44 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-11-2024, 02:44 PM
|
Comment