I also noticed that you are running STAR with 16 threads. There must be a slurm command to ensure that you are actually getting that many cores and that they are all on the same physical node.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I've most recently been submitting jobs with this command:
[scripts]$ sbatch -p Long -N 1 -n 16 --no-requeue STAR_generate_genome_indices.sh
Which I believe should allocate a single node of 16 cores to the job. I've tried this new command:
[scripts]$ sbatch -p Long -N 1 -n 16 --mem 64000 --no-requeue STAR_generate_genome_indices.sh
To see if manually specifying the amount of memory to use makes a difference.Last edited by stu; 08-12-2015, 11:49 AM.
Comment
-
Originally posted by pandamon View PostHello,
Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?
Thanks a lot in advance.
Cheers
the best way is to generate the genome index for a combined reference of mouse and human. This will require ~60GB of RAM. You would need to make the chromosomes names distinct in the mouse and human genome FASTA (say add m to the mouse chromosome names). You also need to do the same renaming of the chromosome names in the annotations GTFs. The GTFs (say GENCODE) typically have distinct transcript names for different species - if not, you would have to rename them as well. The GTF files from two species have to be concatenated.
Cheers
Alex
Comment
-
--quantMode GeneCounts short read error
I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.
STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500
Nov 28 17:54:22 ...... FATAL ERROR, exiting
Is it because of the trimming with Cutadapt? Is STAR failing to process 'zero length' reads? Thanks for your help. Log.out file attached
cutadapt 1.8.3:
cutadapt -q 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o inpath/sample_R1.fastq.gz -p inpath/sample_R2.fastq.gz pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz
STAR 2.4.2a:
STAR --genomeLoad NoSharedMemory --genomeSAsparseD 2 --outSAMstrandField intronMotif --genomeDir pathto/STARgenome --sjdbGTFfile pathto/STARgenome/gencode.v23.annotation.gtf --runThreadN 2 --quantMode GeneCounts --readFilesIn pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix sample --outSAMtype BAM Unsorted --outStd BAM_UnsortedAttached Files
Comment
-
Originally posted by rvann View PostI am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.
STAR mapping/counting fails with the following error message:
EXITING because of FATAL ERROR in reads input: short read sequence line: 1
Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=500
Nov 28 17:54:22 ...... FATAL ERROR, exiting
the error is caused by the zero-length read sequence, STAR cannot process those.
Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.
Cheers
Alex
Comment
-
Originally posted by gsgs View Postwhat's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files
the probability is >0 as I know people who are working on Windows executable for STAR.
Cheers
Alex
Comment
-
thanks.
In the meantime I found the manual and looked at it.
Whenever it mentioned "Windows" it meant the boxes
that may open ... so this is just not considered.
I'm not familiar with Linux/Unix, but I remember that often
I could compile similar programs with my
old GCC / DJGPP compiler on Windows.
I haven't done this since long and there are many
programs here with .h and .cpp extension to be included,
(why is it so complicated ?)
so a long list for potential problems.
I don't understand why this conversion is so difficult,
why they have no solution for this already.
I mean, it should be a small step as compared to
creating the programs and getting it to work in the first place ?!?
Currently I'm using MAFFT, the author had helped me to
get a Windows-executable and how to run it from batch.
They used fast fourier transform but it became clear to me,
that this is slow for large problems
and that there should be a faster solution by finding matching
subsequences.
Comment
-
so, what can I do ?
---------------------------------------
Buy another computer with Linux on it, install "STAR" on it.
Whenever I have a big Windows/DOS fasta file to be aligned,
(delete the carriage returns since Linux doesn't like them ?)
copy it from the Windows/DOS HD to a micro-SD, insert it into the Linux computer,
run STAR on it, insert it into the Windows computer,
copy it back to Windows HD, insert the carriage returns
-----------------------------------------
do you sell such a Linux computer, with STAR suitably installed on micro-SD ?
Easy to use, boots from SD, aligns the fasta file fasta01.fa on it, writes
the result to fasta02.fa and shuts down.
No display, no keyboard needed, a raspberry computer ?!
Comment
-
Hi @gsgs
porting software designed for Linux to Windows is not an easy task.
As I mentioned, there is a serious effort to do that - I will ask if there is an ETA.
In the meantime, I could suggest the following work-arounds (in the order of increased difficulty):
1. Use Amazon or Google computing clouds. It will cost you a few dollars per run.
2. Run a virtual Linux machine on your Windows server.
3. Make you server dual-boot Windows/Linux, with a shared FAT partition to transfer data.
4. Try to compile and run STAR under cygwin Linux-like environment. This should be easier than full porting, however, I am not sure if this will work.
Cheers
Alex
Comment
-
Hi Alex,
Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:
N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0
My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:
##gff-version 3
##annot-version v1.0
scaffold_1 phytozomev10 gene 10215584 10239664 . + . ID=Thhalv10024176m.g.v1.0;Name=Thhalv10024176m.g
scaffold_1 phytozomev10 mRNA 10215584 10239664 . + . ID=Thhalv10024176m.v1.0;Name=Thhalv10024176m;pacid=20194900;longest=1;Parent=Thhalv10024176m.g.v1.0
scaffold_1 phytozomev10 exon 10215584 10215918 . + . ID=Thhalv10024176m.v1.0.exon.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 five_prime_UTR 10215584 10215821 . + . ID=Thhalv10024176m.v1.0.five_prime_UTR.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10215822 10215918 . + 0 ID=Thhalv10024176m.v1.0.CDS.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216476 10216579 . + . ID=Thhalv10024176m.v1.0.exon.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216476 10216579 . + 2 ID=Thhalv10024176m.v1.0.CDS.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10216865 10216999 . + . ID=Thhalv10024176m.v1.0.exon.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10216865 10216999 . + 0 ID=Thhalv10024176m.v1.0.CDS.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 exon 10217082 10217132 . + . ID=Thhalv10024176m.v1.0.exon.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
scaffold_1 phytozomev10 CDS 10217082 10217132 . + 0 ID=Thhalv10024176m.v1.0.CDS.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.
Thanks!
Comment
-
Originally posted by salamay View PostHi Alex,
Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:
N_unmapped 3350825 3350825 3350825
N_multimapping 2233686 2233686 2233686
N_noFeature 4913585 40288551 40271442
N_ambiguous 0 0 0
My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:
I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.
Thanks!
Hi Yasser,
at the moment the best option is to convert GFF3 file into GTF file.
For instance, you can use gffread tool from Cufflinks package:
$ gffread -T annot.gff3 -o annot.gtf
It creates the gtf file with proper transcript_id and gene_id tags, which you can supply as --sjdbGTFfile without any Parent options.
Cheers
Alex
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment