Originally posted by Nino
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by alexdobin View PostWe used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
Comment
-
Originally posted by apredeus View PostAlex, thank you for the great tool - STAR is indeed very impressive!
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
to generate combined mouse/human genome with STAR you would need to modify slightly your fasta and gtf files:
1. Modify chromosomes names so that mouse and human chromosomes have distinct names, e.g. chr1h/chr1m etc. In the FASTA files you need to make these modifications in all sequences name lines (i.e. starting with ">"). In GTF files you would need to modify all chromosome names in field 1.
2. Make sure that the transcript_id in GTF files are distinct for mouse and human. This is usually the case, for instance, Gencode has "ENSMUSTxxxxx" for mouse and "ENSTxxxxx" for human.
3. Concatenate GTF files for mouse and human into a single GTF file
4. Run genome generation with
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir ./ --genomeFastaFiles /path/to/human.fa /path/to/mouse.fa --sjdbGTFfile /path/to/mouse_human.gtf --sjdbOverhang 100
If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.
Cheers
Alex
Comment
-
Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
Comment
-
Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
Comment
-
Originally posted by apredeus View PostGreat, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
Originally posted by apredeus View PostAlso, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Comment
-
Originally posted by alexdobin View PostI recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
So, definitely including the extra scaffolds!
Comment
-
Originally posted by alexdobin View PostThere are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
failed to generate genome using STAR
Hi, I build genome using command:
STAR --runMode genomeGenerate --genomeDir STAR_pathway --genomeFastaFiles file.fa.gz --runThreadN 10
Then I failed and got message: "BUG: next index is smaller than previous, EXITING".
Also, does anyone have more detailed manual of STAR, I downloaded the manual from the website, it shows /pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/
GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/
fasta2 --runThreadN <n> …
What are the other opions in ...? I tried unzip the fa.gz file to fa file and then got the the wrong message: "limitGenomeGenerateRAM=28is too small for your genome
SOLUTION: please specify limitGenomeGenerateRAM not less than114 GB and make that much RAM available".
For other aligners we can type -h or --help to find the details, but not for star...Last edited by shangzhong0619; 06-19-2014, 02:38 PM.
Comment
-
Hi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Thanks for your reply, yes my reference fasta has many scaffolds. When I try to install the latest version, it shows the following effor.
samtools/libbam.a(bgzf.o): In function `bgzf_compress':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:187: undefined reference to `deflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:188: undefined reference to `deflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:189: undefined reference to `deflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
samtools/libbam.a(bgzf.o): In function `bgzf_dopen':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:160: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `bgzf_open':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:142: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `inflate_block':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:224: undefined reference to `inflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:228: undefined reference to `inflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:229: undefined reference to `inflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:233: undefined reference to `inflateEnd'
samtools/libbam.a(bam_import.o): In function `ks_getuntil2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `__bam_get_lines':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:92: undefined reference to `gzclose'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_close':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:485: undefined reference to `gzclose'
samtools/libbam.a(bam_import.o): In function `sam_open':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `ks_getc':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:147: undefined reference to `gzclose'
collect2: ld returned 1 exit status
make: *** [STAR] Error 1
I have samtools-0.1.19 in my computer. what was this error about? thank you.
Comment
-
Hi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by shangzhong0619 View PostThanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.
for genome generation, STAR needs unzipped fasta. You do it once per genome, and can delete the fasta after the genome is generated. '--readFilesCommand zcat' option only applies to fastq/fasta reads at the mapping stage.
Cheers
Alex
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
50 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment