Originally posted by Nino
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by alexdobin View PostWe used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
Comment
-
Originally posted by apredeus View PostAlex, thank you for the great tool - STAR is indeed very impressive!
Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
to generate combined mouse/human genome with STAR you would need to modify slightly your fasta and gtf files:
1. Modify chromosomes names so that mouse and human chromosomes have distinct names, e.g. chr1h/chr1m etc. In the FASTA files you need to make these modifications in all sequences name lines (i.e. starting with ">"). In GTF files you would need to modify all chromosome names in field 1.
2. Make sure that the transcript_id in GTF files are distinct for mouse and human. This is usually the case, for instance, Gencode has "ENSMUSTxxxxx" for mouse and "ENSTxxxxx" for human.
3. Concatenate GTF files for mouse and human into a single GTF file
4. Run genome generation with
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir ./ --genomeFastaFiles /path/to/human.fa /path/to/mouse.fa --sjdbGTFfile /path/to/mouse_human.gtf --sjdbOverhang 100
If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.
Cheers
Alex
Comment
-
Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
Comment
-
Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
Comment
-
Originally posted by apredeus View PostGreat, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.
What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
Originally posted by apredeus View PostAlso, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.
Thank you!
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Comment
-
Originally posted by alexdobin View PostI recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.
On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
So, definitely including the extra scaffolds!
Comment
-
Originally posted by alexdobin View PostThere are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
Comment
-
failed to generate genome using STAR
Hi, I build genome using command:
STAR --runMode genomeGenerate --genomeDir STAR_pathway --genomeFastaFiles file.fa.gz --runThreadN 10
Then I failed and got message: "BUG: next index is smaller than previous, EXITING".
Also, does anyone have more detailed manual of STAR, I downloaded the manual from the website, it shows /pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/
GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/
fasta2 --runThreadN <n> …
What are the other opions in ...? I tried unzip the fa.gz file to fa file and then got the the wrong message: "limitGenomeGenerateRAM=28is too small for your genome
SOLUTION: please specify limitGenomeGenerateRAM not less than114 GB and make that much RAM available".
For other aligners we can type -h or --help to find the details, but not for star...Last edited by shangzhong0619; 06-19-2014, 02:38 PM.
Comment
-
Hi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzhong,
please try one of the latest STAR patches.
Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.
You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbOverhang 0
int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
if =0, splice junction database is not used
Cheers
Alex
Thanks for your reply, yes my reference fasta has many scaffolds. When I try to install the latest version, it shows the following effor.
samtools/libbam.a(bgzf.o): In function `bgzf_compress':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:187: undefined reference to `deflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:188: undefined reference to `deflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:189: undefined reference to `deflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
samtools/libbam.a(bgzf.o): In function `bgzf_dopen':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:160: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `bgzf_open':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:142: undefined reference to `compressBound'
samtools/libbam.a(bgzf.o): In function `inflate_block':
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:224: undefined reference to `inflateInit2_'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:228: undefined reference to `inflate'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:229: undefined reference to `inflateEnd'
/home/dobin/STARcode/samtools-0.1.19/bgzf.c:233: undefined reference to `inflateEnd'
samtools/libbam.a(bam_import.o): In function `ks_getuntil2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `__bam_get_lines':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:92: undefined reference to `gzclose'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_close':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:485: undefined reference to `gzclose'
samtools/libbam.a(bam_import.o): In function `sam_open':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzdopen'
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzopen64'
samtools/libbam.a(bam_import.o): In function `ks_getc':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
samtools/libbam.a(bam_import.o): In function `sam_header_read2':
/home/dobin/STARcode/samtools-0.1.19/bam_import.c:147: undefined reference to `gzclose'
collect2: ld returned 1 exit status
make: *** [STAR] Error 1
I have samtools-0.1.19 in my computer. what was this error about? thank you.
Comment
-
Hi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by alexdobin View PostHi Shangzong,
please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz
If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.
Cheers
Alex
Comment
-
Originally posted by shangzhong0619 View PostThanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.
for genome generation, STAR needs unzipped fasta. You do it once per genome, and can delete the fasta after the genome is generated. '--readFilesCommand zcat' option only applies to fastq/fasta reads at the mapping stage.
Cheers
Alex
Comment
Latest Articles
Collapse
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
-
by seqadmin
The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...-
Channel: Articles
08-27-2024, 04:44 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 06:25 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Today, 06:25 AM
|
||
Started by seqadmin, Yesterday, 01:02 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 01:02 PM
|
||
Started by seqadmin, 09-18-2024, 06:39 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-18-2024, 06:39 AM
|
||
Started by seqadmin, 09-11-2024, 02:44 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-11-2024, 02:44 PM
|
Comment