Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Originally posted by Nino View Post
    Hey Devon,

    Its turns it is not difficult since a group of individual from Case Western Reserve University, Cleveland, OH published a paper on a program they developed called LoQuM which does exactly what I wanted. I have not tried the program yet but here is the title of article if you would like to read if yourself

    "Accurate estimation of short read mapping quality for next-generation genome sequencing"

    Thanks,
    Nino
    Interesting, I'll have to give that paper a read, thanks!

    Comment


    • #77
      Originally posted by alexdobin View Post
      We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post
      Alex, thank you for the great tool - STAR is indeed very impressive!

      Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?

      Comment


      • #78
        Originally posted by apredeus View Post
        Alex, thank you for the great tool - STAR is indeed very impressive!

        Would you post a command that one would use to generate a mixed genome for mouse and human? We normally align to mouse and human genomes using GTF files of annotated mRNAs. Can one incorporate both of those GTF files into a combined mouse+human index?
        Hi Alex,

        to generate combined mouse/human genome with STAR you would need to modify slightly your fasta and gtf files:
        1. Modify chromosomes names so that mouse and human chromosomes have distinct names, e.g. chr1h/chr1m etc. In the FASTA files you need to make these modifications in all sequences name lines (i.e. starting with ">"). In GTF files you would need to modify all chromosome names in field 1.
        2. Make sure that the transcript_id in GTF files are distinct for mouse and human. This is usually the case, for instance, Gencode has "ENSMUSTxxxxx" for mouse and "ENSTxxxxx" for human.
        3. Concatenate GTF files for mouse and human into a single GTF file
        4. Run genome generation with
        STAR --runMode genomeGenerate --runThreadN 12 --genomeDir ./ --genomeFastaFiles /path/to/human.fa /path/to/mouse.fa --sjdbGTFfile /path/to/mouse_human.gtf --sjdbOverhang 100

        If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.

        Cheers
        Alex

        Comment


        • #79
          Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.

          Originally posted by alexdobin View Post
          If you want to use mRNA GTF files instead or in addition to standard annotations, I would recommend checking the splice junctions in this file for very short introns, and filtering them out - please see this post.
          What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.

          Comment


          • #80
            Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.

            Thank you!

            Comment


            • #81
              Originally posted by apredeus View Post
              Great, thank you very much for the informative answer. I'll make sure to filter out the ultra-short introns.



              What would "standard" annotation be in this case? RefSeq? It's just that we have always used mRNA collection for both humans and mice (there's about 1.5 mil for mm9 and 2.5 or so for hs19), I'm not sure what else do people use.
              There are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
              You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).

              Comment


              • #82
                Originally posted by apredeus View Post
                Also, I meant to ask - do you normally include random chromosomes and "hap" chromosomes from human genome into the overall index? I know it shouldn't make a huge difference, however there's quite a few transcripts that are mapped to these.

                Thank you!
                I recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.

                On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.

                Comment


                • #83
                  Originally posted by alexdobin View Post
                  I recommend including the *GL* or *gl* marked "unplaced" scaffolds. There are some rRNAs on these scaffolds from which large number of reads may originate, especially if the ribo-depletion did not work well.

                  On the other hand, *hap* scaffolds represent haplotypes. Some reads will map equally well to multiple haplotypes, and thus will be marked as "multi-mappers", which is not a desired behavior in most cases. Unless you need this kind of halpotype-aware mapping, I do not recommend including them.
                  Yes, very logical. Funny, after I posted this question, this whole story came up:

                  Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                  So, definitely including the extra scaffolds!

                  Comment


                  • #84
                    Originally posted by alexdobin View Post
                    There are many possible choices for annotations: RefSeq, UCSC genes, ENSEMBL. I would recommend Gencode annotations, which are based on ENSEMBL, are very comprehensive, and are used by ENCODE.
                    You can simply add these annotations GTF files to the mRNA GTF files, making sure transcript IDs are distinct. This will increase the number of junctions in your database, which is usually beneficial. Again, depending on the quality of mRNA alignments you may need to filter them for junctions with very short introns, and, possibly, other artifacts. A simple way to do it is to start generating the genome with mRNA GTF file, then filter suspicious junctions from jdbList.out.tab file, and then re-generate genome feeding the filtered junctions with --sjdbFileChrStartEnd (you can include the annotations GTF file at the same time).
                    Yes, and in order to avoid Gencode/UCSC scaffold naming differences, I will just use the Gencode GTF with no "random" annotations. Sounds great. Thanks again for the answers!

                    Comment


                    • #85
                      failed to generate genome using STAR

                      Hi, I build genome using command:
                      STAR --runMode genomeGenerate --genomeDir STAR_pathway --genomeFastaFiles file.fa.gz --runThreadN 10
                      Then I failed and got message: "BUG: next index is smaller than previous, EXITING".

                      Also, does anyone have more detailed manual of STAR, I downloaded the manual from the website, it shows /pathToStarDir/STAR --runMode genomeGenerate --genomeDir /path/to/
                      GenomeDir --genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/
                      fasta2 --runThreadN <n> …
                      What are the other opions in ...? I tried unzip the fa.gz file to fa file and then got the the wrong message: "limitGenomeGenerateRAM=28is too small for your genome
                      SOLUTION: please specify limitGenomeGenerateRAM not less than114 GB and make that much RAM available".

                      For other aligners we can type -h or --help to find the details, but not for star...
                      Last edited by shangzhong0619; 06-19-2014, 02:38 PM.

                      Comment


                      • #86
                        Hi Shangzhong,

                        please try one of the latest STAR patches.
                        Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.

                        You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:

                        sjdbGTFfile -
                        string: path to the GTF file with annotations

                        sjdbOverhang 0
                        int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
                        if =0, splice junction database is not used

                        Cheers
                        Alex

                        Comment


                        • #87
                          Originally posted by alexdobin View Post
                          Hi Shangzhong,

                          please try one of the latest STAR patches.
                          Do you have a large number of contigs/scaffolds in your genome assembly? This would explain the error message (see this post). If so, you need to use --genomeChrBinNbits 14 or smaller.

                          You can find brief description of all parameters at the end of the manual, or in the parametersDefault file in the source directory. If you want to use annotations to improve mapping accuracy, you will need:

                          sjdbGTFfile -
                          string: path to the GTF file with annotations

                          sjdbOverhang 0
                          int>=0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
                          if =0, splice junction database is not used

                          Cheers
                          Alex
                          Hi Alex,
                          Thanks for your reply, yes my reference fasta has many scaffolds. When I try to install the latest version, it shows the following effor.


                          samtools/libbam.a(bgzf.o): In function `bgzf_compress':
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:187: undefined reference to `deflateInit2_'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:188: undefined reference to `deflate'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:189: undefined reference to `deflateEnd'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:195: undefined reference to `crc32'
                          samtools/libbam.a(bgzf.o): In function `bgzf_dopen':
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:160: undefined reference to `compressBound'
                          samtools/libbam.a(bgzf.o): In function `bgzf_open':
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:142: undefined reference to `compressBound'
                          samtools/libbam.a(bgzf.o): In function `inflate_block':
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:224: undefined reference to `inflateInit2_'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:228: undefined reference to `inflate'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:229: undefined reference to `inflateEnd'
                          /home/dobin/STARcode/samtools-0.1.19/bgzf.c:233: undefined reference to `inflateEnd'
                          samtools/libbam.a(bam_import.o): In function `ks_getuntil2':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
                          samtools/libbam.a(bam_import.o): In function `__bam_get_lines':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzdopen'
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:92: undefined reference to `gzclose'
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:76: undefined reference to `gzopen64'
                          samtools/libbam.a(bam_import.o): In function `sam_close':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:485: undefined reference to `gzclose'
                          samtools/libbam.a(bam_import.o): In function `sam_open':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzdopen'
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:472: undefined reference to `gzopen64'
                          samtools/libbam.a(bam_import.o): In function `sam_header_read2':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzdopen'
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:126: undefined reference to `gzopen64'
                          samtools/libbam.a(bam_import.o): In function `ks_getc':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:17: undefined reference to `gzread'
                          samtools/libbam.a(bam_import.o): In function `sam_header_read2':
                          /home/dobin/STARcode/samtools-0.1.19/bam_import.c:147: undefined reference to `gzclose'
                          collect2: ld returned 1 exit status
                          make: *** [STAR] Error 1

                          I have samtools-0.1.19 in my computer. what was this error about? thank you.

                          Comment


                          • #88
                            Hi Shangzong,

                            please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz

                            If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.

                            Cheers
                            Alex

                            Comment


                            • #89
                              Originally posted by alexdobin View Post
                              Hi Shangzong,

                              please try this patch http://labshare.cshl.edu/shares/ging...R_2.3.1z10.tgz

                              If this does not help, please use the pre-compiled STAR or STARstatic executables in the source directory.

                              Cheers
                              Alex
                              Thanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.

                              Comment


                              • #90
                                Originally posted by shangzhong0619 View Post
                                Thanks. It works. I have another problem, when indexing the genome, does STAR accept gzipped fasta file? It didn't work for me and got "BUG: next index is smaller than previous", I also tried --readFilesCommand zcat, still didn't work. But when I unzip the fasta file, it works.
                                Hi Shangzhong,

                                for genome generation, STAR needs unzipped fasta. You do it once per genome, and can delete the fasta after the genome is generated. '--readFilesCommand zcat' option only applies to fastq/fasta reads at the mapping stage.

                                Cheers
                                Alex

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X