Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by kjusto View Post
    Hi,
    Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

    biostat1@biostat[STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
    Jul 17 10:55:11 ..... Started STAR run
    Jul 17 10:55:11 ... Starting to generate Genome files
    terminate called after throwing an instance of 'std:ut_of_range'
    what(): vector::_M_range_check
    zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

    Any ideas,
    Thanks!
    Hi @kjusto,
    Is it possible that your L1_1.fq L1_2.fq files are not fasta files - they seem to have FASTQ extension?
    If they are valid fasta, please send me a link to them, and I will try to figure out the problem.
    Cheers
    Alex

    Comment


    • Originally posted by alexdobin View Post
      Hi Emma,

      these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
      They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
      One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
      A better approach is filter out long-gap alignments supported by too few reads, e.g. :
      --outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
      This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=100kb by >=4 reads.
      This looks like a very bad idea. There are plenty of fusion genes which are known to be "adjacent" but still are genuine fusion genes.

      For example, the famous fusion gene FGFR3-TACC3 (for more info about it please Google it) has the genes FGFR3 and TACC3 as adjacent and the distance between them is less than 50kb. If this fusion is expressed moderately there could be, for example, 2 reads supporting the fusion and it will be discarded according to the above criteria.

      Out there are several fusion finders which can handle this kind of cases with very low false positives rate, e.g. https://code.google.com/p/fusioncatcher/wiki/comparison
      Last edited by ndaniel; 09-03-2014, 11:18 AM.

      Comment


      • In using STAR for my alignments, I'm subsequently using samtools for some basic evaluations (and GATK after that to generate a vcf file). An issue I'm having is that even though STAR is correctly including the mapping location for each read in the sam file, it is not setting the flag that says if a read is mapped or not (-f 4). As such, when I use samtools to count the number of mapped reads, it's always zero. Samtools accurately counts the number of total reads and unmapped reads.

        Is this an issue with STAR or possibly something I'm doing incorrectly?

        Comment


        • STAR doesn't output unmapped reads in the SAM file (at least not by default), so bit 4 shouldn't be set there for any record.

          Comment


          • That's a bit of an issue because the default of not having a flag set is to call it "unmapped".

            Running:

            samtools view -c -F 4 <file.name>

            counts all the reads as unmapped. Downstream tools that rely on that flag are thus rendered incompatible.

            Comment


            • No, not setting that bit means the read is mapped.
              Edit:I'll add that samtools view -c -F 4 foo.bam will count the mapped reads. Using -f 4 will count the unmapped reads, which will be 0.

              Comment


              • 0x4 means 'unmapped'. If no flags are set, that means it's a single-ended read mapped to the plus strand.

                Comment


                • My mistake. Thanks!

                  Comment


                  • Hi,
                    I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

                    Thanks!

                    Comment


                    • Originally posted by serash View Post
                      Hi,
                      I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

                      Thanks!
                      Hi,

                      the --twopass1readsN cannot be used with --genomeLoad LoadAndKeep option, you will need to re-load the genome for every sample. For each of the samples the 2-nd pass genome indices will be different, since the junctions discovered in the 1st pass will change from sample to sample.

                      For multiple samples, one option is to map (2-pass with --twopass1readsN -1 which will map all reads in the first pass) all the samples together, and use the Read Group tags to mark different samples.

                      Another option is the manual 2-step operation: run the 1st pass on all samples, collect the detected junctions from all samples, re-generate the genome, and run 2nd pass with the new genome, which can now be loaded with --genomeLoad LoadAndKeep option.

                      Cheers
                      Alex

                      Comment


                      • Hi again,

                        It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
                        This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

                        cheers,
                        Dries

                        Comment


                        • Originally posted by serash View Post
                          Hi again,

                          It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
                          This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

                          cheers,
                          Dries
                          This is a good suggestion, and it's already on my (short) TODO-list - adding annotated junctions on the fly, so that there no need to re-generate the whole genome index when you only want to change annotations.

                          Cheers
                          Alex

                          Comment


                          • That's great to know! Thanks

                            Comment


                            • Star is generating files like this in the directory I'm running it from:
                              • SJ.out.tab
                              • Log.std.out
                              • Log.out
                              • Log.final.out
                              • Log.progress.out


                              How can I put those files into a specific directory so that I can run multiple jobs in parallel from a single directory?


                              Cheers,
                              Dan.
                              Homepage: Dan Bolser
                              MetaBase the database of biological databases.

                              Comment


                              • Just use the --outFileNamePrefix option to give everything an appropriate prefix.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X