Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

STAR: ultrafast universal RNA-seq aligner

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by kjusto View Post
    Hi,
    Trying to generate genome from Rice reference and I get the following error,have tried several STAR patches available:

    [email protected][STAR_2.3.1z10] ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles L1_1.fq L1_2.fq
    Jul 17 10:55:11 ..... Started STAR run
    Jul 17 10:55:11 ... Starting to generate Genome files
    terminate called after throwing an instance of 'std:ut_of_range'
    what(): vector::_M_range_check
    zsh: abort ./STAR --runMode genomeGenerate --genomeDir IRGSP_genome --genomeFastaFiles

    Any ideas,
    Thanks!
    Hi @kjusto,
    Is it possible that your L1_1.fq L1_2.fq files are not fasta files - they seem to have FASTQ extension?
    If they are valid fasta, please send me a link to them, and I will try to figure out the problem.
    Cheers
    Alex

    Comment


    • Originally posted by alexdobin View Post
      Hi Emma,

      these long-gap splices, often connecting adjacent genes, are somewhat common in RNA-seq data. It's hard to say whether they are biochemically real "read-through transcription" events, or some kind of wet-lab or mapping artifacts.
      They would be clearly mapping artifacts if "better" alignments of these sequences can be found, however, BLATing or BLASTing them did not result in any better alignments.
      One way to get rid of them is to completely prohibit long gaps with --alignIntronMax N, which would prohibit any gap longer than N (by default this is ~600000). However, if you make this too small, say 100000, you may miss a number of valid junctions, as mammalian introns can be hundred of kilobases long.
      A better approach is filter out long-gap alignments supported by too few reads, e.g. :
      --outFilterType BySJout --outSJfilterIntronMaxVsReadN 10000 20000 50000 100000
      This would only allow unannotated junctions <=10kb supported by >=1 spliced read, <=20kb supported by >=2 reads, <=50kb by >= 3 reads, <=100kb by >=4 reads.
      This looks like a very bad idea. There are plenty of fusion genes which are known to be "adjacent" but still are genuine fusion genes.

      For example, the famous fusion gene FGFR3-TACC3 (for more info about it please Google it) has the genes FGFR3 and TACC3 as adjacent and the distance between them is less than 50kb. If this fusion is expressed moderately there could be, for example, 2 reads supporting the fusion and it will be discarded according to the above criteria.

      Out there are several fusion finders which can handle this kind of cases with very low false positives rate, e.g. https://code.google.com/p/fusioncatcher/wiki/comparison
      Last edited by ndaniel; 09-03-2014, 11:18 AM.

      Comment


      • In using STAR for my alignments, I'm subsequently using samtools for some basic evaluations (and GATK after that to generate a vcf file). An issue I'm having is that even though STAR is correctly including the mapping location for each read in the sam file, it is not setting the flag that says if a read is mapped or not (-f 4). As such, when I use samtools to count the number of mapped reads, it's always zero. Samtools accurately counts the number of total reads and unmapped reads.

        Is this an issue with STAR or possibly something I'm doing incorrectly?

        Comment


        • STAR doesn't output unmapped reads in the SAM file (at least not by default), so bit 4 shouldn't be set there for any record.

          Comment


          • That's a bit of an issue because the default of not having a flag set is to call it "unmapped".

            Running:

            samtools view -c -F 4 <file.name>

            counts all the reads as unmapped. Downstream tools that rely on that flag are thus rendered incompatible.

            Comment


            • No, not setting that bit means the read is mapped.
              Edit:I'll add that samtools view -c -F 4 foo.bam will count the mapped reads. Using -f 4 will count the unmapped reads, which will be 0.

              Comment


              • 0x4 means 'unmapped'. If no flags are set, that means it's a single-ended read mapped to the plus strand.

                Comment


                • My mistake. Thanks!

                  Comment


                  • Hi,
                    I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

                    Thanks!

                    Comment


                    • Originally posted by serash View Post
                      Hi,
                      I'm currently doing 2 pass alignment with the --twopass1readsN option, which adds the 1st pass junctions into genome in memory. I'm curious to know if this works when a genome in memory i used (--genomeLoad LoadAndKeep) from previous runs. And will this give the correct results if I run additional 2 pass alignments with this genome in memory or do I need to reload the genome for each run?

                      Thanks!
                      Hi,

                      the --twopass1readsN cannot be used with --genomeLoad LoadAndKeep option, you will need to re-load the genome for every sample. For each of the samples the 2-nd pass genome indices will be different, since the junctions discovered in the 1st pass will change from sample to sample.

                      For multiple samples, one option is to map (2-pass with --twopass1readsN -1 which will map all reads in the first pass) all the samples together, and use the Read Group tags to mark different samples.

                      Another option is the manual 2-step operation: run the 1st pass on all samples, collect the detected junctions from all samples, re-generate the genome, and run 2nd pass with the new genome, which can now be loaded with --genomeLoad LoadAndKeep option.

                      Cheers
                      Alex

                      Comment


                      • Hi again,

                        It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
                        This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

                        cheers,
                        Dries

                        Comment


                        • Originally posted by serash View Post
                          Hi again,

                          It makes sense I need to rebuild the genome first and can than use it again for all samples. However, I noticed that rebuilding the genome takes longer than the alignment itself. If I run the 2-pass alignment with --twopass1readsN -1 the outcome is achieved much faster by adding the SJ.out.tab in memory from the first run.
                          This said, I was wondering if its possible to do the same as the 2-pass run in shared memory, to add the data (give the file by --sjdbFileChrStartEnd?) to an existing genome in memory to avoid rebuilding the entire genome and thus saving time.

                          cheers,
                          Dries
                          This is a good suggestion, and it's already on my (short) TODO-list - adding annotated junctions on the fly, so that there no need to re-generate the whole genome index when you only want to change annotations.

                          Cheers
                          Alex

                          Comment


                          • That's great to know! Thanks

                            Comment


                            • Star is generating files like this in the directory I'm running it from:
                              • SJ.out.tab
                              • Log.std.out
                              • Log.out
                              • Log.final.out
                              • Log.progress.out


                              How can I put those files into a specific directory so that I can run multiple jobs in parallel from a single directory?


                              Cheers,
                              Dan.
                              Homepage: Dan Bolser
                              MetaBase the database of biological databases.

                              Comment


                              • Just use the --outFileNamePrefix option to give everything an appropriate prefix.

                                Comment

                                Working...
                                X