Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • STAR: ultrafast universal RNA-seq aligner

    Dear All,

    I would like to formally introduce on SEQanswers our RNA-seq mapper STAR.
    Its advantages include:
    • Very high mapping speed:
      on a modest 12-core cluster STAR maps 400 Million pairs per hour for human 2x100 Illumina reads (>50 times faster than TopHat).
    • Accurate alignment of contiguous and spliced reads:
      in our tests on real and simulated data STAR showed better sensitivity and precision than TopHat.
    • Detection of polyA-tails, non-canonical splices and chimeric (fusion) junctions.
    • Mapping reads of any length:
      STAR can efficiently map reads of any length generated by current or emerging sequencing platforms, starting from ~15 bases (small RNA) and up to full length transcripts several kilobases long.
    • Thorough testing on large ENCODE datasets:
      STAR was used to map 64 Billion reads of long RNA-seq and 16 Billion reads of short RNA-seq, and will be used to map RNA-seq data in the next ENCODE phase.

    STAR requires ~30GB of RAM for mapping to the human genome (could be reduced to 16GB in the "sparse" mode with some speed loss).

    More information can be found in out recent paper.
    If you decide to try it out, please download one of the latest STAR releases.
    I will be happy to answer any questions via SEQanswers, STAR discussion forum, or by e-mail:[email protected]

    Cheers
    Alex
    Last edited by alexdobin; 06-20-2014, 01:43 PM. Reason: fixed URL

  • #2
    Your program is great. It's solving my alignments in a fraction of time respect TopHat. Thank you!!

    Comment


    • #3
      Yes, I am also happy with the results from STAR. It is incredibly fast and easy to use. I will test the shared memory option again now that I have my own workstation machine to play with.

      Compatibility with Cufflinks has also been great. Works with HT-seq as well.

      Thank you!!

      Comment


      • #4
        i also like STAR a lot. so much simpler and faster than other tools around.

        could you give an example how to use gzipped input files (--readFilesCommand)? i cannot get it to work.

        thanks!

        Comment


        • #5
          I've used it as follows, which works fine:

          STAR_2.3.0e.Linux_x86_64/STAR --genomeDir /dir/STAR/Genome --readFilesIn /dir/*.fastq.gz --readFilesCommand zcat --outFileNamePrefix SampleA --runThreadN 20

          I hope the implementation of the readcounts will follow soon, since I'm curious to compare this tool to the tools I'm using at this moment.
          Last edited by iris_aurelia; 02-18-2013, 11:51 PM.

          Comment


          • #6
            @volks:
            Good example from @iris_aurelia.
            If you have multiple files you wish to map in one run, they should be separated by commas, while paired-end mates are separated by space:
            --readFilesIn Read1a.gz,Read1b.gz Read2a.gz,Read2b.gz

            'zcat' should be on your path, or you can use absolute path such as '/bin/zcat'


            @iris_aurelia:
            Presently I am working on outputting sorted BAM, since 'samtools sort' is a bottleneck for many users (because samtools is not threaded). Then I will code "read counts per gene" (similar to HT-seq counts). Then you will be able to map the reads and get the counts in one run. Hopefully, it will be ready in 1-2 months.

            Comment


            • #7
              Hi Alex,

              Yes it is a pity that samtools doesn't support multiple threads. We've tried Novosort and that seems to work fine, however you need a license in order to run it in a multithreaded way.

              HT-seq is what I usually use to get the readcounts. Are you gonna make this step multithreaded as well? It would indeed be nice if you could just run a sample and get the whole bunch of information all in once.

              Comment


              • #8
                That is a great feature, Alex. I am looking forward to it.

                I currently convert cufflinks output into read counts for DESeq.

                You read in the length and coverage columns from the isoforms.fpkm_tracking file as well as the read length of your reads and then caclulate:

                my $reads = $length * $coverage / $readlength;

                I have compared the values to htseq-count and found them to be nearly the same.

                The advantage over htseq is you can multi-thread cufflinks.
                Last edited by NGSfan; 02-19-2013, 02:25 AM.

                Comment


                • #9
                  I have problems generating a new genome.
                  My genome is more than 300.000 contigs and ~300MB in size.

                  Feb 18 11:25:11 ..... Started STAR run
                  Feb 18 11:25:11 ... Starting to generate Genome files
                  /var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed

                  Comment


                  • #10
                    Originally posted by NGSfan View Post
                    That is a great feature, Alex. I am looking forward to it.

                    I currently convert cufflinks output into read counts for DESeq.

                    You read in the length and coverage columns from the isoforms.fpkm_tracking file as well as the read length of your reads and then caclulate:

                    my $reads = $length * $coverage / $readlength;

                    I have compared the values to htseq-count and found them to be nearly the same.

                    The advantage over htseq is you can multi-thread cufflinks.


                    In my experience Cufflinks (using the multithreaded option) is even slower than HTseq-count.
                    We are doing some experimenting creating our own multithreaded count script which outputs exact the same data as HT-seq count. In order to do this multithreaded we are running each chromosome separately, which might be an idea for STAR?

                    Comment


                    • #11
                      Originally posted by JonB View Post
                      I have problems generating a new genome.
                      My genome is more than 300.000 contigs and ~300MB in size.

                      Feb 18 11:25:11 ..... Started STAR run
                      Feb 18 11:25:11 ... Starting to generate Genome files
                      /var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed


                      When the genome has a lot of contigs, I noticed that STAR needs a lot more memory... how much do you have? I would recommend finding a machine with 256gb RAM if possible.

                      Comment


                      • #12
                        Originally posted by iris_aurelia View Post
                        In my experience Cufflinks (using the multithreaded option) is even slower than HTseq-count.
                        We are doing some experimenting creating our own multithreaded count script which outputs exact the same data as HT-seq count. In order to do this multithreaded we are running each chromosome separately, which might be an idea for STAR?
                        That's good to know - I have been using 24 threads which is pretty fast, but I did try it on 8 threads and noticed it is not much faster at that level, especially if you include options for multi-read correction and fragment bias correction.

                        I would support any attempt to multi-thread HTseq-count, let us know when you have a script. You can share it on code.google.com, it's where I put my most useful scripts.
                        Last edited by NGSfan; 02-19-2013, 05:33 AM.

                        Comment


                        • #13
                          NGSfan:
                          Thanks. I think I had 60 GB of memory last time. I will try some more

                          Comment


                          • #14
                            This looks great, I think I'll try it out! Have you tested STAR on mixed RNA datasets? i.e. RNA-Seq on a sample containing both human and bacterial/viral RNA? Thats what I'm working with at the moment and Tophat just isn't cutting it, there are millions of reads that are identified as being human and bacterial (depending on which aligning step I run first)

                            Comment


                            • #15
                              read count

                              @iris_aurelia:
                              My plan is to count the reads per gene within STAR algortihm as they are mapped, so it will be as multi-threaded as STAR itself.
                              Multi-threading on a per-chromosome basis is a good idea. I am going to implement a multi-threaded BAM sorting using this kind of parallelization.

                              @NGSfan:
                              Using Cufflinks "coverage" for calculating gene counts is a nice trick. We might as well do it since we are typically running both Cufflinks and HT-seq on our datasets. In our experience Cufflinks' speed is acceptable for A+ datasets, however, it slows down considerably and often gets stuck on A- and Total RNA samples.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM
                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 05:31 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-24-2024, 06:58 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-23-2024, 08:43 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-17-2024, 07:29 AM
                              0 responses
                              58 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X