Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by alexdobin View Post
    Hi @rvann,

    the error is caused by the zero-length read sequence, STAR cannot process those.
    Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.

    Cheers
    Alex
    Thanks, Alex. I tried running cutadapt with the flag -m set to N=1 so that any 'zero-length' reads would be discarded (see below). STAR had no problems with these trimmed fastq files.

    cutadapt -q 10 -m 1 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o outpath/file_R1.fastq.gz -p outpath/file_R2.fastq.gz inpath/file_R1.fastq.gz inpath/file_R2.fastq.gz

    Comment


    • I'm trying to run STAR at the moment, but it's throwing me an error while generating the genome.

      I simply use the standard command:
      ./STAR --runMode genomeGenerate --genomeDir /2tb-internal/data/STAR_test/ --genomeFastaFiles /2tb-internal/data/contig-100.without_polyA.fixed_names.fa --runThreadN 5

      The output I get:

      Jan 13 15:53:51 ..... Started STAR run
      Jan 13 15:53:51 ... Starting to generate Genome files
      terminate called after throwing an instance of 'std::bad_alloc'
      what(): std::bad_alloc
      Aborted (core dumped)

      Not sure what the problem could be.
      I didn't compile STAR myself, because I don't have GCC 4.7 installed, therefore compiling doesn't work (Working on Ubuntu 12.04). Is that maybe a problem?
      The "genome" is not very big, but is actually a transcriptome, so it's not very big (file is 115 mb), but a lot entries (~200.000), guess that could also be a problem...or not?
      Hardware should else be sufficient, have 16 GB RAM, and can switch to a server with 256 GB if necessary.
      Any hints what could be the factor causing the problem here?

      Thanks .

      Comment


      • Hi @bastianwur,

        quoting from the manual:

        "If you are using a genome with a large (>5,000) number of references (chrosomes/sca olds), you may need to reduce the --genomeChrBinNbits to reduce RAM consumption. The following scaling is recommended: --genomeChrBinNbits = min(18, log2(GenomeLength/NumberOfReferences))."

        In you case it would be --genomeChrBinNbits 9 .

        Cheers
        Alex

        Originally posted by bastianwur View Post
        I'm trying to run STAR at the moment, but it's throwing me an error while generating the genome.

        I simply use the standard command:
        ./STAR --runMode genomeGenerate --genomeDir /2tb-internal/data/STAR_test/ --genomeFastaFiles /2tb-internal/data/contig-100.without_polyA.fixed_names.fa --runThreadN 5

        The output I get:

        Jan 13 15:53:51 ..... Started STAR run
        Jan 13 15:53:51 ... Starting to generate Genome files
        terminate called after throwing an instance of 'std::bad_alloc'
        what(): std::bad_alloc
        Aborted (core dumped)

        Not sure what the problem could be.
        I didn't compile STAR myself, because I don't have GCC 4.7 installed, therefore compiling doesn't work (Working on Ubuntu 12.04). Is that maybe a problem?
        The "genome" is not very big, but is actually a transcriptome, so it's not very big (file is 115 mb), but a lot entries (~200.000), guess that could also be a problem...or not?
        Hardware should else be sufficient, have 16 GB RAM, and can switch to a server with 256 GB if necessary.
        Any hints what could be the factor causing the problem here?

        Thanks .

        Comment


        • Hi alexdobin

          As many posts that you answered, I got a RAM problem.

          The machines I use have 24GB of RAM and I am working with human data.

          This is my command (I decreased genomeChrBinNbits argument, as I saw in your previous answers):

          Code:
          STAR --runMode genomeGenerate \
            --genomeDir test_star/reference.Merged \
            --genomeFastaFiles Homo_sapiens.GRCh38.fa \
            --runThreadN 2 \
            --genomeChrBinNbits 8 \
            --limitGenomeGenerateRAM 10000000000 \
            --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
            --limitIObufferSize 4000000000 \
            --sjdbOverhang 99
          This is the log:

          Code:
          Feb 03 11:15:54 ..... Started STAR run
          Feb 03 11:15:54 ... Starting to generate Genome files
          Feb 03 11:17:11 ... starting to sort  Suffix Array. This may take a long time...
          Feb 03 11:17:35 ... sorting Suffix Array chunks and saving them to disk...
          Feb 03 12:37:51 ... loading chunks from disk, packing SA...
          terminate called after throwing an instance of 'std::bad_alloc'
            what():  std::bad_alloc
          /var/spool/torque/mom_priv/jobs/1759223.moab.host.SC: line 41: 27870 Aborted                 STAR --runMode genomeGenerate --genomeDir test_star/reference.Merged --genomeFastaFiles Homo_sapiens.GRCh38.fa --runThreadN 2 --genomeChrBinNbits 8 --limitGenomeGenerateRAM 10000000000 --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab --limitIObufferSize 4000000000 --sjdbOverhang 99
          The execution created 45 SA_* files of around 1GB.

          What am I doing wrong in the parameters? I would like it to run with a ~20 GB of RAM consumption.

          Thanks in advance for any help you might give me.

          Comment


          • Originally posted by emixaM View Post
            Hi alexdobin

            As many posts that you answered, I got a RAM problem.

            The machines I use have 24GB of RAM and I am working with human data.

            This is my command (I decreased genomeChrBinNbits argument, as I saw in your previous answers):

            Code:
            STAR --runMode genomeGenerate \
              --genomeDir test_star/reference.Merged \
              --genomeFastaFiles Homo_sapiens.GRCh38.fa \
              --runThreadN 2 \
              --genomeChrBinNbits 8 \
              --limitGenomeGenerateRAM 10000000000 \
              --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
              --limitIObufferSize 4000000000 \
              --sjdbOverhang 99
            This is the log:

            Code:
            Feb 03 11:15:54 ..... Started STAR run
            Feb 03 11:15:54 ... Starting to generate Genome files
            Feb 03 11:17:11 ... starting to sort  Suffix Array. This may take a long time...
            Feb 03 11:17:35 ... sorting Suffix Array chunks and saving them to disk...
            Feb 03 12:37:51 ... loading chunks from disk, packing SA...
            terminate called after throwing an instance of 'std::bad_alloc'
              what():  std::bad_alloc
            /var/spool/torque/mom_priv/jobs/1759223.moab.host.SC: line 41: 27870 Aborted                 STAR --runMode genomeGenerate --genomeDir test_star/reference.Merged --genomeFastaFiles Homo_sapiens.GRCh38.fa --runThreadN 2 --genomeChrBinNbits 8 --limitGenomeGenerateRAM 10000000000 --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab --limitIObufferSize 4000000000 --sjdbOverhang 99
            The execution created 45 SA_* files of around 1GB.

            What am I doing wrong in the parameters? I would like it to run with a ~20 GB of RAM consumption.

            Thanks in advance for any help you might give me.
            Hi @emixaM,

            with 24GB of RAM and human genome you would need to use --genomeSAsparseD 2.
            You do not need --genomeChrBinNbits 8 --limitIObufferSize 4000000000, and can increase
            --limitGenomeGenerateRAM 20000000000
            Code:
            STAR --runMode genomeGenerate \
              --genomeDir test_star/reference.Merged \
              --genomeFastaFiles Homo_sapiens.GRCh38.fa \
              --runThreadN 2 \
              --limitGenomeGenerateRAM 20000000000 \
              --sjdbFileChrStartEnd alignment_1stPass/AllSamples.SJ.out.tab \
              --sjdbOverhang 99
            Cheers
            Alex

            Comment


            • Thank you very much Alex!

              I tried, for good measure, to launch STAR from the beginning, not directly on the 2nd pass, on my 24GB machine, and I got this error.

              Here is the command:
              Code:
              STAR --runMode alignReads \
                --genomeDir Homo_sapiens.GRCh38/genome/star_index/Ensembl77.sjdbOverhang99 \
                --readFilesIn \
                  sample_L001.trim.pair1.fastq.gz \
                  sample_L001.trim.pair2.fastq.gz \
                --runThreadN 2 \
                --readFilesCommand zcat \
                --outStd Log \
                --outSAMunmapped Within \
                --outSAMtype BAM Unsorted \
                --outFileNamePrefix alignment_1stPass/sample/ \
                --outSAMattrRGline ID:"sample_L001" 	PL:"ILLUMINA" 	 PU:"CXX_1" 		SM:"sample" 	CN:"seq"  \
                --limitGenomeGenerateRAM 20000000000 \
                --limitIObufferSize 32000000 \
                --genomeSAsparseD 2
              And the error:
              Code:
              Feb 08 10:39:30 ..... Started STAR run
              Feb 08 10:39:31 ..... Loading genome
              
              EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
              Possible cause 1: not enough RAM. Check if you have enough RAM 31550092110 bytes
              Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 31550092110
              
              Feb 08 10:39:31 ...... FATAL ERROR, exiting
              
              gzip: stdout: Broken pipe
              
              gzip: stdout: Broken pipe
              Is it the same problem?

              Comment


              • Originally posted by emixaM View Post
                Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 31550092110
                @emixaM: Can you post the output from?

                Code:
                $ ulimit -a

                Comment


                • No problem:

                  Code:
                  $ ulimit -a
                  core file size          (blocks, -c) 0
                  data seg size           (kbytes, -d) unlimited
                  scheduling priority             (-e) 0
                  file size               (blocks, -f) unlimited
                  pending signals                 (-i) 95162
                  max locked memory       (kbytes, -l) 4096000
                  max memory size         (kbytes, -m) unlimited
                  open files                      (-n) 32768
                  pipe size            (512 bytes, -p) 8
                  POSIX message queues     (bytes, -q) 819200
                  real-time priority              (-r) 0
                  stack size              (kbytes, -s) unlimited
                  cpu time               (seconds, -t) unlimited
                  max user processes              (-u) 95162
                  virtual memory          (kbytes, -v) unlimited
                  file locks                      (-x) unlimited

                  Comment


                  • Are you the only user on this machine when you try to run STAR? Are there any other processes going on in background? Have you looked at output of say "top"? What does the memory section up at the top say?

                    Comment


                    • Yes, I am the only one, it runs on a compute node of a HPC.

                      Here is the top of a compute node (no STAR running):

                      Code:
                      $ top -M
                      top - 14:33:48 up 3 days, 23:33,  0 users,  load average: 0.00, 1.98, 5.10
                      Tasks: 333 total,   1 running, 332 sleeping,   0 stopped,   0 zombie
                      Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
                      Mem:    23.584G total, 1948.043M used,   21.682G free,    0.000k buffers
                      Swap:    0.000k total,    0.000k used,    0.000k free, 1073.691M cached
                      
                        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
                      31824 emixaM   20   0 27620 1620 1072 R  0.3  0.0   0:00.06 top
                      ...

                      Comment


                      • Are you using a job scheduler then? Perhaps the scheduler requires an explicit request for more RAM when not using default (e.g. with LSF it would be -M 20).

                        Comment


                        • Yes I am using a schedule (MOAB) but never had any problem accessing all the RAM of the nodes.

                          Comment


                          • Originally posted by umn_bist
                            I am attempting to crossover to STAR from TopHat for my RNAseq workflow but I had a question regarding its capabilities.

                            Specifically, TopHat had an option to include mate-inner-dist and mate-std-dev but relied on RSeQC to input these values for each unique samples.

                            I was wondering if STAR provides an option to streamline/automate this process. Thank you very much for the awesome tool!
                            Hi @umn_bist

                            STAR does not need or use the "expected" insert length information for mapping - so yo do not have to worry about it.
                            For mapping of DNA-seq data we can decide whether an alignment has insert size that's too big or too small by comparing it "expected" distribution of insert size.
                            However, in RNA-seq the unsequenced portion of insert between the mates can contain a splice junctions.
                            This means that we cannot simply calculate the insert size from an alignment to compare it with "expected" insert size.

                            Cheers
                            Alex

                            Comment


                            • Hi Alex,

                              I have a simple question that is tell the strand of a read mapped to (strand-specific data). I viewed my output .bam file using "sambamba view". Is it possible to tell tell the strand from the example given below?

                              HWI-ST1395:97:d29b4acxx:5:2105:5630:92677 99 chr4 15003609 3 100M = 15003773 264 GCATAATTATGAAGATATTTTTTTATATGGTCCGACTATATAAAATAAATATCATAACGGAAATCAACTTGAACAATGCTGTACCAAAACTACAGT
                              GGTT CCCFFFFFHHHHGJIJJIJJJIJJJJJJJJGIJJIIJJJIJJJIJJGGJJIJJJJIJJIJIJHGGGFFDFDFECEEECDDDDDDCDDDDDDCCDD@@C@B NH:i:2 HI:i:1 AS:i:198 nM:i:0

                              HWI-ST1395:97:d29b4acxx:7:2301:9293:56479 99 chr4 15003609 3 100M = 15003701 192 GCATAATTATGAAGATATTTTTTTATATGGTCCGACTATATAAAATAAATATCATAACGGAAATCAACTTGAACAATGCTGTACCAAAACTACAGT
                              GGTT CCCFFFFFHHHHHJJJIIJJJJJJHJJJJJJJIIJJJJIIJJGJIIJJJJIJJJJGGIJIGJFEHHEFFFEFECEEECDDDACCCDCDDBDDCDD>@CCD NH:i:2 HI:i:1 AS:i:198 nM:i:0
                              If not, how can I do that (get the strand information). I am also interested to know what the second column stands for in the above example? I am not able to figure it out.

                              Thanks,

                              Rui
                              Last edited by harrike; 02-16-2016, 02:18 PM.

                              Comment


                              • Originally posted by harrike View Post
                                I am also interested to know what the second column stands for in the above example? I am not able to figure it out.

                                Thanks,

                                Rui
                                Second column is bitwise FLAG. Check section 1.4 from full SAM spec: https://samtools.github.io/hts-specs/SAMv1.pdf

                                Bitwise FLAG also tells you the strand information (https://www.biostars.org/p/59388/). You can "translate" the FLAG using this tool: https://broadinstitute.github.io/pic...ain-flags.html
                                Last edited by GenoMax; 02-16-2016, 02:46 PM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM
                                • seqadmin
                                  Non-Coding RNA Research and Technologies
                                  by seqadmin




                                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                  Nobel Prize for MicroRNA Discovery
                                  This week,...
                                  10-07-2024, 08:07 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:31 AM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-24-2024, 06:58 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-23-2024, 08:43 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-17-2024, 07:29 AM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X