Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I also noticed that you are running STAR with 16 threads. There must be a slurm command to ensure that you are actually getting that many cores and that they are all on the same physical node.

    Comment


    • I've most recently been submitting jobs with this command:

      [scripts]$ sbatch -p Long -N 1 -n 16 --no-requeue STAR_generate_genome_indices.sh

      Which I believe should allocate a single node of 16 cores to the job. I've tried this new command:

      [scripts]$ sbatch -p Long -N 1 -n 16 --mem 64000 --no-requeue STAR_generate_genome_indices.sh

      To see if manually specifying the amount of memory to use makes a difference.
      Last edited by stu; 08-12-2015, 11:49 AM.

      Comment


      • Hi @stu,

        since you have GFF (not GTF) file, you need to use
        --sjdbGTFtagExonParentTranscript Parent
        It's actually explained in the manual (chapter 2.2.3)

        If this does not work, please post a few "exon" lines of your GFF.

        Cheers
        Alex

        Comment


        • Alex -

          Worked like a charm! Thank you for your help!

          Comment


          • Hello,

            Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?

            Thanks a lot in advance.

            Cheers

            Comment


            • Originally posted by pandamon View Post
              Hello,

              Say I would like to map my data to both human and mouse reference genomes. Can I do both mapping simultaneously with STAR? Can anyone shed a light to me on how this can be effectively done?

              Thanks a lot in advance.

              Cheers
              Hi,

              the best way is to generate the genome index for a combined reference of mouse and human. This will require ~60GB of RAM. You would need to make the chromosomes names distinct in the mouse and human genome FASTA (say add m to the mouse chromosome names). You also need to do the same renaming of the chromosome names in the annotations GTFs. The GTFs (say GENCODE) typically have distinct transcript names for different species - if not, you would have to rename them as well. The GTF files from two species have to be concatenated.

              Cheers
              Alex

              Comment


              • --quantMode GeneCounts short read error

                I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
                Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.

                STAR mapping/counting fails with the following error message:
                EXITING because of FATAL ERROR in reads input: short read sequence line: 1
                Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
                Read Sequence====
                DEF_readNameLengthMax=50000
                DEF_readSeqLengthMax=500

                Nov 28 17:54:22 ...... FATAL ERROR, exiting

                Is it because of the trimming with Cutadapt? Is STAR failing to process 'zero length' reads? Thanks for your help. Log.out file attached

                cutadapt 1.8.3:
                cutadapt -q 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o inpath/sample_R1.fastq.gz -p inpath/sample_R2.fastq.gz pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz


                STAR 2.4.2a:
                STAR --genomeLoad NoSharedMemory --genomeSAsparseD 2 --outSAMstrandField intronMotif --genomeDir pathto/STARgenome --sjdbGTFfile pathto/STARgenome/gencode.v23.annotation.gtf --runThreadN 2 --quantMode GeneCounts --readFilesIn pathto/sample_R1.fastq.gz pathto/sample_R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix sample --outSAMtype BAM Unsorted --outStd BAM_Unsorted
                Attached Files

                Comment


                • what's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files

                  Comment


                  • Originally posted by rvann View Post
                    I am trying to use STAR to perform simultaneous alignment and gene counting as described in the 2.4.2a manual. I have Illumina TruSeq Total RNA preps sequenced on a HiSeq 4000. I trimmed Illumina adapters and reads with Phred score < 10 using Cutadapt with the arguments below.
                    Our cluster has a 30 GB RAM limit for jobs, so I first indexed the genome using --genomeSAsparseD 2.

                    STAR mapping/counting fails with the following error message:
                    EXITING because of FATAL ERROR in reads input: short read sequence line: 1
                    Read Name=@K00152:8:H3MF5BBXX:4:1101:30404:1773
                    Read Sequence====
                    DEF_readNameLengthMax=50000
                    DEF_readSeqLengthMax=500

                    Nov 28 17:54:22 ...... FATAL ERROR, exiting
                    Hi @rvann,

                    the error is caused by the zero-length read sequence, STAR cannot process those.
                    Hopefully, cutadapt has an option to remove them - this has to be done simultaneously from read1 and read2 files to preserve the order of the reads.

                    Cheers
                    Alex

                    Comment


                    • Originally posted by gsgs View Post
                      what's the probability that I will be able to get a windows executable that runs on command line per cmd.exe on my computer and also runs from batch files
                      Hi @gsgs,

                      the probability is >0 as I know people who are working on Windows executable for STAR.

                      Cheers
                      Alex

                      Comment


                      • thanks.

                        In the meantime I found the manual and looked at it.
                        Whenever it mentioned "Windows" it meant the boxes
                        that may open ... so this is just not considered.

                        I'm not familiar with Linux/Unix, but I remember that often
                        I could compile similar programs with my
                        old GCC / DJGPP compiler on Windows.
                        I haven't done this since long and there are many
                        programs here with .h and .cpp extension to be included,
                        (why is it so complicated ?)
                        so a long list for potential problems.

                        I don't understand why this conversion is so difficult,
                        why they have no solution for this already.

                        I mean, it should be a small step as compared to
                        creating the programs and getting it to work in the first place ?!?


                        Currently I'm using MAFFT, the author had helped me to
                        get a Windows-executable and how to run it from batch.
                        They used fast fourier transform but it became clear to me,
                        that this is slow for large problems
                        and that there should be a faster solution by finding matching


                        subsequences.

                        Comment


                        • so, what can I do ?

                          ---------------------------------------
                          Buy another computer with Linux on it, install "STAR" on it.
                          Whenever I have a big Windows/DOS fasta file to be aligned,
                          (delete the carriage returns since Linux doesn't like them ?)
                          copy it from the Windows/DOS HD to a micro-SD, insert it into the Linux computer,
                          run STAR on it, insert it into the Windows computer,
                          copy it back to Windows HD, insert the carriage returns
                          -----------------------------------------

                          do you sell such a Linux computer, with STAR suitably installed on micro-SD ?
                          Easy to use, boots from SD, aligns the fasta file fasta01.fa on it, writes
                          the result to fasta02.fa and shuts down.

                          No display, no keyboard needed, a raspberry computer ?!

                          Comment


                          • Hi @gsgs

                            porting software designed for Linux to Windows is not an easy task.
                            As I mentioned, there is a serious effort to do that - I will ask if there is an ETA.
                            In the meantime, I could suggest the following work-arounds (in the order of increased difficulty):

                            1. Use Amazon or Google computing clouds. It will cost you a few dollars per run.
                            2. Run a virtual Linux machine on your Windows server.
                            3. Make you server dual-boot Windows/Linux, with a shared FAT partition to transfer data.
                            4. Try to compile and run STAR under cygwin Linux-like environment. This should be easier than full porting, however, I am not sure if this will work.

                            Cheers
                            Alex

                            Comment


                            • Hi Alex,

                              Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:

                              N_unmapped 3350825 3350825 3350825
                              N_multimapping 2233686 2233686 2233686
                              N_noFeature 4913585 40288551 40271442
                              N_ambiguous 0 0 0

                              My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:

                              ##gff-version 3
                              ##annot-version v1.0
                              scaffold_1 phytozomev10 gene 10215584 10239664 . + . ID=Thhalv10024176m.g.v1.0;Name=Thhalv10024176m.g
                              scaffold_1 phytozomev10 mRNA 10215584 10239664 . + . ID=Thhalv10024176m.v1.0;Name=Thhalv10024176m;pacid=20194900;longest=1;Parent=Thhalv10024176m.g.v1.0
                              scaffold_1 phytozomev10 exon 10215584 10215918 . + . ID=Thhalv10024176m.v1.0.exon.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 five_prime_UTR 10215584 10215821 . + . ID=Thhalv10024176m.v1.0.five_prime_UTR.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 CDS 10215822 10215918 . + 0 ID=Thhalv10024176m.v1.0.CDS.1;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 exon 10216476 10216579 . + . ID=Thhalv10024176m.v1.0.exon.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 CDS 10216476 10216579 . + 2 ID=Thhalv10024176m.v1.0.CDS.2;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 exon 10216865 10216999 . + . ID=Thhalv10024176m.v1.0.exon.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 CDS 10216865 10216999 . + 0 ID=Thhalv10024176m.v1.0.CDS.3;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 exon 10217082 10217132 . + . ID=Thhalv10024176m.v1.0.exon.4;Parent=Thhalv10024176m.v1.0;pacid=20194900
                              scaffold_1 phytozomev10 CDS 10217082 10217132 . + 0 ID=Thhalv10024176m.v1.0.CDS.4;Parent=Thhalv10024176m.v1.0;pacid=20194900

                              I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.

                              Thanks!

                              Comment


                              • Originally posted by salamay View Post
                                Hi Alex,

                                Your software is really great! I had a quick question about the --quantMode GeneCounts function. I seem to be getting mostly noFeature hits. This is what is in my ReadsPerGene.out.tab file:

                                N_unmapped 3350825 3350825 3350825
                                N_multimapping 2233686 2233686 2233686
                                N_noFeature 4913585 40288551 40271442
                                N_ambiguous 0 0 0

                                My genome generation step included the annotation file, which is a gff3 file. Since it's a gff3 file, I added the paramters "--sjdbGTFfile xx.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon exon". This is a subset of the gff3 file:


                                I'm fairly certain that I should not be getting ~40 million no feature hits when counting. Do you have any idea as to what the problem might be? Please let me know if any more information is needed.

                                Thanks!

                                Hi Yasser,

                                at the moment the best option is to convert GFF3 file into GTF file.
                                For instance, you can use gffread tool from Cufflinks package:
                                $ gffread -T annot.gff3 -o annot.gtf
                                It creates the gtf file with proper transcript_id and gene_id tags, which you can supply as --sjdbGTFfile without any Parent options.

                                Cheers
                                Alex

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 08:47 AM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                59 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X