Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jiwu2573
    Member
    • Jun 2009
    • 34

    #16
    Another check:

    My hg18.fa constructed from hg18 UCSC from TopHat website has the size of 3131776827 bytes, same as yours?

    Comment

    • Xi Wang
      Senior Member
      • Oct 2009
      • 317

      #17
      Originally posted by jiwu2573 View Post
      I only know my fastq file for 1 sample is around 3 GB after converting and joining all the 120 qseq.txt files, not sure how to find out how many reads in total? How do you know?
      I guess there are more 10 million reads. You can count the number of lines of FASEQ file, and the number of reads equals to 1/4 of the number of lines.

      The read length is 76 bp. I am running tophat with the default configuration without any argument except --solexa1.3-quals. I guess you are designating the number of mismatches, number of multi-aligned loci by the argument. If that's the case, what number do you use?
      I usually discard all the multi-reads, by specifying "-g 1". I think it is normal to take several hours to finish your job.

      PS. I am running TopHat through univ connection to the Linux server. Is it supposed to be faster than running on my local computer? How many processors do you have in your computer? Is a normal PC enough?
      I think so. The computational server will be more efficient than PC.
      Xi Wang

      Comment

      • Xi Wang
        Senior Member
        • Oct 2009
        • 317

        #18
        Originally posted by jiwu2573 View Post
        Another question:

        Is there any need to run Bowtie alone as TopHat will call Bowtie anyway?
        You can just run Tophat.
        Xi Wang

        Comment

        • Xi Wang
          Senior Member
          • Oct 2009
          • 317

          #19
          Originally posted by jiwu2573 View Post
          Another check:

          My hg18.fa constructed from hg18 UCSC from TopHat website has the size of 3131776827 bytes, same as yours?
          It should be ok. The size of mine is 3169831337. How about anybody else?
          Xi Wang

          Comment

          • jiwu2573
            Member
            • Jun 2009
            • 34

            #20
            Hi Xi,

            Thanks a lot for your help.

            Last night I managed to run 1 sample by tophat successfully (it took 17 hours).

            I tried to visualize the output, coverage.wig and junctions.bed in UCSC genome browser.

            When I load coverage.wig, it shows Error File 'coverage.wig' - Error line 3771557 of custom track: chromEnd less than 1 (0)

            When I load junctions.bed, it only shows chromosome 20?
            Name Description Type Doc Items Pos
            junctions TopHat junctions bed 80207 chr20:

            By the way, my junctions file is 6430K and my coverage file is around 173M, are these normal?

            How do you do your visualization?

            Do you use Cufflinks to quantify the expression after TopHat?

            Comment

            • Xi Wang
              Senior Member
              • Oct 2009
              • 317

              #21
              1. for 'coverage.wig', you can use a text editor to see what happened in line 3771557.
              2. for "junctions.bed", you can switch the chr by typing other chromosome ID in the blank following "position/search" . E.g: chr1
              3. the sizes of the file look fine.

              You can alternatively use some local viewers to visualize the results. But UCSC genome browser is recommended. And Cufflink is useful for identifying transcripts with expression.
              Xi Wang

              Comment

              • jiwu2573
                Member
                • Jun 2009
                • 34

                #22
                Hi Xi,

                After I run Cufflinks and Cuffcompare, any output files can be used directly to feed into DEGseq to get differential expression analysis? I have 2 groups with each group 2 replicates.

                If conversion of files are needed, could you please tell me the easiest and fastest way to get DEGseq running for my analysis done by TopHat and then Cufflinks and Cuffcompare?

                Comment

                • jiwu2573
                  Member
                  • Jun 2009
                  • 34

                  #23
                  Another question is about identification of novel transcripts.

                  I understand that junctions.bed gives out all the possible splices, but how can we dig out the novel form from all these data? Certainly not by eyeballing through UCSC genome browser?

                  Comment

                  • jiwu2573
                    Member
                    • Jun 2009
                    • 34

                    #24
                    I also get huge size of tophat_out folder (~22 GB for 1 human RNA seq).

                    Do you recommend to keep all these folders untouched or just keep sam, wig, and bed files instead? The inside folder log and tmp not useful at all?

                    Comment

                    • Xi Wang
                      Senior Member
                      • Oct 2009
                      • 317

                      #25
                      1. DEGseq: you can use my perl script for the conversion from SAM to BED (attached). I didn't test the script much, if any problem, let me know.
                      2. New transcript/junction: there should be a tool for doing such thing. Anyone knows such tools?
                      3. Tophat_out folder: the newest version of tophat will delete the tmp files automaticly, and I don't think the tmp files much help.
                      Attached Files
                      Xi Wang

                      Comment

                      • jiwu2573
                        Member
                        • Jun 2009
                        • 34

                        #26
                        Thanks!

                        I've converted all my samples. No error message!

                        I know there are paper and manual explaining how to use DEGSeq, I am just a bit lazy, well, it does take long for me to digest everything.

                        Could you please just tell me the command line I put in for my 4 bed files (2 groups with 2 samples in each group) for DE gene analysis?

                        Comment

                        • Xi Wang
                          Senior Member
                          • Oct 2009
                          • 317

                          #27
                          I put an example below. Before you run DEGseq, you should first download a gene annotation file in refFlat format, maybe from UCSC genome browser. Some parameters for DEGseq should also be tuned according to your analysis purpose.

                          Code:
                            mapResultBatch1 <- c("group1_file1", "group1_file2")  
                            mapResultBatch2 <- c("group2_file2", "group2_file2")
                            outputDir <- "DEGseq_out_dir"
                            refFlat <- "gene annotation in refFlat format"
                            DEGseq(mapResultBatch1, mapResultBatch2, fileFormat="bed", refFlat=refFlat,
                                   outputDir=outputDir, method="MARS")
                          Last edited by Xi Wang; 01-21-2010, 08:24 PM.
                          Xi Wang

                          Comment

                          • jiwu2573
                            Member
                            • Jun 2009
                            • 34

                            #28
                            Hi Xi,

                            I've run the DEGseq, it works!

                            But I doubt actually I should use samWrapper instead?
                            2 groups are case and control, 2 samples in each group are from different individuals.
                            DEGseq seems to deal with the same sample with technology replicates, but not biology replicates.

                            Is my understanding correct? If use samWrapper, could you please tell me the command line as I can hardly copy your examples in the manual? Thanks

                            Comment

                            • Xi Wang
                              Senior Member
                              • Oct 2009
                              • 317

                              #29
                              Hi,

                              Yes, for biological replicates, it is better to use samWrapper. But I am afraid that your sample size (2 vs 2) is a little bit small.
                              A piece of code example for samWrapper:
                              Code:
                              # get gene express
                                mapResultBatch <- c("group1_file1", "group1_file2", "group2_file2", "group2_file2")  
                                output <- "gene.exp"
                                refFlat <- "gene annotation in refFlat format"
                                exp <- getGeneExp(mapResultBatch, refFlat = refFlat, output = output)
                              
                              # apply samWrapper
                              geneExpFile <- "gene.exp"
                              set.seed(100)
                              geneExpFile1 <- geneExpFile
                              geneExpFile2 <- geneExpFile
                              output <- "samWrapperOut.txt"
                              expCol1 = c(2, 4)
                              expCol2 = c(6, 8)
                              measure1 = c(-1, -2)
                              measure2 = c(1, 2)
                              samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, 
                              geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, 
                              nperms = 100, min.foldchange = 2, max.qValue = 1e-04, output = output, paired = TRUE)
                              Xi Wang

                              Comment

                              • jiwu2573
                                Member
                                • Jun 2009
                                • 34

                                #30
                                Hi Xi,

                                The samWrapper also works well. But I am a bit confused about the column designation.

                                When I looked at the output "gene.exp", 1st col gives the gene ID, 2nd col for raw counts, 3rd col for RPKM, and 4th for all reads, then the pattern keeps on repeating for each sample. In this order, expCol1 should be col 2 and 5 (raw counts columns), but how comes 2 and 4 as you wrote?

                                Also I don't understand measure1 or measure2. What's the use of them? By the way, my samples are unpaired.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  06-18-2026, 07:11 AM
                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, 06-17-2026, 06:09 AM
                                0 responses
                                24 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                42 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                48 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...