Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with cufflink

    Hi,

    I am quite new to nextgen sequence analysis.

    I used bowtie to align our colorspace data into the reference genome and it was quite fast. cufflink ran alright with my single .sam file, but the problem is I have 4 experimental replicates. Anyone has a suggestion as how I can combine these 4 cufflink outputs?

    Is there a feature available in cufflink that can take more than one .sam files as technical replicates?

    Many thanks

  • #2
    What are you trying to do with those replicates? If you are looking for replicate support of novel transcripts, you can probably do it with cuffcompare and then set whatever prevalence thresholds you feel comfortable with. Version 0.8.3 isn't replicate-aware during the transcript reconstruction process though.

    P.S. Why are you using bowtie as input, you aren't going to see the novel splices without TopHat. Also, what kind of reads do you have and what is the purpose of the analysis?

    Comment


    • #3
      GKM, Thanks for the response!

      We have colorspace data and as I see, tophat is not compatible with colorspace reads. May be I have to convert the into MAQ format before I can use them for tophat. But nevertheless, I have the sam outputs from bowtie which are generated after the raw reads were aligned to the genome.

      Now I want to cluster the transcripts in reference to the genome, so that they can aid in gene model correction. I am not sure how the replicates will fare in this process. I would certainly try and see what 'cuffcompare' has to say with respect to the difference.

      Any advise with this regard will be great.

      Thanks

      Comment


      • #4
        In order for running Cufflinks to make sense, you will need spliced reads. From what you describe, I get the impression that you are aligning against the genome without the junctions, so you are missing those completely.

        What you should do is convert reads to fastq (I haven't worked with SOLID RNA-Seq data myself so I am not familiar with what the available options for doing this are), then run TopHat (make sure you supply it with the correct parameters, insert length, junctions if you have short reads, etc.), then run cufflinks.

        After you run cufflinks, you can use cuffcompare to compare to the existing annotation

        Comment


        • #5
          Thanks GKM I really appreciate it!!

          Is there any other software available for RNAseq clustering? And I still am just curious to know what people do when they have experimental replicates.

          Thanks

          Comment


          • #6
            I am pretty sure replicate support (i.e. run cufflinks on replicates) is being worked on by cufflinks developers, but how soon it will be available to the wide community, I have no idea. Other software I haven't used

            In the meantime my advice is to run it on each replicate, then run cuffompare and look at the tracking files for transcripts you find in all replicates.

            Comment


            • #7
              Originally posted by tsucheta View Post
              Thanks GKM I really appreciate it!!

              Is there any other software available for RNAseq clustering? And I still am just curious to know what people do when they have experimental replicates.

              Thanks
              Hi tsucheta,

              you're right, Tophat isn't able to handle color-space reads, so you definitely have to stick with bowtie, which shouldn't be a problem at all, unless your reads are not sequenced too long. Unless you are not interrested in splice-juntion tracking, a splice-mapper like Tophat only makes sense for longer reads, but not necessarily for reads up to, lets say, 50Bp. This is, because short reads are not expected to span exon boundaries to such a large extend that you will miss information.

              Anyway, if you've aligned your reads already, there are at least three packages out there that properly handle biological and technical replicates:
              1. EdgeR
              2. DESeq
              3. DEGSeq


              EdgeR appears the most mature one. DESeq is very similar to EdgeR and appears to be more powerful in calling differential expression. DEGSeq follows a different statistical approach (controversially discussed), produces nice pictures implicitely and is very easy to use (although the others are as well).

              Uwe

              Comment


              • #8
                I would say a splice-aware mapper like TopHat always makes sense when aligning RNA-seq reads, since you always lose information if you dont, regardless of the length of the reads. It is not more difficult to run TopHat than Bowtie and only takes a little longer. After aligning the reads, you can always chose to extract only the reads that did not span a splice junction if that is what you wish.

                Comment


                • #9
                  Thanks for your posts! It has really been useful. While I have not tried the following softwares
                  1. EdgeR
                  2. DESeq
                  3. DEGSeq
                  I am still stuck with tophat. I could convert sequences to fastq and finally running tophat ends with the following errors:

                  --------
                  [Wed Sep 8 18:38:48 2010] Beginning TopHat run (v1.0.14)
                  -----------------------------------------------
                  [Wed Sep 8 18:38:48 2010] Preparing output location ./tophat_out/
                  [Wed Sep 8 18:38:48 2010] Checking for Bowtie index files
                  [Wed Sep 8 18:38:48 2010] Checking for reference FASTA file
                  Warning: Could not find FASTA file /home/data/bowtie-0.12.5/index/soj
                  aeV1.fa
                  [Wed Sep 8 18:38:48 2010] Reconstituting reference FASTA file from Bowtie index

                  [Wed Sep 8 18:39:07 2010] Checking for Bowtie
                  Bowtie version: 0.12.5.0
                  [Wed Sep 8 18:39:07 2010] Checking reads
                  seed length: 50bp
                  format: fastq
                  quality scale: phred33 (default)
                  [Wed Sep 8 18:43:08 2010] Reading known junctions from GFF file
                  Warning: TopHat did not find any junctions in GFF file
                  [Wed Sep 8 18:44:05 2010] Mapping reads against sojaeV1 with Bowtie
                  [Wed Sep 8 18:44:05 2010] Joining segment hits
                  Traceback (most recent call last):
                  File "/home/data/tophat-1.0.14.Linux_x86_64/tophat", line 1854, in <module>

                  sys.exit(main())
                  File "/home/data/tophat-1.0.14.Linux_x86_64/tophat", line 1814, in main
                  user_supplied_juncs)
                  File "/home/data/tophat-1.0.14.Linux_x86_64/tophat", line 1562, in spliced_
                  alignment
                  segment_len)
                  File "/home/data/tophat-1.0.14.Linux_x86_64/tophat", line 1229, in split_re
                  ads
                  reads_file = open(reads_filename)
                  IOError: [Errno 2] No such file or directory: './tophat_out/tmp//left_kept_reads
                  _missing.fq'

                  --
                  I am running the binary tophat distribution.

                  Many thanks

                  Comment


                  • #10
                    Hi tsucheta,

                    i'm not really sure, what you fed into Tophat?! Fastq, but what code? Tophat will not work with color-spaced reads, i mentioned that. So i assume, you scriptually translated into base-space-fastq and ran Tophat with that?

                    But thats exactly what i was referring to. You will for sure not miss that much information by not using Tophat with 50bp reads and stick with Bowtie instead. However, translating color-space reads (whatever format) to base-space reads (whatever format) introduces a vaste of nucleotide misinterpretations, unless it is not properly decoded by a color-space aligner. Just imagine you have a color-spaced read that perfectly aligns to the reference genome => translation to base-space and then aligning in base-space is no problem. Now imagine you have a color-spaced read that aligns but has a single SNP (or sequencing error) nearby its 5' end. This read could still be aligned well in color-space. But translating this one into base-space without the knowledge that there indeed is a SNP leads to a almost completely different read in base-space and therefore results in no alignments at all any more! This has been discussed several times.

                    Afterall, i cannot tell for sure, what problem Tophat indeed has. The error message means that not a single one of your "left reads" has been properly aligned, because the file which should contain that information doesn't even exist. So the question still is: what exactly did you fed into Tophat. And by the way: do you really have paired end data? Thought it wouldn't be available before SOLiD4? And even this is so, as far as i know paired-end with SOLiD4 produces asymetric read length, which (according to Tophat-manual) is again not supported (all reads need to have same length!).

                    Could you please explain in more detail what exactly you did? And please also provide the command-line used to invoke Tophat.

                    Best,
                    Uwe


                    ps: just noticed the warnings Tophat provided.
                    Code:
                    Warning: Could not find FASTA file /home/data/bowtie-0.12.5/index/sojaeV1.fa
                    i dimly remember that Tophat needs the fasta file of the reference genome in addition to the bowtie-index. Don't really know, whether this is still the case, but at least there is the warning and given you result in no reads aligned it's worth a try, isn't it.
                    Last edited by Uwe Appelt; 09-08-2010, 11:56 PM.

                    Comment


                    • #11
                      Thanks Uwe for the responses! I have figured out the reason why tophat was exiting. I was trying to run it with colorspace indexed reference. That being fixed, it runs fine.
                      Coming to alignment sensitivity, I could not agree with you more about the lost alignments via tophat route.
                      While translating the colorspace data into fastq format, I have lost a number of reads because there was a either a "." character or a number >3.
                      I have not investigated the details about the alignment qualities and the number of reads that failed to align through tophat, but looking at just the sam output files, looks like tophat produces sam files 1/4th the size of bowtie sam output file. SO, there may be some information loss there.

                      In the coming days, I will compare bowtie -> cufflink with tophat -> cufflink output to see if there is a major difference.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Exploring the Dynamics of the Tumor Microenvironment
                        by seqadmin




                        The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                        07-08-2024, 03:19 PM
                      • seqadmin
                        Exploring Human Diversity Through Large-Scale Omics
                        by seqadmin


                        In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                        06-25-2024, 06:43 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 07-19-2024, 07:20 AM
                      0 responses
                      38 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-16-2024, 05:49 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-15-2024, 06:53 AM
                      0 responses
                      61 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 07-10-2024, 07:30 AM
                      0 responses
                      43 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X