Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    convert_color_to_bp in Tophat still problematic

    I have installed the patch posted by dcjones on , but when running Tophat on SOLiD colorspace paired-end read data I seem to have uncovered another problem with convert_color_to_bp:


    tophat -C -F 0.10 -p 12 --mate-inner-dist 125 --mate-std-dev 25 --microexon-search --GFF ucsc-genes-with-GRCh-IDs.gtf GRCh37-lite_c test_F3.csfasta test_F5-P2.csfasta

    [Fri Oct 8 15:15:19 2010] Beginning TopHat run (v1.1.0)
    -----------------------------------------------
    [Fri Oct 8 15:15:19 2010] Preparing output location ./tophat_out/
    [Fri Oct 8 15:15:19 2010] Checking for Bowtie index files
    [Fri Oct 8 15:15:19 2010] Checking for reference FASTA file
    [Fri Oct 8 15:15:19 2010] Checking for Bowtie
    Bowtie version: 0.12.7.0
    [Fri Oct 8 15:15:19 2010] Checking for Samtools
    Samtools version: 0.1.8.0
    [Fri Oct 8 15:15:35 2010] Checking reads
    min read length: 25bp, max read length: 50bp
    format: fasta
    [Fri Oct 8 17:05:10 2010] Reading known junctions from GFF file
    [Fri Oct 8 18:01:36 2010] Mapping reads against GRCh37-lite_c with Bowtie
    [Fri Oct 8 23:30:08 2010] Joining segment hits
    [Sat Oct 9 04:14:44 2010] Mapping reads against GRCh37-lite_c with Bowtie(1/2)
    [Sat Oct 9 08:28:21 2010] Mapping reads against GRCh37-lite_c with Bowtie(2/2)
    [Sat Oct 9 12:59:31 2010] Mapping reads against GRCh37-lite_c with Bowtie
    [Sun Oct 10 00:08:07 2010] Joining segment hits
    Traceback (most recent call last):
    File "/usr/local/bin/tophat", line 2174, in <module>
    sys.exit(main())
    File "/usr/local/bin/tophat", line 2133, in main
    user_supplied_juncs)
    File "/usr/local/bin/tophat", line 1848, in spliced_alignment
    segment_len)
    File "/usr/local/bin/tophat", line 1570, in split_reads
    split_record(read_name, read_seq, read_quals, output_files, offsets, color)
    File "/usr/local/bin/tophat", line 1503, in split_record
    read_seq_temp = convert_color_to_bp(read_seq)
    File "/usr/local/bin/tophat", line 1477, in convert_color_to_bp
    base = decode_dic[base+ch]
    KeyError: '+1'
    make: *** [tophat_out/accepted_hits.sam] Error 1

    Comment


    • #17
      Originally posted by dcjones View Post
      I don't thing there is a problem with the '.'s needing to be 'N's. It expects '.'s in colorspace reads. The problem is that tophat converts the '.'s to 'N's on exactly one read (the last read), and it should not.

      I don't know that you can modify your reads to work around that.
      I see, all runs failed with "N" reads. I just started another 4 samples today with "." in the .csfasta files. We will see what happens next

      If this fails I will try with the patch next!

      Comment


      • #18
        Originally posted by dsidote View Post
        I used the precompiled version and it worked. Our sysadmin is recompiling the code with dcjones patch, so as soon as that is done I will test it with unmodified data.

        DerSeb: Did you try removing the reads with the missed colorcalls instead of converting to 'N' to see if the mixed colorspace-basespace is the issue?
        I have not yet tried that, but I tried both "." and "N" files now. N crashed after a few hours and I'm just waiting for the "." mapping to finish!

        Comment


        • #19
          BUT, the accepted_hits.bam file is empty! What did I do wrong this time?[/QUOTE]

          Apparently I have gremlins; another run worked fine.

          Comment


          • #20
            Version 1.1.1 on the main page apparently includes fixes for these bugs...

            Comment


            • #21
              I just ran Tophat -> Cufflinks with and without GTF files on SOLiD colorspace data smoothly, thanks to the developers the new version works like a charm!

              Comment


              • #22
                I ran TopHat on paired-end SOLiD reads, and used the output for Cufflinks. Cufflinks identified the input as single-end, 25-bp reads (it was actually PE 50+25 bp). Does this mean Cufflinks is not working, or TopHat?

                Comment


                • #23
                  Is there somebody with some clue on how to tune TopHat parameters? I just made a new thread for it:

                  Comment


                  • #24
                    Originally posted by krobison View Post
                    Does someone (such as the Tophat team) have a small colorspace dataset which works in Tophat that they'd be willing & able to make public? Having a positive control would be awfully handy.
                    I think you can download this


                    This data set was generated by sequencing SOLiD™ Total RNA-Seq prepared libraries using paired-end reads of 50bp (forward) and 25 bp (reverse) on the SOLiD™ 4 System. The data provided is the mapping output and whole transcriptome results from the SOLiD™ BioScope™ 1.2.1 WT analysis pipeline.

                    Just grab the first few thousand for a small test dataset.

                    I am trying to run it on real life RNA-seq human single end 50 bp
                    it is taking forever for searching for junctions via segment mapping.

                    has anyone completed single end 50 bp solid data alignment with tophat ?
                    http://kevin-gattaca.blogspot.com/

                    Comment


                    • #25
                      Thanks -- you do have to go thru the ABI folks to actually get access to the data, but I did succeed in getting TopHat to run on this.

                      I have gotten TopHat to run successfully on the SE datasets from the SRA -- it just requires trimming the first quality value out & rewriting the FASTQ as csfasta+qual file pairs.

                      Code:
                      #!/usr/bin/perl
                      use strict;
                      
                      # reformat single-end SOLiD FASTQ data from Short Read Archive
                      # to work successfully with patched version of TopHat 1.1.0
                      
                      foreach my $arg(@ARGV)
                      {
                          my ($stem)=($arg=~/(.*).fastq$/);;
                          die "Could not identify stem in $arg\n" unless (defined $stem);
                          
                          open(IN,$arg);
                          open(FASTA,">$stem.csfasta");
                          open(QUAL,">$stem.qual");
                          while (my $idLineA=<IN>)
                          {
                              chomp($idLineA);
                              my ($id)=($idLineA=~/^.([^ ]+)/);
                              my $seqLine=<IN>;
                              my $idLineB=<IN>;
                              my $qualLine=<IN>;
                              chomp($qualLine);
                              my @qualVals=();
                              foreach my $qualChar(split(//,$qualLine))
                              {
                                  my $qualVal=ord($qualChar)-33;
                                  if ($qualVal<0)
                                  {
                                      $qualVal=0;
                                      print STDERR ">$qualChar< for $idLineB\n";
                                  }
                                  push(@qualVals,$qualVal);
                              }
                              shift(@qualVals); # dump first qual val
                              print FASTA ">$id\n";
                              print FASTA $seqLine;
                              print QUAL ">$id\n";
                              print QUAL join(" ",@qualVals),"\n";
                          }
                      }
                      After reformatting, my command line looked like this (you may need to change path to .gtf file)
                      Code:
                      tophat --color -G $BOWTIE_INDEXES/hg18.ref-genes.gtf -o SRR040361-tophat -p 8 --quals hg18  SRR040361.csfasta  SRR040361.qual 1> tophat.2.out 2> tophat.2.err

                      Comment


                      • #26
                        Just to let you know, I have gotten SE 50bp and PE 50bp & 25bp SOLiD data to work. I use version 1.1 and trim the headers by hand, also replacing -1 values with 0.

                        There was a problem with some files not running through properly with the error I posted above.

                        The solution:It didn't work to combine two csfasta and qual files using

                        Code:
                         cat 1.csfasta | cat 2.csfasta > 1and2.csfasta
                        so used:

                        Code:
                         cat 2.csfasta | cat 1.csfasta > 2and1.csfasta
                        and it worked!! I wonder what the reason was for this? (also I got the same error using only the single files!

                        Comment


                        • #27
                          Did you try feeding the output from the PE data into cufflinks? If so, did it report 50+25 bp reads?

                          Comment


                          • #28
                            I've just tried running .csfasta and .qual files I got straight off the SOLiD run cluster with the newest Tophat. I got this error:

                            Traceback (most recent call last):
                            File "./tophat", line 2166, in ?
                            sys.exit(main())
                            File "./tophat", line 2125, in main
                            user_supplied_juncs)
                            File "./tophat", line 1840, in spliced_alignment
                            segment_len)
                            File "./tophat", line 1562, in split_reads
                            split_record(read_name, read_seq, read_quals, output_files, offsets, color)
                            File "./tophat", line 1495, in split_record
                            read_seq_temp = convert_color_to_bp(read_seq)
                            File "./tophat", line 1469, in convert_color_to_bp
                            base = decode_dic[base+ch]
                            KeyError: 'TN'

                            My .csfasta files all have a 'T' as the first base from the adaptor sequence.

                            >1_32_272_F3
                            T32203022012022322331200020221000013202020302001020

                            Do I need to get rid of the first 'T'?

                            Comment


                            • #29
                              I just looked in the tophat code. There is no key for 'TN'. I am guessing 'T.' is same as 'TN'?

                              I can just add in 'TN' : 'N' and also for the other bases?
                              Last edited by damiankao; 10-20-2010, 04:35 AM.

                              Comment


                              • #30
                                @jamessmith01, Cufflinks just reports the shortest length read it finds.

                                Be on the lookout for the next version of Cufflinks (hopefully coming this week), which will include proper options to handle strand-specificity in the SOLiD protocol.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-30-2024, 12:17 PM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-29-2024, 10:49 AM
                                0 responses
                                19 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X