Seqanswers Leaderboard Ad

**cbarrett** · 10-10-2010, 06:37 AM

convert_color_to_bp in Tophat still problematic

I have installed the patch posted by dcjones on , but when running Tophat on SOLiD colorspace paired-end read data I seem to have uncovered another problem with convert_color_to_bp:

tophat -C -F 0.10 -p 12 --mate-inner-dist 125 --mate-std-dev 25 --microexon-search --GFF ucsc-genes-with-GRCh-IDs.gtf GRCh37-lite_c test_F3.csfasta test_F5-P2.csfasta

[Fri Oct 8 15:15:19 2010] Beginning TopHat run (v1.1.0)
-----------------------------------------------
[Fri Oct 8 15:15:19 2010] Preparing output location ./tophat_out/
[Fri Oct 8 15:15:19 2010] Checking for Bowtie index files
[Fri Oct 8 15:15:19 2010] Checking for reference FASTA file
[Fri Oct 8 15:15:19 2010] Checking for Bowtie
Bowtie version: 0.12.7.0
[Fri Oct 8 15:15:19 2010] Checking for Samtools
Samtools version: 0.1.8.0
[Fri Oct 8 15:15:35 2010] Checking reads
min read length: 25bp, max read length: 50bp
format: fasta
[Fri Oct 8 17:05:10 2010] Reading known junctions from GFF file
[Fri Oct 8 18:01:36 2010] Mapping reads against GRCh37-lite_c with Bowtie
[Fri Oct 8 23:30:08 2010] Joining segment hits
[Sat Oct 9 04:14:44 2010] Mapping reads against GRCh37-lite_c with Bowtie(1/2)
[Sat Oct 9 08:28:21 2010] Mapping reads against GRCh37-lite_c with Bowtie(2/2)
[Sat Oct 9 12:59:31 2010] Mapping reads against GRCh37-lite_c with Bowtie
[Sun Oct 10 00:08:07 2010] Joining segment hits
Traceback (most recent call last):
File "/usr/local/bin/tophat", line 2174, in <module>
sys.exit(main())
File "/usr/local/bin/tophat", line 2133, in main
user_supplied_juncs)
File "/usr/local/bin/tophat", line 1848, in spliced_alignment
segment_len)
File "/usr/local/bin/tophat", line 1570, in split_reads
split_record(read_name, read_seq, read_quals, output_files, offsets, color)
File "/usr/local/bin/tophat", line 1503, in split_record
read_seq_temp = convert_color_to_bp(read_seq)
File "/usr/local/bin/tophat", line 1477, in convert_color_to_bp
base = decode_dic[base+ch]
KeyError: '+1'
make: *** [tophat_out/accepted_hits.sam] Error 1

**DerSeb** · 10-11-2010, 11:50 AM

Originally posted by dcjones View Post

I don't thing there is a problem with the '.'s needing to be 'N's. It expects '.'s in colorspace reads. The problem is that tophat converts the '.'s to 'N's on exactly one read (the last read), and it should not.

I don't know that you can modify your reads to work around that.

I see, all runs failed with "N" reads. I just started another 4 samples today with "." in the .csfasta files. We will see what happens next

If this fails I will try with the patch next!

**DerSeb** · 10-11-2010, 11:52 AM

Originally posted by dsidote View Post

I used the precompiled version and it worked. Our sysadmin is recompiling the code with dcjones patch, so as soon as that is done I will test it with unmodified data.

DerSeb: Did you try removing the reads with the missed colorcalls instead of converting to 'N' to see if the mixed colorspace-basespace is the issue?

I have not yet tried that, but I tried both "." and "N" files now. N crashed after a few hours and I'm just waiting for the "." mapping to finish!

**krobison** · 10-11-2010, 04:54 PM

BUT, the accepted_hits.bam file is empty! What did I do wrong this time?[/QUOTE]

Apparently I have gremlins; another run worked fine.

**AdamB** · 10-12-2010, 03:19 AM

Version 1.1.1 on the main page apparently includes fixes for these bugs...

**Pejman** · 10-14-2010, 10:28 AM

I just ran Tophat -> Cufflinks with and without GTF files on SOLiD colorspace data smoothly, thanks to the developers the new version works like a charm!

**AdamB** · 10-15-2010, 02:11 AM

I ran TopHat on paired-end SOLiD reads, and used the output for Cufflinks. Cufflinks identified the input as single-end, 25-bp reads (it was actually PE 50+25 bp). Does this mean Cufflinks is not working, or TopHat?

**Pejman** · 10-15-2010, 07:52 AM

Is there somebody with some clue on how to tune TopHat parameters? I just made a new thread for it:

Tuning TopHat parameters for SOLiD reads - SEQanswers

http://seqanswers.com/forums/showthread.php?p=27217#post27217

Sequencing by Oligo Ligation/Detection (Life Technologies)

**KevinLam** · 10-19-2010, 01:09 AM

Originally posted by krobison View Post

Does someone (such as the Tophat team) have a small colorspace dataset which works in Tophat that they'd be willing & able to make public? Having a positive control would be awfully handy.

I think you can download this

Attention Required! | Cloudflare

http://solidsoftwaretools.com/gf/project/wtpe/

This data set was generated by sequencing SOLiD™ Total RNA-Seq prepared libraries using paired-end reads of 50bp (forward) and 25 bp (reverse) on the SOLiD™ 4 System. The data provided is the mapping output and whole transcriptome results from the SOLiD™ BioScope™ 1.2.1 WT analysis pipeline.

Just grab the first few thousand for a small test dataset.

I am trying to run it on real life RNA-seq human single end 50 bp
it is taking forever for searching for junctions via segment mapping.

has anyone completed single end 50 bp solid data alignment with tophat ?

**krobison** · 10-19-2010, 05:10 AM

Thanks -- you do have to go thru the ABI folks to actually get access to the data, but I did succeed in getting TopHat to run on this.

I have gotten TopHat to run successfully on the SE datasets from the SRA -- it just requires trimming the first quality value out & rewriting the FASTQ as csfasta+qual file pairs.

Code:

#!/usr/bin/perl
use strict;

# reformat single-end SOLiD FASTQ data from Short Read Archive
# to work successfully with patched version of TopHat 1.1.0

foreach my $arg(@ARGV)
{
    my ($stem)=($arg=~/(.*).fastq$/);;
    die "Could not identify stem in $arg\n" unless (defined $stem);
    
    open(IN,$arg);
    open(FASTA,">$stem.csfasta");
    open(QUAL,">$stem.qual");
    while (my $idLineA=<IN>)
    {
        chomp($idLineA);
        my ($id)=($idLineA=~/^.([^ ]+)/);
        my $seqLine=<IN>;
        my $idLineB=<IN>;
        my $qualLine=<IN>;
        chomp($qualLine);
        my @qualVals=();
        foreach my $qualChar(split(//,$qualLine))
        {
            my $qualVal=ord($qualChar)-33;
            if ($qualVal<0)
            {
                $qualVal=0;
                print STDERR ">$qualChar< for $idLineB\n";
            }
            push(@qualVals,$qualVal);
        }
        shift(@qualVals); # dump first qual val
        print FASTA ">$id\n";
        print FASTA $seqLine;
        print QUAL ">$id\n";
        print QUAL join(" ",@qualVals),"\n";
    }
}

After reformatting, my command line looked like this (you may need to change path to .gtf file)

Code:

tophat --color -G $BOWTIE_INDEXES/hg18.ref-genes.gtf -o SRR040361-tophat -p 8 --quals hg18  SRR040361.csfasta  SRR040361.qual 1> tophat.2.out 2> tophat.2.err

**DerSeb** · 10-20-2010, 12:00 AM

Just to let you know, I have gotten SE 50bp and PE 50bp & 25bp SOLiD data to work. I use version 1.1 and trim the headers by hand, also replacing -1 values with 0.

There was a problem with some files not running through properly with the error I posted above.

The solution:It didn't work to combine two csfasta and qual files using

Code:

 cat 1.csfasta | cat 2.csfasta > 1and2.csfasta

so used:

Code:

 cat 2.csfasta | cat 1.csfasta > 2and1.csfasta

and it worked!! I wonder what the reason was for this? (also I got the same error using only the single files!

**AdamB** · 10-20-2010, 02:10 AM

Did you try feeding the output from the PE data into cufflinks? If so, did it report 50+25 bp reads?

**damiankao** · 10-20-2010, 04:14 AM

I've just tried running .csfasta and .qual files I got straight off the SOLiD run cluster with the newest Tophat. I got this error:

Traceback (most recent call last):
File "./tophat", line 2166, in ?
sys.exit(main())
File "./tophat", line 2125, in main
user_supplied_juncs)
File "./tophat", line 1840, in spliced_alignment
segment_len)
File "./tophat", line 1562, in split_reads
split_record(read_name, read_seq, read_quals, output_files, offsets, color)
File "./tophat", line 1495, in split_record
read_seq_temp = convert_color_to_bp(read_seq)
File "./tophat", line 1469, in convert_color_to_bp
base = decode_dic[base+ch]
KeyError: 'TN'

My .csfasta files all have a 'T' as the first base from the adaptor sequence.

>1_32_272_F3
T32203022012022322331200020221000013202020302001020

Do I need to get rid of the first 'T'?

**damiankao** · 10-20-2010, 04:32 AM

I just looked in the tophat code. There is no key for 'TN'. I am guessing 'T.' is same as 'TN'?

I can just add in 'TN' : 'N' and also for the other bases?

**adarob** · 10-20-2010, 07:49 AM

@jamessmith01, Cufflinks just reports the shortest length read it finds.

Be on the lookout for the next version of Cufflinks (hopefully coming this week), which will include proper options to handle strand-specificity in the SOLiD protocol.

Topics	Statistics	Last Post
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 13 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News