Seqanswers Leaderboard Ad

**acervera** · 08-16-2013, 05:03 AM

@dpryan: thanks for the reply!
Yes, I used tophat-fusion, so this means that the counts I get from these alignments are unreliable?
Also, I am using dexseq_count.py on my 105 RNA-Seq samples, and only one of the files triggered an error:
ValueError: ("unsupported format character ''' (0x27) at index 34", 'line 433369370
And this is the line:
SRR391511.sra.ncbi_enc.52549224 163 chr3 49396810 50 50M = 49396966 205 GGGAAACCAATTCCTATTTACTTAGCCCAGCTCCATGGGGTACTGAGATA B*CBABA><7ABB6BACAA>;AB@B@?ABB@@A?=AB<<6-?>BB>A>6@ XA:i:1 MD:Z:1T48 N@BAA=?9:@4@ XA:i:0 MD:Z:50 NM:i:0 XS:A:- XP:Z:chr3 49397428 50M NH:i:1

Is there something I should fix in that line to make it work? Also, it would be good to know why it is only failing with that one file when all of them have been processed the same way. Thanks for any input in the matter.

**dpryan** · 08-16-2013, 05:41 AM

I wouldn't say that the counts are unreliable, so much as not necessarily representing exactly what you want...though that's just splitting semantic hairs.

In the example you posted above (in #14), if all the parts of that fusion gene mapped to, say, KCNJ12, then that fusion read would end up getting counted twice, which is not what you'll want. If the second part of the fusion read mapped to, say, KCNJ18 (random example), then the whole paired-read would end up getting counted once for each of the genes, which is also probably not ideal.

The conservative solution (short of just realigning) would be to simply remove fusion reads, which could be done with

Code:

samtools view file.bam | grep -v "XF:Z" | htseq-count [options] - GTF_file > counts_file

or something like that. This will still produce a bunch of error messages due to having missing mates, but you remove the chimeric-alignment-double-counting issue. In an ideal world, the counting program would treat chimeric alignments that map to the same feature as a single entity and those aligning to different features in a user-defined manner. If you have some budget to spare, perhaps Simon Anders could be "monetarily encouraged" to modify htseq-count to do that...

Regarding the dexseq_count.py error, it's not clear to me from that line why that would be occuring. It's complaining about there being an apostrophe somewhere, though I don't see one. I suspect there's something with the format of the chimeric read that it doesn't like, but I'm not familiar enough with the inner-workings of the script to say what (perhaps Simon will chime in).

**acervera** · 08-16-2013, 06:34 AM

Thanks for the quick reply @dpryan
I will try to get the counts again following your suggestion for removing the chimeric/fusion alignments.
It would be great if Simon Anders can implement a better solution for this, but unfortunately "monetary encouragement" is beyond my possibilities at this early stage of my PhD studies ....
So I will give it a try now and perhaps by Monday I will have the new counts.
thanks agan!

**adaigle** · 10-28-2013, 12:43 PM

Hi, I was wondering if you ever figured out the error message you were describing above? I am getting a similar thing with HTSeq, only it is with the first read of the SAM file.

Error occured when reading first line of sam file.
Error: ("unsupported format character ''' (0x27) at index 34", 'line 30 of file
CAL120_sr.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1168]

And this is the line:
Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZPBf,0.0155953,0.00769459,0.0049578 ZMBs,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZFi28 RGZY71VY.IonXpress_014 PGZtmap MDZ12 NMi0 ASi12 XAZmap4-1 XSi12

Any help would be GREATLY appreciated!

**dpryan** · 10-28-2013, 12:45 PM

That appears to be a pretty screwed up SAM line. Did your file get corrupted?

**adaigle** · 10-28-2013, 12:53 PM

Oh sorry, there should be a bunch of colons in there. That's because I was fooling around with it when I was getting errors earlier about an illegal colon in the optional fields. Here is the real line:

Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZP:B:f,0.0155953,0.00769459,0.0049578 ZM:B:s,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZF:i:28 RG:Z:Y71VY.IonXpress_014 PG:Z:tmap MD:Z:12 NM:i:0 AS:i:12 XA:Z:map4-1 XS:i:12

Does this still look screwed up?

**bbl** · 11-13-2013, 04:55 AM

How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?

**Eveleee** · 11-13-2013, 05:25 AM

Transcriptom annotation

Hello to everyone,
I would like to ask for advice wich tools to use to annotate assembled RNA seq. We have sequenced several species at Illumina HiSeq and want to annotate genes now. Then see the differences in expression, because we have different species and hybrids as well.
Does anyone can advice me the pipeline? I have heard about RAST and then to manually approved it in GMOD.
I see majority is working with blast2go, but it is not free

**Simon Anders** · 11-15-2013, 03:54 AM

Originally posted by bbl View Post

How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?

They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

**bbl** · 11-15-2013, 06:53 AM

Thanks Simon. Does it mean I can understand that I only need to filter multiple mapped reads in the accepted_hits.bam from tophat2? No other filtering step necessary prior to HTseq-count?

Originally posted by Simon Anders View Post

They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

**Simon Anders** · 11-15-2013, 07:17 AM

If TopHat marks the multimapping reads with the "NH" tag (and, IIRC, it does), htseq-count filters these out for you, too.

**kcm.eid** · 03-02-2014, 10:05 PM

Sorting SAM file properly can fix the warning

Hi,
I came across the same issue while using htseq-count.
I ve used

Code:

sort -n 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam

command to sort my accepted_hits.sam, but still getting the warning message. When I little analysed the warning message and my input sorted SAM file, found that my sam file was not sorted properly. One read which is showing warning is as follows:

ERR032234.10311751 147 chr12 100208754 50 76M = 100208593 -237 GGCAGCCCAGGGCTTGGTGTTGGCAGTTTGAGTTCGTGAGATAGAAAGAGTGGGGTGTCAAGGCAGTACCCCTGAG >>=@=<9;9?>A<=;A;7>5;;;;=8@=<7==@:;5AA;A@A19A5AAAA@<<<;<;=8=5;@;8;<?.55>7?7> XA:i:0 MD:Z:76 NM:i:0 NH:i:1

ERR032234.1031175 145 chr1 51757180 50 76M = 51755662 -1594 CTCGGACTTGATCTGCCCAGACTTTTGGTCAGCAAGGAGAAGGTTATTGTTTGTTAAGAGGAAAATCCGAGATGTA A3=>D<D??>D?@DD?9DDD>DDBBDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD XA:i:0 MD:Z:76 NM:i:0 NH:i:1

ERR032234.10311751 99 chr12 100208593 50 76M = 100208754 237 CTAAAAGCTTACCTCCAAAACAGGATTCTTGTGTAACTAGGAATCCTGCATGAGAACCAGAAACCCTAACCTCCGA CAA@?AC8>ACC>AA<C?8B=6B<98?=>BC<C9=AAA<CA>A=A:=BB=>>::<>@==@*2:;8:7><<;7<><; XA:i:0 MD:Z:76 NM:i:0 NH:i:1

So the reads were not sorted correctly. I used the instruction from here

Code:

sort -bn -k 1.11,1 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam

my reads have ids like this: ERR032234.1031175
this sort command would sort the file on field 1, from chars 11 to end of field, numerically.

now htseq-count working fine.

**raphael123** · 02-11-2015, 12:02 PM

A possible problem :

According to htseq documentation : http://www-huber.embl.de/users/ander.../counting.html

"The fact that the records describe the same fragment can be seen from the fact that they have the same read name"

So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

So simply do that:

(samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News