A possible problem :
According to htseq documentation : http://www-huber.embl.de/users/ander.../counting.html
"The fact that the records describe the same fragment can be seen from the fact that they have the same read name"
So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)
So simply do that:
(samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam
Seqanswers Leaderboard Ad
Collapse
X
-
Sorting SAM file properly can fix the warning
Hi,
I came across the same issue while using htseq-count.
I ve usedCode:sort -n 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
ERR032234.10311751 147 chr12 100208754 50 76M = 100208593 -237 GGCAGCCCAGGGCTTGGTGTTGGCAGTTTGAGTTCGTGAGATAGAAAGAGTGGGGTGTCAAGGCAGTACCCCTGAG >>=@=<9;9?>A<=;A;7>5;;;;=8@=<7==@:;5AA;A@A19A5AAAA@<<<;<;=8=5;@;8;<?.55>7?7> XA:i:0 MD:Z:76 NM:i:0 NH:i:1
ERR032234.1031175 145 chr1 51757180 50 76M = 51755662 -1594 CTCGGACTTGATCTGCCCAGACTTTTGGTCAGCAAGGAGAAGGTTATTGTTTGTTAAGAGGAAAATCCGAGATGTA A3=>D<D??>D?@DD?9DDD>DDBBDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD XA:i:0 MD:Z:76 NM:i:0 NH:i:1
ERR032234.10311751 99 chr12 100208593 50 76M = 100208754 237 CTAAAAGCTTACCTCCAAAACAGGATTCTTGTGTAACTAGGAATCCTGCATGAGAACCAGAAACCCTAACCTCCGA CAA@?AC8>ACC>AA<C?8B=6B<98?=>BC<C9=AAA<CA>A=A:=BB=>>::<>@==@*2:;8:7><<;7<><; XA:i:0 MD:Z:76 NM:i:0 NH:i:1
So the reads were not sorted correctly. I used the instruction from here
Code:sort -bn -k 1.11,1 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
this sort command would sort the file on field 1, from chars 11 to end of field, numerically.
now htseq-count working fine.
Leave a comment:
-
-
If TopHat marks the multimapping reads with the "NH" tag (and, IIRC, it does), htseq-count filters these out for you, too.
Leave a comment:
-
-
Thanks Simon. Does it mean I can understand that I only need to filter multiple mapped reads in the accepted_hits.bam from tophat2? No other filtering step necessary prior to HTseq-count?
Originally posted by Simon Anders View PostThey are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).
Leave a comment:
-
-
Originally posted by bbl View PostHow does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?
Leave a comment:
-
-
Transcriptom annotation
Hello to everyone,
I would like to ask for advice wich tools to use to annotate assembled RNA seq. We have sequenced several species at Illumina HiSeq and want to annotate genes now. Then see the differences in expression, because we have different species and hybrids as well.
Does anyone can advice me the pipeline? I have heard about RAST and then to manually approved it in GMOD.
I see majority is working with blast2go, but it is not free
Leave a comment:
-
-
How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?
Leave a comment:
-
-
Oh sorry, there should be a bunch of colons in there. That's because I was fooling around with it when I was getting errors earlier about an illegal colon in the optional fields. Here is the real line:
Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZP:B:f,0.0155953,0.00769459,0.0049578 ZM:B:s,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZF:i:28 RG:Z:Y71VY.IonXpress_014 PG:Z:tmap MD:Z:12 NM:i:0 AS:i:12 XA:Z:map4-1 XS:i:12
Does this still look screwed up?
Leave a comment:
-
-
That appears to be a pretty screwed up SAM line. Did your file get corrupted?
Leave a comment:
-
-
Hi, I was wondering if you ever figured out the error message you were describing above? I am getting a similar thing with HTSeq, only it is with the first read of the SAM file.
Error occured when reading first line of sam file.
Error: ("unsupported format character ''' (0x27) at index 34", 'line 30 of file
CAL120_sr.sam')
[Exception type: ValueError, raised in _HTSeq.pyx:1168]
And this is the line:
Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZPBf,0.0155953,0.00769459,0.0049578 ZMBs,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZFi28 RGZY71VY.IonXpress_014 PGZtmap MDZ12 NMi0 ASi12 XAZmap4-1 XSi12
Any help would be GREATLY appreciated!
Leave a comment:
-
-
Thanks for the quick reply @dpryan
I will try to get the counts again following your suggestion for removing the chimeric/fusion alignments.
It would be great if Simon Anders can implement a better solution for this, but unfortunately "monetary encouragement" is beyond my possibilities at this early stage of my PhD studies ....
So I will give it a try now and perhaps by Monday I will have the new counts.
thanks agan!
Leave a comment:
-
-
I wouldn't say that the counts are unreliable, so much as not necessarily representing exactly what you want...though that's just splitting semantic hairs.
In the example you posted above (in #14), if all the parts of that fusion gene mapped to, say, KCNJ12, then that fusion read would end up getting counted twice, which is not what you'll want. If the second part of the fusion read mapped to, say, KCNJ18 (random example), then the whole paired-read would end up getting counted once for each of the genes, which is also probably not ideal.
The conservative solution (short of just realigning) would be to simply remove fusion reads, which could be done with
Code:samtools view file.bam | grep -v "XF:Z" | htseq-count [options] - GTF_file > counts_file
Regarding the dexseq_count.py error, it's not clear to me from that line why that would be occuring. It's complaining about there being an apostrophe somewhere, though I don't see one. I suspect there's something with the format of the chimeric read that it doesn't like, but I'm not familiar enough with the inner-workings of the script to say what (perhaps Simon will chime in).
Leave a comment:
-
-
@dpryan: thanks for the reply!
Yes, I used tophat-fusion, so this means that the counts I get from these alignments are unreliable?
Also, I am using dexseq_count.py on my 105 RNA-Seq samples, and only one of the files triggered an error:
ValueError: ("unsupported format character ''' (0x27) at index 34", 'line 433369370
And this is the line:
SRR391511.sra.ncbi_enc.52549224 163 chr3 49396810 50 50M = 49396966 205 GGGAAACCAATTCCTATTTACTTAGCCCAGCTCCATGGGGTACTGAGATA B*CBABA><7ABB6BACAA>;AB@B@?ABB@@A?=AB<<6-?>BB>A>6@ XA:i:1 MD:Z:1T48 N@BAA=?9:@4@ XA:i:0 MD:Z:50 NM:i:0 XS:A:- XP:Z:chr3 49397428 50M NH:i:1
Is there something I should fix in that line to make it work? Also, it would be good to know why it is only failing with that one file when all of them have been processed the same way. Thanks for any input in the matter.
Leave a comment:
-
-
@acevera: It looks like you have chimeric reads, is this output from tophat-fusion (or tophat2 with --fusion-search)?
Edit: I should add a link to a short back and forth on an identical issue from sourceforge. Counting chimeric reads requires a good bit of forethought. In the case above, both sections probably map to the same gene, which has an inversion. In this case, you only want the read counted once (though it will get counted twice), while in other cases, where the read might map to a fusion of two genes, you might want it counted twice or not at all. I wouldn't expect htseq-count to know exactly how you want each of these cases handled.
BTW, the SAM spec. has only recently been updated to include chimeric alignments, through the addition of the 0x800 flag bit. Tophat doesn't follow that, though given that tophat-fusion predates the specification change, that's pretty unsurprising.Last edited by dpryan; 08-15-2013, 06:58 AM.
Leave a comment:
-
-
Hi,
I am having a similiar problem, i.e. getting lots of warning messages from htseq-count, but I believe my files are properly sorted and I did not mix paired-end with single-end reads.
I used the following commands:
samtools sort -n accepted_hits.bam sorted
samtools view sorted.bam | htseq-count -s no - genes.gtf > counts.txt
Then I get tons of warnings like
Warning: Read SRR316646.sra.ncbi_enc.1030 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR316646.sra.ncbi_enc.1140 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR316646.sra.ncbi_enc.1193 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
For the first warning the corresponding section of my sorted.bam file looks like this:
SRR316646.sra.ncbi_enc.1029 163 chr6 30036407 50 69M405N7M = 30036975 644 CANAGAAAAAGGTAGAATGGACAAGTGACACTGTGGACAATGAACACATGGGCCGCCGCTCATCCAAATG
CTGCTG C@#@BCCCC;>D@CDEEEDDGGFGGEEEEGDEFDFGFFAGGDGEEFGGFCBDEGGGGGEEBEE=EB?@BEB?C?@F XA:i:1 MD:Z:2G73 NM:i:1 XS:A:+ XP:Z:chr6 30036975 76M NH:i:1
SRR316646.sra.ncbi_enc.1029 83 chr6 30036975 50 76M = 30036407 -644 ACGTGGCCACCGCAAAGGACGGCGTCGTGCAACCCTAGGACCGACCCCCACCACACCTCCCCAGCCTCCTGACCCT :?==@?C?5ECBE?@E?B@EBCED?DECEEABDDD-EEEEC?CBDEDBEEBE;<EEFGDFFFAFAEDFFFFFDE XA:i:1 MD:Z:54C21 NM:i:1 XS:A:+ XP:Z:chr6 30036407 69M405N7M NH:i:1
SRR316646.sra.ncbi_enc.1030 163 chr3 101544467 50 76M = 101544619 196 GTNTTTTACATATAAATATTTTTCTATGTTAAATAGTCTTCATAAGTTAATTATAAAGATCTGCTTAATAGTTCTG @?#<A@C=B>DA?C?AECEEFFEFEEFFBFFDFFFFBFFFEEBEBFBBFDDB?EEDECBADEF?EEE@D?E=@EB? XA:i:1 MD:Z:2T73 NM:i:1 XS:A:+ XP:Z:chr3-chr3 101544619 36M101544702F40m NH:i:1
SRR316646.sra.ncbi_enc.1030 83 chr3 101544619 50 36M = 101544467 -196 TATATCAATTTCAGTGGTTTAAAATATGAATTTCTA EE=E@EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDD AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:1 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@EBEB@BDEEDEDEEFE
FEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
SRR316646.sra.ncbi_enc.1030 83 chr3 101544664 50 40M = 101544467 -196 GAAAGTATTAATTTAATGTAGTGAATCTCATATTCCAGAA DCCDD?EEEEDFFFEDEFDED@EBCCCCED
FFFFFBFFEB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:2 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@
EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
SRR316646.sra.ncbi_enc.1032 99 chr6 119499446 50 76M = 119499512 142 ATTAGGTGGAACCATATGAAACTGCTGACAGTTTTTAAACTACAAATGCAGCAATTTCATATGTTTCAGCCTAATC EEE:AD:?:CA>ADCD?DBBDDDA?CCBDDDDD545@CD=CBB:BBA?DBB5B?C=??CAC?=AB?BCAA5?5A XA:i:0 MD:Z:76 NM:i:0 XS:A:- XP:Z:chr6 119499512 76M NH:i:1
Is it ok to ignore the warnings or there is something wrong with my alignment files??
I appreciate any help on these.
Thanks!
Leave a comment:
-
Latest Articles
Collapse
-
by seqadmin
This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.
The Headliner
The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...-
Channel: Articles
03-03-2025, 01:39 PM -
-
by seqadmin
The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...-
Channel: Articles
02-24-2025, 06:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 05:03 AM
|
0 responses
16 views
0 reactions
|
Last Post
by seqadmin
Yesterday, 05:03 AM
|
||
Started by seqadmin, 03-19-2025, 07:27 AM
|
0 responses
17 views
0 reactions
|
Last Post
by seqadmin
03-19-2025, 07:27 AM
|
||
Started by seqadmin, 03-18-2025, 12:50 PM
|
0 responses
18 views
0 reactions
|
Last Post
by seqadmin
03-18-2025, 12:50 PM
|
||
Started by seqadmin, 03-03-2025, 01:15 PM
|
0 responses
185 views
0 reactions
|
Last Post
by seqadmin
03-03-2025, 01:15 PM
|
Leave a comment: