Seqanswers Leaderboard Ad

**fennan** · 02-10-2011, 05:10 AM

Hi Simon,

I can't believe I missed that. It was clearly explained in the documentation. Sorry about that! When I read the warning "Is the SAM file properly sorted?" I assumed it was coordinate-based sorted

Now it's working as nicely as usual.
Thanks again!

**poisson200** · 03-03-2011, 05:38 AM

htseq-count problem

Dear Simon, all,
Possibly my or a htseq-count problem (HTSeq-0.4.7p1). I have ran some paired-end RNA-seq SOLiD data with Tophat (version TopHat v1.2.0) and have generated a test SAM file as follows;

Situation 1) Sam input format 1, as it came from tophat.

$ htseq-count test.sam fred.gtf > f
100000 GFF lines processed.
Warning: Read 1131_1472_555_F5-P2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read 1131_1472_555_F3 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
0 reads processed.

There is only 1 mate pair in the file, please see the attached text file for formats.

So maybe I thought the names have to be equivalent, so I knocked off the extensions (see attached, situation 2)

Same result;
$ htseq-count test.sam fred.gtf > f
100000 GFF lines processed.
Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
0 reads processed.

Situation 3) So I thought maybe the “forward” mate should come first, so I flipped read positions.

$ htseq-count test.sam fred.gtf > f
100000 GFF lines processed.
Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
0 reads processed.

Turning off stranded did not help either;

$ htseq-count -s no test.sam fred.gtf > f
100000 GFF lines processed.
Warning: Read 1131_1472_555 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
0 reads processed.

It could be very well my user problem but I am not seeing why this error message would occur. Thank you for any advice,

Kind regards,

John.

Attached Files

HTSEQcount.txt (1.3 KB, 83 views)

**Simon Anders** · 03-15-2011, 08:16 AM

Hi John

according to the SAM format specification, the names of reads from the same fragment have to be identical, so it was by design that your situation 1 did not work.

Situations 2 and 3 should work, and I guess they did, only the counting was wrong. (It says "0 read processed" although the reads were processed.) They should be counted as "non_unique_alignment", because of the optional field "NH:i:4".

I've just fixed the issue that thew wrong number of processed reads is displayed, so please try again with v0.7.4p3 and check whether you get non-zero counts for any of the couters.

Simon

**poisson200** · 03-15-2011, 12:35 PM

Thank you

Hi Simon,
Thanks a lot for looking at that and fixing it; I will give it a whirl tomorrow. About read names; this is how they come of the SOLiD4 machine and I guess AB expect users to use Bioscope.

Kind regards,

John.

**poisson200** · 04-01-2011, 06:00 AM

htseq-count; a small consideration/nice to have

Dear Simon,
I have been doing some testing regards counts of reads to genes based on genome mapping.

My GTF file is from hg19 Genome Browser for Refseq gene (table browser).
Reads are FASTQ mapped with tophat (bam output sorted by name)

It is possibly just a "nice to have" but because of how Refseq is annotating gene-fusions, some counts are lost. E.g. for DDX39B

$ grep DDX39B htseq_output
ATP6V1G2-DDX39B 0
DDX39B 0

With a quick/dirty Perl program that does not care about ambiguity.

$ grep DDX39B dirty_perl_counts
ATP6V1G2-DDX39B 26
DDX39B 23

Since Refseq is annotating gene fusions along with "vanilla" genes in this way, total read counts for a gene or gene-fusion are lost whilst enforcing the ambiguity rule (which I don't think is optional, even with -m intersection-nonempty).

It seems a bit cumbersome to filter fusions from GTF files. Maybe it could be good to only enforce ambiguity if the overlapping features are on different strands or enable an option to turn it off if desired by the user?

As it is, it is possible an interesting differentially expressed gene could be missed.

Kind regards,

John.

**mackan** · 04-06-2011, 04:10 AM

Hi Simon and John,

I have exactly the problem you describe John, getting them warnings and no result. My sam files are sorted on read name and produced with Tophat 1.2.0. Data is Solid 50+35.

However when I ran the analysis on position-sorted sam files, 6 out of my 8 samples generated results. Still with the warnings though. I can't figure this one out...

One thing that I have realized is that surprisingly few of my read pairs have both mates properly mapped from Tophat. This is spotted in the sam file as '=' (mate is on the same chrom) or '*' (mate is not mapped). I wonder how this is handled by HTSeq-count - perhaps singles are ignored?

I have sam files from Bioscope too, and in them the majority of reads pairs mapped together ('=').

John, did you manage to get proper HTSeq-count results from your Solid-Tophat data, and did you figure out any problem with Solid read pairs in Tophat? I also tried chopping off the read suffixes with no effect, and I checked the csfasta files and made sure they match pair by pair.

Best regards
Markus

**Simon Anders** · 04-06-2011, 04:30 AM

To use paired-end data, SAM files must be sorted by read name. If you sort by position, htseq-count won't find the mate, complain about it with a warning and treat it as a single read.

Two reasons come to mind why you might not get any results when sorted by read name:

- Solid data seems to violate the convention that mated reads must have the same read name by adding some suffix to the reverse reads. If present, these need to be removed. (See posts #92, #93 further up.)

- htseq-count assumed that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand. This works fine for my Illumina data. However, maybe it is the other way round with SOLiD. I've promised an option '--stranded=reverse" to deal with this but haven't added it yet. (See this thread.)

**poisson200** · 04-06-2011, 06:29 AM

Hi Markus, Simon.
Markus, I had a slightly different point, htseq is working for me but some genes would have a count of 0 if the GTF file had a gene fusion annotation.

In the GTF there is annotation for both ATP6V1G2-DDX39B fusion and the
DDX39B alone. I make the assumption (possibly wrongly) that because these features overlap, the counts are not assigned as it looks like these reads are ambiguous. Simon, is that right?

If it helps Markus, read on;

My reads typically look like this, following a tophat run.

Code:

863_343_1081_F5-P2     0        chr1    14571  3 35M    *       0       0       TCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAG     ccc]]babba]RYRPaUO[cNHOUZXSFP]^K3GG     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516560
1009_1698_1545_F5-P2   0        chr1    14571  3 35M    *       0       0       TCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAG     [[__ac_]\[\YTU]RNMP_(!2BRSRMYZXW2::     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516560
1120_1709_1928_F5-P2   0        chr1    14572  0 35M    *       0       0       CCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGG     ^^WZaa^_^XAIWT[WIH])!]bTRL9FVXN&6^^     NM:i:0  NH:i:6  CC:Z:chr12      CP:i:91019
1070_1592_1892_F5-P2   163      chr1    14586  3 35M    =       14594   0       TTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGC     [[abcGG``cWR]ab``a`bRJY`_GI\\[@>W^^     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516545  XS:A:-
1267_389_283_F5-P2     0        chr1    14586  3 35M    *       0       0       TTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGC     SS[b`<;TR[PAL[_ML]UYN5E^^CBRSYK>DUU     NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516545
1070_1592_1892_F3      83       chr1    14594  3 50M    =       14586   0       CCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATG      TTWFCWYRQUVUX^]\_^^^WNQ\_``_a_XX]a^[__`cccbbca_acc      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516522  XS:A:-
1247_1145_1797_F3      83       chr1    14597  3 50M    =       14567   0       CCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCA      DDZZRHEOX[`Y6>a\XTW`][XX^_`FFa][^`a`bSScb`Y[ccc___      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516519  XS:A:-
1100_941_1049_F3       16       chr1    14597  3 50M    *       0       0       CCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCA      JJ\]KCEM^``_DFb]QPX[[]\[^ZZGD[_ZY_bccXXbbbZ[bbcbbb      NM:i:0  NH:i:2  CC:Z:chr15      CP:i:102516519

If you want to compare what I do/did;

Code:

#!/bin/bash


if [ $# -ne 1 ]
then
  echo "Usage: `basename $0` name_of_bam_from_tophat"
  exit 1
fi




# remove _F3 or _F5-P2 from sam output
samtools view $1 | perl -wpe 's/^(\S+)(_F3|_F5-P2)/$1/' > sam_file 


# USCS utilities has a program called fetchChromSizes; use to get Chromosome sizes, in turn used to make a bam file
fetchChromSizes hg19 > sizes

# make a new bam file
samtools view -bt sizes -o sam_file.bam sam_file

# sort data
samtools sort -n sam_file.bam sorted_sam_file

# run htseq
samtools view sorted_sam_file.bam | htseq-count -m intersection-nonempty - ../RefseqHG19_gtf_gene > gene_counts

I have found that in most cases my dirtyperl works the same as htseq but not all, I am currently investigating why that is.

Kind regards,

John.

**Simon Anders** · 04-06-2011, 11:11 AM

Originally posted by poisson200 View Post

In the GTF there is annotation for both ATP6V1G2-DDX39B fusion and the
DDX39B alone. I make the assumption (possibly wrongly) that because these features overlap, the counts are not assigned as it looks like these reads are ambiguous. Simon, is that right?

Yes, that's right. By the way, are you specially interested in these fusion genes? If not, maybe using the GTF file from Ensembl would make your life easier. It does not contain your fusion gene and probably does not contain other fusion genes either.

**bsuac6** · 04-06-2011, 11:21 PM

Originally posted by Simon Anders View Post

To use paired-end data, SAM files must be sorted by read name. If you sort by position, htseq-count won't find the mate, complain about it with a warning and treat it as a single read.

Two reasons come to mind why you might not get any results when sorted by read name:

- Solid data seems to violate the convention that mated reads must have the same read name by adding some suffix to the reverse reads. If present, these need to be removed. (See posts #92, #93 further up.)

- htseq-count assumed that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand. This works fine for my Illumina data. However, maybe it is the other way round with SOLiD. I've promised an option '--stranded=reverse" to deal with this but haven't added it yet. (See this thread.)

Hi Simon,

I have been having similar problems with SOLID data and HT-seq. I did remove the suffixes but some reads were still erroring (warning cant find aligned mate) despite being sorted correctly. Is it likely that this stranded issue is the problem? If htseq-count assumes that the forward mate has the orientation of the coding strand and the reverse mate that of the template strand and you think that solid data is the opposite way round, then why don't all the reads error?

Thanks

p.s. I personal messeged you the error and a excerpt of the sam file on the 04/04/2011 09:43

**adeonari** · 04-07-2011, 03:40 PM

Hello Dr. Anders,

I had a brief question for you concerning your great program htseq-counts. I can some of the cases you consider in your clear illustration here:

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

But does htseq-counts consider the (very infrequent) case in which a read may be longer than a gene?

**Simon Anders** · 04-07-2011, 10:39 PM

Originally posted by adeonari View Post

But does htseq-counts consider the (very infrequent) case in which a read may be longer than a gene?

This is treated the same way as a read extending over the gene's boundary (second row in the figure.)

**adeonari** · 04-15-2011, 06:51 AM

Another quick question: is it possible with htseq-counts to exclude the reads from the 3' UTR?

**Simon Anders** · 04-17-2011, 10:32 PM

The '-i' and '-t' options give you some flexibility what htseq-count is counted.

If this is insufficient, please remember that htseq-count was actually meant not so much as a tool in its own right but rather a demonstration of the possibilities of HTSeq, our Python framework for HTS analysis. As demonstrated in the documentation, it is fairly easy (if you know Python or are willing to learn it) to write your own counting script using HTSeq and then implement whatever custom logic you need for your use case.

**sshen** · 06-18-2011, 06:23 PM

Will HTSeq be able to count reads from a bed file

Dear Simon,

My case is somehow unusual. I have a mapped sequence reads file in bed format. I wonder if you could point me out how I can use HTSeq for reads counts on gene basis. I understand that the bed file contains only interval data and doesn't have quality scores for each base. But in my case, I only need read counts. I have limited knowledge about python.

Your help will be greatly appreciated!

Best
Steve

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News