Seqanswers Leaderboard Ad

**Simon Anders** · 08-18-2011, 04:57 AM

To clear up this confusion:

1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.

Furthermore, the format specifies that each read contains information about whether the read was aligned, whether its mate was aligned and where the read aligned to and where its mate aligned to. For the partner, another line has to be present that shows a "mirror" of the same alignment, i.e., the information on whether and where the read and its mate align should match if one exchanges read and mate information.

If you now perform any filtering that might remove reads but not their mates, this rule gets broken, and HTSeq issues a warning (but proceeds anyway, treating the missing partner as not aligned).

Some aligners manage to produce SAM files with missing mates even without filtering.

2. If a SAM line's flag field says that the read was not aligned then, in my opinion, the fields for where the read aligned should not contain any coordinates. HTSeq issues a warning in this case but treats the read as not aligned. BWA has a habit of sometimes producing such reads, and this is even documented somewhere.

3. Finally, a "proper pair" is, as I read the standard, a pair in which both reads align to opposite strands. If a SAM line's flag field indicates that a pair is proper but the alignment information disagrees with this, a warning is issued.

It is rather frustrating that the SAM format allows so many ways of storing self-contradictory information in a SAM file, without giving clear rules in the specification, because this forces software to expect all these cases.

**Simon Anders** · 08-18-2011, 05:25 AM

Regarding BWA and the FLAG fields, see also this thread: http://seqanswers.com/forums/showthread.php?t=8429

**ThePresident** · 08-24-2012, 04:02 PM

Originally posted by Simon Anders View Post

To clear up this confusion:

1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.

So, if I get it well, we don't need to have sorted SAM files if dealing with single-end RNAseq reads? Am-I right?

**Simon Anders** · 08-25-2012, 12:37 AM

Originally posted by ThePresident View Post

So, if I get it well, we don't need to have sorted SAM files if dealing with single-end RNAseq reads? Am-I right?

Yes.

------

**glados** · 10-30-2012, 04:57 AM

Dear all.

I'm trying to find information about how HTSeq counts reads. I understood that one pair (properly paired) is counted as 1 count.
What about pairs that are not flagged as 'properly paired'?
What about the reads that lost their mate and became single reads?
Are they counted as 1 count as well? Or not counted at all?

Additionally I'm loosing quite many reads that have multiple mappings. Anyone figured out a way to deal with this in HTSeq, instead of just throwing them all out?

**wanfahmi** · 01-10-2013, 04:05 AM

Hi Simon,

1. If you have paired-end data, htseq-count expects that the SAM file is sorted by read name. This is because the SAM format prescribes that paired reads have the same read ID, and hence, sorting causes the read pairs to appear in adjacent line.

How could I sorted the SAM file according to name? Thank you

**Simon Anders** · 01-10-2013, 04:18 AM

"samtools sort -n"

**DZhang** · 01-10-2013, 07:45 AM

I use picard (http://picard.sourceforge.net/)

Best regards,
Douglas

https://www.contigexpress.com

**wanfahmi** · 01-10-2013, 09:05 AM

Hi Simon,

Already sort according to name and run htseq-count and here is the output;

ENSG00000261838 0
ENSG00000261839 2
ENSG00000261840 6
ENSG00000261841 0
ENSG00000261842 0
no_feature 1288128
ambiguous 512283
too_low_aQual 0
not_aligned 0
alignment_not_unique 1033314

Actually, I can't understand well about the htseq-count output. Should it given the raw count of each read/genes? As I want to know the raw count, instead FPKM (from cuff diff output) value. I want to compare the raw count with my array data.

Thank you

**wanfahmi** · 01-11-2013, 02:48 AM

Sorry to add up,

I was using GTF file from ensembl (genes.gtf), thus the "ENSG00000261838" is a gene from human. How could I replace the number to name of gene? What should I do? Thanks

**zaki** · 03-04-2013, 06:41 PM

Originally posted by DZhang View Post

Hi vadim,

I am happy to report it worked! I guess "export LC_ALL=POSIX" helped. I had not known it was related to the sorting or htseq-count. But you pointed me to the right direction and offered a working solution.

Thank you very much and now I am off for a great July 4th!

Douglas

Originally posted by vadim View Post

You may also want to do this prior to sorting:
export LC_ALL=POSIX

Hi Vadim and Douglas,

I am wondering, what does the export quote do?

Code:

export LC_ALL=POSIX

And how would one use it? I am assuming LC_ALL is your .sam file which is unsorted? Would this be correct..

I sorted the .sam file using

Code:

sort -k1

Although there were occasional "Is the SAM file properly sorted?", it worked for most of the reads. So I was thinking if the "export: function would help. But I am unsure how to use it

Any insight would be great, Many thanks

**delasa** · 05-24-2013, 10:40 AM

Hi Simon,
I am using DEXSeq for analyzing Tophat results. I have some problems for preprocessing the BAM files produced by Tophat, existing in DEXSeq package.
1. dexseq_prepare_annotation.py only works with GTF files, while alot of annotation files are in GFF3.
2. dexseq_count.py needs sam files to be sorted. but I can not sort the sam file using commands mentioned in this thread. sam files are big anf it says no space left on the device.

**sdriscoll** · 05-24-2013, 11:35 AM

not very helpful but if you can't sort your alignment files on your computer then you need to find a computer with more memory/hard drive space. unless this is a one-off experiment and you don't plan to work with sequencing much afterwards.

it seems to me you should be able to find a GTF format for your annotation - i've never used a GFF3 format file except for a long while back when Tophat accepted that formation and not GTF. what species are you working with?

depending on what sorting you need you might be able to have Tophat produce the output the way you need it. For example you can use the --keep-fasta-order option or the --no-sort-bam option which makes it so the output is not coordinate sorted. I'm not sure if either of them will replicate name sorting though.

**Simon Anders** · 05-24-2013, 11:41 AM

Yes, we should add an option to deal with GFF3. I'll try to get to this soon. For now: It should be sufficient to reduce your GFF3 file to only the lines containing "exon" in the third column and then replacing in each line the word "Parent" by the word "gene_id" in the attributes.

As for sorting: Of course you need to have enough free space on your hard disk to save the sorted file. Actually, you need twice as much free space as your SAM file takes to accommodate both the resulting file and the intermediate ones. But come on, hard disks are really cheap now. Just buy a new one if your old one is full.

**delasa** · 05-24-2013, 11:45 AM

I am working on a powerful server with about 200 TB hard drive and 256 GB memory, but I can not sort the SAM files using sort -s -c -k1,1 !!! weird >_<

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News