Unconfigured Ad

**vadim** · 06-14-2011, 03:21 AM

wc -l <sam file>

**biofreak** · 06-14-2011, 06:13 AM

I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
T1a T1b T2 T3 N1 N2
Gene_00001 0 0 2 0 0 1
Gene_00002 20 8 12 5 19 26
Gene_00003 3 0 2 0 0 0
Gene_00004 75 84 241 149 271 257
Gene_00005 10 16 4 0 4 10
Gene_00006 129 126 451 223 243 149

**dariober** · 06-14-2011, 07:55 AM

Originally posted by biofreak View Post

I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
T1a T1b T2 T3 N1 N2
Gene_00001 0 0 2 0 0 1
Gene_00002 20 8 12 5 19 26
Gene_00003 3 0 2 0 0 0
Gene_00004 75 84 241 149 271 257
Gene_00005 10 16 4 0 4 10
Gene_00006 129 126 451 223 243 149

Hi,

Starting from the output of TopHat you can use this pipeline:

Code:

samtools sort -n accepted_hits.bam accepted_hits.nsorted  ## Sort by read name (necessary for htseq-count)
samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf  ## See the options of htseq that suite you

For each library, this will give you a table of read counts for each gene.

Hope it helps

Dario

**biofreak** · 06-21-2011, 07:59 AM

thanks Dario for the reply.
I did run htSeq-Count and obtained the following output for each of the SAM files.
.
.
NM_000019 246
NM_000020 0
NM_000021 0
NM_000022 489
NM_000023 2
NR_038131 14
NR_038133 0
NR_038135 0
NR_038146 0
NR_038150 0
no_feature 4796180
ambiguous 1464598
too_low_aQual 0
not_aligned

Is the number of ambiguous reads too high?

**kmcarr** · 06-24-2011, 04:55 PM

Originally posted by biofreak View Post

thanks Dario for the reply.
I did run htSeq-Count and obtained the following output for each of the SAM files.
.
.
NM_000019 246
NM_000020 0
NM_000021 0
NM_000022 489
NM_000023 2
NR_038131 14
NR_038133 0
NR_038135 0
NR_038146 0
NR_038150 0
no_feature 4796180
ambiguous 1464598
too_low_aQual 0
not_aligned

Is the number of ambiguous reads too high?

Well how many reads were inyour data set? Can't say if the number is too high if we don't know what fraction they represent.

**wanfahmi** · 12-20-2012, 04:51 AM

Starting from the output of TopHat you can use this pipeline:

samtools sort -n accepted_hits.bam accepted_hits.nsorted ## Sort by read name (necessary for htseq-count)
samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf ## See the options of htseq that suite you

Hi Dario,

I am trying to do exactly you mention above, the first step is to sort the accepted_hits.bam. But it looks like to generate a lots of file. Does it looks correct? As attached. Thanks.

Attached Files

20-12-2012 12-47-23.png (12.1 KB, 231 views)

**hanshart** · 12-20-2012, 05:12 AM

This is ok. It takes some time and after finishing the process the temporary files are deleted.

**wanfahmi** · 12-20-2012, 05:34 AM

Originally posted by hanshart View Post

This is ok. It takes some time and after finishing the process the temporary files are deleted.

HI Hanshart,

Is there any problem with my sort? It showed segmentation fault but this accepeted_hits.bam is from tophat output. Thanks

[v1wwanm-lx03 S1_R1_thout]$ ls
accepted_hits.bam accepted_hits.sorted.0004.bam accepted_hits.sorted.0012.bam accepted_hits.sorted.0020.bam pullout
accepted_hits.bed accepted_hits.sorted.0005.bam accepted_hits.sorted.0013.bam accepted_hits.sorted.0021.bam test_extracted_locus.out
accepted_hits.index accepted_hits.sorted.0006.bam accepted_hits.sorted.0014.bam deletions.bed tmp
accepted_hits.sam accepted_hits.sorted.0007.bam accepted_hits.sorted.0015.bam extracted.out unmapped.bam
accepted_hits.sorted.0000.bam accepted_hits.sorted.0008.bam accepted_hits.sorted.0016.bam insertions.bed
accepted_hits.sorted.0001.bam accepted_hits.sorted.0009.bam accepted_hits.sorted.0017.bam junctions.bed
accepted_hits.sorted.0002.bam accepted_hits.sorted.0010.bam accepted_hits.sorted.0018.bam logs
accepted_hits.sorted.0003.bam accepted_hits.sorted.0011.bam accepted_hits.sorted.0019.bam prep_reads.info
[v1wwanm-lx03 S1_R1_thout]$ [bam_sort_core] merging from 25 files...
[bam_header_read] bgzf_check_EOF: Invalid argument
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[bam_header_read] bgzf_check_EOF: Invalid argument
[bam_header_read] invalid BAM binary header (this is not a BAM file).

[1]+ Segmentation fault samtools sort accepted_hits.bam accepted_hits.sorted

**wanfahmi** · 01-15-2013, 07:26 AM

Hi,

How can we change the identifier to real gene name. The gene name should be taken from genes.gtf, isn't it? Thank you

ENSG00000261838 0
ENSG00000261839 2
ENSG00000261840 3
ENSG00000261841 0
ENSG00000261842 0
no_feature 985727
ambiguous 435068
too_low_aQual 0
not_aligned 0
alignment_not_unique 778451

**Simon Anders** · 01-16-2013, 12:20 AM

How does your genes.gtf file look like? Does it contain gene names?

**wanfahmi** · 01-16-2013, 01:59 AM

Hi Simon,

I took the gtf file from ENSEMBL (genes.gtf). Here is the file looks like. It is possible to change from identifier to gene name? Thank you.

1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-002"; tss_id "TSS26614";
1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-201"; tss_id "TSS165962";
1 transcribed_unprocessed_pseudogene exon 11874 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; tss_id "TSS165964";
1 transcribed_unprocessed_pseudogene exon 12010 12057 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-001"; tss_id "TSS72060";

**Simon Anders** · 01-16-2013, 02:39 AM

You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".

**wanfahmi** · 01-16-2013, 06:28 AM

Originally posted by Simon Anders View Post

You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".

It's work! Tq simon..

Topics	Statistics	Last Post
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Today, 08:59 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Today, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM

Unconfigured Ad

Raw read counts for RNAseq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News