Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • biofreak
    Member
    • Jun 2011
    • 44

    Raw read counts for RNAseq

    hi All,
    I am very new to this NGS business.. Presently I am trying my hands on RNA-seq data. I have managed to obtain the SAM files by running topHat.
    I also ran cufflinks on those files. I now want to conduct differential expression analysis using DESeq package. (and eventually compare the results of cufflinks and DESeq). How do I obtain the raw read counts from the SAM file? I read that that is the input to DESeq. Can someone please help?
    thanks.
  • vadim
    Member
    • Sep 2009
    • 37

    #2
    wc -l <sam file>

    Comment

    • biofreak
      Member
      • Jun 2011
      • 44

      #3
      I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
      T1a T1b T2 T3 N1 N2
      Gene_00001 0 0 2 0 0 1
      Gene_00002 20 8 12 5 19 26
      Gene_00003 3 0 2 0 0 0
      Gene_00004 75 84 241 149 271 257
      Gene_00005 10 16 4 0 4 10
      Gene_00006 129 126 451 223 243 149

      Comment

      • dariober
        Senior Member
        • May 2010
        • 311

        #4
        Originally posted by biofreak View Post
        I am sorry but I meant to ask How can I obtain something as follows where columns I different samples and rows are genes/exons/transcripts and each of the value is the read count. Is this called a raw read count?
        T1a T1b T2 T3 N1 N2
        Gene_00001 0 0 2 0 0 1
        Gene_00002 20 8 12 5 19 26
        Gene_00003 3 0 2 0 0 0
        Gene_00004 75 84 241 149 271 257
        Gene_00005 10 16 4 0 4 10
        Gene_00006 129 126 451 223 243 149
        Hi,

        Starting from the output of TopHat you can use this pipeline:

        Code:
        samtools sort -n accepted_hits.bam accepted_hits.nsorted  ## Sort by read name (necessary for htseq-count)
        samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
        python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf  ## See the options of htseq that suite you
        For each library, this will give you a table of read counts for each gene.

        Hope it helps

        Dario

        Comment

        • biofreak
          Member
          • Jun 2011
          • 44

          #5
          thanks Dario for the reply.
          I did run htSeq-Count and obtained the following output for each of the SAM files.
          .
          .
          NM_000019 246
          NM_000020 0
          NM_000021 0
          NM_000022 489
          NM_000023 2
          NR_038131 14
          NR_038133 0
          NR_038135 0
          NR_038146 0
          NR_038150 0
          no_feature 4796180
          ambiguous 1464598
          too_low_aQual 0
          not_aligned

          Is the number of ambiguous reads too high?

          Comment

          • kmcarr
            Senior Member
            • May 2008
            • 1181

            #6
            Originally posted by biofreak View Post
            thanks Dario for the reply.
            I did run htSeq-Count and obtained the following output for each of the SAM files.
            .
            .
            NM_000019 246
            NM_000020 0
            NM_000021 0
            NM_000022 489
            NM_000023 2
            NR_038131 14
            NR_038133 0
            NR_038135 0
            NR_038146 0
            NR_038150 0
            no_feature 4796180
            ambiguous 1464598
            too_low_aQual 0
            not_aligned

            Is the number of ambiguous reads too high?
            Well how many reads were inyour data set? Can't say if the number is too high if we don't know what fraction they represent.

            Comment

            • wanfahmi
              Member
              • Apr 2008
              • 34

              #7
              Starting from the output of TopHat you can use this pipeline:

              samtools sort -n accepted_hits.bam accepted_hits.nsorted ## Sort by read name (necessary for htseq-count)
              samtools view accepted_hits.nsorted.bam > accepted_hits.nsorted.sam
              python -m HTSeq.scripts.count accepted_hits.nsorted.sam myfile.gtf ## See the options of htseq that suite you

              Hi Dario,

              I am trying to do exactly you mention above, the first step is to sort the accepted_hits.bam. But it looks like to generate a lots of file. Does it looks correct? As attached. Thanks.
              Attached Files

              Comment

              • hanshart
                Member
                • Nov 2011
                • 27

                #8
                This is ok. It takes some time and after finishing the process the temporary files are deleted.

                Comment

                • wanfahmi
                  Member
                  • Apr 2008
                  • 34

                  #9
                  Originally posted by hanshart View Post
                  This is ok. It takes some time and after finishing the process the temporary files are deleted.
                  HI Hanshart,

                  Is there any problem with my sort? It showed segmentation fault but this accepeted_hits.bam is from tophat output. Thanks

                  [v1wwanm-lx03 S1_R1_thout]$ ls
                  accepted_hits.bam accepted_hits.sorted.0004.bam accepted_hits.sorted.0012.bam accepted_hits.sorted.0020.bam pullout
                  accepted_hits.bed accepted_hits.sorted.0005.bam accepted_hits.sorted.0013.bam accepted_hits.sorted.0021.bam test_extracted_locus.out
                  accepted_hits.index accepted_hits.sorted.0006.bam accepted_hits.sorted.0014.bam deletions.bed tmp
                  accepted_hits.sam accepted_hits.sorted.0007.bam accepted_hits.sorted.0015.bam extracted.out unmapped.bam
                  accepted_hits.sorted.0000.bam accepted_hits.sorted.0008.bam accepted_hits.sorted.0016.bam insertions.bed
                  accepted_hits.sorted.0001.bam accepted_hits.sorted.0009.bam accepted_hits.sorted.0017.bam junctions.bed
                  accepted_hits.sorted.0002.bam accepted_hits.sorted.0010.bam accepted_hits.sorted.0018.bam logs
                  accepted_hits.sorted.0003.bam accepted_hits.sorted.0011.bam accepted_hits.sorted.0019.bam prep_reads.info
                  [v1wwanm-lx03 S1_R1_thout]$ [bam_sort_core] merging from 25 files...
                  [bam_header_read] bgzf_check_EOF: Invalid argument
                  [bam_header_read] invalid BAM binary header (this is not a BAM file).
                  [bam_header_read] bgzf_check_EOF: Invalid argument
                  [bam_header_read] invalid BAM binary header (this is not a BAM file).

                  [1]+ Segmentation fault samtools sort accepted_hits.bam accepted_hits.sorted

                  Comment

                  • wanfahmi
                    Member
                    • Apr 2008
                    • 34

                    #10
                    Hi,

                    How can we change the identifier to real gene name. The gene name should be taken from genes.gtf, isn't it? Thank you


                    ENSG00000261838 0
                    ENSG00000261839 2
                    ENSG00000261840 3
                    ENSG00000261841 0
                    ENSG00000261842 0
                    no_feature 985727
                    ambiguous 435068
                    too_low_aQual 0
                    not_aligned 0
                    alignment_not_unique 778451

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      How does your genes.gtf file look like? Does it contain gene names?

                      Comment

                      • wanfahmi
                        Member
                        • Apr 2008
                        • 34

                        #12
                        Hi Simon,

                        I took the gtf file from ENSEMBL (genes.gtf). Here is the file looks like. It is possible to change from identifier to gene name? Thank you.

                        1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-002"; tss_id "TSS26614";
                        1 transcribed_unprocessed_pseudogene exon 11872 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-201"; tss_id "TSS165962";
                        1 transcribed_unprocessed_pseudogene exon 11874 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-202"; tss_id "TSS165964";
                        1 transcribed_unprocessed_pseudogene exon 12010 12057 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; exon_number "1"; gene_biotype "pseudogene"; gene_name "DDX11L1"; transcript_name "DDX11L1-001"; tss_id "TSS72060";

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".

                          Comment

                          • wanfahmi
                            Member
                            • Apr 2008
                            • 34

                            #14
                            Originally posted by Simon Anders View Post
                            You can use the -'i' option to tell htseq-count to use the attribute "gene_name" instead of the default, which is "gene_id".
                            It's work! Tq simon..

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Today, 08:59 AM
                            0 responses
                            9 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            21 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            17 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            30 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...