Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sunkorner
    Member
    • Jan 2011
    • 15

    #16
    Please find the the gff file I have used at this url ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_004556.gff

    otherwise also please check a sample of it, hope it will be readable in the post
    ##gff-version 3
    #!gff-spec-version 1.14
    #!source-version NCBI C++ formatter 0.2
    ##Type DNA NC_004556.1
    NC_004556.1 RefSeq source 1 2519802 . + . organism=Xylella fastidiosa Temecula1;mol_type=genomic DNA;strain=Temecula1;db_xref=taxon:183190;note=Pierce%27s disease strain
    NC_004556.1 RefSeq gene 146 1465 . + . ID=NC_004556.1:dnaA;locus_tag=PD0001;db_xref=GeneID:1144259
    NC_004556.1 RefSeq CDS 146 1462 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq start_codon 146 148 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq stop_codon 1463 1465 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
    NC_004556.1 RefSeq gene 1747 2847 . + . ID=NC_004556.1:dnaN;locus_tag=PD0002;db_xref=GeneID:1144260
    NC_004556.1 RefSeq CDS 1747 2844 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq start_codon 1747 1749 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq stop_codon 2845 2847 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
    NC_004556.1 RefSeq gene 3153 4247 . + . ID=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;db_xref=GeneID:1144261
    NC_004556.1 RefSeq CDS 3153 4244 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    NC_004556.1 RefSeq start_codon 3153 3155 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    NC_004556.1 RefSeq stop_codon 4245 4247 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
    I am looking for the count of all the reads mapped to each gene. Please let me know if it is correct way I am doing using locus_tag and CDs. It is also microbial genome.

    Thank you

    Comment

    • sunkorner
      Member
      • Jan 2011
      • 15

      #17
      On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.
      I just gave the last few lines where no features is shown, I also get the count of reads mapped to each locus_tag. eg
      PD0003 23
      PD0001 5
      etc

      Thank you very much

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #18
        Again: Are the five lines you quote the whole output that you get from htseq-count? And have you looked at your data and the GTF file with a genome browser?

        And what is a "locus tag"? (Sorry, I'm not very familiar with prokaryotic data.)

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #19
          Then, you should really check things with a genome browser. Check especially whether the reads are really on the same strand as the gene, to make sure that '-s yes' is correct.

          Comment

          • sunkorner
            Member
            • Jan 2011
            • 15

            #20
            Locus tags are systematic, unique identifiers that are assigned to each gene in GenBank. Therefore I used them as identifiers in this context too. However I am not sure if the use of CDS is causing this problem of lack of identification.

            I will also look into the genome browser and get back to you with results. Thank you

            Comment

            • sunkorner
              Member
              • Jan 2011
              • 15

              #21
              Results from genome browser view,

              I could see many reads actually mapping to each gene which we call locus tag. I get lesser than the actual count for each gene. Any way to increase it. When I remove the -i option then I get error, tRNAs have no gene id feature, error on 72 line and the program exit. Please let me know how to proceed.

              Any input from guys working with bacterial genomes will be very helpful

              Thank you

              Comment

              • sunkorner
                Member
                • Jan 2011
                • 15

                #22
                There are also many regions in between the genes where the reads are mapping....but the ones mapping onto the gene too are not shown in the count

                Comment

                • sunkorner
                  Member
                  • Jan 2011
                  • 15

                  #23
                  I tried to give command,

                  htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out

                  I get error line 72 gene feature doesnt exist for tRNA .......then the progam exits. When I checked the line I see that the feature in this line has 'exon'. Such lines exists for many tRNAs in this gff file. but which are other transcripts we have different information.
                  NC_004556.1 RefSeq gene 17033 17950 . + . ID=NC_004556.1:hemF;locus_tag=PD0015;db_xref=GeneID:1144215
                  NC_004556.1 RefSeq CDS 17033 17947 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq start_codon 17033 17035 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq stop_codon 17948 17950 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
                  NC_004556.1 RefSeq gene 20728 20804 . - . locus_tag=PD0016;db_xref=GeneID:1144216
                  NC_004556.1 RefSeq exon 20728 20804 . - . gbkey=tRNA;locus_tag=PD0016;product=tRNA-Met;note=found by tRNAscan;db_xref=GeneID:1144216;exon_number=1
                  So I am exculsively giving the command htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out

                  Then in PD_out we see,
                  .........
                  ,..........
                  PD2111 4
                  PD2112 2
                  PD2113 7
                  PD2105 15
                  PD2106 11
                  PD2107 35
                  PD2108 10
                  PD2109 21
                  PD2110 17
                  .....
                  ...
                  ....
                  no_feature 14905057
                  ambiguous 13
                  too_low_aQual 0
                  not_aligned 1913205
                  alignment_not_unique 0

                  Something like this. Am I on the right path.

                  By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.

                  Any suggestions please.

                  Thank you

                  Comment

                  • Simon Anders
                    Senior Member
                    • Feb 2010
                    • 995

                    #24
                    Originally posted by sunkorner View Post
                    htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out
                    This won't work with your GFF file, because it does not have 'gene_id' attributes, and you don't have a '-i'.


                    This is why you needed this here:
                    htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out
                    By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.
                    If you had many read falling onto overlapping genes, they would be counted as ambiguous. "no_feature" means that they don't fall anywhere.

                    What you should do is visualize the 'out' SAM file along your GFF file and see if you find any read that is marked as "no_feature" and appears to be on a feature. Then, make a screenshot and post it here, and we can see what is wrong.

                    Comment

                    • sunkorner
                      Member
                      • Jan 2011
                      • 15

                      #25
                      I am using DESeq and I get the following error.

                      > exampleFile = "/home/sun/Desktop/xylella_normalized_rpm.tab"
                      > countsTable <- read.delim(exampleFile, header=TRUE, stringsAsFactors=TRUE)
                      > head(countsTable)
                      Locus_tag WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
                      1 PD0001 32.59 35.18 200.56 37.90 173.21 224.03
                      2 PD0002 26.55 11.80 149.89 18.16 130.71 162.92
                      3 PD0003 9.73 7.46 171.14 14.50 129.91 203.61
                      4 PD0004 7.36 6.46 20.83 12.31 16.76 29.63
                      5 PD0005 47.71 69.57 481.91 82.01 372.57 567.03
                      6 PD0006 5.13 4.90 175.65 13.04 154.31 202.34
                      > rownames( countsTable) <- countsTable$Locus_tag
                      > countsTable <- countsTable[ ,-1 ]
                      > conds <- c( "W", "W", "W", "M", "M", "M" )
                      > cds <- newCountDataSet( countsTable, conds )
                      Error in newCountDataSet(countsTable, conds) :
                      The countData is not integer.
                      > cds <- newCountDataSet( countsTable, conds)
                      Error in newCountDataSet(countsTable, conds) :
                      The countData is not integer.
                      > head(countsTable)
                      WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
                      PD0001 32.59 35.18 200.56 37.90 173.21 224.03
                      PD0002 26.55 11.80 149.89 18.16 130.71 162.92
                      PD0003 9.73 7.46 171.14 14.50 129.91 203.61
                      PD0004 7.36 6.46 20.83 12.31 16.76 29.63
                      PD0005 47.71 69.57 481.91 82.01 372.57 567.03
                      PD0006 5.13 4.90 175.65 13.04 154.31 202.34
                      > str(countsTable)
                      'data.frame': 2123 obs. of 6 variables:
                      $ WT.1 : num 32.59 26.55 9.73 7.36 47.71 ...
                      $ WT.2 : num 35.18 11.8 7.46 6.46 69.57 ...
                      $ WT.3 : num 200.6 149.9 171.1 20.8 481.9 ...
                      $ mutant.1: num 37.9 18.2 14.5 12.3 82 ...
                      $ mutant.2: num 173.2 130.7 129.9 16.8 372.6 ...
                      $ mutant.3: num 224 162.9 203.6 29.6 567 ...
                      I have to still visualize my out SAM file, so I will get back on that soon.

                      Thank you

                      Comment

                      • Simon Anders
                        Senior Member
                        • Feb 2010
                        • 995

                        #26
                        Please, read the error message. It says clearly:

                        "Error in newCountDataSet(countsTable, conds) :The countData is not integer."

                        Sorry but you are about the thousandth person to ask this, and I'm a bit desperate because I thought I wrote it nice and clearly in the manual:

                        DESeq expects as input a table, which tells, for each sample and each gene, how many reads fall onto the gene in the sample.

                        It seems that many users feel an irresistible urge to do strange things to their counts and then supply something like the number of read per million reads, the number per kilobase transcript length, or whatever.

                        So, please, people, stop normalizing, dividing, rounding or doing whatever with your data, and just hand over the raw unnormalized integer counts.

                        Comment

                        • sunkorner
                          Member
                          • Jan 2011
                          • 15

                          #27
                          Thank you very much and also sorry. I am new to RNAseq data analysis and so somehow must have overlooked the details.

                          Comment

                          • Camg
                            Member
                            • Jan 2011
                            • 21

                            #28
                            Error running HTSeq

                            Hi,

                            I'm having the same problem as the original post in this thread. When I run HTSeq I'm getting the error:

                            Warning: Skipping read 'HWI-ST765:7:2307:9917:51869#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.
                            Warning: Skipping read 'HWI-ST765:7:2307:9917:76673#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.
                            This is the command I used:

                            python ../../programs/HTSeq-0.5.3p3/HTSeq/scripts/count.py --mode=intersection-nonempty --stranded=no --idattr=transcript_id -o 24.1.htseq 24.1.sorted.sam ../ec_txn.gtf
                            However, I can see that my SAM file and GTF file are using the same chromosome names so I'm not sure why HTSeq isn't recognizing them.

                            Here are examples of a few lines from my SAM and GTF files:

                            HWI-ST765:7:1101:10001:176633#0 147 chr1 97536 255 101M = 97470 -167 ACAAAGATGTCGCACTTGGCATTGATGTACATGCAGACCTCCTGGCAGACCTTGCGCTCGATGTGCTTCTTGAGCTTGTGGAGCTGTCCTGGGCATCGGAG DDDDDDDDDDDDCCDDDDDDDDDDEEFEDDDDDDDCADDFEFFEHHHHHJIJJJIGGIJJIJIHEIJIJJJJJJJJHEJJJJIJJJJIHBHHHFFDB;CC@ NM:i:0 NH:i:1
                            HWI-ST765:7:1101:10001:176633#0 99 chr1 97470 255 101M = 97536 167 CGGAGGTTTCTGTATCTTCTCCTGTCGTTAACCACGACTCTGTAGAAGTCGCAGTCCCCGACCAGCACAAAGATGTCGCACTTGGCATTGATGTACATGCA BCCFFFBDHHHHHJJJJJJJJJJJJJIFIIJJJJJJIJJJGGIJIJJJHJJIJJIIIJJHHFFDEDDDDDDCDDDCCBDDDDDDDDCDDDDEDEEEEEDDD NM:i:1 NH:i:1
                            HWI-ST765:7:1101:10002:95570#0 147 chr1 107247 255 101M = 107184 -164 TCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACAGGAATGTTCTTCTTTTTTGCCAGCGCCATGACAAGCTTGGAAATTCTTGGCTCTGCATTCTCG BDDDCDDDDDDDDDDDDDDDDDDDDDEDEEEEFEFFFFFFFHHHHHIIGIIIJIJJJJJJJJJJIJIJIIJJJIJIJIJJJJJJJJJJHHHHHFFFFFCCC NM:i:0 NH:i:1
                            HWI-ST765:7:1101:10002:95570#0 99 chr1 107184 255 101M = 107247 164 CGGCAACACAGCACCCCTTGCTCCTCACCTTCCCGCTGGAGCTGACATTCTCCACACCGACAATCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIIIJFHIJIJIIIJGGJIIJIGHHBECDDDCDDDDDDDDDCDDDDDDDDDDDDDDDDDDDDDDDDD NM:i:0 NH:i:1

                            chr1 tophat CDS 4117 4593 . + 0 gene_id "ECU01_0010"; transcript_id "ECU01_0010";
                            chr1 tophat CDS 4391 5086 . - 0 gene_id "ECU01_0020"; transcript_id "ECU01_0020";
                            chr1 tophat CDS 5014 7486 . - 0 gene_id "ECU01_0023"; transcript_id "ECU01_0023";
                            chr1 tophat CDS 7536 8832 . - 0 gene_id "ECU01_0025"; transcript_id "ECU01_0025";
                            chr1 tophat CDS 11158 11529 . - 0 gene_id "ECU01_0030"; transcript_id "ECU01_0030";
                            Let me know if you have any ideas on how to solve this issue.
                            Thanks in advance

                            Comment

                            • Camg
                              Member
                              • Jan 2011
                              • 21

                              #29
                              So I was messing around with parameters and found that when I changed --type=CDS to match my GTF file the program ran perfectly. I guess I was assuming CDS was the standard feature in column 3 of a GTF file since that's what is shown in the example GTF (link in manual). Also, I'm not sure why not specifying the --type attribute resulted in the error message I was getting.

                              Anyways, it is working now. Thanks for this great tool.

                              Comment

                              • Simon Anders
                                Senior Member
                                • Feb 2010
                                • 995

                                #30
                                The default value for 'type' is 'exon', not 'CDS'. This is because a GTF files usually have two lines for each exon, one labelled 'exon' and the other 'CDS', the difference being that the former includes untranslated regions and the latter only the coding part of the exon. (For interior exons, both lines contain the same coordinates.) I felt that most users would want htrseq-count to count a read also if it maps to a gene's UTR and this is why the default is 'exon', not 'CDS'.

                                In your GTF file, all the 'exon' lines seem to be missing, so you hit on the right solution to tell htseq-count to use the 'CDS' lines instead.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  Yesterday, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 12:03 PM
                                0 responses
                                17 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, Yesterday, 11:40 AM
                                0 responses
                                13 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                29 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-26-2026, 10:12 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...