Announcement

Collapse
No announcement yet.

htseq-count for sam and gff3

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by padmoo View Post
    Hi everyone,

    I have a similar problem with my gff file. I get the error:

    Error occured when processing GFF file (line 1 of file /gpfs/scratch/cbh12wsu/Thaps3_chromosomes_geneModels_FilteredModels2.gff):
    Feature fgenesh1_pg.C_chr_1000001 does not contain a 'gene_id' attribute
    [Exception type: ValueError, raised in count.py:53]

    My gff files looks like this:
    chr_1 JGI exon 300 1153 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
    chr_1 JGI CDS 300 1153 . - 0 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 10
    chr_1 JGI exon 1199 2425 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
    chr_1 JGI CDS 1199 2425 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber 9
    chr_1 JGI exon 2512 2935 . - . name "fgenesh1_pg.C_chr_1000001"; transcriptId 867
    chr_1 JGI CDS 2512 2935 . - 2 name "fgenesh1_pg.C_chr_1000001"; proteinId 867; exonNumber

    I'm guessing there is something missing? Can anyone help?

    Thanks!
    Padmoo
    In the gff format, the final column has a semi colon seperated list of attributes describing the feature. htseq-counts is looking for one called gene_id that lets it assign counts on the gene level, but your gff doesn't have that as one of the attributes. I'm not sure if "fgenesh1_pg.C_chr_1000001" is supposed to be a gene of if it's something else (it would make your life much easier if it is), but somehow you need to add something that looks like 'gene_id "myGene"' to the gff where it describes exons.

    Comment


    • #32
      Hi cmbetts,

      thanks for replying so quickly. I tried to find some other gff files and came across one that looks like this:
      ##gff-version 3
      ##sequence-region bd_19x4 1 66660
      bd_19x4 prediction gene 639 1694 0 - . ID=gene_1;Name=jgi.p|Thaps3_bd|833;portal_id=Thaps3_bd;proteinId=833;transcriptId=833
      bd_19x4 prediction mRNA 639 1694 . - . ID=mRNA_1;Name=jgi.p|Thaps3_bd|833;Parent=gene_1;proteinId=833;track=FilteredModels1;transcriptId=833
      bd_19x4 prediction exon 639 1694 . - . ID=exon_1_1;Parent=mRNA_1
      bd_19x4 prediction CDS 639 1694 . - 0 ID=CDS_1;Parent=mRNA_1
      bd_19x4 prediction gene 3895 5607 0 - . ID=gene_2;Name=jgi.p|Thaps3_bd|1845;portal_id=Thaps3_bd;proteinId=1845;transcriptId=1845
      bd_19x4 prediction mRNA 3895 5607 . - . ID=mRNA_2;Name=jgi.p|Thaps3_bd|1845;Parent=gene_2;proteinId=1845;track=FilteredModels1;transcriptId=1845
      bd_19x4 prediction exon 5551 5607 . - . ID=exon_2_1;Parent=mRNA_2
      bd_19x4 prediction five_prime_UTR 5551 5607 . - . ID=UTR5_2;Parent=mRNA_2
      bd_19x4 prediction exon 3895 5455 . - . ID=exon_2_2;Parent=mRNA_2
      bd_19x4 prediction five_prime_UTR 5097 5455 . - . ID=UTR5_2;Parent=mRNA_2

      This one has an ID=gene but it's only numbered. Maybe there are no gene models?
      I'm really confused... The files are from the JGI genome portal.

      Comment


      • #33
        Solved my problem. Didn't pay attention to the -i option...

        Comment


        • #34
          Be careful here. In this file, the "exon" lines only contain transcript IDs ("mRNA_1") but not gene IDs ("gene_1") in their "Parent" fields. This will not work, because if an exon appears in several transcripts of the same gene, it will be listed in several lines in the GFF file, each with a different "Parent" value, and if you use "-i Parent", htseq-count will think that these different exon lines correspond to different genes and consider the assignment "ambiguous" rather than counting it for the gene.

          You will have to tweak the GFF3 file such that you add to each "exon" line an additional attribute, say, "gene", which contains the gene ID rather than the transcript of mRNA ID, i.e. (in the logic of GFF3), the Parent ID of the Parent.

          Sorry that this is so complicated, but unfortunately, these file formats are all ill-specified and keep changing. (Maybe I am getting senile, but I am pretty sure that five years ago, when I wrote htseq-count, the GFF3 specification (or was it maybe only GFF2 then?) had "exon" and "gene" lines but no "mRNA" lines, and the "Parent" of an "exon" line was a "gene" line.

          Comment


          • #35
            Hi simon,

            Recently, I am trying to do DEG analysis. I use HTSeq-count to count reads. Here I want to know do I need use the .gtf file during reads mapping with Tophat (-G) before using the HTSeq-count? Because I found there was a little difference between the output.counts. Thank you very much.

            Comment


            • #36
              While you don't need to supply the GTF file to tophat, it will often improve the results slightly.

              Comment


              • #37
                Originally posted by dpryan View Post
                While you don't need to supply the GTF file to tophat, it will often improve the results slightly.
                Thank you for the reply. I have another question. When I used HTSeq-count to count the read pairs for paired-end data, why I just get the counts of reads but not the read pairs. e.g. for my .sam, it has 15241301 lines excluded headers. After HTSeq-count(-r name -s no) analysis, I got the output.counts:
                (sum of counts of genes: 14939019)
                __no_feature 129677
                __ambiguous 172605
                __too_low_aQual 0
                __not_aligned 0
                __alignment_not_unique 0

                So how can I get the read pairs? Thank you very much.

                ps: After tophat, I samtools rmdup the accepted_hits.bam, filtered the unique_mapping_reads and sorted the .bam.
                Last edited by populus; 07-21-2015, 08:09 AM. Reason: add information

                Comment


                • #38
                  1) Don't remove duplicates from RNAseq data without a very very good reason.
                  2) HTseq-count is giving you counts of read pairs, or at least it should be if you actually aligned things as pairs.

                  Comment


                  • #39
                    Originally posted by dpryan View Post
                    1) Don't remove duplicates from RNAseq data without a very very good reason.
                    2) HTseq-count is giving you counts of read pairs, or at least it should be if you actually aligned things as pairs.
                    I think I have figured out the problem. I found there were suffix .1 and .2 in the seq_name of paired reads. It must be counted as single end data. I removed the suffix and the htseq-count worked well.

                    Thank you very much.

                    Comment


                    • #40
                      Most of my reads are going to no_feature

                      Good evening SEQanswers community,

                      I am having issues with counting my reads using HTseq. I found this thread, but I was not able to use the same modifications to help my situation. First I sorted my unsorted mapped reads sam file (not sure if this was necessary), then I edited the sorted file, then I edited the gff3 file, and executed the HTSeq.scripts.count. Most of my reads went to no_feature. What am I doing wrong? Please see my input and sections of my output.

                      I would greatly appreciate your help!

                      #Sorting my mapped reads with:
                      samtools sort -n -O sam -T Project.sort -o Project.sort.sam Project_mapped.sam

                      Output:
                      @SQ SN:lcl|NC_017304.1_cds_WP_003516470.1_3012 LN:1380
                      @SQ SN:lcl|NC_017304.1_cds_WP_003516468.1_3013 LN:621
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513356.1_3014 LN:876
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513354.1_3015 LN:216
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513352.1_3016 LN:381
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513349.1_3017 LN:135
                      @PG ID:bowtie2 PN:bowtie2 VN:2.2.9 CL:"<removed> basic-0 --threads 12 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x Project_b2index -S Project
                      _mapped.sam -U /tmp/15217.unp"
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1236:29817 4 * 0 0 * * 0 0 AACAACAAAAGACAAAAAACTTAAAAACAAAAGACAAAGTCCAAAATTA BBBB/FFFFFFBBB
                      F/FFFB/BFFFFF/FFBFF<<<<<F/FBFFF<F<F YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1301:27869 4 * 0 0 * * 0 0 TAAGAGTTTTTCGAGGGAATAAATAACAACGTTCAAGGAAATCCT BBB<<B//<B//<<////</<B/FFFB////<FB<//</<BB/FB YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1302:11737 0 lcl|NC_017304.1_cds_WP_003514647.1_448 142 42 48M * 0 0 AAAGCCATAGATGCAGCGGTTAATGACATGGCTCTGATAAGTGGTCAG BBBBBFFBF////BFFF</F/FFF/FFFFFFFB/<<FFBFBF<FFFFB AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:11A36 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1318:93081 0 lcl|NC_017304.1_cds_WP_003514647.1_448 321 40 44M * 0 0 CCTGCCAAGAGTAAGTGACTTTTGAGGTGTTCCTGTAAATTCTT BBBBBF///</<B<///<</<</<<<//<<BFFF/<</FBBFFB AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:15G6A21 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1324:19877 0 lcl|NC_017304.1_cds_WP_014522594.1_623 2717 1 50M * 0 0 GCGGTATACCATCCAAGGGAATAGCAAACTGTGACTTTGTATACAGCTAT BBBBBBFFFFFFF/FB<//BFFFFFFFFFBFF<FF/FFFFFFFFFFBFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1325:83438 0 lcl|NC_017304.1_cds_WP_003517434.1_1009 2051 24 41M * 0 0 TCAACACTTCAATCAATGCGGTTATTGAGCTGGTTAATTCC B<BBBBBF/</<B<F/<//BB//7F/<F<F/</<FFFF<FB AS:i:-9 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:8G6G9G15 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1331:65339 0 lcl|NC_017304.1_cds_WP_003513122.1_105 10 40 43M * 0 0 GTTACTTTAAACCTAAGAGTTGAGGCAATAAAAAACAACGAAG <BBBFF/<//<F////</<<//<FBF7FFF//<FFB//</<FF AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:6A6A29 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1345:80429 0 lcl|NC_017304.1_cds_WP_003512596.1_1851 207 42 49M * 0 0 TGTTGAAGAGACGGGACTTCCTATCTGCCTGCATCTTG

                      #Edited the sorted file with sed
                      cp ./Project.sort.sam ./Project.sort.sed.sam
                      sed -i 's/lcl|NC............................_/gene/g' Project.sort.sed.sam

                      Pre-edit:
                      @SQ SN:lcl|NC_017304.1_cds_WP_003516470.1_3012 LN:1380
                      @SQ SN:lcl|NC_017304.1_cds_WP_003516468.1_3013 LN:621
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513356.1_3014 LN:876
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513354.1_3015 LN:216
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513352.1_3016 LN:381
                      @SQ SN:lcl|NC_017304.1_cds_WP_003513349.1_3017 LN:135
                      @PG ID:bowtie2 PN:bowtie2 VN:2.2.9 CL:<removed> basic-0 --threads 12 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x cipA_b2index -S cipA
                      _mapped_DS3_3-5.sam -U /tmp/15217.unp"
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1236:29817 4 * 0 0 * * 0 0 AACAACAAAAGACAAAAAACTTAAAAACAAAAGACAAAGTCCAAAATTA BBBB/FFFFFFBBB
                      F/FFFB/BFFFFF/FFBFF<<<<<F/FBFFF<F<F YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1301:27869 4 * 0 0 * * 0 0 TAAGAGTTTTTCGAGGGAATAAATAACAACGTTCAAGGAAATCCT BBB<<B//<B//<<////</<B
                      /FFFB////<FB<//</<BB/FB YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1302:11737 0 lcl|NC_017304.1_cds_WP_003514647.1_448 142 42 48M * 0 0 AAAGCCATAGATGCAGCGGTTAATGACATGGCTCTGAT
                      AAGTGGTCAG BBBBBFFBF////BFFF</F/FFF/FFFFFFFB/<<FFBFBF<FFFFB AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:11A36 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1318:93081 0 lcl|NC_017304.1_cds_WP_003514647.1_448 321 40 44M * 0 0 CCTGCCAAGAGTAAGTGACTTTTGAGGTGTTCCTGTAA
                      ATTCTT BBBBBF///</<B<///<</<</<<<//<<BFFF/<</FBBFFB AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:15G6A21 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1324:19877 0 lcl|NC_017304.1_cds_WP_014522594.1_623 2717 1 50M * 0 0 GCGGTATACCATCCAAGGGAATAGCAAACTGTGACTTTGTATACAGCTAT BBBBBBFFFFFFF/FB<//BFFFFFFFFFBFF<FF/FFFFFFFFFFBFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU

                      Post-edit:
                      @SQ SN:gene3013 LN:621
                      @SQ SN:gene3014 LN:876
                      @SQ SN:gene3015 LN:216
                      @SQ SN:gene3016 LN:381
                      @SQ SN:gene3017 LN:135
                      @PG ID:bowtie2 PN:bowtie2 VN:2.2.9 CL:<removed> basic-0 --threads 12 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 -x cipA_b2index -S cipA
                      _mapped_DS3_3-5.sam -U /tmp/15217.unp"
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1236:29817 4 * 0 0 * * 0 0 AACAACAAAAGACAAAAAACTTAAAAACAAAAGACAAAGTCCAAAATTA BBBB/FFFFFFBBB
                      F/FFFB/BFFFFF/FFBFF<<<<<F/FBFFF<F<F YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1301:27869 4 * 0 0 * * 0 0 TAAGAGTTTTTCGAGGGAATAAATAACAACGTTCAAGGAAATCCT BBB<<B//<B//<<////</<B
                      /FFFB////<FB<//</<BB/FB YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1302:11737 0 gene448 142 42 48M * 0 0 AAAGCCATAGATGCAGCGGTTAATGACATGGCTCTGATAAGTGGTCAG BBBBBFFBF////B
                      FFF</F/FFF/FFFFFFFB/<<FFBFBF<FFFFB AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:11A36 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1318:93081 0 gene448 321 40 44M * 0 0 CCTGCCAAGAGTAAGTGACTTTTGAGGTGTTCCTGTAAATTCTT BBBBBF///</<B<///<</<<
                      /<<<//<<BFFF/<</FBBFFB AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:15G6A21 YT:Z:UU
                      HISEQ-WALDORF:257:C95T2ANXX:2:1101:1324:19877 0 gene623 2717 1 50M * 0 0 GCGGTATACCATCCAAGGGAATAGCAAACTGTGACTTTGTATACAGCTAT BBBBBBFFFFFFF/
                      FB<//BFFFFFFFFFBFF<FF/FFFFFFFFFFBFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU


                      #Edited the GFF3 file with Excel, then used ren *.txt *.gff to convert back.

                      Pre-edit:
                      ##gff-version 3
                      #!gff-spec-version 1.21
                      #!processor NCBI annotwriter
                      #!genome-build ASM18492v1
                      #!genome-build-accession NCBI_Assembly:GCF_000184925.1
                      ##sequence-region NC_017304.1 1 3561619
                      ##species https://www.ncbi.nlm.nih.gov/Taxonom....cgi?id=637887
                      NC_017304.1 RefSeq region 1 3561619 . + . ID=id0;Dbxref=taxon:637887;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=DSM 1313
                      NC_017304.1 RefSeq gene 212 1543 . + . ID=gene0;Name=CLO1313_RS00010;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00010;old_locus_tag=Clo1313_0001
                      NC_017304.1 Protein Homology CDS 212 1543 . + 0 ID=cds0;Parent=gene0;Dbxref=Genbank:WP_003516465.1;Name=WP_003516465.1;gbkey=CDS;product=chromosomal replication initiator protein DnaA;protein_id=WP_003516465.1;transl_table=11
                      NC_017304.1 RefSeq gene 1793 2893 . + . ID=gene1;Name=CLO1313_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00015;old_locus_tag=Clo1313_0002
                      NC_017304.1 Protein Homology CDS 1793 2893 . + 0 ID=cds1;Parent=gene1;Dbxref=Genbank:WP_003513339.1;Name=WP_003513339.1;gbkey=CDS;product=DNA polymerase III subunit beta;protein_id=WP_003513339.1;transl_table=11
                      NC_017304.1 RefSeq gene 2909 3115 . + . ID=gene2;Name=CLO1313_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00020;old_locus_tag=Clo1313_0003

                      Post-edit:
                      ##gff-version 3
                      #!gff-spec-version 1.21
                      #!processor NCBI annotwriter
                      #!genome-build ASM18492v1
                      #!genome-build-accession NCBI_Assembly:GCF_000184925.1
                      ##sequence-region NC_017304.1 1 3561619
                      ##species https://www.ncbi.nlm.nih.gov/Taxonom....cgi?id=637887
                      NC_017304.1 RefSeq region 1 3561619 . + . ID=id0
                      NC_017304.1 RefSeq gene 212 1543 . + . ID=gene0
                      NC_017304.1 Protein Homology CDS 212 1543 . + 0 ID=cds0
                      NC_017304.1 RefSeq gene 1793 2893 . + . ID=gene1
                      NC_017304.1 Protein Homology CDS 1793 2893 . + 0 ID=cds1
                      NC_017304.1 RefSeq gene 2909 3115 . + . ID=gene2
                      NC_017304.1 Protein Homology CDS 2909 3115 . + 0 ID=cds2
                      NC_017304.1 RefSeq gene 3133 4242 . + . ID=gene3
                      NC_017304.1 Protein Homology CDS 3133 4242 . + 0 ID=cds3
                      NC_017304.1 RefSeq gene 4410 4724 . + . ID=gene4
                      NC_017304.1 Protein Homology CDS 4410 4724 . + 0 ID=cds4


                      #Then did the count
                      #Count reads using htseq-count
                      python2.7 -m HTSeq.scripts.count -t gene -i ID Project.sort.sed.sam GCF_000184925.1_2.gff > Project_counts.txt

                      Output: Project_counts.txt

                      gene995^M 0
                      gene996^M 0
                      gene997^M 0
                      gene998^M 0
                      gene999^M 0
                      __no_feature 14821716
                      __ambiguous 0
                      __too_low_aQual 999116
                      __not_aligned 6400468
                      __alignment_not_unique 0
                      Last edited by jmwhitha; 07-20-2016, 06:05 PM. Reason: need to know

                      Comment


                      • #41
                        Does it have to do with my S? http://www-huber.embl.de/users/ander...doc/count.html says that "__no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty)." I know there are start and stop positions for the GFF3. Are there corresponding start and stop positions in the sam file? If so, what column? I did look at http://genome.sph.umich.edu/wiki/SAM, but I still can't figure out if there is corresponding start and stop info. Thanks again for the help!

                        http://www-huber.embl.de/users/ander...doc/count.html

                        Comment


                        • #42
                          @SQ SN:gene3013 LN:621
                          So, each contig/chromosome in your SAM file is a gene. This looks like you aligned to a transcriptome but then tried to proceed with a work flow meant for SAM files aligned to a genome.

                          I think you got confused about what a GFF3 file is.

                          Better explain a bit more what you are trying to do.

                          Comment


                          • #43
                            Multi-fasta

                            Thank you very much for your fast reply! Yes, each contig/chromosome (aka @SQ) is actually a gene. The format of the multi-fasta file I used for my bowtie2 alignment is below. It's not a transcriptome (http://www.ncbi.nlm.nih.gov/nuccore/NC_017304). Why do you think that? My goal is to map count gene reads, not transcript reads. I thought by replacing "lcl|NC............................_" with gene, I could get HTseq count to count my reads. What should I do instead? Thank you again!

                            >lcl|NC_017304.1_cds_WP_003516465.1_1 [gene=CLO1313_RS00010] [protein=/inference=EXISTENCE: similar to AA sequence:SwissProt:A3DHZ4.1] [protein_id=WP_003516465.1] [location=212..1543
                            ]
                            ATGAATACTCAGTTGAATGAAATATGGCAAAAAACTTTAGGACTGCTTAAAAATGAGCTTACAGAAATCA
                            GTTTTAACACCTGGATCAAGACCATCGATCCATTGTCCTTGACAGGCAATACTATAAACCTGGCTGTCCC
                            GGCGGAATTCAACAAGGGAATTCTTGAGTCCAGGTATCAAACTCTGATTAAAAATGCCATTAAGCAAGTT
                            ACTTTTAAGGAATACGAGATTGCATTTATCGTGCCTTCACAGGAAAATTTAAACAAGCTGACGAAGCAGA
                            CCGAGTCCGCCGGCAATGAGGATTCTCCTTTGTCAGTATTAAACCCGAAGTACACGTTTGACACTTTTGT
                            CATAGGAAACAGCAACAGATTTGCACACGCAGCCGCACTGGCCGTGGCCGAGGCACCGGGAAAAGCATAC
                            AATCCCTTGTTCATATATGGCGGAGTGGGACTTGGGAAGACTCATCTTATGCATGCCATCGGGCACTACA
                            TTCTGGAACAGAATTCTTCCCAAAGGGTTTTGTATGTTTCATCTGAAAAATTTACCAACGAACTTATCAA
                            TGCCATTAAAGACAACAGAAATGAAGAATTCAGATCCAAATACAGAAATATTGACGTACTGCTTATAGAC
                            GACATACAATTCATTGCCGGAAAGGAAAGAACGGAGGAGGAGTTCTTCCATACCTTCAATGCTCTTTACG
                            AAGCAAACAAACAGATAATCCTGTCAAGCGACAAGCCTCCGAAAGAAATTTCTCTTGAGGACCGCCTGAG
                            ATCCAGGTTTGAATGGGGCTTGATTGCGGACATGCAGGCACCGGATCTGGAAACCAGGATAGCAATACTA
                            AGGAAAAAAGCCCAGCTTGAAAACCTTACTGTTCCAAATGAAGTAATTGTATTCATTGCAGACAAGATAG
                            CATCAAACATCAGAGAACTTGAAGGTGCCTTAAACAGAGTAATAGCATATTCATCGCTTACGGAAAACGA
                            AATTACCGTCGAACTCGCCAGCGAAGCATTAAAAGACATACTGTCAGCAAACAAGGCGAAAGTTTTAAAC
                            TGCACCACAATCCAGGAAGCAGTGGCCAGATACTTTGACATAAGACCGGAAGAATTTAAATCAAAGAAGA
                            GGACAAGGGACATCGCATTCCCAAGACAAATTGCAATGTACCTGTGCAGAGAACTTACCGAAATGTCCCT
                            CCCAAAAATCGGCGAGGAATTCGGCGGAAGAGATCATACTACTGTAATACATGCATGTGAAAAGATAAGT
                            GAAGAAATCGAAAGCAACTCCGAAACCAGGAGGGCCGTAAGTGAAATAAAGAGGAACCTGCTGGGAAAAT
                            AA
                            >lcl|NC_017304.1_cds_WP_003513339.1_2 [gene=CLO1313_RS00015] [protein=/inference=EXISTENCE: similar to AA sequence:RefSeq:WP_003513339.1] [protein_id=WP_003513339.1] [location=1793..
                            2893]
                            ATGAAAATAGTTTGTTCCAAAGAACAGCTAATGGAAGGAATCAACGTCGTGCAAAAAGCAGTGCCGACAA
                            AAGCCACTCTAACCATACTGGAAGGAATATTGCTGGAAGCATACGACAATTTTAAAATGACCGGAAATGA
                            TTTGGAACTGGGAATAGAATGCCTTATAGATGCAGACATTCTGGAAAAAGGATCTATAGTCTTAAATTCA
                            AAAATGTTCGGAGACATAGTAAGAAGACTTCCCGACTCAGAGGTACTTATTGAAGTTAAAGAGAACAATA
                            CAGTTATCATTGAATGTGACAACTCTCACTTTGAGTTAAGGGGTATGCCTTCTGACAGCTTTCCGTCACT
                            GCCTTCAATTGAAAAAGAGAACATGATCAAAGTCAGCCAAAAGGCAATCAGGGATATGATAAGACAAACA
                            CTTTTTGCCGTAAGTATGGAAGGAACCAGACCGATACTTACCGGTTCACTTATTGAATGTGCAGGAAACG
                            AAATTACCTTCGTTTCAATAGACGGATTCAGAATGGCTCTGAGAAAAAACTTTAACAACGAAGGATTTTC
                            CGAATTCAGTGTTGTCGTACCCGCAAAAACCCTCAGCGAGATAGGCAAAATCTTACAGCCGGTTGATGAA
                            GATATTTACATATACAGTTCTCAAAACCAGATACTGTTTGAAATTGGAAATTGCAAAGTTGTATCAAGAC
                            TTTTAGAGGGTGAATATCTAAACTATAAAAGTATTATACCACCGGAATATGAAACCAGCGTAAGACTTAG
                            AACCGAGGACCTTTTGTCCAGCCTTGAAAGGGCGTCATTGATTACTTCGGACGAAAAGAAATACCCGGTT
                            AAATTTAATATTATAGACGATAAAATCATAATTACCTCCAACACTGAAATAGGAGCAGTAAGGGAAGAAA
                            TCAGAGTCGAAGTAAACGGCAGCAACATGGAAGTGGGCTTCAACCCCAGATATTTTATCGAAGCGCTCAG
                            GGTCATAGATGACGAGCTGGTTGACATATACTTCAATTCAAGTGTCGGTCCGTGTACAATAAGACCTCTT
                            GAAGGCGACAGTTTTGCATACATGATACTTCCGGTAAGAATAAATAAATAA

                            Commands:

                            #Index
                            bowtie2-build -o 2 --threads 12 -f NC_017304.1.nucleotide.fa cipA_b2index

                            #Mapped 3 fastq files to the index
                            bowtie2 --threads 12 --very-sensitive-local -x cipA_b2index -U 1.fastq.gz,2.fastq.gz,3.fastq.gz -S Project_mapped.sam

                            Comment


                            • #44
                              If each gene is on its own pseudo-contig, you don't need htseq-count. You just count how often you see each gene ID in the third column of your SAM file which contains the chromosome to which the read was mapped. You may want to use a suitable script to skip over multi-mapping reads, though (in the easiest case, grep for all reads with NH:i:1 or something like that).

                              On the other hand, it is not a good idea to use a tool chain in a manner not intended by the developers of the tools unless you know enough about the tools\ internals to be sure that this is sound. Bowtie is not meant to map to a reference made up not of complete chromosomes of contigs but of individual genes, and htseq-count is neither meant to be used in this manner.

                              However, there are now tools designed for your manner of operation, most notably salmon and kallisto. So, the easiest and cleanest solution would be to use one of these two.
                              Last edited by Simon Anders; 07-23-2016, 09:57 AM.

                              Comment


                              • #45
                                Simon,

                                Thanks again for your quick replies. I just wanted to mention that I really appreciate your last response. While you may not be interested, I wanted to describe for others at least how I was not using the HtSeq-count tool/workflow properly. Instead of using the genome fasta file, I was using a multi-fasta file (cds only). Once I used the genome fasta file, almost everything worked properly. The one issue I ran into was that HtSeq count did not recognize the values in the first column of the GFF3 file. The reason was because the value in the first column was different from the sam file headers. The way I solved this was by changing the genome fasta file header to value in the first column of the GFF3 file before mapping.

                                Thank you and God bless,
                                Jason

                                Comment

                                Working...
                                X