Hello,
I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:
Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.
This occurs for each read and the job stops running.
Here is my command line for htseq-count:
module load python/2.7.3
htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out
I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5’ UTR and 3’ UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.
My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?
I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.
Thanks!
I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:
Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.
This occurs for each read and the job stops running.
Here is my command line for htseq-count:
module load python/2.7.3
htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out
I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5’ UTR and 3’ UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.
My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?
I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.
Thanks!
Comment