Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ssharma
    replied
    Error while using htseq-count

    Hi,
    I am getting an error while using htseq-count.
    I used the following command:
    htseq-count CL1_forBWA_left_DSS3.sam NC_003911.gff

    This is the error i am getting. I got all the files from ncbi.
    Error occured in line 2169 of file NC_003911.gff.
    Error: Feature tRNA does not contain a 'gene_id' attribute
    [Exception type: SystemExit, raised in count.py:55]

    I would really appreciate if anyone can help me out.

    Thanks
    Shalabh

    Leave a comment:


  • Simon Anders
    replied
    For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

    Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

    Code:
    import sys, re
    import HTSeq
    
    insam = HTSeq.SAM_Reader( sys.stdin )
    
    # Go through all reads, with their alignments bundled up:
    for bundle in HTSeq.bundle_multiple_alignments( insam ):
       bestAlmt = None
       # Go through all alignments of a given read, looking
       # for the one with the best alignment score
       for almt in bundle:
          if bestAlmt is None:
             bestAlmt = almt
          elif almt.aQual > bestAlmt.aQual:
             bestAlmt = almt
          elif almt.aQual == bestAlmt:
             # If there are more than one best alignment, 
             # better skip the read
             bestAlmt = None
       if bestAlmt is not None:
          # Change the NH field to 1 and print the line
          print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),
    Call this with something like
    Code:
    sort samfile.sam | python chooseBest.py > filtered.sam
    .

    Be aware hat I haven't really tested this thoroughly.

    Leave a comment:


  • Tobias Kockmann
    replied
    That is easily doable - but this would imply that reads are counted several time as part of different alignments. To my knowledege such a way of counting is not compatible with the statistical model used by DESeq and this is actually what I would like to use the counts for.

    Leave a comment:


  • joro
    replied
    Hi Tobi, could you write a script to change the NH flag to NH:i:1 in the sam file so that htseq-count doesn't ignore them?

    Leave a comment:


  • Tobias Kockmann
    replied
    alignment_not_unique

    Hi HTSeq users,

    I realized that the htseq-count script is very conservative when counting reads matching to known gene models:

    I generated an alignment of my RNA-seq reads to the Drosophila genome using tophat in combination with known Ensembl gene models (provided by gtf file). The output in sam format looks like this:

    Code:
    [tobiasko@bs-dsvr09 tophat_out]$ head accepted_hits.sam
    -1350550920:7:9:328:542 0       chr2L   8880    3       35M     *       0       0       CTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACA     6IC@/IIIIIFABI2II1III3'E2>@I:,&'(31     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097273   XS:A:-
    -1350550920:7:77:1586:563       0       chr2L   8883    3       35M     *       0       0       GACTCGCAAAGTGGACTTGTTCAGCAAGGACATAA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097270   XS:A:-
    -1350550920:7:24:1131:1918      0       chr2L   9327    3       35M     *       0       0       TTTTACAAATCTTACGTAAACACTCCAAGCATGAA     IIIIIIIIIIIIIIII?III=ICEI;EI:IIDII;     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19096819   XS:A:-
    As indicated by the optional NH flag some of the reads are mapped to multiple locations (e.g. first line -> NH:i:2). My problem is, that htseq-counts ignores all reads/alignments that have been mapped to multiple locations by default, which in my case leads to a massive loss of coverage (>50%).

    I wonder, if there is a way to either tune tophat to only report the best alignment per read like one can do when using bowtie with -k1 --best, or to modify the couting scheme of htseq-count to be less stringent (e.g. count every read once at the best alignment position).

    Greetings,
    Tobi

    Leave a comment:


  • usad
    replied
    Install instructions for CentOS (RHEL)

    Hi,

    for what it's worth I detailed the additional steps necessary to install HTSeq on CentOS (should work on Red Head Enterprise as well) in RNA-Seq howto wiki. (Or rather in the first incomplete alpha version)




    Best Wishes,
    Björn

    Leave a comment:


  • fabrice
    replied
    I am runing DEXseq and htseq-count and have some problems. Maybe others are interesting to this or have the answer, so I post to here again.

    When I run dexseq_count.py in DEXseq on my sam file from bwa, it always has this warning:

    ~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:543:
    UserWarning: Malformed SAM line: RNAME != '*' although flag bit
    &0x0004 set
    algnt = SAM_Alignment.from_SAM_line( line )
    ~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:592:
    UserWarning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have
    an aligned mate which could not be found. (Is the SAM file properly
    sorted?)
    "which could not be found. (Is the SAM file properly sorted?)" )

    but I can found GA2-EAS337_0017_FC:1:8:11202:16260#0 mate in the data,
    as show in the attachments a.sam. And the a.sam is sorted by the name
    (default by bwa).

    In htseq-count, it also have this problem.

    htseq-count a.sam Homo_sapiens.GRCh37.63/Homo_sapiens.GRCh37.63.gtf
    100000 GFF lines processed.
    200000 GFF lines processed.
    ...
    2000000 GFF lines processed.
    2064769 GFF lines processed.
    Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
    Warning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have an
    aligned mate which could not be found. (Is the SAM file properly
    sorted?)
    4 sam line pairs processed.

    Why here report 4 line pairs? In fact, there are 3 sam line pairs in a.sam.

    BTW: Do this kind of count using BWA or tophat
    mapping, which one is better? (Illumina pair-end human RNA-seq data)

    Thank you.
    Attached Files

    Leave a comment:


  • joro
    replied
    Oops, I didn't realise that I had the quiet option on! Yes I'm now getting warnings about the chromosomes not appearing in the GFF file. I guess these reads are not included in the output SAM file? Thanks for your help Simon.

    Leave a comment:


  • joro
    replied
    Thanks for your reply. No I didn't get any warnings about reads being skipped. I'll email the files to you.

    Leave a comment:


  • Simon Anders
    replied
    Ho Joro,

    did you get any warnings about reads being skipped? Otherwise, could you send me some examples of SAM lines that are in the input but not the output file, togetehr with your GTF file? Thanks.

    Simon

    Leave a comment:


  • joro
    replied
    Hi,

    I have some reads in the original SAM file that don't seem to be accounted for in the output of htseq-count. To look in more detail I created a new SAM file using --samout. The reads aren't in this SAM file. Presumably they should be in the SAM file but assigned something like "no_feature"? Any suggestions are welcome, thanks.

    Leave a comment:


  • Simon Anders
    replied
    In principle, you could follow the 'Tour' in the HTSeq documentation to learn how to do this yourself.

    The htseq-count script falls into two parts: the main part doing the actual counting, and surrounding it a thin wrapper to get the intervals out of the SAM file. This is all that needs to be changed, and once I separate these into two modular parts, it will be even easier. Might take a while until I find the time to so, though.

    S

    Leave a comment:


  • sshen
    replied
    Will HTSeq be able to count reads from a bed file

    Dear Simon,

    My case is somehow unusual. I have a mapped sequence reads file in bed format. I wonder if you could point me out how I can use HTSeq for reads counts on gene basis. I understand that the bed file contains only interval data and doesn't have quality scores for each base. But in my case, I only need read counts. I have limited knowledge about python.

    Your help will be greatly appreciated!

    Best
    Steve

    Leave a comment:


  • Simon Anders
    replied
    The '-i' and '-t' options give you some flexibility what htseq-count is counted.

    If this is insufficient, please remember that htseq-count was actually meant not so much as a tool in its own right but rather a demonstration of the possibilities of HTSeq, our Python framework for HTS analysis. As demonstrated in the documentation, it is fairly easy (if you know Python or are willing to learn it) to write your own counting script using HTSeq and then implement whatever custom logic you need for your use case.

    Leave a comment:


  • adeonari
    replied
    Another quick question: is it possible with htseq-counts to exclude the reads from the 3' UTR?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 08:06 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-30-2024, 12:17 PM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-29-2024, 10:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
26 views
0 likes
Last Post seqadmin  
Working...
X