Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mdk308
    replied
    KHubbard - not sure but it looks like you may not have cygwin installed?

    Leave a comment:


  • KHubbard
    replied
    Trouble with HTSeq install

    Trying to install HTSeq onto a machine running Windows7 and I got the following error:

    C:\Users\ArrayStar>cd C:\HTSeq-0.5.3p9

    C:\HTSeq-0.5.3p9>PATH=%PATH%;C:\Python27;C:\MinGW\bin

    C:\HTSeq-0.5.3p9>python setup.py build --compiler=mingw32
    Could not import 'setuptools', falling back to 'distutils'.
    running build
    running build_py
    running build_ext
    building 'HTSeq._HTSeq' extension
    C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -IC:\Python27\lib\site-packages\
    numpy\core\include -IC:\Python27\include -IC:\Python27\PC -c src/_HTSeq.c -o bui
    ld\temp.win-amd64-2.7\Release\src\_htseq.o -w
    cc1.exe: error: unrecognized command line option '-mno-cygwin'
    error: command 'gcc' failed with exit status 1

    C:\HTSeq-0.5.3p9>

    Any help with what I'm doing wrong and what I can do to fix it? Thanks!

    Leave a comment:


  • Delphine
    replied
    Hi,

    I usely use a command line like this, if this can help:
    I use samtools to read the bam file and with the pipe, I use htseq-count. The "-" call the output of the samtools view and pass it to htseq-count

    samtools view file.bam | python2.6 htseq-count - file.gtf -q > htseq_output &

    Delphine

    Leave a comment:


  • dvanic
    replied
    For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam:
    Question: Is there an easy way to use a bam file as an input fot htseq-count now???

    Leave a comment:


  • gringer
    replied
    Originally posted by gringer View Post
    For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam
    After discovering that pysam doesn't work so well using python 2.4 (which is installed on the Illumina server I use for analysis), I had another look at Simon's code for reading BAM files using SAMtools. The resulting method I came up with uses yield to process the samtools view output, which makes it just a little bit faster (seems to be about an order of magnitude faster than the previous samtools view hack, and possibly a teensy bit faster than pysam). There's no reading of the advanced SAM features (like split reads), and I took out the region-based search, but it might be good enough for a few other people:

    Code:
    def BAM_Reader( bamfile, samtools_exec="samtools" ):
        # derived from Simon Anders' code from
        # http://seqanswers.com/forums/showpost.php?p=22114&postcount=46
    
       """Get Reader object for part of a BAM file
    
       bamfile -- a string, either a file name or an URL to a BAM file
       region -- a GenomicInterval
       samtools_exec -- file name (with path, if necessary)
          of the 'samtools' executable
    
       Returns a SAM_Reader object.
       """
    
       cmd_line = ( samtools_exec , "view", bamfile )
    
       samtools_view = subprocess.Popen( cmd_line,
          stdout = subprocess.PIPE )
    
       for line in samtools_view.stdout:
          fields = line.split('\t')
          #columns from http://samtools.sourceforge.net/SAM1.pdf
          read = HTSeq.SequenceWithQualities(name = fields[1-1], seq = fields[10-1],
                                              qualstr = fields[11-1])
          mapStrand = '+'
          if(int(fields[2-1]) & 0x10 == 0x10):
             mapStrand = '-'
          if(int(fields[2-1]) & 0x4 == 0x4):
             iv = None
          else:
             iv = HTSeq.GenomicInterval(
                chrom = fields[3-1], start = int(fields[4-1]) - 1,
                end = int(fields[4-1]) - 1 + len(fields[10-1]),
                strand = mapStrand)
          yield HTSeq.Alignment(read, iv)
    I've also attached a full script that uses this code for generating 3-column count files and coverage graphs with experiment labels.
    Attached Files

    Leave a comment:


  • gringer
    replied
    For those who weren't already aware of it, HTSeq seems to be able to read BAM files now, using pysam:



    edit: on that note, is there any way I can get the Query name from a BAM file line using the BAM_Reader interface? I need this because it contains an experiment ID, and I would like to split my output based on that ID. My current way to get around it is to use pysam directly. Looking at the HTSeq source code showed me how to convert the output into something that could be used for coverage analysis:

    Code:
    bamFile = pysam.Samfile(bamFileName, "rb")
    coverage = dict()
    totalCount = 0
    hitCount = 0
    for read in bamFile:
        totalCount += 1
        if(not(read.is_unmapped)):
            experiment = read.qname[:read.qname.find(".")]
            hitCount += 1
            ## Taken from HTSeq source code
            chrom = bamFile.getrname(read.tid)
            strand = "-" if read.is_reverse else "+"
            iv = HTSeq.GenomicInterval( chrom, read.pos, read.aend, strand )
            if(not(experiment in coverage)):
                coverage[experiment] = (
                    HTSeq.GenomicArray("auto", stranded = True, typecode = "i"))
            (coverage[experiment])[iv] += 1
    For what it's worth, my BAM file contains query names like '<experiment>.<id>'. The reads were mapped to the yeast genome using tophat, and I'm now trying to extract out coverages for each gene / exon (including n bases up or downstream) on a per-experiment basis.
    Last edited by gringer; 10-25-2011, 05:15 AM. Reason: added additional question

    Leave a comment:


  • songyj
    replied
    Prequisites
    To use HTSeq, you need at least version 2.5 of Python (Python 3 does not work yet),

    Python 3 does not work yet....ooh no!!

    Leave a comment:


  • ssharma
    replied
    No, there is no gene_id.
    I even tried -i ID option and nothing is working.
    Is there any way i can convert gff file to gtf or can directly get GTF file for my Genome?

    Thanks
    Shalabh

    Leave a comment:


  • Simon Anders
    replied
    Well, there is no attribute called "gene_id" in your GFF file, or is there? The attributes are the entries in the last column, and their name is the part before the "=" sign. Try '-i ID'.

    Leave a comment:


  • ssharma
    replied
    Thanks Simon,
    So if i want to use gff file i need to mention the feature type to be used (3rd column) with -t option.

    So this is what i did: (i want to use 'gene')

    htseq-count -t gene CL1_forBWA_left_DSS3.sam NC_003911.gff
    but now i am getting following error:


    Error occured in line 6 of file NC_003911.gff.
    Error: Feature NC_003911.11:gidA does not contain a 'gene_id' attribute
    [Exception type: SystemExit, raised in count.py:55]

    These are few lines from my gff:

    ##gff-version 3
    #!gff-spec-version 1.14
    #!source-version NCBI C++ formatter 0.2
    ##Type DNA NC_003911.11
    NC_003911.11 RefSeq source 1 4109442 . + . organism=Ruegeria pomeroyi DSS-3;mol_type=genomic DNA;strain=DSS-3;db_xref=taxon:246200
    NC_003911.11 RefSeq gene 1 1869 . + . ID=NC_003911.11:gidA;locus_tag=SPO0001;db_xref=GeneID:3194636
    NC_003911.11 RefSeq CDS 1 1866 . + 0 ID=NC_003911.11:gidA:unknown_transcript_1;Parent=NC_003911.11:gidA;locus_tag=SPO0001;note=GidA%3B glucose-inhibited cell division protein A%3B involved in the 5-carboxymethylaminomethyl modification %28mnm%285%29s%282%29U%29 of the wobble uridine base in some tRNAs;transl_table=11;product=tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA;protein_id=YP_165274.1;db_xref=GI:56694929;db_xref=GeneID:3194636;exon_number=1

    Leave a comment:


  • Simon Anders
    replied
    ssharma:

    This line does not have a 'gene_id' attribute, but neither do the others. Please note that htseq-counts expects by default a GTF file. Yours is not one. (See this post for details.) See also the htseq-count documentation, especially then options '-t' and '-i'.

    Leave a comment:


  • ssharma
    replied
    @joro

    Here are the 3 lines from gff file.
    Middle one is 2169


    NC_003911.11 RefSeq gene 176231 176306 . + . locus_tag=SPO_tRNA-Gln-1;db_xref=GeneID:3196480
    NC_003911.11 RefSeq exon 176231 176306 . + . gbkey=tRNA;locus_tag=SPO_tRNA-Gln-1;product=tRNA-Gln;db_xref=GeneID:3196480;exon_number=1
    NC_003911.11 RefSeq gene 176312 176389 . + . locus_tag=SPO_tRNA-Pro-2;db_xref=GeneID:3196280

    Leave a comment:


  • Tobias Kockmann
    replied
    Sorry, my mistake. The comma at the end of the print line escaped the c&p.

    Leave a comment:


  • Tobias Kockmann
    replied
    Originally posted by Simon Anders View Post
    For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

    Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

    Code:
    import sys, re
    import HTSeq
    
    insam = HTSeq.SAM_Reader( sys.stdin )
    
    # Go through all reads, with their alignments bundled up:
    for bundle in HTSeq.bundle_multiple_alignments( insam ):
       bestAlmt = None
       # Go through all alignments of a given read, looking
       # for the one with the best alignment score
       for almt in bundle:
          if bestAlmt is None:
             bestAlmt = almt
          elif almt.aQual > bestAlmt.aQual:
             bestAlmt = almt
          elif almt.aQual == bestAlmt:
             # If there are more than one best alignment, 
             # better skip the read
             bestAlmt = None
       if bestAlmt is not None:
          # Change the NH field to 1 and print the line
          print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),
    Call this with something like
    Code:
    sort samfile.sam | python chooseBest.py > filtered.sam
    .

    Be aware hat I haven't really tested this thoroughly.
    Thanks for the code! I tested it using a tophat alignment:

    Code:
    [tobiasko@bs-bsse01 tophat_output_n_mode]$ ls -l
    total 269918
    -rw-r--r-- 1 tobiasko bsse-paro 275526474 Sep 26 10:26 accepted_hits.bam
    -rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 deletions.bed
    -rw-r--r-- 1 tobiasko bsse-paro        54 Sep 26 10:24 insertions.bed
    -rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 junctions.bed
    -rw-r--r-- 1 tobiasko bsse-paro        68 Sep 26 10:09 left_kept_reads.info
    drwxr-xr-x 2 tobiasko bsse-paro        13 Sep 26 10:21 logs
    [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools sort -n accepted_hits.bam accepted_hits.sortedByName
    [bam_sort_core] merging from 2 files...
    [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | python /usr/local/paro/scripts/selectBest.py > accepted_hits.sortedByName.best.sam
    I think in principle it works, but there are blank lines in the sam output. Could this be fixed quickly???

    These are the sam file heads:

    Code:
    [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | head
    -1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1
    -1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1
    -1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1
    -1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1
    -1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1
    -1350550920:7:1:2:92    0       chrX    13614346        255     35M     *       0       0       ACGGTGGCAGCTGAGTCCGTGACATCGTGTTCGCT     ?2CE16?/+>9I-4:=/7A+-4'(8(88++1$($6     NM:i:0  NH:i:1
    -1350550920:7:1:2:195   0       chr3R   16964869        255     35M     *       0       0       GACACCACCAACGCGGGCTCCTACGTGGCCTGTTC     /7.1'36.+2I/<16:7/B.,-.+/)4*%++),*)     NM:i:0  NH:i:1
    -1350550920:7:1:2:198   0       chrX    9515569 255     35M     *       0       0       AAGCCAAATGCACTGCCCCGTGCCCACACATGCTT     A98''G:?05+;;>/30'/?05%$--,.,0+.-,-     NM:i:0  NH:i:1
    -1350550920:7:1:2:222   16      chr3R   10049214        255     35M     *       0       0       GAGCTCATAGAAGCGATTAAAGAAGCTATTGGGCG     )8+%6&?9C.9?5*7CI?III7EF<94:3C//6.5     NM:i:0  NH:i:1
    -1350550920:7:1:2:237   16      chr3R   26711769        255     35M     *       0       0       GTTGGCCACCAAGACGTCGCGAAGGTGCTTTACTA     -242-(<+,:=8:6596DCFB>D:?176GI@6A:A     NM:i:0  NH:i:1
    
    [tobiasko@bs-bsse01 tophat_output_n_mode]$ head  accepted_hits.sortedByName.best.sam
    -1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1
    
    -1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1
    
    -1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1
    
    -1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1
    
    -1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1

    Leave a comment:


  • joro
    replied
    Hi Shalabh, perhaps it would help if you showed us line 2169 of the *gff file.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Recent Innovations in Spatial Biology
    by seqadmin


    Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

    3D Genomics
    While spatial biology often involves studying proteins and RNAs in their...
    01-01-2025, 07:30 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 07:35 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-23-2025, 09:43 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-23-2025, 08:36 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-17-2025, 09:38 AM
0 responses
37 views
0 likes
Last Post seqadmin  
Working...
X