Header Leaderboard Ad

Collapse

Ensembl gtf to gff3 for tophat

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • reut
    replied
    you can run cufflinks with the .bam file

    Originally posted by edge View Post
    Thanks chadn737,
    In order to run Cufflink in default, is it I must include or exclude "-h" option?
    eg.
    Code:
    samtools view input.bam > output.sam
    Thanks again.
    You don't have to convert the accepted_hits.bam to .sam for cufflinks, it works with the bam file as well.
    (which is better, since the bam file is compressed and therefore a lot smaller than the sam file)

    Leave a comment:


  • edge
    replied
    Thanks chadn737,
    In order to run Cufflink in default, is it I must include or exclude "-h" option?
    eg.
    Code:
    samtools view input.bam > output.sam
    Thanks again.

    Leave a comment:


  • chadn737
    replied
    Its fairly simple to convert bam to sam using samtools.

    $ samtools view -h -o accepted_hits.sam accepted_hits.bam

    Leave a comment:


  • edge
    replied
    Hi telos,

    Do you know that how to specify Tophat produce accepted_hits.sam?
    After I run Tophat, why it only generate accepted_hits.bam
    Thanks for advice.

    Leave a comment:


  • edge
    replied
    Hi telos,

    Do you know that how to specify Tophat produce accepted_hits.sam?
    After I run Tophat, why it only generate accepted_hits.bam
    Thanks for advice.

    Leave a comment:


  • telos
    replied
    OK, fair enough.. I encountered the problem when comparing the SAM output with the GFF file from your script. Nothing a regexp can't solve, but it would be nice nevertheless if the file produced by your script were entirely consistent with the TopHat SAM output.

    Leave a comment:


  • genec
    replied
    Yeah, the MT/M thing is always an issue. Both MT and M will work, so there's not one that's right, you just have to be consistent from the beginning.

    Gene

    Leave a comment:


  • telos
    replied
    MT -> chrM

    You've omitted changing MT in the Ensembl GTF not to chrMT but to chrM for compatibility with TopHat.

    Leave a comment:


  • genec
    replied
    Bug fix

    That was a good catch, Michelle. I'm attaching a fixed gtf_to_gff.pl. The previous version dropped the very last gene in the gtf file.

    Gene
    Attached Files

    Leave a comment:


  • mdimon
    replied
    thank you! (and a little bug?)

    Thanks for the script! The C. elegans version is great for other GTF files downloaded from UCSC also.

    I did notice what appears to be a little bug:
    push @trs, [@exons];
    should be added before the final
    process(@trs);

    (I am not a perl expert, I'm more of a python type, so I may be wrong, but until I added this line the last record from the GTF file didn't get printed to the GFF3 file.)

    -- Michelle

    Leave a comment:


  • seqfast
    replied
    thank you!

    Thanks very much, this works well. I had something similar but was getting hung up in the details. much appreciate people making these most useful scripts available - Thanks Gene,

    -sf

    Leave a comment:


  • genec
    replied
    See the attached updated script. I modified it to work with your C elegans file. I believe it works, but give the output a good look to make sure that everything is processed correctly.

    Gene
    Attached Files

    Leave a comment:


  • seqfast
    replied
    script looks great, need help for c elegans

    Thanks for the script, looks great and works well for the human gtf. I'm working on c.elegans gtf files (from ensembl), and the ENSG* strings aren't there ... i'm not a regex expert and figured I'd ask if it was an easy fix to use the c.elegans gtf files. I like this script for it's simplicity, I could use the other one mentioned in this thread if need be. Here is a snippet, i've also attached it in case of formatting issues. Thanks!

    -sf

    I snoRNA exon 3747 3909 . - . gene_id "Y74C9A.6"; transcript_id "Y74C9A.6"; exon_number "1"; gene_name "Y74C9A.6"; transcript_name "NR_001477.2";
    I protein_coding exon 10095 10232 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding CDS 10095 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
    I protein_coding start_codon 10146 10148 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "1"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding exon 9727 9846 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding CDS 9727 9846 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "2"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
    I protein_coding exon 6037 6327 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding CDS 6037 6327 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "3"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
    I protein_coding exon 5195 5296 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding CDS 5195 5296 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "4"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
    I protein_coding exon 4124 4358 . - . gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1";
    I protein_coding CDS 4224 4358 . - 0 gene_id "Y74C9A.3"; transcript_id "Y74C9A.3.1"; exon_number "5"; gene_name "Y74C9A.3"; transcript_name "Y74C9A.3.1"; protein_id "Y74C9A.3.1";
    Attached Files

    Leave a comment:


  • HTS
    replied
    I see. Thanks for the explanation! The reason gtf2gff3 doesn't work for you is probably because you forgot to convert chromosome names in the Ensembl convention to the UCSC convention? I forgot that I also wrote a small script to do that (among other things to filter the downloaded GTF file to suit my needs) before running gtf2gff3 (with the default configuration). I guess the real difference is that gtf2gff3 doesn't assume any particular ordering of the lines so it loads everything into memory and tries to figure out appropriate gene models from there. Since Ensmbl GTF files do group things according to genes/transcripts, it is good to explore that property.

    Leave a comment:


  • genec
    replied
    Yes, I had tried that gtf2gff3 script, but it wasn't working right for me. Maybe I didn't configure it correctly.

    The script I posted has trivial memory requirements since it only holds one gene's worth of data in memory at once. All the exons for a gene are assumed to be located together in the gtf file, which seems to hold true for the Ensembl file. This script won't work for non-Ensembl gtf files without modification.

    Gene

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
    by seqadmin



    Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
    03-21-2023, 01:49 PM
  • seqadmin
    Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
    by seqadmin




    Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
    03-10-2023, 05:31 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-24-2023, 02:45 PM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2023, 12:26 PM
0 responses
15 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-17-2023, 12:32 PM
0 responses
17 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-15-2023, 12:42 PM
0 responses
24 views
0 likes
Last Post seqadmin  
Working...
X