Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotating a genome based on CNV calls.

    I have a bed file of read locations that looks like

    chr1 193244 193246
    chr2 293244 293246

    I want to identify where they lie on hg19 (what gene, intronic or exonic etc.) How do I automate this? The data spans a number of chr's, but I would still like a visual UCSC like output, as well as a txt one if possible.
    Last edited by pepsimax; 06-15-2012, 09:40 AM.

  • #2

    Did you find a solution to this problem. I am having the same at the moment and would like to know.




    • #3
      VarioWatch ( provides visual output up to one thousand variants online in real time. It also provides text output for millions of variants. However, it does not take the bed file format. You can try it if the bed file format can be converted.


      • #4
        I can recommend an easy way to cover CNV overlap with human genes.

        I - Create an initial genetrack.RefSeq.GRCh37.txt file
        Go to the UCSC genome table browser:

        There are many output options, here are the changes that you'll need to make:
        clade: Mammal
        genome: Human
        assembly: ''choose the appropriate assembly for the reference you're using''
        group: Genes abd Gene Prediction Tracks
        track: RefSeq Genes
        table: refGene
        region: ''choose the genome option''

        Choose the output filename:

        Click the get output button.

        You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs (contigs other than the standard 1-22,X,Y,MT)

        II - Remove non-standard contigs and sorting the file in karyotypic order:
        Create the extract.tcl. This file looks like so:

        #!/usr/bin/env tclsh

        # Remove contigs other than the standard 1-22,X,Y,MT
        # and sort the file in karyotypic order.

        proc ContentFromFile {{Fichier ""}} {
        if {[string equal $Fichier ""]} {return ""}
        set f [open $Fichier r]
        set Texte [read -nonewline $f]
        close $f
        return $Texte

        proc LinesFromFile {{Fichier ""}} {
        return [split [ContentFromFile $Fichier] "\n"]

        proc WriteTextInFile {texte fichier} {
        set fifi [open $fichier a]
        puts $fifi $texte
        close $fifi
        return 1

        proc IncreasingSortOnElement4 {X Y {N 4}} {
        return [expr {[lindex $X $N]>[lindex $Y $N]}]

        ## Checking and displaying parameter
        set geneFile [lindex $argv 0]
        if {![file exists $geneFile]} {
        puts "$geneFile doesn't exist. Exit"

        ## Defining output file
        regsub ".txt" $geneFile "" outputFile
        set outputFile "$outputFile.sorted.txt"
        file delete -force $outputFile
        puts "...creation of $outputFile"

        foreach L [LinesFromFile $geneFile] {
        set Ls [split $L "\t"]
        if {[regexp "^#" $L]} {
        WriteTextInFile $L $outputFile
        set i_chr [lsearch -exact $L "chrom" ]; if {$i_chr == -1} {puts "Bad header line syntax. chrom column not found - Exit"; exit}

        regsub -all " " [lindex $Ls $i_chr] "" chrom
        lappend linelist($chrom) "$L"

        foreach val {1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M MT} {
        if {![info exists linelist(chr$val)]} {continue}
        WriteTextInFile [join [lsort -command IncreasingSortOnElement4 $linelist(chr$val)] "\n"] $outputFile

        Then run the extract.tcl file:
        extract.tcl genetrack.RefSeq.GRCh37.txt
        -> create genetrack.RefSeq.GRCh37.sorted.txt

        III - Create the genetrack.RefSeq.GRCh37.sorted.bed file:
        cat genetrack.RefSeq.GRCh37.sorted.txt | awk -F"\t" '{print $3"\t"$5"\t"$6"\t"$13"\t"}' > genetrack.RefSeq.GRCh37.sorted.bed

        IV - Annotate your CNV bed file:
        Sample CNV bed file = CNVsample.bed
        chr7 5952473 5978460

        Using Bedtools, run the intersection like so:
        intersectBed -a CNVsample.bed -b genetrack.RefSeq.GRCh37.sorted.bed -wb > CNVsample.annotated

        chr7 5952473 5965603 chr7 5938340 5965603 CCZ1
        chr7 5965776 5978460 chr7 5965776 6010314 RSPH10B
        chr7 5965776 5978460 chr7 5965776 6010314 RSPH10B2

        Of course, you can use Bedtools to run the intersection with other bed files!


        • #5
          AnnotSV: An integrated tool for Structural Variations annotation


          I'm annotating my CNV/SV human events with the AnnotSV tool.
          PMID: 29669011 DOI: 10.1093/bioinformatics/bty304

          It associates a complete panel of different datasets to provide high quality structural variations (SV) / CNV annotation :
          - Gene annotations
          - Promoters annotations
          - DGV Gold Standard annotations
          - DECIPHER gene annotations
          - 1000 genomes annotations
          - GC content annotations
          - Repeated sequences annotations
          - TAD annotations
          - OMIM annotations
          - Gene intolerance annotations
          - Haploinsufficiency annotations
          - Homozygous and heterozygous SNV/indel annotations
          - ...

          AnnotSV starts by detecting the genomic overlaps between the input and the annotation features.

          Moreover, interesting information, this tool constructs an annotation based on the full-length SV but also an annotation for each gene within the SV.

          Really easy to install and to use!

          Input format: VCF or BED

          Else, if you have CNV calls from different CNV callers, I advise you (before to annotate) to identify/merge the common CNV detected by your different callers. For that, I would consider CNV that share a 70% reciprocal overlap measured by length and position (> 70% shared length) (as done in DGV).
          Last edited by lgmSeq; 07-09-2018, 10:21 PM.


          Latest Articles


          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin

            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin

            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM





          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          Last Post seqadmin