Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help me with dbsnp

    Hi, all,

    I am checking somatic mutations from cancer cell line RNASeq, could someone tell me in detail how to filter out polymorphisms in human reference genome using dbSNP database?

    Thanks

  • #2
    What does your rnaseq "mutation" data look like?
    What does your dbsnp file look like?
    What do you mean by "filter out" ?

    Comment


    • #3
      I have applied VARSCAN to predict millions of mutations in my RNASeq data. In order to get somatic mutation, i want to filter the polymorphisms in the dbSNP database from mutations Varscan predicted. I dont know how to do it. Which dbSNP database should i download and how to use it ?
      Last edited by zjrouc; 07-17-2015, 12:25 PM.

      Comment


      • #4
        Do you know any programming languages?
        Can you script in shell languages?
        What build is the varscan output (hg18,hg19,grch38) ?

        Many compressed copies of various versions of dbsnp for hg19 (human) is here ...
        ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

        For example : snp138.txt.gz or snp142Common.txt.gz

        Note: there are likely somatic mutations in dbSNP.
        Last edited by Richard Finney; 07-17-2015, 12:42 PM.

        Comment


        • #5
          I am using GRCH38 version of reference genome. I have downloaded snp142Common.txt.gz file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/. Would you please tell me which software should i use next to remove those polymorphisms?

          Comment


          • #6
            What do the first few lines of your varscan output look like ?

            (not the header)

            Comment


            • #7
              Hi, it looks like this:
              chr1 10443 . C T . PASS ADP=8;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:14:8:8:3:
              4:57.14%:3.4965E-2:32:32:0:3:0:4
              chr1 131628 . C A . PASS ADP=58;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:28:58:58:
              44:9:16.98%:1.3495E-3:37:40:8:36:6:3

              Comment


              • #8
                Ok.

                You want to remove anything from varscan output that's in col2(chrom) and col3(chromStart) in dbsnp142common .

                What's your favorite programming language?

                What do you think the next step is?

                Comment


                • #9
                  well, the first idea came into my mind is to combine these two cols as an identifier, and find anything in my varscan but not in dbsnp database. for language, i know some bash commands, but not that sophistic. i think sed can do this, right?

                  Comment


                  • #10
                    METHOD1 ... (using sort uniq -d then filter based on dupes)

                    Ok. dbsnp is a muy grande file. A little scripting will take big time with that big of a file.
                    A programming language really comes in handy ... but bash and the unix utilties are up to the task.

                    Check out "sort" and "uniq".
                    cut col1 and col2 from varscan.
                    Cut col2 and col3 from dbsnp.
                    "cat" the files , pipe to "sort -d" for duplicates (call it "dupes").
                    "sort" might need the "--buffer-size" param of a few gig to sort in RAM (not disk).

                    You may need to slap a tab on the end of "dupes". "sed" can do this for you.
                    The reason is we don't want "chr\t123" to match "chr\t1234" so we make it "chr\t123\t" because varscan sepearate col2 and col3 with a tab ("\t").


                    Theoretically , this should then work ... "-f" says "use this file for matches" and "-v" says "actually, do the opposite, dont match them". See "man grep" for details.

                    fgrep -v -f dupes varscan.output.

                    Make sure varscan and dbsnp are not "one off", that is their coordinates agree and aren't "off by one".
                    Make sure to got the tabs right.

                    METHOD2 ... (using "comm")
                    cut -f1,2 varscanoutput | sort > file1
                    zcat hg38.snp142Common.txt.gz | cut -f2,3 | sort --buffer-size=20G > file2
                    comm -12 file1 file2 | awk '{print $1"\t"}' > dupes
                    fgrep -f -v dupes varscanoutput

                    Comment


                    • #11
                      Thank you for your help, really appreciated.
                      Last edited by zjrouc; 07-20-2015, 01:35 PM.

                      Comment


                      • #12
                        Check out bedtools:



                        ExAc may also be a better source of rare germline SNPs in coding regions

                        ftp://ftp.broadinstitute.org/pub/ExA...se/release0.3/

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Latest Developments in Precision Medicine
                          by seqadmin



                          Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                          Somatic Genomics
                          “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                          Yesterday, 01:16 PM
                        • seqadmin
                          Recent Advances in Sequencing Analysis Tools
                          by seqadmin


                          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                          05-06-2024, 07:48 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 07:15 AM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 05-23-2024, 10:28 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 05-23-2024, 07:35 AM
                        0 responses
                        19 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 05-22-2024, 02:06 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X