Unconfigured Ad

**Richard Finney** · 07-17-2015, 09:36 AM

What does your rnaseq "mutation" data look like?
What does your dbsnp file look like?
What do you mean by "filter out" ?

**zjrouc** · 07-17-2015, 12:21 PM

I have applied VARSCAN to predict millions of mutations in my RNASeq data. In order to get somatic mutation, i want to filter the polymorphisms in the dbSNP database from mutations Varscan predicted. I dont know how to do it. Which dbSNP database should i download and how to use it ?

**Richard Finney** · 07-17-2015, 12:36 PM

Do you know any programming languages?
Can you script in shell languages?
What build is the varscan output (hg18,hg19,grch38) ?

Many compressed copies of various versions of dbsnp for hg19 (human) is here ...
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

For example : snp138.txt.gz or snp142Common.txt.gz

Note: there are likely somatic mutations in dbSNP.

**zjrouc** · 07-20-2015, 08:52 AM

I am using GRCH38 version of reference genome. I have downloaded snp142Common.txt.gz file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/. Would you please tell me which software should i use next to remove those polymorphisms?

**Richard Finney** · 07-20-2015, 09:08 AM

What do the first few lines of your varscan output look like ?

(not the header)

**zjrouc** · 07-20-2015, 10:37 AM

Hi， it looks like this:
chr1 10443 . C T . PASS ADP=8;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP

P:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:14:8:8:3:
4:57.14%:3.4965E-2:32:32:0:3:0:4
chr1 131628 . C A . PASS ADP=58;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP

P:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:28:58:58:
44:9:16.98%:1.3495E-3:37:40:8:36:6:3

**Richard Finney** · 07-20-2015, 12:37 PM

Ok.

You want to remove anything from varscan output that's in col2(chrom) and col3(chromStart) in dbsnp142common .

What's your favorite programming language?

What do you think the next step is?

**zjrouc** · 07-20-2015, 12:42 PM

well, the first idea came into my mind is to combine these two cols as an identifier, and find anything in my varscan but not in dbsnp database. for language, i know some bash commands, but not that sophistic. i think sed can do this, right?

**Richard Finney** · 07-20-2015, 01:07 PM

METHOD1 ... (using sort uniq -d then filter based on dupes)

Ok. dbsnp is a muy grande file. A little scripting will take big time with that big of a file.
A programming language really comes in handy ... but bash and the unix utilties are up to the task.

Check out "sort" and "uniq".
cut col1 and col2 from varscan.
Cut col2 and col3 from dbsnp.
"cat" the files , pipe to "sort -d" for duplicates (call it "dupes").
"sort" might need the "--buffer-size" param of a few gig to sort in RAM (not disk).

You may need to slap a tab on the end of "dupes". "sed" can do this for you.
The reason is we don't want "chr\t123" to match "chr\t1234" so we make it "chr\t123\t" because varscan sepearate col2 and col3 with a tab ("\t").

Theoretically , this should then work ... "-f" says "use this file for matches" and "-v" says "actually, do the opposite, dont match them". See "man grep" for details.

fgrep -v -f dupes varscan.output.

Make sure varscan and dbsnp are not "one off", that is their coordinates agree and aren't "off by one".
Make sure to got the tabs right.

METHOD2 ... (using "comm")
cut -f1,2 varscanoutput | sort > file1
zcat hg38.snp142Common.txt.gz | cut -f2,3 | sort --buffer-size=20G > file2
comm -12 file1 file2 | awk '{print $1"\t"}' > dupes
fgrep -f -v dupes varscanoutput

**zjrouc** · 07-20-2015, 01:26 PM

Thank you for your help, really appreciated.

**m_two** · 07-27-2015, 02:47 PM

Check out bedtools:

intersect — bedtools 2.31.0 documentation

http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html

ExAc may also be a better source of rare germline SNPs in coding regions

ftp://ftp.broadinstitute.org/pub/ExA...se/release0.3/

ExAC browser

http://exac.broadinstitute.org/

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 55 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

help me with dbsnp

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News