Hi all, I am processing the RRBS data generated by Illumina Hiseq 2000, 50 bp, single end. I used fastQC for quality check, and found one sample has many more overrepresented sequences than others:
Here is a relatively normal sample for comparison,
The sequence in red is one of the 2 adapters used. In total, 2 adapters and 2 PCR primers were used in the sequencing process. They are
PE Adapters
PE PCR Primer 1.0
PE PCR Primer 2.0
While it is easy to spot one adapter as contamination, I have difficulty in finding the possible sources for other overrepresented sequences. They don't seem to stem from the other adapter and primers.
My colleague suggested me trying blast, and I used UCSC blat for the first overrepresented sequence (in green) "CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG", and the results show it came from chrUn:23414511-23414560,
cDNA YourSeq
Genomic chrUn (reverse strand):
Side by Side Alignment
I am confused by the possibility of such contamination from genome thus making up 3.69% of total reads.
I also tried blat for "CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG" and "TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT", they have no matches found. Therefore I have a couple of questions:
1) how to find the possible origins of the overrepresented sequences
2) how to filtered them
2.1) is it safe enough to filter all of them out? (of course if it is certain they are pure pollutions)
2.2) fastQC outputs overrepresented sequences only whose frequency is above 0.1%, do I need to search for more such sequences? If so, how to determine the threshold? (BSMAP uses a parameter -k to filter the top overrepresented k-mers, its default being 1e-6.)
Lastly 2 less relevant questions about the raw reads not beginning with C or T, since my data are MspI digested (cut at C-CGG), fragments are supposed to begin with C or T, so is it safe to discard them? Also, as the methylation information is concentrated at the head of reads, is it necessary/feasible to study methylation contexts other than CpG, e.g. CHG, CHH from my data?
Thanks for any advice.
PS. I forgot to mention the species is rat. Thanks.
Code:
Sequence Count Percentage Possible Source [COLOR="Green"]CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG[/COLOR] 4168787 3.694709642846457 No Hit CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG 2534233 2.2460382606066713 No Hit [COLOR="Red"]GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT[/COLOR] 2493949 2.2103353851053744 Illumina Paired End PCR Primer 2 (100% over 50bp) TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT 1344836 1.1919003147071456 No Hit AAATTATGCAGTCGAGTTTCCCACATTTGGGGAAATCGCAGGGGTCAGCA 1281767 1.1360035652534837 No Hit GATAAAGACTCATTCCTTGTAGAGCAATAAAATTTATCGTGGCTTAACTA 1028733 0.9117447677260468 No Hit CACAGAGTGGAACGTCCCTTTAGACAGAGCAGATTTGAAACACTCTTTTT 1027822 0.9109373672796742 No Hit GATCTCTTTCACTGTCATAATTTCCTCAGTTATAATTTTGCAAAGGCGGT 980751 0.8692193141389342 No Hit CAGATCATGGGCACCAGACAGGCAAGACAGGTTTGTTAAGAGATGGGTGG 941876 0.8347652061776363 No Hit CCCGCACCGTCCCTGGCCAGATGTGAGTCCTCCCACCCCTGTCGGGGCTC 896901 0.7949047944590668 No Hit CTCCTATTTCCAAAAATCCATTTAATATATTGTCCTCGGATAGAGGACGT 857061 0.7595954269689545 No Hit CCGAGTTTTGTGGAGGAACCCACATAACAAACACCCGCGAAGCCAAAGCA 841097 0.7454468641523844 No Hit GGGGGGAATAAGGAATGACTGCAAATGGGTATGGAGTTTTTTAGGATGTT 764633 0.6776784034153376 No Hit GTCAGCCTTGACTACATAGCAAAACTCAAGGCCAACCAGACCTAAAAACA 738731 0.6547219968709378 No Hit AGTGAGAACCTGGTGCTATGGACAGCTAAGAGCTCACATCCCAAACTGCA 689985 0.6115194258952095 No Hit AGAGATTAAGGCCGAGTACTGGGCGGTGTCCTCCCCCACTGCACTAACCT 677824 0.6007413832735414 No Hit CCAATATAGGATGGCCCCCTACCAAAAGCTGAGTTTTAGACTACATCCCT 648286 0.5745624651780862 No Hit GCTGGGCGTGGTGGTGGGCGCCTGTAGTCCCAGCTGCTCGGGAGGCTGAG 635748 0.5634502952586327 No Hit CACTTTCTCAGGTATAGAGAGACTCACTTCCTCCTGTGGAGGAAAGCCTG 627786 0.556393739436437 No Hit GGGGCTTATCACAGCAATAGAACAGCAATTATGACTGGAGTATGATAGTT 620923 0.5503112045698546 No Hit TGAAAAAACATGAAGACCTCGGTCTGGATCTCTAACACCCGCACTTTCCA 617973 0.5476966806216661 No Hit CTCGCCTCCGGTGCACCTCAGGTACACGACTTTGACCTCGTTGGGGTCGA 610104 0.5407225487747862 No Hit GTGTGGAATGCCCAGGAGGCCCAGGCTGACTTTGCCAAGGTGTTGGAGCT 578537 0.51274536997056 No Hit TGTTGTTCTGAGGGTCTACCCGAACTGCTCCTGAGGGGCCCAGGTTTGTA 573695 0.5084540055783129 No Hit AGAGCACAGCAGACTTACGGCCTTAGGAAGAAAGCTGCTCACCACATACT 561256 0.4974295773100019 No Hit CTGCCATCTTTGCGGGTGACTTTCCATCCCTTGAACCAAGGCATATTAGC 491273 0.43540509274522954 No Hit GGCAAATGAAAGGATTCTCCAGGGGCAACACAAATCAGGTTTTCAATTAT 486016 0.4307459224416272 No Hit AGACTTCATTGCTCATAGCTATATAGCCTTCATGCTGGGTTAGCTAGCTT 478411 0.4240057683311276 No Hit TCTTTTGTGCAGGAAGCAGGGGAAGGACCAGGTGTCTACCACCTGTAGAA 458030 0.4059425098267104 No Hit CTCTCCTTGCTTTCTCCTTGCTAGCTGCCCCCTCCTTGGCAGCCCACACC 448639 0.39761946087842615 No Hit CATAGCAAATCCTGTACAACCTTCAAAATGGATGCAGAATGCCTCCACTC 440726 0.39060633274214956 No Hit CTCCCCGGGGCTCCCGCCGGCTTCTCCGGGATCGGTCGCGTTACCGCACT 425394 0.3770178984460049 No Hit GTGGACCAACCCATGTACTGTGGTACATCTACACAAAATCAGTGACTTCA 421541 0.3736030642858794 No Hit GCTTCGCGCCCCAGCCCGACCGACCCAGCCCTTAGAGCCAATCCTTATCC 421527 0.3735906563756168 No Hit TAGGGAATTGAAACACCACAAGTGGTAGGAAGTGCGGCCACAAGGTCGGT 414823 0.3676490399184453 No Hit GGTGCCCTTCCGTCAATTCCTTTAAGTTTCAGCTTTGCAACCATACTCCC 403885 0.3579549168861449 No Hit TACCTTCTTCTGGTGTGTCTGAAGACAGTACAGTGTACGTGCATACGTAC 349789 0.3100107516314984 No Hit CCCACCATACATATTAACTGTAGTCACAATGTGACCGACTTCTTTTTGCT 344734 0.30553060974739904 No Hit CCTCAGGTTCTGGTGACATTCCTGCTACTCCCACACTACTAGCTTATATT 342766 0.3037864120762008 No Hit CGCTCTGGTCCGTCTTGCGCCGGTCCAAGAATTTCACCTCTAGCGGCGCA 325915 0.28885171951656513 No Hit ATGTCTCTCAGACCAACAGAATGTGAAGACAATGGCTGTACATGGCGGCC 325544 0.2885229098946065 No Hit CAATTCGATGGTGTTTCCATTCGATTCATTCGATGTTGATTCCATTAGCT 323500 0.28671135499626843 No Hit ATCATTTGAGGTCAAGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACC 318014 0.2818492267319422 No Hit CTGGTCAGCCACAGCAAAGACTGGGAAGAGCACCTGAGGGAAGGACGCAC 307797 0.27279411107816515 No Hit ACCTGATCTAGCCTAGAGACCAGACCCTAGGTGACAGTACTGTTTCAAGC 290243 0.2572363641674867 No Hit TGACTTTGTATGTTCATTGTAACTTCTTTGTTGATTCATCTAGCTTTCTC 289391 0.2564812542000776 No Hit CCAACTGTTGCCTCGGTGCCACACTCCATCATCAATGGGTACAAGCGCGT 280591 0.24868199632073554 No Hit CGCCTAGAAATTTTGATTCCATTCGTGAAAATTTTTCTATATCCCGAACA 278477 0.24680840187108452 No Hit AAAGTCGAAATGCAGGATGGGATTTTAAAATGGTAGAAGAGTAGGAAGCT 276112 0.24471235131601132 No Hit GCCCTTTTTCTTGTGCAGTTTGAGTTTGGAAATGTCTTAGAGCATGTCTT 264968 0.23483565474698995 No Hit CCTGGTACAACTCCTGGTGGTGGGTCTGGGAGGGCTGACTGGGCAGGGAG 263449 0.23348939648349898 No Hit CTATACAATTCTCTGTTATGTGGGTCTGTCATGTGCACTGTAGGACATTT 261867 0.23208730262382635 No Hit GTCTGTGATGCCCTTAGATGTCCGGGGCTGCACGCGCGCTACACTGACTG 260688 0.2310423793238554 No Hit CCAGTGTTGTGATTGAGCTATCCCACCAAAAGTATCGAGACCCACCTGTG 258187 0.22882579478337417 No Hit TCTCTCTCAATTTGGTCTTCTAGGTGATTCTAGTTCCAGTCAGTTGACAA 254677 0.22571495442468206 No Hit TGGGATTATAGGCGTGCGTCACCACGCCCAGCTAATTTTGTTGTATTTTT 242812 0.21519925047713734 No Hit CTGGTCAAGTGAAGCAGTGGGAGCGGAGAAGGAACAAAGAAATCTGTAAC 230735 0.2044956553170448 No Hit CTCCTATTCCATCTCCCTGCTCCAAAAATCCATTTAATATATTGTCCTCG 230627 0.20439993715216198 No Hit GTTTGATATGGTTTGGCTGTATTCCCATCCAACTATCACCTTGAATTGTA 217060 0.1923757858284081 No Hit CTGTGCCATCTATGAGGGACAGCCGCTGACGTGTCCTCATTGGCAGTGTG 211325 0.1872929740172687 No Hit CTTTTCAGGAGCACCCCACTTGTGGTACCAATTTACTCTGTGAGTCCATT 210579 0.18663180965613355 No Hit CCTGGTAGTATACTTTTCTGGTAGAGAGTAGTATATGTATTTTGTGGAAC 208492 0.18478214474770324 No Hit GTCGCTTCTTGGAACCCAATTGCTTCTCATGGGTTGGGTGGAGAGCAAAC 202247 0.17924733049128375 No Hit CCCCCCAAGCACCCCACCTTGTCCCCCAGGATGGTCAGGCATCTAGGGAT 200278 0.17750224654078098 No Hit AGAAGCAGGGCTCTACCATAACTAGAGCTCTGAGGCGGGATGTCAGTTAG 198273 0.17572525653531723 No Hit CAGTTAACACTATAATCAAATGTACTTATAAAATCTGGACCTAACAGCAT 198031 0.17551077694363532 No Hit ATTATATAAGTGTTTGTTCATTTGCGGGTGAAGCTACCATTTCCCACAAA 197166 0.1747441453452682 No Hit CCTTAAAGTATTTTTGAACTATGAAACAAAAACTAAACTGGCTTTATCCA 195457 0.1732294940139278 No Hit GTGCACCGGCTGCTCCGCAAGGGCAACTACTCGGAGCGCGTGGGCGCCGG 194906 0.17274115411716442 No Hit TGGACAATGACAGGAGGTAAAACCATGGGGAAAGAATGTTACCTACTGAG 192060 0.17021880321664082 No Hit CATCCATATCAGAATCCTGTCAACAAGCACTCCTGTCTTCATTAAGTTTT 191966 0.17013549296202057 No Hit AATCATCGAGTGGAATCGAATGGAATTATGATCAAATGGAATCGAATGTA 190779 0.16908347942761387 No Hit AGTTACTTGGTGACTTCAGTTCATTCTCACTTGGACACGCTTGTATTTAG 186902 0.16564737456418102 No Hit CTCAAGCATTATTACAGAGCAATAGTTAAAAAACTTATATGGTATTGGTA 185477 0.16438442655531027 No Hit CTCCAGTACCTGTCTAGGCATACACAACTGCACCTGGTTTGTGTGGTGCT 183150 0.16232205461380697 No Hit CTATAATTCCTTTATACCACACTTGAAATTATCCTGGTTGTAATTTTTTT 181595 0.16094389029535505 No Hit TATCGAACAGATATCTGCCATGTTTATTGCAGCACTATTCACAATAGCCA 179886 0.15942923896401465 No Hit GCGGCTGAGGCGGCCGTCGGCTGGGTGGGCAGGAGTGGTCGGGCGAACCC 178773 0.15844281009813876 No Hit AATTTATTAGTATAAAGCAGGGACAGGAGAGATGGTTCTAGAAGTAAAAG 170043 0.15070559177010964 No Hit CAAGCACAAGGCCCTGGGTTCAGTCCCCAGCTCCAAAAAAAAAAATTATT 167454 0.14841101465083503 No Hit CTGATAAATGCACGCATCCCCCCCCGGGAAGGGGGGTCAGCGCCCGTCGG 162444 0.14397075533543685 No Hit AAGAGCACACCGACAGGTACCAGCAAATGCTGACGGGCCATCAATGCGGG 161423 0.1430658641655723 No Hit ATAATATTTTAGAGGCAGAAGATCATAAAGTCCACAGAGAAACTGAGAGC 158692 0.14064543538506283 No Hit CCCAGGCTGGAGTGCAGTGGCACAATCTCGGCCCACTGCAACCTCCGCCT 145570 0.12901567835179842 No Hit TCATTGAGATTAGCCAGACCCAAAGCTTGTACACCTCAATGAACTTAATA 143644 0.1273087044113879 No Hit TGGTTAGGTGGAGGGAAAAAATAGTTAAATTTATGGATGTTTTAGTATGG 140787 0.12477660443851511 No Hit AACGTATAAGGTCATCCACTATTAGACCACATGGGTATAAGGCTGTCCCT 139600 0.12372459090410841 No Hit CTTCCACAACTTCCTTCTTCTCCTTTAAGTCCTTGGTGGTGATTTCGGAG 139514 0.12364837088392393 No Hit GTTTGCTTCAGAGGCACTGTGTTCCACCCAGAAACATAGACTGCAAGACC 138876 0.12308292468767162 No Hit CATTCATTCCTCCATGGCTTCTGCTTCAGTTCCTGCCTCCAGGTTCCTGC 136648 0.12110829439731094 No Hit ATCTCATGGCAGAAGAGCATCACATGCTGAGCAGCTGCACAAGATAGAGC 134427 0.11913986806208153 No Hit CTCTGCTGCTTAATTTCAGGAATGGCAAATTATCAGCATTACTGACATAT 133329 0.1181667333857727 No Hit CTTCCTTCTACTGTTCAGTCTATGTCATATCAAATAAATTTACTCATTAG 128815 0.11416606860539201 No Hit CTCTAGTAAACATGTCATCTCACTAGCACAAATGTCCTCGTTAGCCAGTC 127186 0.11272231961840926 No Hit TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 125995 0.11166676096678466 No Hit TACACACACCTTTAAATTTACGAATTCCCAAAACTAAGTCAAGCAGGGTA 123011 0.10902210352224412 No Hit ATCTGGACGTCCCTGAAGCAGGGGGACAGGTGTACAGACATGTTCTTGTG 122365 0.10844956709155605 No Hit ATTAGCCTTGTCTTTGGAAGGAGACTTACTGTCTCTCTTCCTAAATTTAA 121134 0.10735855726775265 No Hit CCTCCTCTCATTTTTGTTTTGCCTTTGAATATTGCTTTCACTAATTTTAG 119193 0.10563828913777502 No Hit
Code:
Sequence Count Percentage Possible Source TGGGAGTTTGAGGAGATGTTAGTTGATGTGAGAGAGAATTGAGGTAGATG 270138 0.22391259681883607 No Hit TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 263551 0.2184527493510764 No Hit CGGGAGTTTTAGTGTATTAGGGTTTTAGATGGTTTTTGGTTTTTTTTTTT 215863 0.17892501198315086 No Hit CGGTTTTAGAGGAATTTTGTTTTTGTGTGTTTTGAGTTTATTAGGTAGGT 210263 0.17428327130917876 No Hit TGGAGGGAGGAGTGGGGATGGTGATGGTGGGATGTGGGGAGGGGGGAGAG 200400 0.16610800554714536 No Hit TGGGGAGTGGGGTTTTGTGAGTAGATTTTTAGTTGTGTGATGTGATTTTT 187398 0.15533087836089793 No Hit TGGGGATATAGTATTTTTTTGGTTTTAGAGGTTTAGGTTTTTGTTATTAG 178946 0.14832516547225286 No Hit TGGTTGTTGTGGTTGTGGTGGTGTGTTTTGTTTGGTTTTTTGGAGGTGTG 176769 0.14652068878524618 No Hit GCTGTCCACACGTCGTTGAAAGGCACTGACTGCCCCTGAGCTACTTAGGG 170937 0.1416866474262095 No Hit CGGTACGAGATCGATAGTTAATAAGTATCGTAAGGGAAAGTTGAAAAGAA 168924 0.14001810743036916 No Hit CATTTCAGGCCTTGTGCCAACATCATTAAACTCCCAGTCATACCCAAAAC 158301 0.13121289114829668 No Hit TAGGCAGTACCATTCAGGACATAGGCATGGGCAAGGACTTCATGTCTAAA 152310 0.12624705750940973 No Hit [COLOR="Red"]GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT[/COLOR] 135347 0.11218672767859023 Illumina Paired End PCR Primer 2 (100% over 50bp) TGGTTGTGGGAATGTTGTTGTGGAAGGGGGGGATGAGGTGGTAATTGTAG 124767 0.10341715333383573 No Hit CGGGGGACGTTTTAATCGCGTAGGTTTTGGGATTCGTGAGAGACGTTTTA 124016 0.10279466275416556 No Hit
PE Adapters
Code:
5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG 5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Code:
5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Code:
5' CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
My colleague suggested me trying blast, and I used UCSC blat for the first overrepresented sequence (in green) "CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG", and the results show it came from chrUn:23414511-23414560,
cDNA YourSeq
Code:
CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG 50
Code:
cagtgaaaaa acgatgagag tagtggtatt tcaccggcgg cccgcgaggc 23414611 cggcggaccc cgccccgacc cctcgcgggg aacggggggg cgccgggggc 23414561 CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG 23414511 tcaagctcaa cagggtcttc tttccccgct gattccgcca agcccgttcc 23414461 cttggctgtg gtttcgctgg atagtaggta gggacagtgg gaatctcgtt 23414411
Code:
00000001 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 00000050 <<<<<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<<<<< 23414560 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 23414511
I also tried blat for "CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG" and "TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT", they have no matches found. Therefore I have a couple of questions:
1) how to find the possible origins of the overrepresented sequences
2) how to filtered them
2.1) is it safe enough to filter all of them out? (of course if it is certain they are pure pollutions)
2.2) fastQC outputs overrepresented sequences only whose frequency is above 0.1%, do I need to search for more such sequences? If so, how to determine the threshold? (BSMAP uses a parameter -k to filter the top overrepresented k-mers, its default being 1e-6.)
Lastly 2 less relevant questions about the raw reads not beginning with C or T, since my data are MspI digested (cut at C-CGG), fragments are supposed to begin with C or T, so is it safe to discard them? Also, as the methylation information is concentrated at the head of reads, is it necessary/feasible to study methylation contexts other than CpG, e.g. CHG, CHH from my data?
Thanks for any advice.
PS. I forgot to mention the species is rat. Thanks.
Comment