A, C, G and T are "normal" nucleotides and N represents unknown nucleotides in repetitive sequences or SNPs. When checking the reference genome from 1KGP, I found these odd nucleotides: M and R. Does anybody know anything about that?
yields
Code:
>>> ref_f = open("human_g1k_v37.fasta", "r") >>> >>> cnt = 0 >>> >>> for i in ref_f: ... cnt += 1 ... if i[0] == ">": ... print i, ... continue ... ... n = i.count("A") + i.count("C") + i.count("G") + i.count("T") + i.count("N") ... ... if n < len(i) - 1: ... print cnt, i
Code:
... >3 dna:chromosome chromosome:GRCh37:3:1:198022430:1 9221347 CGCTACATAGCTGMCTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG 9221351 CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA ...
Comment