Hi All,
I'm analysing some iCLIP data generated following Koning et al, Jove 2011.
Included in the protocol is the usage of a 5 nucleotide unique molecular identifier (UMI), incorporated around the library barcode at the start of the read thus:
UUUBBBBUU
where B is a library barcode base and U is a UMI base.
I extracted and recorded the UMI for each read and after mapping deduplicated removing reads that mapped to the same location and had the same UMI sequence as another read.
I think that if the incorporation of a UMI into a read were completely random, we would expect that the number of reads in each sample with each UMI (after de-duplication) would be roughly equal with a binomial distribution, and at such high numbers should approximate a normal distribution. But this is not what I see. The distributions of UMI usage are much more like log-normal distributions than normal distributions.
Have other people seen this? What are the potential biases this could introduce into downstream analysis. It feels to me that as long as there is no interaction between fragment sequence and UMI sequence that this just means that effectively its like fewer independent UMIs were used, but I'd love to hear what other people think.
I'm analysing some iCLIP data generated following Koning et al, Jove 2011.
Included in the protocol is the usage of a 5 nucleotide unique molecular identifier (UMI), incorporated around the library barcode at the start of the read thus:
UUUBBBBUU
where B is a library barcode base and U is a UMI base.
I extracted and recorded the UMI for each read and after mapping deduplicated removing reads that mapped to the same location and had the same UMI sequence as another read.
I think that if the incorporation of a UMI into a read were completely random, we would expect that the number of reads in each sample with each UMI (after de-duplication) would be roughly equal with a binomial distribution, and at such high numbers should approximate a normal distribution. But this is not what I see. The distributions of UMI usage are much more like log-normal distributions than normal distributions.
Have other people seen this? What are the potential biases this could introduce into downstream analysis. It feels to me that as long as there is no interaction between fragment sequence and UMI sequence that this just means that effectively its like fewer independent UMIs were used, but I'd love to hear what other people think.
Comment