You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
README.md
tdigest
Efficient percentile estimation of streaming or distributed data
This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest
Installation
tdigest is compatible with both Python 2 and Python 3.
pip install tdigest
Usage
Update the digest sequentially
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
Update the digest in batches
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
Sum two digests to create a new digest
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
API
TDigest.
update(x, w=1): update the tdigest with valuexand weightw.batch_update(x, w=1): update the tdigest with values in arrayxand weightw.compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.percentile(p): return thepth percentile. Example:p=50is the median.quantile(q): return the CDF the valueqis at.trimmed_mean(p1, p2): return the mean of data set without the values below and above thep1andp2percentile respectively.
