History

root 1ac2f9cbb2 ITSI		1 year ago
..
PyYAML-5.4.dist-info	ITSI	1 year ago
Routes-2.5.1.dist-info	ITSI	1 year ago
_yaml	ITSI	1 year ago
apifilesave	ITSI	1 year ago
apiiconcollection	ITSI	1 year ago
argparse-1.4.0.dist-info	ITSI	1 year ago
backports	ITSI	1 year ago
backports.zoneinfo-0.2.1.dist-info	ITSI	1 year ago
bin	ITSI	1 year ago
bintrees	ITSI	1 year ago
certifi	ITSI	1 year ago
certifi-2025.1.31.dist-info	ITSI	1 year ago
charset_normalizer	ITSI	1 year ago
charset_normalizer-3.4.1.dist-info	ITSI	1 year ago
croniter	ITSI	1 year ago
croniter-1.3.1.dist-info	ITSI	1 year ago
dateutil	ITSI	1 year ago
defusedxml	ITSI	1 year ago
defusedxml-0.7.1.dist-info	ITSI	1 year ago
deprecation-2.1.0.dist-info	ITSI	1 year ago
idna	ITSI	1 year ago
idna-3.10.dist-info	ITSI	1 year ago
itoamodels	ITSI	1 year ago
itsicli	ITSI	1 year ago
itsicli-0.0.42.dist-info	ITSI	1 year ago
itsimodels	ITSI	1 year ago
itsimodels-0.0.43.dist-info	ITSI	1 year ago
nats	ITSI	1 year ago
nats_py-2.6.0.dist-info	ITSI	1 year ago
packaging	ITSI	1 year ago
packaging-24.2.dist-info	ITSI	1 year ago
pexpect	ITSI	1 year ago
pexpect-4.9.0.dist-info	ITSI	1 year ago
ptyprocess	ITSI	1 year ago
ptyprocess-0.7.0.dist-info	ITSI	1 year ago
python_dateutil-2.9.0.post0.dist-info	ITSI	1 year ago
python_slugify-8.0.4.dist-info	ITSI	1 year ago
python_statemachine-2.1.1.dist-info	ITSI	1 year ago
pytz	ITSI	1 year ago
pytz-2022.1.dist-info	ITSI	1 year ago
pyudorandom	ITSI	1 year ago
repoze/lru	ITSI	1 year ago
repoze.lru-0.7.dist-info	ITSI	1 year ago
requests	ITSI	1 year ago
requests-2.31.0.dist-info	ITSI	1 year ago
requests-2.32.3.dist-info	ITSI	1 year ago
routes	ITSI	1 year ago
six-1.17.0.dist-info	ITSI	1 year ago
slugify	ITSI	1 year ago
solnlib	ITSI	1 year ago
solnlib-4.9.0.dist-info	ITSI	1 year ago
sortedcontainers	ITSI	1 year ago
sortedcontainers-2.4.0.dist-info	ITSI	1 year ago
splunklib	ITSI	1 year ago
statemachine	ITSI	1 year ago
svgelements	ITSI	1 year ago
svgelements-1.1.4.dist-info	ITSI	1 year ago
tdigest	ITSI	1 year ago
tests	ITSI	1 year ago
text_unidecode	ITSI	1 year ago
text_unidecode-1.3.dist-info	ITSI	1 year ago
tzlocal	ITSI	1 year ago
tzlocal-5.1.dist-info	ITSI	1 year ago
unittest_xml_reporting-2.1.0.dist-info	ITSI	1 year ago
urllib3	ITSI	1 year ago
urllib3-1.26.19.dist-info	ITSI	1 year ago
urllib3-2.3.0.dist-info	ITSI	1 year ago
xmlrunner	ITSI	1 year ago
yaml	ITSI	1 year ago
LICENSE.txt	ITSI	1 year ago
MANIFEST	ITSI	1 year ago
README.md	ITSI	1 year ago
__init__.py	ITSI	1 year ago
argparse.py	ITSI	1 year ago
deprecation.py	ITSI	1 year ago
repoze.lru-0.7-py3.6-nspkg.pth	ITSI	1 year ago
six.py	ITSI	1 year ago

README.md

tdigest

Efficient percentile estimation of streaming or distributed data

This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3

API

TDigest.

update(x, w=1): update the tdigest with value x and weight w.
batch_update(x, w=1): update the tdigest with values in array x and weight w.
compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
percentile(p): return the pth percentile. Example: p=50 is the median.
quantile(q): return the CDF the value q is at.
trimmed_mean(p1, p2): return the mean of data set without the values below and above the p1 and p2 percentile respectively.