You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
root 1ac2f9cbb2
ITSI
1 year ago
..
PyYAML-5.4.dist-info ITSI 1 year ago
Routes-2.5.1.dist-info ITSI 1 year ago
_yaml ITSI 1 year ago
apifilesave ITSI 1 year ago
apiiconcollection ITSI 1 year ago
argparse-1.4.0.dist-info ITSI 1 year ago
backports ITSI 1 year ago
backports.zoneinfo-0.2.1.dist-info ITSI 1 year ago
bin ITSI 1 year ago
bintrees ITSI 1 year ago
certifi ITSI 1 year ago
certifi-2025.1.31.dist-info ITSI 1 year ago
charset_normalizer ITSI 1 year ago
charset_normalizer-3.4.1.dist-info ITSI 1 year ago
croniter ITSI 1 year ago
croniter-1.3.1.dist-info ITSI 1 year ago
dateutil ITSI 1 year ago
defusedxml ITSI 1 year ago
defusedxml-0.7.1.dist-info ITSI 1 year ago
deprecation-2.1.0.dist-info ITSI 1 year ago
idna ITSI 1 year ago
idna-3.10.dist-info ITSI 1 year ago
itoamodels ITSI 1 year ago
itsicli ITSI 1 year ago
itsicli-0.0.42.dist-info ITSI 1 year ago
itsimodels ITSI 1 year ago
itsimodels-0.0.43.dist-info ITSI 1 year ago
nats ITSI 1 year ago
nats_py-2.6.0.dist-info ITSI 1 year ago
packaging ITSI 1 year ago
packaging-24.2.dist-info ITSI 1 year ago
pexpect ITSI 1 year ago
pexpect-4.9.0.dist-info ITSI 1 year ago
ptyprocess ITSI 1 year ago
ptyprocess-0.7.0.dist-info ITSI 1 year ago
python_dateutil-2.9.0.post0.dist-info ITSI 1 year ago
python_slugify-8.0.4.dist-info ITSI 1 year ago
python_statemachine-2.1.1.dist-info ITSI 1 year ago
pytz ITSI 1 year ago
pytz-2022.1.dist-info ITSI 1 year ago
pyudorandom ITSI 1 year ago
repoze/lru ITSI 1 year ago
repoze.lru-0.7.dist-info ITSI 1 year ago
requests ITSI 1 year ago
requests-2.31.0.dist-info ITSI 1 year ago
requests-2.32.3.dist-info ITSI 1 year ago
routes ITSI 1 year ago
six-1.17.0.dist-info ITSI 1 year ago
slugify ITSI 1 year ago
solnlib ITSI 1 year ago
solnlib-4.9.0.dist-info ITSI 1 year ago
sortedcontainers ITSI 1 year ago
sortedcontainers-2.4.0.dist-info ITSI 1 year ago
splunklib ITSI 1 year ago
statemachine ITSI 1 year ago
svgelements ITSI 1 year ago
svgelements-1.1.4.dist-info ITSI 1 year ago
tdigest ITSI 1 year ago
tests ITSI 1 year ago
text_unidecode ITSI 1 year ago
text_unidecode-1.3.dist-info ITSI 1 year ago
tzlocal ITSI 1 year ago
tzlocal-5.1.dist-info ITSI 1 year ago
unittest_xml_reporting-2.1.0.dist-info ITSI 1 year ago
urllib3 ITSI 1 year ago
urllib3-1.26.19.dist-info ITSI 1 year ago
urllib3-2.3.0.dist-info ITSI 1 year ago
xmlrunner ITSI 1 year ago
yaml ITSI 1 year ago
LICENSE.txt ITSI 1 year ago
MANIFEST ITSI 1 year ago
README.md ITSI 1 year ago
__init__.py ITSI 1 year ago
argparse.py ITSI 1 year ago
deprecation.py ITSI 1 year ago
repoze.lru-0.7-py3.6-nspkg.pth ITSI 1 year ago
six.py ITSI 1 year ago

README.md

tdigest

Efficient percentile estimation of streaming or distributed data

Latest Version Build Status

This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3

API

TDigest.

  • update(x, w=1): update the tdigest with value x and weight w.
  • batch_update(x, w=1): update the tdigest with values in array x and weight w.
  • compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
  • percentile(p): return the pth percentile. Example: p=50 is the median.
  • quantile(q): return the CDF the value q is at.
  • trimmed_mean(p1, p2): return the mean of data set without the values below and above the p1 and p2 percentile respectively.