You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
root 1ac2f9cbb2
ITSI
12 months ago
..
PyYAML-5.4.dist-info ITSI 12 months ago
Routes-2.5.1.dist-info ITSI 12 months ago
_yaml ITSI 12 months ago
apifilesave ITSI 12 months ago
apiiconcollection ITSI 12 months ago
argparse-1.4.0.dist-info ITSI 12 months ago
backports ITSI 12 months ago
backports.zoneinfo-0.2.1.dist-info ITSI 12 months ago
bin ITSI 12 months ago
bintrees ITSI 12 months ago
certifi ITSI 12 months ago
certifi-2025.1.31.dist-info ITSI 12 months ago
charset_normalizer ITSI 12 months ago
charset_normalizer-3.4.1.dist-info ITSI 12 months ago
croniter ITSI 12 months ago
croniter-1.3.1.dist-info ITSI 12 months ago
dateutil ITSI 12 months ago
defusedxml ITSI 12 months ago
defusedxml-0.7.1.dist-info ITSI 12 months ago
deprecation-2.1.0.dist-info ITSI 12 months ago
idna ITSI 12 months ago
idna-3.10.dist-info ITSI 12 months ago
itoamodels ITSI 12 months ago
itsicli ITSI 12 months ago
itsicli-0.0.42.dist-info ITSI 12 months ago
itsimodels ITSI 12 months ago
itsimodels-0.0.43.dist-info ITSI 12 months ago
nats ITSI 12 months ago
nats_py-2.6.0.dist-info ITSI 12 months ago
packaging ITSI 12 months ago
packaging-24.2.dist-info ITSI 12 months ago
pexpect ITSI 12 months ago
pexpect-4.9.0.dist-info ITSI 12 months ago
ptyprocess ITSI 12 months ago
ptyprocess-0.7.0.dist-info ITSI 12 months ago
python_dateutil-2.9.0.post0.dist-info ITSI 12 months ago
python_slugify-8.0.4.dist-info ITSI 12 months ago
python_statemachine-2.1.1.dist-info ITSI 12 months ago
pytz ITSI 12 months ago
pytz-2022.1.dist-info ITSI 12 months ago
pyudorandom ITSI 12 months ago
repoze/lru ITSI 12 months ago
repoze.lru-0.7.dist-info ITSI 12 months ago
requests ITSI 12 months ago
requests-2.31.0.dist-info ITSI 12 months ago
requests-2.32.3.dist-info ITSI 12 months ago
routes ITSI 12 months ago
six-1.17.0.dist-info ITSI 12 months ago
slugify ITSI 12 months ago
solnlib ITSI 12 months ago
solnlib-4.9.0.dist-info ITSI 12 months ago
sortedcontainers ITSI 12 months ago
sortedcontainers-2.4.0.dist-info ITSI 12 months ago
splunklib ITSI 12 months ago
statemachine ITSI 12 months ago
svgelements ITSI 12 months ago
svgelements-1.1.4.dist-info ITSI 12 months ago
tdigest ITSI 12 months ago
tests ITSI 12 months ago
text_unidecode ITSI 12 months ago
text_unidecode-1.3.dist-info ITSI 12 months ago
tzlocal ITSI 12 months ago
tzlocal-5.1.dist-info ITSI 12 months ago
unittest_xml_reporting-2.1.0.dist-info ITSI 12 months ago
urllib3 ITSI 12 months ago
urllib3-1.26.19.dist-info ITSI 12 months ago
urllib3-2.3.0.dist-info ITSI 12 months ago
xmlrunner ITSI 12 months ago
yaml ITSI 12 months ago
LICENSE.txt ITSI 12 months ago
MANIFEST ITSI 12 months ago
README.md ITSI 12 months ago
__init__.py ITSI 12 months ago
argparse.py ITSI 12 months ago
deprecation.py ITSI 12 months ago
repoze.lru-0.7-py3.6-nspkg.pth ITSI 12 months ago
six.py ITSI 12 months ago

README.md

tdigest

Efficient percentile estimation of streaming or distributed data

Latest Version Build Status

This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: Percentile and Quantile Estimation of Big Data: The t-Digest

Installation

tdigest is compatible with both Python 2 and Python 3.

pip install tdigest

Usage

Update the digest sequentially

from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution

Update the digest in batches

another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))

Sum two digests to create a new digest

sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3

API

TDigest.

  • update(x, w=1): update the tdigest with value x and weight w.
  • batch_update(x, w=1): update the tdigest with values in array x and weight w.
  • compress(): perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
  • percentile(p): return the pth percentile. Example: p=50 is the median.
  • quantile(q): return the CDF the value q is at.
  • trimmed_mean(p1, p2): return the mean of data set without the values below and above the p1 and p2 percentile respectively.