Skip to content

Commit

Permalink
candidate 0.5.0
Browse files Browse the repository at this point in the history
  • Loading branch information
kno10 committed Dec 10, 2023
1 parent c6dd3c9 commit 9461399
Show file tree
Hide file tree
Showing 9 changed files with 135 additions and 55 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
runs-on: macos-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: actions-rs/toolchain@v1
Expand Down Expand Up @@ -79,7 +79,7 @@ jobs:
runs-on: windows-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v3
- uses: actions-rs/toolchain@v1
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

For changes to the main Rust package, please see <https://github.com/kno10/rust-kmedoids/blob/main/CHANGELOG.md>

## kmedoids 0.5.0 (2023-12-10)

- add DynMSC, Silhouette clustering with optimal number of clusters
- update dependency versions

## kmedoids 0.4.3 (2023-04-20)

- fix silhouette evaluation for k > 2 (in Rust)
Expand Down
22 changes: 18 additions & 4 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ message: "If you use this software, please cite it as below."
authors:
- family-names: Schubert
given-names: Erich
orcid: 0000-0001-9143-4880
orcid: "https://orcid.org/0000-0001-9143-4880"
- family-names: Lenssen
given-names: Lars
orcid: 0000-0003-0037-0418
orcid: "https://orcid.org/0000-0003-0037-0418"
title: "Fast k-medoids Clustering in Rust and Python"
journal: "J. Open Source Softw."
doi: 10.21105/joss.04183
version: 0.4.3
date-released: 2023-04-20
version: 0.5.0
date-released: 2023-12-10
license: GPL-3.0
preferred-citation:
title: "Fast k-medoids Clustering in Rust and Python"
Expand Down Expand Up @@ -40,6 +40,8 @@ references:
year: "2021"
type: article
journal: "Inf. Syst."
volume: 101
start: 101804
authors:
- family-names: Schubert
given-names: Erich
Expand All @@ -55,3 +57,15 @@ references:
given-names: Lars
- family-names: Schubert
given-names: Erich
- title: "Medoid silhouette clustering with automatic cluster number selection"
doi: "10.1016/j.is.2023.102290"
year: "2024"
type: article
journal: "Inf. Syst."
volume: 120
start: 102290
authors:
- family-names: Lenssen
given-names: Lars
- family-names: Schubert
given-names: Erich
10 changes: 5 additions & 5 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[package]
edition = "2021"
name = "kmedoids"
version = "0.4.3"
version = "0.5.0"
authors = ["Erich Schubert <[email protected]>", "Lars Lenssen <[email protected]>"]
description = "k-Medoids clustering with the FasterPAM algorithm"
homepage = "https://github.com/kno10/python-kmedoids"
Expand All @@ -14,13 +14,13 @@ name = "kmedoids"
crate-type = ["cdylib"]

[dependencies]
rustkmedoids = { version = "0.4.3", package = "kmedoids", git = "https://github.com/kno10/rust-kmedoids" }
numpy = "0.18"
rustkmedoids = { version = "0.5.0", package = "kmedoids", git = "https://github.com/kno10/rust-kmedoids" }
numpy = "0.20"
ndarray = "0.15"
rand = "0.8"
rayon = "1.7"
rayon = "1.8"

[dependencies.pyo3]
version = "0.18"
version = "0.20"
features = ["extension-module"]

13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ For further details on medoid Silhouette clustering with automatic cluster numbe
> Lars Lenssen, Erich Schubert:
> **Medoid silhouette clustering with automatic cluster number selection**
> Information Systems (120), 2024, 102290
> <https://doi.org/10.1016/j.is.2023.102290>
> <https://doi.org/10.1016/j.is.2023.102290>
> Preprint: <https://arxiv.org/abs/2309.03751>
an earlier version was published as:
the basic FasterMSC method was first published as:

> Lars Lenssen, Erich Schubert:
> **Clustering by Direct Optimization of the Medoid Silhouette**
Expand Down Expand Up @@ -139,6 +140,12 @@ print("Loss with PAM:", pam.loss)

### Choose the optimal number of clusters

This package includes DynMSC, an algorithm that optimizes the Medoid Silhouette,
and chooses the "optimal" number of clusters in a range of 2..kmax.
Beware that if you allow a too large kmax, the optimum result will likely have many
one-elemental clusters. A too high kmax may mask more desirable results, hence it
is recommended that you choose only 2-3 times the number of clusters you expect as maximum.

```python
import kmedoids, numpy
from sklearn.datasets import fetch_openml
Expand Down Expand Up @@ -169,7 +176,7 @@ For larger data sets, it is recommended to only cluster a representative sample
* Silhouette index for evaluation (Rousseeuw, 1987)
* **FasterMSC** (Lenssen and Schubert, 2022)
* FastMSC (Lenssen and Schubert, 2022)
* DynMSC (Lenssen and Schubert, 2023)
* **DynMSC** (Lenssen and Schubert, 2023)
* PAMSIL (Van der Laan and Pollard, 2003)
* PAMMEDSIL (Van der Laan and Pollard, 2003)
* Medoid Silhouette index for evaluation (Van der Laan and Pollard, 2003)
Expand Down
59 changes: 37 additions & 22 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ Example
print("Loss is:", c.loss)
Using the sklearn-compatible API
-------------------
--------------------------------

Note that KMedoids defaults to the `"precomputed"` metric, expecting a pairwise distance matrix.
If you have sklearn installed, you can use `metric="euclidean"`.
Expand Down Expand Up @@ -114,8 +114,14 @@ MNIST (10k samples)
print("PAM took: %.2f ms" % ((time.time() - start)*1000))
print("Loss with PAM:", pam.loss)
Choose the optimal number of clusters
-------------------
Choosing the optimal number of clusters
---------------------------------------

This package includes :ref:`DynMSC<dynmsc>`, an algorithm that optimizes the Medoid Silhouette,
and chooses the "optimal" number of clusters in a range of 2..kmax.
Beware that if you allow a too large kmax, the optimum result will likely have many
one-elemental clusters. A too high kmax may mask more desirable results, hence it
is recommended that you choose only 2-3 times the number of clusters you expect as maximum.

.. code-block:: python
Expand All @@ -142,18 +148,26 @@ For larger data sets, it is recommended to only cluster a representative sample
Implemented Algorithms
======================

K-Medoids Clustering:

* :ref:`FasterPAM<fasterpam>` (Schubert and Rousseeuw, 2020, 2021)
* :ref:`FastPAM1<fastpam1>` (Schubert and Rousseeuw, 2019, 2021)
* :ref:`PAM<pam>` (Kaufman and Rousseeuw, 1987) with BUILD and SWAP
* :ref:`Alternating<alternating>` (k-means-style approach)
* :ref:`BUILD<build>` (Kaufman and Rousseeuw, 1987)
* :ref:`Silhouette<silhouette>` (Kaufman and Rousseeuw, 1987)
* :ref:`Alternating<alternating>` (k-means-style approach)

Silhouette Clustering:

* :ref:`DynMSC<dynmsc>` (Lenssen and Schubert, 2023)
* :ref:`FasterMSC<fastermsc>` (Lenssen and Schubert, 2022)
* :ref:`FastMSC<fastmsc>` (Lenssen and Schubert, 2022)
* :ref:`DynMSC<dynmsc>` (Lenssen and Schubert, 2023)
* :ref:`PAMSIL<pamsil>` (Van der Laan and Pollard, 2003)
* :ref:`PAMMEDSIL<pammedsil>` (Van der Laan and Pollard, 2003)
* :ref:`MedoidSilhouette<medoid_silhouette>` (Van der Laan and Pollard, 2003)
* :ref:`PAMSIL<pamsil>` (Van der Laan and Pollard, 2003)

Evaluation:

* :ref:`Medoid Silhouette<medoid_silhouette>` (Van der Laan and Pollard, 2003)
* :ref:`Silhouette<silhouette>` (Kaufman and Rousseeuw, 1987)

Note that the k-means style "alternating" algorithm yields rather poor result quality
(see Schubert and Rousseeuw 2021 for an example and explanation).
Expand Down Expand Up @@ -193,6 +207,13 @@ PAM BUILD

.. autofunction:: pam_build

.. _DynMSC:

DynMSC
======

.. autofunction:: dynmsc

.. _FasterMSC:

FasterMSC
Expand All @@ -207,12 +228,12 @@ FastMSC

.. autofunction:: fastmsc

.. _DynMSC:
.. _PAMMEDSIL:

DynMSC
PAMMEDSIL
=========

.. autofunction:: dynmsc
.. autofunction:: pammedsil

.. _PAMSIL:

Expand All @@ -221,13 +242,6 @@ PAMSIL

.. autofunction:: pamsil

.. _PAMMEDSIL:

PAMMEDSIL
=========

.. autofunction:: pammedsil

.. _Silhouette:

Silhouette
Expand Down Expand Up @@ -288,10 +302,11 @@ an earlier (slower, and now obsolete) version was published as:
For further details on medoid Silhouette clustering with automatic cluster number selection (FasterMSC, DynMSC), see:

| Lars Lenssen, Erich Schubert:
| **Medoid silhouette clustering with automatic cluster number selection**
| Information Systems (120), 2024, 102290
| https://doi.org/10.1016/j.is.2023.102290
| Lars Lenssen, Erich Schubert:
| **Medoid silhouette clustering with automatic cluster number selection**
| Information Systems (120), 2024, 102290
| https://doi.org/10.1016/j.is.2023.102290
| Preprint: https://arxiv.org/abs/2309.03751
an earlier version was published as:

Expand Down
23 changes: 17 additions & 6 deletions kmedoids/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,22 @@
- PAM (the original Partitioning Around Medoids algorithm)
- Alternating (k-means style algorithm, yields results of lower quality)
- BUILD (the initialization of PAM)
- Silhouette evaluation
Additionally, the package implements clustering algorithms
for direct optimization of the (Medoid) Silhouette,
in decreasing order of performance:
- FasterMSC
- FastMSC (same result as PAMMEDSIL; but faster)
- DynMSC
- DynMSC (automatic choice of k; faster than repeated FasterMSC)
- PAMMEDSIL
- PAMSIL
Evaluation measures:
- Silhouette evaluation
- Medoid Silhouette evaluation
References:
| Erich Schubert and Lars Lenssen:
Expand All @@ -43,6 +47,7 @@
| Medoid silhouette clustering with automatic cluster number selection
| Information Systems (120), 2024, 102290
| <https://doi.org/10.1016/j.is.2023.102290>
| Preprint: <https://arxiv.org/abs/2309.03751>
| Lars Lenssen, Erich Schubert:
| Clustering by Direct Optimization of the Medoid Silhouette
Expand Down Expand Up @@ -78,7 +83,9 @@
"alternating",
"pam_build",
"silhouette",
"KMedoidsResult"
"medoid_silhouette",
"KMedoidsResult",
"DynkResult",
]

class KMedoidsResult:
Expand Down Expand Up @@ -113,7 +120,7 @@ def __repr__(self):

class DynkResult:
"""
K-medoids clustering result with automatic number of clusters
K-medoids or Silhouette clustering result with automatic number of clusters
:param loss: Loss of this clustering (sum of deviations)
:type loss: float
Expand Down Expand Up @@ -519,6 +526,7 @@ def fastmsc(diss, medoids, max_iter=100, init="random", random_state=None):
| Medoid silhouette clustering with automatic cluster number selection
| Information Systems (120), 2024, 102290
| <https://doi.org/10.1016/j.is.2023.102290>
| Preprint: <https://arxiv.org/abs/2309.03751>
| Lars Lenssen, Erich Schubert:
| Clustering by Direct Optimization of the Medoid Silhouette
Expand Down Expand Up @@ -568,6 +576,7 @@ def fastermsc(diss, medoids, max_iter=100, init="random", random_state=None):
| Medoid silhouette clustering with automatic cluster number selection
| Information Systems (120), 2024, 102290
| <https://doi.org/10.1016/j.is.2023.102290>
| Preprint: <https://arxiv.org/abs/2309.03751>
| Lars Lenssen, Erich Schubert:
| Clustering by Direct Optimization of the Medoid Silhouette
Expand Down Expand Up @@ -617,10 +626,11 @@ def dynmsc(diss, medoids, max_iter=100, init="random", random_state=None):
| Medoid silhouette clustering with automatic cluster number selection
| Information Systems (120), 2024, 102290
| <https://doi.org/10.1016/j.is.2023.102290>
| Preprint: <https://arxiv.org/abs/2309.03751>
:param diss: square numpy array of dissimilarities
:type diss: ndarray
:param medoids: maximum number of clusters to find or existing medoids with length of maximum number of clusters to find
:param medoids: maximum number of clusters to find or existing medoids with length of maximum number of clusters to find
:type medoids: int or ndarray
:param max_iter: maximum number of iterations
:type max_iter: int
Expand Down Expand Up @@ -831,6 +841,7 @@ class KMedoids(SKLearnClusterer):
| Medoid silhouette clustering with automatic cluster number selection
| Information Systems (120), 2024, 102290
| <https://doi.org/10.1016/j.is.2023.102290>
| Preprint: <https://arxiv.org/abs/2309.03751>
| Lars Lenssen, Erich Schubert:
| Clustering by Direct Optimization of the Medoid Silhouette
Expand All @@ -850,7 +861,7 @@ class KMedoids(SKLearnClusterer):
| In: Journal of Statistical Computation and Simulation, pp 575-584, 2003
| https://doi.org/10.1080/0094965031000136012
:param n_clusters: The number of clusters to form
:param n_clusters: The number of clusters to form (maximum number of clusters if `method="dynmsc"`)
:type n_clusters: int
:param metric: It is recommended to use 'precomputed', in particular when experimenting with different `n_clusters`.
If you have sklearn installed, you may pass any metric supported by `sklearn.metrics.pairwise_distances`.
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
[build-system]
requires = ["maturin>=0.14,<0.15"]
requires = ["maturin>=1.4,<2"]
build-backend = "maturin"

[project]
name = "kmedoids"
version = "0.4.3"
version = "0.5.0"
description = "k-Medoids Clustering in Python with FasterPAM"
requires-dist = ["numpy"]
classifier = [
Expand Down
Loading

0 comments on commit 9461399

Please sign in to comment.