Hi! It has been several weeks since I talked about my work last time. In the past serveral weeks I mainly worked on the optimization of `cluster.hierarchy`

.

## The SLINK Algorithm

The most important optimization is the SLINK alogrithm^{1} for single linkage. The naive hierarchical agglomerative clustering (HAC) algorithm has a \(O(n ^ 3)\) time complexity, while SLINK is \(O(n ^ 2)\) and very easy to implement (even easier than the naive algorithm).

### The Pointer Representation

SLINK is very good in performance but requires a special linkage representation – the pointer representation, which is different from what `cluster.hierarchy`

is using. The pointer representation can be described as follows.

- A cluster is represented by the member with the largest
**index** - \(\Pi[i] (i \in [0, n - 1])\) is the first cluster that cluster \(i\) joins, \(\Lambda[i] (i \in [0, n - 1]\) is the distance between cluster \(i\) and cluster \(\Pi[i]\) when they join

For example, the pointer representation of the following dendrogram is

- \(\Pi[i] = \{6, 3, 3, 5, 9, 9, 8, 9, 9, 9\}\)
- \(\Lambda[i] = \{1.394, 0.419, 0.831, 1.561, 3.123, 10.967, 1.633, 1.198, 4.990, \infty\}\)

### Implementation

The implementation of SLINK is very simple. There’s a pesudo-code in the orignal paper. The following Cython code need two pre-defined function `condensed_index`

, which calculate the index of element `(i, j)`

in a square condensed matrix, and `from_pointer_representation`

, which convert the pointer representation to what you need.

```
def slink(double[:] dists, double[:, :] Z, int n):
cdef int i, j
cdef double[:] M = np.ndarray(n, dtype=np.double)
cdef double[:] Lambda = np.ndarray(n, dtype=np.double)
cdef int[:] Pi = np.ndarray(n, dtype=np.int32)
Pi[0] = 0
Lambda[0] = NPY_INFINITYF
for i in range(1, n):
Pi[i] = i
Lambda[i] = NPY_INFINITYF
for j in range(i):
M[j] = dists[condensed_index(n, i, j)]
for j in range(i):
if Lambda[j] >= M[j]:
M[Pi[j]] = min(M[Pi[j]], Lambda[j])
Lambda[j] = M[j]
Pi[j] = i
else:
M[Pi[j]] = min(M[Pi[j]], M[j])
for j in range(i):
if Lambda[j] >= Lambda[Pi[j]]:
Pi[j] = i
from_pointer_representation(Z, Lambda, Pi, n)
```

### Performance

On a `N = 2000`

dataset, the improvement is significant.

```
In [20]: %timeit _hierarchy.slink(dists, Z, N)
10 loops, best of 3: 29.7 ms per loop
In [21]: %timeit _hierarchy.linkage(dists, Z, N, 0)
1 loops, best of 3: 1.87 s per loop
```

## Other Attempts

I’ve also tried some other optimizations, some of which succeed while the others failed.

I used binary search in `cluster_maxclust_monocrit`

and there was a bit improvement (though it is not a time-consuming function in most cases).

Before (`N = 2000`

):

```
In [14]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
10 loops, best of 3: 35.6 ms per loop
```

After (`N = 2000`

):

```
In [11]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
100 loops, best of 3: 5.86 ms per loop
```

Besides, I tried an algorithm similar to SLINK but for complete linkage – the CLINK algorithm^{2}. However, it seems that CLINK is not the complete linkage that we used today. It did not always result in the best linkage in some cases on my implementation. Perhaps I have misunderstood some things in that paper.

At the suggestion of Charles, I tried an optimized HAC algorithm using priority queue. It has \(O(n^2 \log n)\) time complexity. However, it didn’t work well as expected. It was slower even when `N = 3000`

. The algorithm needs to delete a non-root node in a priority queue, so when a binary heap is used, it needs to keep track of the index of every node and results in the increase of the time constant. Some other priority queue algorithms might perform better but I haven’t tried.