GSoC2014: Recent progress Aug 11 2014 CST by Richard

Hi! It has been several weeks since I talked about my work last time. In the past serveral weeks I mainly worked on the optimization of cluster.hierarchy.

The SLINK Algorithm

The most important optimization is the SLINK alogrithm1 for single linkage. The naive hierarchical agglomerative clustering (HAC) algorithm has a \(O(n ^ 3)\) time complexity, while SLINK is \(O(n ^ 2)\) and very easy to implement (even easier than the naive algorithm).

The Pointer Representation

SLINK is very good in performance but requires a special linkage representation – the pointer representation, which is different from what cluster.hierarchy is using. The pointer representation can be described as follows.

  • A cluster is represented by the member with the largest index
  • \(\Pi[i] (i \in [0, n - 1])\) is the first cluster that cluster \(i\) joins, \(\Lambda[i] (i \in [0, n - 1]\) is the distance between cluster \(i\) and cluster \(\Pi[i]\) when they join

For example, the pointer representation of the following dendrogram is

  • \(\Pi[i] = \{6, 3, 3, 5, 9, 9, 8, 9, 9, 9\}\)
  • \(\Lambda[i] = \{1.394, 0.419, 0.831, 1.561, 3.123, 10.967, 1.633, 1.198, 4.990, \infty\}\)

Implementation

The implementation of SLINK is very simple. There’s a pesudo-code in the orignal paper. The following Cython code need two pre-defined function condensed_index, which calculate the index of element (i, j) in a square condensed matrix, and from_pointer_representation, which convert the pointer representation to what you need.

def slink(double[:] dists, double[:, :] Z, int n):
    cdef int i, j
    cdef double[:] M = np.ndarray(n, dtype=np.double)
    cdef double[:] Lambda = np.ndarray(n, dtype=np.double)
    cdef int[:] Pi = np.ndarray(n, dtype=np.int32)

    Pi[0] = 0
    Lambda[0] = NPY_INFINITYF
    for i in range(1, n):
        Pi[i] = i
        Lambda[i] = NPY_INFINITYF

        for j in range(i):
            M[j] = dists[condensed_index(n, i, j)]

        for j in range(i):
            if Lambda[j] >= M[j]:
                M[Pi[j]] = min(M[Pi[j]], Lambda[j])
                Lambda[j] = M[j]
                Pi[j] = i
            else:
                M[Pi[j]] = min(M[Pi[j]], M[j])

        for j in range(i):
            if Lambda[j] >= Lambda[Pi[j]]:
                Pi[j] = i

    from_pointer_representation(Z, Lambda, Pi, n)

Performance

On a N = 2000 dataset, the improvement is significant.

In [20]: %timeit _hierarchy.slink(dists, Z, N)
10 loops, best of 3: 29.7 ms per loop

In [21]: %timeit _hierarchy.linkage(dists, Z, N, 0)
1 loops, best of 3: 1.87 s per loop

Other Attempts

I’ve also tried some other optimizations, some of which succeed while the others failed.

I used binary search in cluster_maxclust_monocrit and there was a bit improvement (though it is not a time-consuming function in most cases).

Before (N = 2000):

In [14]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
10 loops, best of 3: 35.6 ms per loop

After (N = 2000):

In [11]: %timeit hierarchy.fcluster(Z, 10, 'maxclust')
100 loops, best of 3: 5.86 ms per loop

Besides, I tried an algorithm similar to SLINK but for complete linkage – the CLINK algorithm2. However, it seems that CLINK is not the complete linkage that we used today. It did not always result in the best linkage in some cases on my implementation. Perhaps I have misunderstood some things in that paper.

At the suggestion of Charles, I tried an optimized HAC algorithm using priority queue. It has \(O(n^2 \log n)\) time complexity. However, it didn’t work well as expected. It was slower even when N = 3000. The algorithm needs to delete a non-root node in a priority queue, so when a binary heap is used, it needs to keep track of the index of every node and results in the increase of the time constant. Some other priority queue algorithms might perform better but I haven’t tried.


  1. Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1), 30-34. 

  2. Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366. 

Rewrite scipy.cluster.hierarchy Jul 14 2014 CST by Richard

After rewriting cluster.vq, I am rewriting the underlying implementation of cluster.hierarchy in Cython now.

Some Concerns about Cython

The reason why I use Cython to rewrite these modules is that Cython code is more maintainable, especially when using NumPy’s ndarray. Cython provides some easy-to-use and efficient mechanisms to access Python buffer memory, such as Typed Memoryview. However, if you need to iterate through the whole array, it would is a bit slow. It is because Cython will translate A[i, j] into something like *(A.data + i * A.strides[0] + j * A.strides[1]), i.e. it needs to calculate the offset in the array data buffer on every array access. Consider the following C code.

int i, j;
double *current_row;

/* method 1 */
s = 0;
current_row = (double *)A.data;
for(i = 0; i < A.shape[0]; ++i) {
    for(j = 0; j < A.shape[1]; ++j)
        s += current_row[j];
    current_row += A.shape[1];
}

/* method 2 */
s = 0;
for(i = 0; i < A.shape[0]; ++i)
    for(j = 0; j < A.shape[1]; ++j)
        s += *(A.data + i * A.shape[1] + j);

The original C implementation uses method 1 shown above, which is much more efficient than method 2, which is similiar to the C code that Cython generates for memoryview accesses. Of course method 1 can be adopted in Cython but the neatness and maintainablity of Cython code will reduce. In fact, that is just how I implemented _vq last month. But _vq has only two public functions while _hierarchy has 14, and the algorithms in _hierarchy are more complicated that those in _vq. It would be unwise to just translate all the C code into Cython with loads of pointer operations. Fortunately, the time complexities of most functions in _hierarchy are \(O(n)\). I think the performance loss of these functions is not a big problem and they can just use memoryview to keep maintainablity.

The New Implementation

The following table is a speed comparision of the original C implementation and the new Cython implementation. With the use of memoryview, most functions have about 30% performance loss. I used some optimization strategies for some functions and they run faster than the original version. The most important function, linkage, has a 2.5x speedup on a dataset with 2000 points.

functionoldnew
calculate_cluster_sizes4.273us6.996us
cluster_dist68.224us108.234us
cluster_in78.020us102.400us
cluster_maxclust_dist32.913ms1.174ms
cluster_maxclust_monocrit35.539ms1.454ms
cluster_monocrit113.104us46.170us
cohpenetic_distances13.052ms13.458ms
get_max_Rfield_for_each_cluster30.679us42.969us
get_max_dist_for_each_cluster27.729us42.977us
inconsistent105.663us146.673us
leaders33.941us47.849us
linkage4.476s1.794s
prelist29.833us41.334us

The new implementation has yet to be finished. All tests pass but there may be still some bugs now. And it lacks documentations now.

GSoC2014: The First Month Jun 21 2014 CST by Richard

The first month of my GSoC has passed and it is about time to make a summary.

The PRs

I’ve made 4 Pull-Requests for my GSoC project, 3 of which have been merged.

  • #3636 MAINT: cluster: some clean up Some cleanups. Mainly for docs.

  • #3683 ENH: cluster: rewrite and optimize vq in Cython The most challenging part. Discussed in this post already.

  • #3702 BUG: cluster: _vq unable to handle large features When I started to work on #3712, I noticed that the new version of _vq I wrote in the previous PR(#3683) may give wrong results when the values of the input features are very large. I though it might be an overflow issue since there was a matrix multiplication in the new algorithm. But I was wrong. It was caused by an initialization issue. I used the C macro NPY_INFINITY to initialize the output distances vector. However, ndarray.fill would regard NPY_INFINITY as the “integral infinity” and fill the float (or double) vector with \(2 ^ {31} - 1\). When the actual minimum distance was greater than such an “infinity”, the result would be an uninitialized value. The currect way to initialize the output distances vector should be outdists.fill(np.inf).

  • #3712 ENH: cluster: reimplement the update-step of K-means in Cython This PR is my recent work and hasn’t been merged yet.

    An iteration of K-means requires two steps, the assignment step and the update step. The assignment step, implemented as vq.vq, has been optimized in the previous PRs. But the update step is still time-consuming, especially for low-dimensional data. The current implementation of the update step in SciPy looks like:

    for j in range(nc):
        mbs = np.where(label == j)
        code[j] = np.mean(data[mbs], axis=0)
    

    However, np.where is slow. A much more efficient algorithm would be:

    for j in range(nobs):
        code[labels[j], :] += data[j, :]
        counts[labels[j]] += 1
    for j in range(nc):
        code[j] /= counts[j]
    

    But of course looping through the observation matrix in pure python would be extremely slow. So I implemented it in Cython and archieved decent speedups especially in low-dimensional data.

Achievement

First of all, I rewrote the underlying part of cluster.vq in Cython and it should be easier to maintain than the original C implementation.

Then, the performance. The new algorithms and optimization tricks work well. I built SciPy with ATLAS and tested vq.kmeans(obs, k) on a i5-3470 machine, and I got the following performance statistics.

datasetorigin#3683#3712speedup
30000 x 100, k = 10, float6428636.259416950.388412687.4252.26
30000 x 100, k = 10, float3213575.05326483.47345272.22.57
10000 x 30, k = 10, float641923.39461146.7348752.1932.56
10000 x 30, k = 10, float321491.879989.209658.10642.27
100000 x 3, k = 4, float641663.1431181.5912529.09923.14
100000 x 3, k = 4, float321278.033967.4396431.02482.97
10000 x 1, k = 3, float64117.690894.062241.61862.83
10000 x 1, k = 3, float32110.287498.682641.99042.63
1000 x 3, k = 4, float6429.2328.6979.85422.97
1000 x 3, k = 4, float3227.524229.248812.48942.20

What’s Next

There’re still about two months to go. First, I will continue to work on #3712. This PR still has some issues. There’re some build warnings which seem to be caused by Cython fused types on some machine. Then, I will start to work on cluster.hierarchy. The hierarchy module will be more complicated and challenging then vq and it may take me about a month to rewrite it in Cython. And then I’ll try to implement the SLINK and CLINK algorithm.

At last, I want to thank my mentors Ralf, Charles and David, and of course everyone else in the SciPy community, who have helpped me a lot. I’m very glad to be working with you!