UNPKG

@stdlib/ml

Version:

Machine learning algorithms.

114 lines (91 loc) 4.82 kB
{{alias}}( k[, ndims][, options] ) Returns an accumulator function which incrementally partitions data into `k` clusters. If not provided initial centroids, the accumulator caches data vectors for subsequent centroid initialization. Until an accumulator computes initial centroids, an accumulator returns `null`. Once an accumulator has initial centroids (either provided or computed), if provided a data vector, the accumulator function returns updated cluster results. If not provided a data vector, the accumulator function returns the current cluster results. Cluster results are comprised of the following: - centroids: a `k x ndims` matrix containing centroid locations. Each centroid is the component-wise mean of the data points assigned to a centroid's corresponding cluster. - stats: a `k x 4` matrix containing cluster statistics. Cluster statistics consists of the following columns: - 0: number of data points assigned to a cluster. - 1: total within-cluster sum of squared distances. - 2: arithmetic mean of squared distances. - 3: corrected sample standard deviation of squared distances. Because an accumulator incrementally partitions data, one should *not* expect cluster statistics to match similar statistics had provided data been analyzed via a batch algorithm. In an incremental context, data points which would not be considered part of a particular cluster when analyzed via a batch algorithm may contribute to that cluster's statistics when analyzed incrementally. In general, the more data provided to an accumulator, the more reliable the cluster statistics. Parameters ---------- k: integer|ndarray Number of clusters or a `k x ndims` matrix containing initial centroids. ndims: integer (optional) Number of dimensions. This argument must accompany an integer-valued first argument. options: Object (optional) Function options. options.metric: string (optional) Distance metric. Must be one of the following: 'euclidean', 'cosine', or 'correlation'. Default: 'euclidean'. options.init: ArrayLike (optional) Centroid initialization method and associated (optional) parameters. The first array element specifies the initialization method and must be one of the following: 'kmeans++', 'sample', or 'forgy'. The second array element specifies the number of data points to use when calculating initial centroids. When performing kmeans++ initialization, the third array element specifies the number of trials to perform when randomly selecting candidate centroids. Typically, more trials is correlated with initial centroids which lead to better clustering; however, a greater number of trials increases computational overhead. Default: ['kmeans++', k, 2+⌊ln(k)⌋ ]. options.normalize: boolean (optional) Boolean indicating whether to normalize incoming data. This option is only relevant for non-Euclidean distance metrics. If set to `true`, an accumulator partitioning data based on cosine distance normalizes provided data to unit Euclidean length. If set to `true`, an accumulator partitioning data based on correlation distance first centers provided data and then normalizes data dimensions to have zero mean and unit variance. If this option is set to `false` and the metric is either cosine or correlation distance, then incoming data *must* already be normalized. Default: true. options.copy: boolean (optional) Boolean indicating whether to copy incoming data to prevent mutation during normalization. Default: true. options.seed: any (optional) PRNG seed. Setting this option is useful when wanting reproducible initial centroids. Returns ------- acc: Function Accumulator function. acc.predict: Function Predicts centroid assignment for each data point in a provided matrix. To specify an output vector, provide a 1-dimensional ndarray as the first argument. Each element in the returned vector corresponds to a predicted cluster index for a respective data point. Examples -------- > var accumulator = {{alias}}( 5, 2 ); > var buf = new {{alias:@stdlib/array/float64}}( 2 ); > var shape = [ 2 ]; > var strides = [ 1 ]; > var v = {{alias:@stdlib/ndarray/ctor}}( 'float64', buf, shape, strides, 0, 'row-major' ); > v.set( 0, 2.0 ); > v.set( 1, 1.0 ); > out = accumulator( v ); > v.set( 0, -5.0 ); > v.set( 1, 3.14 ); > out = accumulator( v ); See Also --------