UNPKG

hac

Version:

Hierarchical agglomerative clustering

165 lines (122 loc) 5.25 kB
# HAC HAC stands for Hierarchical Agglomerative Clustering, a commeon technique for unsupervised document clustering. > **NOTICE**: > HAC requires unpublished modules on github, > it will just work fine with `npm install`, > but will fail on *Tonic* (the *Try it out* on npm website), > since it requires all modules published on npm. > Future works will try to publish these required modules on npm. # Installation ```bash npm install hac --save ``` # Usage ## Instantiate ```javascript var HAC = require("hac"); var hac = new HAC(); ``` ## Add documents ```javascript hac.addDocument(doc, id, class); ``` Arguments: * doc `String`: the document to be added, could be string of text or array of terms * id `String/int` (optional): the id of the docuemnt. If ignored, a uuid would generated automatically * class `String/int` (optional): the class(or label) of this document. You probably won't need this, but if specified, you could use `getMeasure()` to get F measure or Randon Index to see clustering performance. ## Clustering ```javascript hac.cluster(clusterMethod); ``` Arguments: * clusterMethod `Class Method`: the clustering algorithm to be used. Available options are as following: + `HAC.GA`: Group-average Agglomerative clustering + `HAC.SingleLink`: single link clustering + `HAC.CompleteLink`: complete link clustering + `HAC.Centroid`: centroid clustering. *To Be Implemented* ## Get clustering result ```javascript var clusters = hac.getClusters(k, fields); ``` Arguments: * k `int`: the number of clusters * fields `Array`: array of fields of a document that you want in the final clustering result. Available fields are as following: + "id": the id of the document + "class": the class(label) of the document, if specified when calling `addDocument()` + "content": string of document content + "terms": document content represented as array of terms + "tfs": array of term frequencies for this document + "vector": vector representation of this document Alternatively, you could use following method to get clusters with cluster labeling: ```javascript var clusters = hac.getClustersWithLabels(k, fields, featureCount, featureMethod); ``` The cluster labeling algorithm uses feature selection, which is a module called [FeatureSelector](https://github.com/roackb2/feature-selector). Arguments: * k `int`: number of clusters. * fields `Array`: array of fields. see above description of `getClusters()` * featureCount `int`: the number of feature terms that you want for each cluster * featureMethod `Class Method`: the feature selection algorithm to be used. Available options are as following:; + `FeatureSelector.MI`: Expected Mutual Information feature selection + `FeatureSelectr.LLR`: Likelihood Ratio feature selection ## Get performance measurement You could get F measure or Random index for the clustering result. > **NOTE**: if you want to see performance measurements, you must specify the `class` argument when calling `addDocument()`. Also, when calling `getClusters()` or `getClustersWithLabels()`, you must include the field `"class"` in the argment `fields`. ```javascript var measure = getMeasure(clusters, method, beta, showRawScore); ``` Arguments: * clusters `Array`: the clustering result that you get by calling `getClusters()` or `getClustersWithLabels()` * method `Class Method`: the measuring algorithm to be used. Available options are as following: + `HAC.F`: F measure + `HAC.RI`: Random Index * beta `int` (optional): If you use `HAC.F`, you should give `hac` a beta value, which should be integer greater than or equal to 1 * showRawScore `boolean` (optional): If set to true, print the tp, fp, fn, tn, total negative and total positive on the console # Complete example ```javascript var hac = new HAC(); var docs = []; docs.push(["嗨", "你好"]); docs.push(["嗨", "很", "高興", "認識", "你"]); docs.push("hello, how's everything today? is everything ok today?") docs.push("let's test one more document!"); docs.push("documents are always not large enough"); for(var i = 0; i < docs.length; i++) { hac.addDocument(docs[i], i); } hac.cluster(HAC.GA); var clusters = hac.getClusters(2, ["id", "content"]); _.forEach(clusters, function(cluster) { console.log("cluster id: " + cluster.id) _.forEach(cluster.docs, function(doc) { console.log("doc id: " + doc.id) console.log("doc content: " + doc.content); }) console.log() }) ``` the result would be: ``` cluster id: 7 doc id: 0 doc content: 嗨,你好 doc id: 1 doc content: 嗨,很,高興,認識,你 doc id: 2 doc content: hello, how's everything today? is everything ok today? cluster id: 6 doc id: 3 doc content: let's test one more document! doc id: 4 doc content: documents are always not large enough ``` # Release Notes * 1.0.7: update url of modules hosted on github to a simpler form * 1.0.6: correct require path of the heap module * 1.0.5: make statements in README for incompatibility with `Tonic` * 1.0.4: require es6-shim to support older node engine * 1.0.3: change arrow functions to anonymous functions for backward compatibility * 1.0.2: subtle modification to README * 1.0.1: first publishment