lumenize
Version:
Illuminating the forest AND the trees in your data.
1 lines • 21.6 kB
JavaScript
Ext.data.JsonP.Lumenize_BayesianClassifier({"tagname":"class","name":"Lumenize.BayesianClassifier","autodetected":{},"files":[{"filename":"Classifier.coffee.js","href":"Classifier.coffee.html#Lumenize-BayesianClassifier"}],"members":[{"name":"features","tagname":"cfg","owner":"Lumenize.BayesianClassifier","id":"cfg-features","meta":{}},{"name":"outputField","tagname":"cfg","owner":"Lumenize.BayesianClassifier","id":"cfg-outputField","meta":{}},{"name":"constructor","tagname":"method","owner":"Lumenize.BayesianClassifier","id":"method-constructor","meta":{}},{"name":"getStateForSaving","tagname":"method","owner":"Lumenize.BayesianClassifier","id":"method-getStateForSaving","meta":{}},{"name":"predict","tagname":"method","owner":"Lumenize.BayesianClassifier","id":"method-predict","meta":{}},{"name":"train","tagname":"method","owner":"Lumenize.BayesianClassifier","id":"method-train","meta":{}},{"name":"newFromSavedState","tagname":"method","owner":"Lumenize.BayesianClassifier","id":"static-method-newFromSavedState","meta":{"static":true}}],"alternateClassNames":[],"aliases":{},"id":"class-Lumenize.BayesianClassifier","short_doc":"A Bayesian classifier with non-parametric modeling of distributions using v-optimal bucketing. ...","component":false,"superclasses":[],"subclasses":[],"mixedInto":[],"mixins":[],"parentMixins":[],"requires":[],"uses":[],"html":"<div><pre class=\"hierarchy\"><h4>Files</h4><div class='dependency'><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier' target='_blank'>Classifier.coffee.js</a></div></pre><div class='doc-contents'><p><strong>A Bayesian classifier with non-parametric modeling of distributions using v-optimal bucketing.</strong></p>\n\n<p>If you look for libraries for Bayesian classification, the primary use case is spam filtering and they assume that\nthe presence or absence of a word is the only feature you are interested in. This is a more general purpose tool.</p>\n\n<h2>Features</h2>\n\n<ul>\n<li>Works even for bi-modal and other non-normal distributions</li>\n<li>No requirement that you identify the distribution</li>\n<li>Uses <a href=\"http://en.wikipedia.org/wiki/Non-parametric_statistics\">non-parametric modeling</a></li>\n<li>Uses v-optimal bucketing so it deals well with outliers and sharp cliffs</li>\n<li>Serialize (<code>getStateForSaving()</code>) and deserialize (<code>newFromSavedState()</code>) to preserve training between sessions</li>\n</ul>\n\n\n<h2>Why the assumption of a normal distribution is bad in some cases</h2>\n\n<p>The <a href=\"https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification\">wikipedia example of using Bayes</a> tries\nto determine if someone was male or female based upon the height, weight\nand shoe size. The assumption is that men are generally larger, heavier, and have larger shoe size than women. In the\nexample, they use the mean and variance of the male-only and female-only populations to characterize those\ndistributions. This works because these characteristics are generally normally distributed <strong>and the distribution for\nmen is generally to the right of the distribution for women</strong>.</p>\n\n<p>However, let's ask a group of folks who work together if they consider themselves a team and let's try to use the size\nof the group as a feature to predict what a new group would say. If the group is very small (1-2 people), they are\nless likely to consider themselves a team (partnership maybe), but if they are too large (say > 10), they are also\nunlikely to refer to themselves as a team. The non-team distribution is bimodal, looking at its mean and variance\ncompletely mis-characterizes it. Also, the distribution is zero bound so it's likely to be asymmetric, which also\nposes problems for a normal distribution assumption.</p>\n\n<h2>So what do we do instead?</h2>\n\n<p>This classifier uses the actual values (in buckets) rather than characterize the distribution as \"normal\", \"log-normal\", etc.\nThis approach is often referred to as \"building a non-parametric model\".</p>\n\n<p><strong>Pros/Cons</strong>. The use of a non-parametric approach will allow us to deal with non-normal distributions (asymmetric,\nbimodal, etc.) without ever having to identify which nominal distribution is the best fit or having to ask the user\n(who may not know) what distribution to use. The downside to this approach is that it generally requires a larger\ntraining set. You will need to experiment to determine how small is too small for your situation.</p>\n\n<p>This approach is hinted at in the <a href=\"https://en.wikipedia.org/wiki/Naive_Bayes_classifier\">wikipedia article on Bayesian classifiers</a>\nas \"binning to discretize the feature values, to obtain a new set of Bernoulli-distributed features\". However, this\nclassifier does not create new separate Bernoulli features for each bin. Rather, it creates a mapping function from a feature\nvalue to a probability indicating how often the feature value is coincident with a particular outputField value. This mapping\nfunction is different for each bin.</p>\n\n<h2>V-optimal bucketing</h2>\n\n<p>There are two common approaches to bucketing:</p>\n\n<ol>\n<li>Make each bucket be equal in width along the x-axis (like we would for a histogram) (equi-width)</li>\n<li>Make each bucket have roughly the same number of data points (equi-depth)</li>\n</ol>\n\n\n<p>It turns out neither of the above works out well unless the training set is relatively large. Rather, there is an\napproach called <a href=\"http://en.wikipedia.org/wiki/V-optimal_histograms\">v-optimal bucketing</a> which attempts to find the\noptimal boundaries in the data. The basic idea is to look for the splits that provide the minimum total error-squared\nwhere the \"error\" for each point is the distance of that point from the arithmetic mean. This classifier uses v-optimal\nbucketing when the training set has 144 or fewer rows. Above that it switches to equi-depth bucketing. Note, I only\nevaluated a single scenario (Rally RealTeam), but 144 was the point where equi-depth started to provide as-good results as\nv-optimal bucketing. Note, in my test, much larger sets had moderately <em>better</em> results with equi-depth bucketing.</p>\n\n<p>That said, the 144 cutoff was determined with an older version of the v-optimal bucketing. I've since fixed that old\nalgorithms tendency to produce lopsided distributions. It may very well be possible for v-optimal to be better for\neven larger numbers of data points. I need to run a new experiment to see.</p>\n\n<p>The algorithm used here for v-optimal bucketing is slightly inspired by\n<a href=\"http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-Histograms/v-opt3.html\">this</a>.\nHowever, I've made some different choices about when to terminate the splitting and deciding what portion to split again. To\nunderstand the essence of the algorithm used, you need only look at the 9 lines of code in the <code>findBucketSplits()</code> function.\nThe <code>optimalSplitFor2Buckets()</code> function will split the values into two buckets. It tries each possible split\nstarting with only one in the bucket on the left all the way down to a split with only one in the bucket on the right.\nIt then figures out which split has the highest error and splits that again until we have the target number of splits.</p>\n\n<h2>Simple example</h2>\n\n<p>First we need to require the classifier.</p>\n\n<pre><code>{BayesianClassifier} = require('../')\n</code></pre>\n\n<p>Before we start, let's take a look at our training set. The assumption is that we think TeamSize and HasChildProject\nwill be predictors for RealTeam.</p>\n\n<pre><code>trainingSet = [\n {TeamSize: 5, HasChildProject: 0, RealTeam: 1},\n {TeamSize: 3, HasChildProject: 1, RealTeam: 0},\n {TeamSize: 3, HasChildProject: 1, RealTeam: 1},\n {TeamSize: 1, HasChildProject: 0, RealTeam: 0},\n {TeamSize: 2, HasChildProject: 1, RealTeam: 0},\n {TeamSize: 2, HasChildProject: 0, RealTeam: 0},\n {TeamSize: 15, HasChildProject: 1, RealTeam: 0},\n {TeamSize: 27, HasChildProject: 1, RealTeam: 0},\n {TeamSize: 13, HasChildProject: 1, RealTeam: 1},\n {TeamSize: 7, HasChildProject: 0, RealTeam: 1},\n {TeamSize: 7, HasChildProject: 0, RealTeam: 0},\n {TeamSize: 9, HasChildProject: 1, RealTeam: 1},\n {TeamSize: 6, HasChildProject: 0, RealTeam: 1},\n {TeamSize: 5, HasChildProject: 0, RealTeam: 1},\n {TeamSize: 5, HasChildProject: 0, RealTeam: 0},\n]\n</code></pre>\n\n<p>Now, let's set up a simple config indicating our assumptions. Note how the type for TeamSize is 'continuous'\nwhereas the type for HasChildProject is 'discrete' eventhough a number is stored. Continuous types must be numbers\nbut discrete types can either be numbers or strings.</p>\n\n<pre><code>config =\n outputField: \"RealTeam\"\n features: [\n {field: 'TeamSize', type: 'continuous'},\n {field: 'HasChildProject', type: 'discrete'}\n ]\n</code></pre>\n\n<p>We can now instantiate the classifier with that config,</p>\n\n<pre><code>classifier = new BayesianClassifier(config)\n</code></pre>\n\n<p>and pass in our training set.</p>\n\n<pre><code>percentWins = classifier.train(trainingSet)\n</code></pre>\n\n<p>The call to <code>train()</code> returns the percentage of times that the trained classifier gets the right answer for the training\nset. This should usually be pretty high. Anything below say, 70% and you probably don't have the right \"features\"\nin your training set or you don't have enough training set data. Our made up exmple is a borderline case.</p>\n\n<pre><code>console.log(percentWins)\n# 0.7333333333333333\n</code></pre>\n\n<p>Now, let's see how the trained classifier is used to predict \"RealTeam\"-ness. We simply pass in an object with\nfields for each of our features. A very small team with child projects are definitely not a RealTeam.</p>\n\n<pre><code>console.log(classifier.predict({TeamSize: 1, HasChildProject: 1}))\n# 0\n</code></pre>\n\n<p>However, a mid-sized project with no child projects most certainly is a RealTeam.</p>\n\n<pre><code>console.log(classifier.predict({TeamSize: 7, HasChildProject: 0}))\n# 1\n</code></pre>\n\n<p>Here is a less obvious case, with one indicator going one way (the right size) and another going the other way (has child projects).</p>\n\n<pre><code>console.log(classifier.predict({TeamSize: 5, HasChildProject: 1}))\n# 1\n</code></pre>\n\n<p>If you want to know the strength of the prediction, you can pass in <code>true</code> as the second parameter to the <code>predict()</code> method.</p>\n\n<pre><code>console.log(classifier.predict({TeamSize: 5, HasChildProject: 1}, true))\n# { '0': 0.3786982248520709, '1': 0.6213017751479291 }\n</code></pre>\n\n<p>We're only 62.1% sure this is a RealTeam. Notice how the keys for the output are strings eventhough we passed in values\nof type Number for the RealTeam field in our training set. We had no choice in this case because keys of JavaScript\nObjects must be strings. However, the classifier is smart enough to convert it back to the correct type if you call\nit without passing in true for the second parameter.</p>\n\n<p>Like the Lumenize calculators, you can save and restore the state of a trained classifier.</p>\n\n<pre><code>savedState = classifier.getStateForSaving('some meta data')\nnewClassifier = BayesianClassifier.newFromSavedState(savedState)\nconsole.log(newClassifier.meta)\n# some meta data\n</code></pre>\n\n<p>It will make the same predictions.</p>\n\n<pre><code>console.log(newClassifier.predict({TeamSize: 5, HasChildProject: 1}, true))\n# { '0': 0.3786982248520709, '1': 0.6213017751479291 }\n</code></pre>\n</div><div class='members'><div class='members-section'><div class='definedBy'>Defined By</div><h3 class='members-title icon-cfg'>Config options</h3><div class='subsection'><div id='cfg-features' class='member first-child not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-cfg-features' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-cfg-features' class='name expandable'>features</a> : Object[]<span class=\"signature\"></span></div><div class='description'><div class='short'>Array of Maps which specifies the fields to use as features. ...</div><div class='long'><p>Array of Maps which specifies the fields to use as features. Each row in the array should\n be in the form of <code>{field: <fieldName>, type: <'continuous' | 'discrete'>}</code>. Note, that you can even declare Number type\n fields as 'discrete'. It is preferable to do this if you know that it can only be one of a hand full of values\n (0 vs 1 for example).</p>\n\n<p> <strong>WARNING: If you choose 'discrete' for the feature type, then ALL possible values for that feature must appear\n in the training set. If the classifier is asked to make a prediction with a value that it has never seen\n before, it will fail catostrophically.</strong></p>\n</div></div></div><div id='cfg-outputField' class='member not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-cfg-outputField' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-cfg-outputField' class='name expandable'>outputField</a> : String<span class=\"signature\"></span></div><div class='description'><div class='short'><p>String indicating which field in the training set is what we are trying to predict</p>\n</div><div class='long'><p>String indicating which field in the training set is what we are trying to predict</p>\n</div></div></div></div></div><div class='members-section'><h3 class='members-title icon-method'>Methods</h3><div class='subsection'><div class='definedBy'>Defined By</div><h4 class='members-subtitle'>Instance methods</h3><div id='method-constructor' class='member first-child not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-method-constructor' target='_blank' class='view-source'>view source</a></div><strong class='new-keyword'>new</strong><a href='#!/api/Lumenize.BayesianClassifier-method-constructor' class='name expandable'>Lumenize.BayesianClassifier</a>( <span class='pre'>userConfig</span> ) : <a href=\"#!/api/Lumenize.BayesianClassifier\" rel=\"Lumenize.BayesianClassifier\" class=\"docClass\">Lumenize.BayesianClassifier</a><span class=\"signature\"></span></div><div class='description'><div class='short'> ...</div><div class='long'>\n<h3 class=\"pa\">Parameters</h3><ul><li><span class='pre'>userConfig</span> : Object<div class='sub-desc'><p>See Config options for details.</p>\n</div></li></ul><h3 class='pa'>Returns</h3><ul><li><span class='pre'><a href=\"#!/api/Lumenize.BayesianClassifier\" rel=\"Lumenize.BayesianClassifier\" class=\"docClass\">Lumenize.BayesianClassifier</a></span><div class='sub-desc'>\n</div></li></ul></div></div></div><div id='method-getStateForSaving' class='member not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-method-getStateForSaving' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-method-getStateForSaving' class='name expandable'>getStateForSaving</a>( <span class='pre'>[meta]</span> ) : Object<span class=\"signature\"></span></div><div class='description'><div class='short'>Enables saving the state of a Classifier. ...</div><div class='long'><p>Enables saving the state of a Classifier.</p>\n\n<p> See the bottom of the \"Simple example\" for example code of using this\n saving and restoring functionality.</p>\n<h3 class=\"pa\">Parameters</h3><ul><li><span class='pre'>meta</span> : Object (optional)<div class='sub-desc'><p>An optional parameter that will be added to the serialized output and added to the meta field\n within the deserialized Classifier</p>\n</div></li></ul><h3 class='pa'>Returns</h3><ul><li><span class='pre'>Object</span><div class='sub-desc'><p>Returns an Ojbect representing the state of the Classifier. This Object is suitable for saving to\n an object store. Use the static method <code>newFromSavedState()</code> with this Object as the parameter to reconstitute the Classifier.</p>\n</div></li></ul></div></div></div><div id='method-predict' class='member not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-method-predict' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-method-predict' class='name expandable'>predict</a>( <span class='pre'>row, [returnProbabilities]</span> ) : String|Number|Object<span class=\"signature\"></span></div><div class='description'><div class='short'>Use the trained classifier to make a prediction. ...</div><div class='long'><p>Use the trained classifier to make a prediction.</p>\n<h3 class=\"pa\">Parameters</h3><ul><li><span class='pre'>row</span> : Object<div class='sub-desc'><p>an Object containing a field for each of the features specified by the config.</p>\n</div></li><li><span class='pre'>returnProbabilities</span> : Boolean (optional)<div class='sub-desc'><p>If true, then the output will indicate the probabilities of each\n possible outputField value. Otherwise, the output of a call to <code>predict()</code> will return the predicted value with\n the highest probability.</p>\n<p>Defaults to: <code>false</code></p></div></li></ul><h3 class='pa'>Returns</h3><ul><li><span class='pre'>String|Number|Object</span><div class='sub-desc'><p>If returnProbabilities is false (the default), then it will return the prediction.\n If returnProbabilities is true, then it will return an Object indicating the probability for each possible\n outputField value.</p>\n</div></li></ul></div></div></div><div id='method-train' class='member not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-method-train' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-method-train' class='name expandable'>train</a>( <span class='pre'>userSuppliedTrainingSet</span> ) : Number<span class=\"signature\"></span></div><div class='description'><div class='short'>Train the classifier with a training set. ...</div><div class='long'><p>Train the classifier with a training set.</p>\n<h3 class=\"pa\">Parameters</h3><ul><li><span class='pre'>userSuppliedTrainingSet</span> : Object[]<div class='sub-desc'><p>an Array of Maps containing a field for the outputField as well as a field\n for each of the features specified in the config.</p>\n</div></li></ul><h3 class='pa'>Returns</h3><ul><li><span class='pre'>Number</span><div class='sub-desc'><p>The percentage of time that the trained classifier returns the expected outputField for the rows\n in the training set. If this is low (say below 70%), you need more predictive fields and/or more data in your\n training set.</p>\n</div></li></ul></div></div></div></div><div class='subsection'><div class='definedBy'>Defined By</div><h4 class='members-subtitle'>Static methods</h3><div id='static-method-newFromSavedState' class='member first-child not-inherited'><a href='#' class='side expandable'><span> </span></a><div class='title'><div class='meta'><span class='defined-in' rel='Lumenize.BayesianClassifier'>Lumenize.BayesianClassifier</span><br/><a href='source/Classifier.coffee.html#Lumenize-BayesianClassifier-static-method-newFromSavedState' target='_blank' class='view-source'>view source</a></div><a href='#!/api/Lumenize.BayesianClassifier-static-method-newFromSavedState' class='name expandable'>newFromSavedState</a>( <span class='pre'>p</span> ) : Classifier<span class=\"signature\"><span class='static' >static</span></span></div><div class='description'><div class='short'>Deserializes a previously stringified Classifier and returns a new Classifier. ...</div><div class='long'><p>Deserializes a previously stringified Classifier and returns a new Classifier.</p>\n\n<p> See the bottom of the \"Simple example\" for example code of using this\n saving and restoring functionality.</p>\n<h3 class=\"pa\">Parameters</h3><ul><li><span class='pre'>p</span> : String/Object<div class='sub-desc'><p>A String or Object from a previously saved Classifier state</p>\n</div></li></ul><h3 class='pa'>Returns</h3><ul><li><span class='pre'>Classifier</span><div class='sub-desc'>\n</div></li></ul></div></div></div></div></div></div></div>","meta":{}});