guv

Version:

Grid Utilization Virgilante

github.com/jonnor/gov

jonnor/gov

114 lines (77 loc) • 3.94 kB

Markdown

## Tools TODO - Timing Heroku dyno boot time - Calculating mean, stddev from processing time stats. Output as config - Test whether processing times are normally distributed. Visual+[analytical test](https://en.wikipedia.org/wiki/Normality_test). - Calculate clusters/bins to separate multi-mode, non-normal data into - A runnable+introspectable model of the scaling algorithm. Ability to test it on real/historical data. - For a given configuration, estimate what loads can (and cannot) be handled ## Smarter scaling Right now, guv uses the simplest (stupidest) model that can possibly work: Scale the number of workers proportionally to messages in the queue. Scaling factor is based on processing time estimates versus deadline. This model is completely reactive, it only actuates changes after situation as occurred. No prediction. The model does not take into account the (significant) time costs of dyno boot up. Scaling function should receive all neccesary state. The state collected could be a window of (jobs, workers) measurements. Measurements must be timestamped, window should be time-based, number of measurements/time as high as possible (without being disruptive). Cannot assume measurements will be evenly spaced. Key questions: - When considering to scale down, what is the probability that we will go back up or above N messages (and thus W workers), within the next 30-60 seconds. - What is the cost of overestimating/overprovisioning ## Initial design Requirements - Scale workers based on queue length - Support _multiple_ workers/queues - Failure to hit deadline when maxing out scale raises OPS error - Store input metrics, settings and decisions somewhere - for analysis - (optional) Scale webs workers based on request/response metrics - (optional) Allow to run as a periodic job instead of dedicated dyno - (optional) Visualization of results - Notify failure on New Relic - Monitor queues in RabbitMQ - Scale dynos on Heroku Non-requirements - Scaling non-compute resources (databases etc) Settings Should be able to have per-app defaults for each setting, and then override per role. Also a global default ofc. minimum_dynos: . Should allow 0 (N dynos, Integer) maximum_dynos: Based on a budget. Also to catch cases where scaling goes crazy. (N dynos, Integer) target_deadline: The response time we shall attempt to keep. (seconds) Things to consider hysteresis. avoiding scaling up/down unecessarily window length. how much history to consider rescale activation time. how long it takes to make a change (order of 1 minute) time spent in queue time spent performing job Test cases Periodic fluctuations (sine wave, square, triangle) Within-target spike Out-of-bounds spike Architecture - Metric collectors: Gathers metrics about the apps - Converters: Convers metrics into resource - Actuators: Activates new number of resources - Notificators: ## Prior art - Dynosaur. Ruby gem. Dedicated dyno. http://engineering.harrys.com/2014/01/02/dynosaur-a-heroku-autoscaler.html - Workless. Ruby gem. http://symmetricinfinity.com/2013/04/19/autoscale-workers-on-heroku-with-workless.html - Viki. Ruby script. Cronjob. https://github.com/viki-org/heroku-autoscale/blob/master/autoscale http://engineering.viki.com/blog/2011/autoscaling-heroku-dynos/ - Python. Schedule-based scaling. https://realpython.com/blog/python/automatically-scale-heroku-dynos/ - Adept Scale. Service, Heroku addon. https://addons.heroku.com/adept-scale - Heroku Vector. Ruby,Linear scaling. https://github.com/wpeterson/heroku-vector ## References - https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes - https://en.wikipedia.org/?title=Queueing_theory - http://stats.stackexchange.com/questions/18821/why-is-the-poisson-distribution-chosen-to-model-arrival-processes-in-queueing-th - http://www.math.uah.edu/stat/poisson/Poisson.html