Real-time ‘A/B testing’ of models, or combined with more advanced online selection and learning techniques might still have selection bias. Before running A/B tests, the KPI metric stat report can be used to monitor prior test performance, thus being able to respond or adapt if performance degrades for some reasons.
Time is a very precious commodity when you are running A/B tests online. Make sure you go in with agreed upper bounds on how long you’re going to spend testing a given experiment using the prior KPI metic stat report. During the test, if you’re not seeing trends in the p-value that look encouraging, it’s time to pull the plug at the time.
Users are randomly assigned to either your control or your treatment groups might not be random after all, so make sure to audit the systems to make sure there is no selection bias in the actual assignment of users to the control or treatment groups using the KPI metric stat report.
Make sure the user assignment is sticky. If you are messing the effect a chance over an entire session, make sure users are not switching groups before and in the testing period.
Look at the same exact group of users and observe different engagement values, and making sure there’s no inherent bias or other problem, e.g., session leakage and things you need to address.
If you are running multiple test simultaneously, make sure they don't conflict with one another across all user buckets.
https://eng.uber.com/xp-background-push/
In the background, our A/B testing platform, Morpheus, constructs the payload that needs to be sent to users based on the new configuration. This payload is sent to an internal service called GroupPusher, along with a key identifying the flag which needs to be rolled back. GroupPusher is a service which takes some key identifying a set of users in Cassandra and a payload and proceeds to send the payload to all of the users identified set. GroupPusher then pulls the list of affected users from Cassandra based on this key.
GroupPusher next calls Pusher, an internal service that sends the payload to users at a rate of about 3,000 pushes per second. Pusher is responsible for sending push notifications to users, e.g., to let them know that their driver is approaching their pickup location: "Your Uber is arriving now." Pusher then sends the payload down to the mobile clients via APNs (for iOS) and GCM (for Android).
https://eng.uber.com/xp-background-push/
In the background, our A/B testing platform, Morpheus, constructs the payload that needs to be sent to users based on the new configuration. This payload is sent to an internal service called GroupPusher, along with a key identifying the flag which needs to be rolled back. GroupPusher is a service which takes some key identifying a set of users in Cassandra and a payload and proceeds to send the payload to all of the users identified set. GroupPusher then pulls the list of affected users from Cassandra based on this key.
GroupPusher next calls Pusher, an internal service that sends the payload to users at a rate of about 3,000 pushes per second. Pusher is responsible for sending push notifications to users, e.g., to let them know that their driver is approaching their pickup location: "Your Uber is arriving now." Pusher then sends the payload down to the mobile clients via APNs (for iOS) and GCM (for Android).
Dipping into the stream
The twist is that we want to sample from an unknown population, considering only one data point at a time. This use case calls for a special corner of computer science: online algorithms, a subclass of streaming algorithms. “Online” implies only individual points are considered in a single pass. “Streaming” implies the program can only consider a subset of the data at a time, but can work in batches or run multiple passes. Fortunately, Donald Knuth helped popularize an elegant approach that enables random sampling over a stream: Reservoir sampling.
First we designate a counter, which will be incremented for every data point seen. The reservoir is generally a list or array of predefined size. Now we can begin adding data. Until we encounter size elements, elements are added directly to reservoir. Once reservoir is full, incoming data points have a size / counter chance to replace an existing sample point. We never look at the value of a data point and the random chance is guaranteed by definition. This way reservoir is always representative of the dataset as a whole, and is just as likely to have a data point from the beginning as it is from the end. All this, with bounded memory requirements, and very little computation. See the instrumentation section below for links to Python implementations.
In simpler terms, the reservoir progressively renders a scaled-down version of the data, like a fixed-size thumbnail. Reservoir sampling’s ability to handle populations of unknown size fits perfectly with tracking response latency and other metrics of a long-lived server process.
Interpretation
Once you have a reservoir, what are the natural next steps?
- Look at the min, max, and other quantiles of interest (generally median, 95th, 99th, 99.9th percentiles).
- Visualize the CDF and histogram to get a sense for the shape of the data, usually by loading the data in a Jupyter notebook and using Pandas, matplotlib, and occasionally booked.
Transitions
Usually, reservoirs get us what we want and we can get on with non-statistical development. But sometimes, the situation calls for a more tailored approach.
Except for q-digests, biased quantile estimators, and plenty of other advanced algorithms for handling performance data. After a lot of experimentation, two approaches remain our go-tos, both of which are much simpler than one might presume.
The first approach, histogram counters, establishes ranges of interest, called bins or buckets, based on statistics gathered from a particular reservoir. While reservoir counting is data agnostic, looking only at a random value to decide where to put the data, bucketed counting looks at the value, finds the bucket whose range includes that value, and increments the bucket’s associated counter. The value itself is not stored. The code is simple, and the memory consumption is even lower, but the key advantage is the execution speed. Bucketed counting is so low overhead that it allows statistics collection to permeate much deeper into our code than other algorithms would allow.
The second approach, Piecewise Parabolic Quantile Estimation (P2 for short), is an engineering classic. A product of the 1980s electronics industry, P2 is a pragmatic online algorithm originally designed for simple devices. When we look at a reservoir’s distribution and decide we need more resolution for certain quantiles, P2 lets us specify the quantiles ahead of time, and maintains those values on every single observation. The memory consumption is very low, but due to the math involved, P2 uses more CPU than reservoir sampling and bucketed counting. Furthermore, we’ve never seen anyone attempt combination of P2 estimators, but we assume it’s nontrivial. The good news is that for most distributions we see, our P2 estimators are an order of magnitude more accurate than reservoir sampling.
These approaches both take something learned from the reservoir sample and apply it toward doing less. Histograms provide answers in terms of a preconfigured value ranges, P2 provides answers at preconfigured quantile points of interest.
Usually, reservoirs get us what we want and we can get on with non-statistical development. But sometimes, the situation calls for a more tailored approach.
Except for q-digests, biased quantile estimators, and plenty of other advanced algorithms for handling performance data. After a lot of experimentation, two approaches remain our go-tos, both of which are much simpler than one might presume.
The first approach, histogram counters, establishes ranges of interest, called bins or buckets, based on statistics gathered from a particular reservoir. While reservoir counting is data agnostic, looking only at a random value to decide where to put the data, bucketed counting looks at the value, finds the bucket whose range includes that value, and increments the bucket’s associated counter. The value itself is not stored. The code is simple, and the memory consumption is even lower, but the key advantage is the execution speed. Bucketed counting is so low overhead that it allows statistics collection to permeate much deeper into our code than other algorithms would allow.
The second approach, Piecewise Parabolic Quantile Estimation (P2 for short), is an engineering classic. A product of the 1980s electronics industry, P2 is a pragmatic online algorithm originally designed for simple devices. When we look at a reservoir’s distribution and decide we need more resolution for certain quantiles, P2 lets us specify the quantiles ahead of time, and maintains those values on every single observation. The memory consumption is very low, but due to the math involved, P2 uses more CPU than reservoir sampling and bucketed counting. Furthermore, we’ve never seen anyone attempt combination of P2 estimators, but we assume it’s nontrivial. The good news is that for most distributions we see, our P2 estimators are an order of magnitude more accurate than reservoir sampling.
These approaches both take something learned from the reservoir sample and apply it toward doing less. Histograms provide answers in terms of a preconfigured value ranges, P2 provides answers at preconfigured quantile points of interest.
Instrumentation
We focused a lot on statistical fundamentals, but how do we generate relevant datasets in the first place? Our answer is through structured instrumentation of our components. With the right hooks in place, the data will be there when we need it, whether we’re staying late to debug an issue or when we have a spare cycle to improve performance.
Much of services’ robustness can be credited to a reliable remote logging infrastructure, similar to, but more powerful than, rsyslog. Still, before we can send data upstream, the process must collect internal metrics. We leverage two open-source projects, fast approaching major release:
Faststat – Optimized statistical accumulators
Faststat operates on a lower level. True to its name, Faststat is a compiled Python extension that implements accumulators for the measures described here and many more. This includes everything from geometric/harmonic means to Markov-like transition tracking to a metametric that tracks the time between stat updates. At just over half a microsecond per point, Faststat’s low-overhead allows it to permeate into some of the deepest depths of our framework code. Faststat lacks output mechanisms of its own, so our internal framework includes a simple web API and UI for browsing statistics, as well as a greenlet that constantly uploads faststat data to a remote accumulation service for alerting and archiving.
Lithoxyl – Next-generation logging and application instrumentation
Lithoxyl is a high-level library designed for application introspection. It’s intended to be the most Pythonic logging framework possible. This includes structured logging and various accumulators, including reservoir sampling, P2, and others. But more importantly, Lithoxyl creates a separate instrumentation aspect for applications, allowing output levels and formats to be managed separately from the instrumentation points themselves.
One of the many advantages to investing in instrumentation early is that you get a sense for the performance overhead of data collection. Reliability and features far outweigh performance in the enterprise space. Many critical services I’ve worked on could be multiple times faster without instrumentation, but removing this aspect would render them unmaintainable, which brings me to my next point.
Good work takes cycles. All the methods described here are performance-minded, but you have to spend cycles to regain cycles. An airplane could carry more passengers without all those heavy dials and displays up front. It’s not hard to imagine why logging and metrics is, for most services, second only to the features themselves. Always remember and communicate that having to choose between features and instrumentation does not bode well for the reliability record of the application or organization.
For those who really need to move fast and would prefer to reuse or subscribe, there are several promising choices out there, including New Relic and Prometheus. Obviously we have our own systems, but those offerings do have percentiles and histograms.
No comments:
Post a Comment