Why I DON’T Care About Statistical Significance

You know the world has come a long way when someone has to espouse the heresy of not caring about statistical significance.

This is not an argument against A/B testing, but rather about how we use A/B test results to make business decisions.  Instead of statistical significance, let’s make decisions based on expected value, i.e. $benefit × probability − $cost.

A little background on statistical significance, or “p < 0.05″. Say you have just deployed an A/B test, comparing the existing red (control) vs. a new green (test) “BUY NOW!” button. Two weeks later you see that the green-button variant is making $0.02 more per visitor than the red-button variant. You run some stats and see that the p-value is less than 0.05, and are ready to declare the results “significant.”  ”Significant” here means that there’s an over 95% chance that the color made a difference, or more true to the statistics, there’s less than 5% chance that the $0.02 difference is simply due to random fluctuations.

That last sentence there is probably too long to fit in anyone’s attention span. Let me break it down a little. The problem here is that you need to prove, or disprove, that the difference between the two variants is real — “real” meaning generalizable to the larger audience outside of the test traffic. The philosophy of science (confirmation is indefinite while negation is absolute — a thousand white swans can’t prove that all swans are white, but one black swan can disprove that all swans are white) and practicality both require that people set out to prove that the difference is real by disproving the logical opposite, i.e. there is no real difference. Statistics allows us to figure out that if we assume there is no difference between the red- and green-button variants, the probability of observing a $0.02 or larger difference by random chance is less than 0.05, i.e. p < 0.05. That is pretty unlikely. So we accept the alternative assumption, that the difference is real.

What if you have a p-value of 0.4, i.e. a 40% chance of getting a $0.02 or larger difference simply by random fluctuations? Well, you may be asked to leave the test running for longer until it reaches “significance,” which may never happen if the two variants are really not that different, or you may be told to scrap the test.

Is that really the best decision for a business? If we start out with the alternative assumption that there is some difference between the variants, 60% of the time we will make more money with the test variant and 40% of the time we will lose money compared to the control. The net gain in extra-money-making probability is 20%. The expected size of the gain is $0.02. Say that we have 100K visitors each day, that’s $0.02 × 100,000 × 0.2 = $400 in expected extra revenue. It doesn’t cost me any extra to implement the green vs. red button. Of course I should go for the green button.

If we go back to the option of letting the test run for longer before making a decision, the upside is that we will have a more accurate estimate of the impact of the test variant. The downside is that, if one variant has $400 expected extra revenue each day, that’s $400 × (1 − traffic_in_test_variant%) extra dollars we are not taking in each day.

Now suppose you are so diligent that you keep rolling out A/B tests, this time testing a fancy search ranking algorithm. Two weeks later you see that there is a $0.10 increase in dollar spent per visitor for the test variant compared to the control (i.e. existing search ranking algorithm) variant. If the increase is real, with 100K visitors each day, that’s $0.10 × 100,000 = $10,000 dollars extra revenue each day. Now, let’s add a twist: you need five extra servers to support that fancy algorithm in production, and the servers cost $10,000 each to buy, and another $10,000 to run per year. You want to make sure it’s worth the investment. Your stats tell you that you currently have a p-value of 0.3, which most people would interpret as a “nonsignificant” result. But a p-value of 0.3 means that with the new ranking algorithm the net gain in extra-money-making probability is 0.7 − 0.3 = 0.4. With the expected size of the gain being $0.10 per visitor, the expected extra revenue per year is $0.10 × 100,000 × 0.4 × 365 = $1.46M dollars. The rational thing to do is of course release it.

Now, the $0.10 increase is the expected amount of increase. There is risk associated with it. In addition, humans are not rational decision makers, so a better theory is to use expected utility and include risk aversion in the calculation, but that’s outside of the point of this article. This article is about using statistical significance vs. expected value for making decisions.

Statistical significance is that magical point on the probability curve beyond which we accept a difference as real and beneath which we treat the difference as negligible. The problem is, as the above examples have demonstrated, probabilities fall on a continuous curve. Even if you do have a statistically significant result, a significance level of p = 0.05 means that 1 in 20 A/B comparisons will give you a statistically significant result simply by random chance. If you have 20 test variants in the same test, just by chance alone 1 in 20 of these variants will produce “statistically significant” results (unless you adjust the significance level by the number of variants).

The normal distribution (or whatever distribution you use to get the probabilities) does not come with a marker of statistical significance, much like the earth does not come with latitudinal or longitudinal lines. Those lines are added essentially arbitrarily to help you navigate, but they are not the essence of the thing you are dealing with.

The essence of the thing you are dealing with in A/B tests is probability. So let’s go back to the basics and make use of probabilities. Talk about benefit and probability and cost, not statistical significance. It’s no more than a line in the sand.


Notes:

1. The above examples assumed that the A/B tests per se were sound and that the observed differences were stable. To estimate the point at which the data is stable, use power analysis to calculate sample size.

2. Typical hypothesis testing procedure: to investigate whether an observed difference is generalizable outside of the test, we set up two competing hypotheses. The null hypothesis assumes that there is no difference between the two means, i.e. the two samples (e.g. two A/B test variants) are drawn from the same population, their means fall on the same sampling distribution. The alternative hypothesis assumes that the two samples are drawn from different populations, i.e. the means fall on two different sampling distributions. We start out assuming the null hypothesis to be true, and that the mean of the control variant represents the true mean of the population. We calculate the probability of getting the test variant mean under this assumption. If it’s less than some small number, for example p < 0.05, we reject the null hypothesis and accept the alternative hypothesis.


3. Significance levels are very much a convention and vary across disciplines and situations. Sometimes people use 0.01 or 0.001 instead of 0.05 as the significance level. As we all learned from the Higgs boson discovery, they need 5 sigmas (that translates to a p-value of about 0.0000003) to be officially accepted as a “discovery.” Traditional significance levels are biased strongly against false positives (claiming an effect to be true when it’s actually false) because of the severe cost in championing a false new theory or investing in a false new drug.

6 Comments

Mustache vs Swig Templating Shootout!

At Shutterstock we recently went through the process of settling on a preferred templating language. We have lots of projects across different languages and platforms, and it was clear to the front end team that we would gain efficiency by investing in a single templating approach.

There are lots of templating languages out there, but two flavors stood out as obvious prospects: the Mustache family, and the Django family. There are strong pros and cons for each. Mustache has unmatched cross-platform support, while Django-inspired templating languages provide a more robust feature set.

We set out to compare implementations of slightly tricky tasks in both settings. Would we miss the extra features from the Django family? We used Node.js along with Swig (Django family) and Hogan (Mustache family) for our comparison.

Task #1: Position within a list

It happens from time to time that we’d like to know our index within a loop while in the context of templating. Let’s say we’re implementing some sort of thing with drag-and-drop, and we want to add a data attribute with the initial position of the element:

Position in a List with Swig

{% for name in names %}
  <li data-position="{{ loop.index }}">{{ name }}</li>
{% endfor %}

Swig makes this easy. We have access to the special loop property and its attributes. This situation was anticipated.

Position in a List with Mustache

With Mustache we have to massage the data beforehand. Somewhere in code before sending to the template, we populate an index attribute:

var names = [ 'Jacob', 'Sophia', ...];
names.forEach(function(name, index) {
  names[index] = { name: name, index: index };
});

Then finally in the template:

{{#names}}
  <li data-position="{{ index }}">{{ name }}</li>
{{/names}}

This is a little rough. The template itself looks great, but it’s regrettable to have to write the code that transforms names to add index alongside. It’s also a bit of a shame to have to mix formatting code in with business logic or controller logic.

Task #2: Currency Formatting

Often we want to show dollar amounts on the page. Let’s see how we do here with a very basic example where we want to show a list of items and a total price.

Formatting with Swig

Swig comes with some built-in filters, but not exactly any that would get us going here. That’s okay–we can define our own filter and register it with Swig:

var usd = function(value) {
  return '$' + value.toFixed(2) + ' USD';
};
swig.init({ filters: { usd: usd } })

Prepare our data and render:

var invoice = {
  items: [...],
  price: 100
};
res.render("checkout", invoice);

Apply filters by specifying them after a pipe for interpolated values:

{% for item in items %}
  <div class="item">{{ item }}</div>
{% endfor %}

<div class="price">{{ price|usd }}</div>

Formatting with Mustache

Mustache requires a similar approach. We define a lambda to pass along with the data to render. Since we want to act on the rendered value here, we need to render the input first and then return a transformation of that.

var usd = function() {
  return function(template) {
    var value = Hogan.compile(template).render(this);
    return '$' + Number(value).toFixed(2) + ' USD';
  }
};
var invoice = {
  items: [...],
  price: { value: 100, usd: usd }
};
res.render("checkout", invoice);

Later in the template:

{{#items}}
  <div class="item">{{ item }}</div>
{{/items}}

<div class="price">
  {{#price}}{{#usd}}{{ value }}{{/usd}}{{/price}}
</div>

This works, but it’s bulky and verbose having to set the price context, and then having to wrap the value with the formatter. One way or another you have to do some massaging before you pass the data to the template. We could also have just done the formatting right within the data structure too, but it would seem a shame to have to mix display formatting with data access.

Task #3: Place CSS at the top and JavaScript at the bottom

There’s a fairly common pattern where on a given page we may want to include a page-level CSS stylesheet, generate some markup, and then add event listeners at the end. From within a page template, ideally we’d like to put a <link> tag in the head and inject some JavaScript just before the closing </body> tag.

CSS at the top and JS at the bottom with Swig

Swig supports block-level inheritance. In our layout template we specify the document structure:

<!-- layout.swig -->
<!doctype html>
<html>
  <head>
    <link rel="stylesheet" type="text/css" href="/global.css">
    {% block css %}{% endblock %}
  </head>
  <body>
    {% include 'header.swig' %}
    {% block content %}{% endblock %}
    {% include 'footer.swig' %}
    {% block footer_js %}{% endblock %}
  </body>
</html>

Now from within a page template we can override those blocks to get our bits to be where we want them within the document:

<!-- page.swig -->
{% extends 'layout.swig' %}

{% block css %}
  <link rel="stylesheet" type="text/css" href="/page.css">
{% endblock %}

{% block content %}
  <div id="main">Wonderful content here...</div>
{% endblock %}

{% block footer_js %}
  <script>
    var main = document.querySelector('#main');
    main.addEventListener('click', function(e) {
      e.target.setAttribute('contenteditable', true);
    });
  </script>
{% endblock %}

CSS at the top and JS at the bottom with Mustache

Mustache doesn’t directly support inheritance so we have to fake it. There are different ways to do it, none of them especially elegant. One way is to just always piece together the whole document using partials.

<!-- css.mustache -->
<link rel="stylesheet" type="text/css" href="/global.css">
<!-- page.mustache -->
<!doctype html>
<html>
  <head>
    {{>css}}
    <link rel="stylesheet" type="text/css" href="/page.css">
  </head>
  <body>
    {{>header}}
    <div class="main">Wonderful content here...</div>
    {{>footer}}
    <script>
      var main = document.querySelector('#main');
      main.addEventListener('click', function(e) {
        e.target.setAttribute('contenteditable', true);
      });
    </script>
  </body>
</html>

This is nicely explicit, but has its obvious drawbacks too. There are other approaches to consider as well. Some implementations support the concept of standard layouts. There’s also a proposal to update the Mustache spec to support inheritance. Hogan actually has unadvertised support for the proposed implementation. Yet another way is to use carefully crafted helpers in Handlebars. But basically when it comes to inheritance, Mustache support is a bit of a mess at the moment.

Other Considerations

Cross-Platform Support

We want to avoid having our templates lock us into a backend language or framework. Ideally if we want to migrate a project from Ruby/Sinatra to Node/Express, the last thing in the way should be the templates.

On a more practical level, we also want the flexibility to be able to share templates between the server and the client. For example, we may want to serve the first page of search results as rendered markup from the server, but then serve subsequent pages through AJAX, sending back JSON for the browser to interpolate into the same template that the server used. Both Mustache and Swig have solid in-browser implementations, so either fits the bill there.

Mustache actually has a formal spec, so you can be fairly sure that mustache templates that render with one implementation will likely render the same way in another. In real life though, since Mustache is very minimal, different implementations often add extra improvised features. Hogan adds nested accessors; Handlebars adds alternative control-flow syntax and helpers.

The Django family has decent cross-platform support, but implementations take wider liberties deviating from the original Django version. There’s Swig for JavaScript, Jinja for Python, Liquid for Ruby, DTL for Perl, and Twig for PHP. These implementations are all very strongly Django inspired, but again they diverge here and there.

Learning Curve

At Shutterstock we like to encourage people to reach outside their core disciplines and think holistically about what we’re all trying to do. For example, front-end developers might weigh in heavily on the direction of the product; back-end developers might suggest design directions; ops folk might suggest new business strategies, etc, etc.

So when it comes to templating, we want anyone to be able to dive in and find their way quickly, whether or not they have a background in front end development. A quick qualitative survey around the office found that relatively technical people had a much easier time picking up the Django style than they did learning the terse Mustache semantics.

Conclusion

In the end we went with the Django family — Swig, Twig, Liquid, etc. We still use Mustache here and there, but for projects of any decent size we find we really enjoy having the extra features and robustness in the Django templating languages.

Leave a comment

Introducing Lil Brother: Open Source Client-Side Event Tracking

We’re happy to share a project we’ve developed that helps us understand how our customers are interacting with our site.  It’s called Lil Brother — it tracks clicks and other events in the browser, and reports back in real time.

How does it work?

First, add some client code to pull in the library and initialize a Lil Brother observer:

<script src="http://server:8000/lilbro.js"></script>

<script>
var lilBro = new LilBro({
    element: document.body,
    server: 'server:8000'
});
</script>

Then start up the node listener on the server side:

$ node bin/lilbro --output-file events.log

Lil Brother attaches a single event handler to a top-level container and listens for events.  As users click on links or focus inputs or check boxes, etc, the client sends those events over the wire to a listener on the server where they get recorded.

Can I have some context?

When a click happens, we gather what context we can and send that along too.  If the target element has an id and/or a class, we note that.  Otherwise, we traverse up the DOM until we find a parent’s id or class.  We also grab the element tag name, X and Y mouse coordinates relative to the element and to the page, scroll positions, and input values if the element happened to be some sort of input field.

In addition to metadata around the event, we discover other attributes too: browser version, operating system, viewport width and height, request path, and some other bits.

Visits and visitors

Of course clicks are part of a larger hierarchy.  There are users behind these clicks, and users browse in sessions.  To tie events together, Lil Brother sets two cookies: a long-lived visitor cookie, and a short-lived visit cookie.  We send the values of these cookies along so that we can string events together and aggregate later.

What next?

With the data captured on the client, sent over the wire, and recorded one way or another, Lil Brother’s job is done, and there’s much left to do yet.  From here, aggregation, visualization, analysis are now possible, and left to you.  Imagine what you can do with this data: build heat maps, visualize traffic patterns, discover customer friction points, measure behavior in A/B tests, and on and on. Have fun!

Leave a comment

Overengineering and Overadoption

As coders, we usually begin our careers by throwing together commands sprinkled with single-letter variable names and wild contortions of logic. The code usually does the job inefficiently and fails in catastrophic ways in unexpected situations. As we progress, our code becomes more robust and correct, but for a while, we continue to put together code that isn’t very well-designed. Code like this is “underengineered.” Underengineering is bad, unless it’s the result of a conscious decision to get the job done faster by skipping most of the design phase and taking on some “technical debt” (AKA “design debt”).

After a couple years of unintentionally writing underengineered code and dealing with the fallout, we compensate by planning ahead and designing better systems which anticipate future needs. Our code becomes easier to read, understand, and work on. That might be the sweet spot, because the next step for most of us is overengineering.

Once we get good at designing systems in abstract ways, we often proceed to develop them in the most abstract ways possible. Overengineered projects often take months or years to complete. It’s difficult to get consensus when every last detail has to be perfect. When we overengineer, it’s likely that by the time we’re done, the business or technology environment will have changed in ways that render much of that excessively well-planned work irrelevant. And, we can’t see the future, no matter how hard we try. When we overengineer, despite all of our planning and abstractions, we often fail to anticipate the abstractions we’ll actually need. Voltaire is credited with the saying, “The perfect is the enemy of the good.”

Why do we overengineer? It seems smart at first. If some abstractions are good, then lots of abstractions are better! Programmers also get bored. We like to solve new problems and develop new skills. A coder friend both wise and tired beyond his years once told me, “There are only seven problems in software engineering, and we just solve them over and over again.” It took me years to understand what he was saying, because I’ve always said, “Software engineering is the best job in the world, because it’s never the same job twice. If it is, you’re doing it wrong!” In fact, we’re both right. His point was that although we may be solving new problems with new languages, abstractions, techniques, and technologies, at the bottom of it all will always lie just a few fundamental solutions. Loops. Conditionals. Functions. Reading from and writing to files and databases. Because of this, programmers can get bored even while engaged in one of the most interesting jobs in the world.

Overadoption is a similar problem that doesn’t get nearly as much attention as it deserves. Programmers love new technology. But the newest, coolest technologies are also the most buggy. Last year I found and fixed a bug in our payment system that could have caused us to approve all payments of a certain type without actually checking to see if those payments were authorized. The bug was not introduced directly through the fault of any of our coders. It had been made possible long ago by the inclusion of an open-source software package that we were using before it was quite ready for enterprise use. After we installed it, it went through so many changes so quickly that we couldn’t keep up. Updating the package would have broken our code, so we were stuck living with a bug that had been fixed years ago. We were early adopters. But it’s possible to adopt too early. We’ve also quickly adopted and trashed various wiki systems, file systems, project management software systems, and version control schemes. We’ve gotten to a pretty good place today, but several of the steps in between could have been skipped entirely if we’d just been a little more patient and reflective before jumping on those rickety bandwagons.

Engineers sometimes, usually unintentionally, make themselves indispensable by hoarding knowledge about arcane legacy systems. It’s quite possible to become indispensable for the opposite reason as well. If the latest, greatest technology is installed the moment it’s released, it will be difficult for everyone in the organization to keep up. We all enjoy learning new things, and building our skills helps the organization as a whole. But sometimes the best thing we can do for the company is to go to work and solve those seven boring problems all over again for a few hours. Sometimes we just have to sit down and bang out some boilerplate.

William F. Buckley, Jr. said, “A Conservative is a fellow who is standing athwart history yelling ‘Stop!’” I’m not a conservative. I’m just standing athwart technology calmly advising, “Slow down a little.”

3 Comments

Getting Passionate About Problems

Developing solutions to complex technical problems as a team can be fun and challenging, but also highly contentious as every great developer has their own Grand Plan for how things Should Work. For every developer in the room there will be just as many proposed solutions to any given problem. They may be based on competing philosophies, varieties of foresight, attention to edge cases, and other considerations that provide a fertile ground for a passionate discussion, but that discussion often struggles to find focus and consensus.

So how do you streamline this process? Discussions like these can go on endlessly, go off on tangents, and nitpick on trivial details. How do you avoid hashing out solutions that are over-engineered and overly complicated?

One way to do this is to focus more on the problem rather than the solution.

When you focus solely on the problem, you’re working with facts everyone can agree on. Aim your discussion at getting everyone to fully understand the problem, and all the issues surrounding it. Once everyone is on the same page, you’re better positioned to prioritize the issues and offer specific proposals to address them. At this point, you still need to work to keep the discussion focused, so always take every solution and go back to understand the problems they address, and their side effects.

As an example here’s a set of 5 questions I like to follow when attacking a new problem. In general, it keeps me focused, while preventing me from falling in love with any one solution prematurely.

1. What is the high level problem you want to solve?

2. What impediments are currently making this a problem? i.e. Why does this problem exist?

3. For each impediment, what are the issues that need to be addressed to resolve that impediment?

4. For each issue that needs to be addressed, what are all the possible solutions you can imagine?

5. For each solution, what are each of their advantages and disadvantages? What is the scope and magnitude of each of those advantages/disadvantages?

The main things to take away from this are that in steps 1 through 3, you’re just decomposing the problem. Every complex problem is just a combination of simple problems. When you’re dealing with simple problems, you’re more likely to come up with simple solutions.

A second thing to note is that it isn’t until step 4 that you’re even thinking about solutions. Also, your goal at this point is to come up with multiple solutions to each issue. This is very useful for helping everyone keep an open mind. Each individual should make it their goal to see the advantages and disadvantages of each possible solution rather than picking one and arguing it to death.

Step 5 may still be the most contentious since it leaves the most room open for debate, however the fact that you’ve just laid out and decomposed the problem should lead to a more focused discussion on possible solutions.

So in your next planning meeting, give this a shot, and see if it leads to a faster successful outcome, or at the least, a more enjoyable meeting. The key thing is to use the forum of a group discussion to make sure ideas are added that any one person may have missed — not as a platform for everyone to argue their master plan. Ultimately every developer has the same goal of solving problems, executing that goal well as a team is the challenge. But instead of getting passionate about the solution, get passionate about understanding the problem.

Leave a comment

SQL Shells, Rebooted

Like many other Linux/open-source software tech companies, Shutterstock makes extensive use of tried-and-true technologies like MySQL.  We are always exploring different database technologies such as Riak and MongoDB, but at the core of our business is a highly available and tightly managed MySQL infrastructure.  We started on MySQL with a loosely-designed schema and have been adding to it incrementally over the years.

Some of our developers who are less comfortable on the command line use GUIs to access the database but the more bearded folk tend to use the standard mysql-client CLI tool that is the stock and trade of any LAMP stack.  vim, emacs, git and mysql are usually open in many terminals on our desktops.  But, unlike the others, the mysql shell is not the most up-to-date tool in the toolbox.  With poor pagination and output handling, no color highlighting and a somewhat irritating input prompt, the mysql client causes its fair share of frustrated “that’s not what I meant to happen” moments.

We need pagination.  With our eight-year-old schema, some of our tables span 800 columns of console output.  Simple queries like “select * from accounts limit 1” fill the screen with line after line of ASCII table rendering characters  Even restricting the output to one line per column (“\G”) makes for an impressively difficult amount of data to parse.

In response to this, we recently undertook some improvements to the tool.  With our expertise in Perl and Moose OO programming, rewriting the mysql client in Perl seemed like a simple exercise in DBI programming (which we’re very comfortable with) and some straightforward CLI tooling.  By approaching the problem iteratively, we were able to very quickly come up with a drop-in replacement to the mysql client with the majority of features we use on a daily basis implemented.  From this as a starting point, we were free to explore what we wanted to fix.

We’re very pleased with the result.  We call it AltSQL, as it’s an alternative to and improvement over some of the standard command line SQL tools.

The first and simplest change to make was to add color.  We’re used to seeing our prompts full of color.  Our bash prompt highlights the hostname name in red, ls shows directories in blue, and vim and emacs give our coding full color syntax.  Adding contextual coloring to tabular output makes sense, was a simple addition, and comes at no expense since the DBI statement handler contains a great deal of context about each result that’s delivered.

Implementing a better prompt was a simple matter of finding a suitable CPAN module, and Term::Readline::Zoid fit the bill on that measure quite well.  Offering out-of-the-box multiline editing and an extensible autocomplete and key binding interface, we were able to move quickly.

We finally had a mysql shell prompt that could abandon the statement when you typed Ctrl-C rather than exiting the program. Improving the table rendering was next.  By dropping in Text::ASCIITable we quickly had a better table renderer that properly wrapped output on newline characters, but why stop there?  All of our terminal emulators have full Unicode support, so we spent some time developing a simple but powerful low-level Unicode box formatter (Text::UnicodeBox) to make terminal table drawings more intuitive and less obtrusive.

Adding horizontal and vertical pagination was a simple change, but a powerful one.  By checking the output width and height of the table to be printed, we are able to conditionally use the less pager.  This feature finally made “select * from accounts limit 1” a command we could type without worry.  No matter what the terminal size, you’ll be able to see the data in a usable format.

This is just the beginning.  By choosing Moose, all the features of the tool are extendable by other modules.  We’ve written it from the ground up to be pluggable.  In fact, most of the features mentioned above aren’t a  part of the core code, but instead written as modules to modify the behavior.

We hope that other people can benefit from this.  Regardless of if you use Perl or not, we think this is a useful tool that could make your job easier.  Install it from CPAN or Github and try it out.

7 Comments

If It’s Not on Prod, It Doesn’t Count: The Value of Frequent Releases

At Shutterstock, we like to release code.  A lot.  We do it about 60 times per week.

Frequent code releases have become somewhat of a mantra among today’s fast-moving startups, but the value they bring isn’t always articulated well.  In fact, there are a lot of reasons not to push frequently: you could release shoddy or incomplete software, it might not be thoroughly tested, or you might not like the constant pressure of production deployments.

So it’s worth stepping back to look at all the benefits that frequent releases bring:

1) You deliver value to customers more quickly.

This is the first principle of the agile manifesto: “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.”  Features that are sitting in your development environment aren’t benefiting your customers.  Frequent releases get those features into the wild so that your customers can use them.

Be sure to relentlessly focus on delivering value to customers.  Too often, frequent releases are interpreted as breaking big, complicated projects into component parts: first you tackle database schema changes, then business logic, then graphic design.  That’s not the point.  The point is to deliver complete, valuable features to customers as quickly as possible.

This idea also isn’t about releasing half-baked or hacky code.  The art is in finding the smallest implementation the team can develop, test, and release within a short period of time.   It helps to ask yourself, “what is the smallest impactful change we can make to get to our goal?”  Then, challenge what you decided was “smallest” — can you really not get there with an even smaller implementation?  In the end, you want to do the minimal work to test your idea with customers, then learn and repeat.

2) You learn quickly.

The lean software movement has popularized a revolutionary business philosophy: We don’t know what the best thing to do is.  The only way we can know it is to put something in front of customers and get their reactions to it.

By releasing software frequently, you have many more opportunities to get customer feedback and pivot based on it.  You avoid going too far down a path that’s not valuable.

3) It forces you to break big ideas into manageable pieces.

Big projects are risky, complex, and interminable.  By breaking big projects into small pieces and releasing one piece at a time, we not only deliver value more quickly, but we avoid death marches that demoralize software teams.

This is far easier said than done, because everyone loves big, splashy projects — they generate attention, they get people excited, and they offer a fleeting sense of accomplishment.  But users rarely like big, splashy projects.  In fact, users generally don’t like any sort of change; it forces them to re-learn something that they don’t want to re-learn.  By delivering small pieces of functionality, you provide additional value to users without surprising them with radical change.

Some people will object that an incremental process ultimately takes longer than a monolithic one.  That’s okay — it’s a trade-off we’re very happy to make, for two reasons: first, although the final result may end up in customers’ hands later, we’ve been delivering small pieces of value the whole time.  Second, it lets us change direction along the way as we learn instead of committing to a big project that we’re not sure has value.

4) You avoid horrible merges.

Merging code has always been and always will be a pain in the ass.  The more we can avoid it, the happier and more productive we’ll all be.  Frequent releases mean that code merges are small and simple (if they’re necessary at all).  This means you can move more quickly, and developers stay happier.

5) With good automated testing and an a/b testing platform, you reduce risk.

One complication of releasing frequently is making sure that your software works well and is thoroughly tested.  That’s why automated tests are so important in an agile environment — they let you quickly and thoroughly ensure that your code works.  Shorter release cycles inherently produce smaller code pushes.  In general, smaller code pushes are less risky simply because fewer things can go wrong.  By coupling small code pushes with automated testing, you can move quickly with little risk.

A good a/b testing platform also lets you iterate rapidly with low risk.  If you’re able to test changes on 1% of your customers, you drastically reduce the risk associated with rolling out new features, and are able to learn and adapt more quickly.

6) You reduce complexity.

Lots of developers like to over-engineer.  Given enough time, we’ll build dozens of layers of unnecessary abstraction (see Parkison’s Law).  By requiring frequent releases, we push ourselves to choose the simplest path forward.

If not done well, it is possible to paint yourself into a corner with this approach.  It’s important to remember that frequent releases don’t mean short-sighted thinking.  You can still get to a distant goal by approaching it one step at a time.

7) It keeps people motivated.

Who wants to work on a project for months (or years) and never have the thrill of showing it off to their friends?  Or hearing what customers think of it?  Frequent releases motivate people by letting them see the results of their hard work.

We use the scrum/agile framework, with two-week sprints and a demo at the end of each iteration.  A few years ago we started enforcing a rule to drive this point home: you can only demo what’s on production.  If it’s not on prod, it doesn’t count. That’s our way of saying, “You can code all you want, but all that matters is what our customers can do with it.”

For all these reasons, we evangelize frequent releases.  That’s not meant to minimize their difficulty.  It’s often very challenging to figure out how to take a small step forward that delivers value to customers while working towards a more distant goal and letting you change direction if necessary.  We never said it was easy.  In fact, it’s probably one of the most difficult problems in modern software development, because it requires developers to not only be great architects but also appreciate customer needs and product development.  But it’s the best method we’ve found for moving our business forward quickly while minimizing risk.

1 Comment

Perl: When DWIM Doesn’t

We’ve written in the past of our love for Perl. We meant it. But in any loving relationship, there will also be hard parts and unpleasant surprises. These are some tales of unpleasant surprises.

Surprise One: Bonus Feature

Here is some code that sets up a global $config hash, setting a file path the application should read data from.

our $config;
$config->{file_paht} = "/opt/app/data_file";

And the code that reads the data file:

open (my $fh, "<", $config->{file_path})
    or  die "can't open $config->{file_path}: $!";

my $data;
{
    local $/ = undef;
    $data = <$fh>;
}

You probably spotted that file_paht typo before we did. A warning or error would have helped us spot it earlier, but instead we got a bonus feature.

Perl decided what we really wanted was an anonymous temporary file, and provided us one. A brand-new, anonymous tempfile, that could never have been written to, opened for reading.

This bonus feature is documented as a “special” case in the sixteen or seventeenth paragraph of perldoc -f open. Special, indeed. So special that to debug it we ran an strace, thinking …

where the f^H Sam Hill did that open(“/tmp/PerlIO_Z2sAqY”, O_RDWR…) come from?

… and grepped the source to find the answer, and re-read perldoc -f open to try to find our sanity.

Avoiding this bug requires being more defensive, which is always a good idea when reading disk files in production code:

if (exists $config->{file_path}  and  -r $config->{file_path}) {
    ...
}

In writing this article we began to consider this case a bug in perl, and went to file one at rt.perl.org, only to find that the wonderful Perl 5 Porters had beat us to it, and that there is a current thread on the mailing list concerning this bug. Thanks p5p!

Surprise Two: When DWIM Doesn’t

We are constantly A/B testing at Shutterstock. Sometimes we need to usurp a random test assignment to view specific variants. The overrides are cached in the session:

$session->{ab_variant_overrides} = [34, 29];

Code checks if the variants are being usurped, and builds the appropriate template data structure:

if (exists $session->{ab_variant_overrides}) {
    # template expects custom_overrides to be [int, ...]
    $template->{custom_overrides} = $session->{ab_variant_overrides};
}

At one point we needed a quick hack to do something special inside a usurper variant:

if (grep { $_ == 42 } @{ $session->{ab_variant_overrides} }) {
    # give them something special
}

You may have spotted a bug in that code. If we haven’t yet assigned to $session->{ab_variant_overrides}, we’ll be dereferencing an undefined value. What should happen in that case?

One might expect Perl’s fatal “Can’t use an undefined value as an ARRAY reference” under strictures. Instead, the presence of the grep springs an empty array reference into place and assigns it to $session->{ab_variant_overrides}. Oops.

This behavior is hinted at in item 6 of the “Making References” section in perlref.

References of the appropriate type can spring into existence if you dereference them in a context that assumes they exist. Because we haven’t talked about dereferencing yet, we can’t show you any examples yet.

A quick fix here is to be more defensive by changing the dereferencing:

grep { $_ == 42 } @{ $session->{ab_variant_overrides} || [] }

What are your tales of surprise?

5 Comments

Introducing Rickshaw: A JavaScript toolkit for creating interactive time series graphs

We’re happy to share a project we’ve been working on that helps us see into our data.  It’s a JavaScript toolkit for creating interactive time series graphs, called Rickshaw.

At Shutterstock we use Rickshaw to read A/B tests, to monitor application and site health in real time, to see into dense product metrics, and all sorts of other things.

It has been a primary goal during the development of Rickshaw to keep the API simple. It has also been a primary goal to not obscure what lies beneath.  We use Mike Bostock’s wonderful d3 library to manipulate SVG, and those layers stay accessible if you want to get fancy.

Finally, we’ve kept the scope of our problem domain small.  Getting started with a simple graph is a couple of lines of JavaScript and HTML.  From there we can add new functionality by consuming extensions that come with the library.  Here is an example that shows some of them off.

Rickshaw helps us visualize dense time series data.  We hope you have similar success if you give it a try.  Here’s a listing of examples, and a tutorial to get you started.

Leave a comment

Feersum in the Wild: Perl’s Evented Web Server

We use open source software in just about every form it takes: programming languages, operating systems, web servers, databases… even firewalls.  We try to release some of our own software, too.  Open source software has all kinds of advantages, but one of my favorite’s is how easy it is to fix problems if any arise.

Earlier this year, we added autocomplete functionality to our search interface.  Autocomplete is a simple concept that has some tricky implementation details, especially if you have a big data set.  It requires a fast server-side lookup table, and the server response has to be lightning quick.

We’ve experimented with a lot of web servers at Shutterstock.  Our mainstay is Apache, but over the years we’ve toyed with lighttpd, nginx, node, and others.  But hooking logic into a full-featured webserver leaves you with a pretty bulky system, and we thought we’d be better off going with something lighter.

We looked around and found Feersum.  Feersum is an event-based webserver (like nginx and node) that’s written in Perl (or more accurately, a combination of Perl and C) and is based on EV/libev (the same event loop that node uses).  We whipped up a prototype with it and were impressed by its speed — 2,000 requests/sec with a 30ms mean response time with 100 concurrent connections on a lightweight box.  That’s quick!

So we wrote an implementation of autocomplete with it and launched it. And it was a great success — when it worked.  We noticed that sometimes it would simply fail on certain requests.  The host servers seemed fine.  The daemon was still listening and responding to requests.  But for some reason we’d sporadically get “400 Bad Request” errors.

At first we assumed this was a problem with the client — our AJAX code must have somehow been buggy and passed in bad data.  But we ruled that out pretty quickly, and soon isolated the problem to the daemon.  We were able to reproduce the issue by sending simple and innocous HTTP requests that would nonetheless return “400 Bad Request” responses back.  We scratched our heads a bit, and then did that glorious thing that open source software lets you do: we dove into the code.

Here, life got more interesting.  It turns out Feersum is based on another open source project, picohttpparser.  That presented a challenge, both because it was slightly harder to isolate the problem and also because picohttpparser is meant to be lightning fast and is therefore written with a bunch of effective but obscure optimizations.

So we spent a weekend hacking away at it, adding sprintf’s (debugger, bah!) to every line we could to understand the problem.  We got pretty close to figuring it out, but ultimately got tripped up by not knowing whether the problem was in how Feersum was calling picohttpparser, or in picohttpparser itself.

Happily, open source software gives you an easy next step: contact the author.  So we gathered all the information we could about the problem, tried to sum it up as succintly as we could, and posted an issue on Github.  Within two days, the author had identified the error, patched it, and released a new version.  Thanks, stash!

Delighted with the quick reponse, we installed the new version, did some tests — and saw our daemon work flawlessly.

And check it out — we’re now humming along with Feersum serving a snappy response on every keypress of every image search!  That’s way cool.

Leave a comment