Interview with a CodeRage Finalist: Dave K and Projector

Every quarter, the tech team at Shutterstock holds CodeRage, a 24-hour hackathon where we’re encouraged to work on any project that can bring value to the company.

This quarter, one of the winning projects was called Projector. It’s a web app that lets you turn your webcam into a projector to show drawings and diagrams to other people. Here’s a quick demo of how it works:

Dave K, the lead engineer on our footage team, wrote the app. I interviewed him about the project.

What problem were you trying to solve?

I was onboarding one of our new developers in our Denver office, and I was in New York, and I wanted to show him how our different systems were set up and how they work together. I thought, “Wow, every time I onboard somebody I usually go to a whiteboard and sketch out how this works, because it’s so hard to talk about it without diagrams.” I really just wanted to give him a quick sketch of how our servers are set up. I couldn’t find any good online solutions for drawing with a touchpad, and if you point a webcam at a whiteboard it’s really hard for the person on the other side to see anything.

So a few days later I was thinking it’d be cool if we could use a webcam to show a facsimile of a piece of paper you have in front of you to someone across the country and be able to make changes in real time.

How did you approach the problem?

Every hackathon, I think of a problem beforehand — I take hackathons very seriously! — and I need to have a clear project that I’m going to be working on. A lot of times I’ll try to stay on top of Javascript libraries and HTML5 features because that stuff really interests me. Then, if I try to approach a problem like this, I’ll try to think of what technologies are available to use, and sort out in my head a little bit how I’ll do it.

For this project, the main problem was how to detect where the piece of paper was. So I brainstormed a bit about that. But sometimes it doesn’t work out the way you expect. This was a perfect example — the initial plan I had didn’t work at all, and I had to re-formulate it and sleep on it to find a better solution.

What was the first approach you tried?

Well, I needed to detect the edges of the piece of paper. Originally, I was going to have a setup where you made four black dots on a piece of paper. Then, I was going to try converting the webcam image to black and white, and then detect every shape on the page. Any shape that was touching the corner of the image, I’d delete from the shapes that I’m looking at. And then I’d try to discover the shapes that were closest to the corners of the frame because those would probably be my black reference dots.

Part of trying to solve that problem was to write a fill algorithm, and so I created a structure of every black pixel on the page and then I’d loop through the pixels and try to determine if it was part of a bigger shape based on neighboring pixels. I wrote it as a recursive function, and although it worked on a small scale — like a 10×10 pixel image — on a bigger image I was getting a stack overflow — it was just using too much memory.

So when that didn’t work out, I looked online for different fill algorithms, and one of them was a flood fill algorithm which was supposed to be more performant. I was able to tweak some open source code that I found to get that working, but on a big image it would still crash from using too much memory. It was kind of upsetting because I spent a whole night getting that to work — going down this one rabbit hole. So I thought, “I should just go home and go to sleep.” It was about midnight at the time.

The last thought that entered my mind was, “Wait. Why make these dots black? What happens if we add a color in there?” Then you could do a simple color detection based on different quadrants of the image. And then I felt a little better going to sleep with that idea in my head. The next morning I woke up and just focused on that and it seemed to work pretty well.

What third-party libraries did you use for the project?

A lot of this is reliant on these new, awesome features available in HTML5. One of the things that was really crucial was the getUserMedia() HTML5 function. That lets the browser get access to your webcam. Then I used some of the canvas manipulation tools. HTML5 lets you draw an image to the canvas, and then you can get RGBA values for every pixel on that canvas so you can determine what color something is. You basically have an array you can loop through, and that’s how I’m able to find the green pixels.

The other library I used is BinaryJS that lets you send and receive streaming binary data over web sockets. It uses a compact serialization method to make that as efficient as possible. I also had to use a polyfill for Canvas’ toBlob() method, which turns an image into raw binary data so that BinaryJS can segment the packets. It’s not implemented in mainstream browsers yet, so the polyfill allowed me to use it in browsers that did not already have support.

I used ImageMagick for server-side image processing, and ran a threshold function on the image so everything that fell into the lighter 50% of a black and white image turned to white, and everything in the darker 50% turned to black. That makes it easier to create a facsimile of the image. The place I got the idea for that was from an app called JotNot Pro, which lets you scan documents by using the camera on your smartphone. It uses a similar approach to thresholds to make the scanned text clearer.

The other thing I used ImageMagick for is for perspective distortion. ImageMagick has a perspective distortion function that lets you take four points and map them to new coordinates, which is really neat because I can take those four control points (the green dots) and map them to the corners of the viewer to flatten the image.

What’s nice is that if I’m holding a piece of paper, no matter how I’m holding it, it keeps it in place so it doesn’t jump around. It also makes it so that it doesn’t look squashed.

Have you thought of open sourcing this project?

Yeah, I have to clean it up a bit and make it a little more practical to use, but then I think we could release it.

(Sign up for email updates on the right to get notified when we release it.)

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

How We Built a Cutting-Edge Color Search App

Engineers love working at Shutterstock because they get to build cool things. We aim to solve problems that matter to customers, and we’re constantly trying out new ideas through rapid prototyping.  One of the great things about our culture at Shutterstock is that an idea can come from anywhere, from the newest engineer to the CEO — we’ll try them out equally and see what resonates with users. This is how one of those ideas, our Spectrum color search, came to life.

Finding a problem to solve

Shutterstock serves a very visual audience. Creative directors, designers, freelancers, and others come to us to find visually appealing content. On most stock sites, searching for images is a process of entering keywords and toggling filters to find those that best match an idea.

For our visual-centric audience, this often doesn’t provide an easy path to finding the right images. We realized that an important problem to solve was how we could give our customers a way to search using visual cues in the images themselves: whether an image is color or black-and-white; bright or dark; vibrant or muted; textured or flat.

Experimenting with Color Search

Under the hood, searching by color is an exercise in complex math. At the start, there were a few interesting problems that we needed to work out:

  • What color space do we use to represent our image data?

  • How do we build an algorithm with these values that helps us find interesting images?

  • What kind of interface and controls would make this method of searching intuitive?

Many users are familiar with RGB color palettes but we needed more options to find the right algorithm, so we started playing around with HSV, HSL, and finally the Lab/LCH color spaces, which seemed more intuitive than anything else.

We began by indexing LCH (Lightness, Chroma, Hue) data for a few thousand images into an instance of MongoDB. Each histogram represented the number of pixels in an image for different ranges of lightness, chroma, and hue, from which were were able to compute various other statistics that we added to our index. We then threw together a simple interface where we could plug in numbers, try different equations, and see what images came out.

forest_green_balanced

In order to understand how the mathematics of color spaces connected to a particular visual experience, we broke down all the numbers and put them up in charts, offering inspiration to find the right algorithms.

shutterstock_56644495_mixer

 

 

 

 

 

 

 

 

 

Manually plugging numbers into input fields was fine for a first phase of development, but it certainly wasn’t the interface we wanted to give our customers. A source of inspiration came when Wyatt Jenkins, our head of product — and a former DJ — proposed using sliders to give it the feel of a mixer.  The next immediate version of the prototype had over 20 sliders to control all the visual attributes we had indexed.

lch_sliders_briliant_flat

One of the first prototypes had 24 sliders that you could move to find images with different visual attributes.

 

Closing in on the Final Product.

As our engineers worked on refining the accuracy and performance of the technology behind our color-search prototype, product and user-experience design specialists joined in to help build an interface that was intuitive for customers.

This meant tackling many of the details that we avoided in our initial rough prototypes:

  • 20+ sliders was way too much to jam into an interface, so we tried a version with 6, and then simplified it down to 3.  Eventually our designers and engineers tuned the experience to work with just one slider.

  • Our initial prototype only had about 100,000 images, and we wanted to run it on over 10 million. To speed up search queries on that much data, we switched the backend data store to Apache Solr. To process our image color data faster, we used the Perl PDL module PDL::Graphics::Colorspace, which was written by one of our algorithm experts, Maggie Xiong. To speed up the interface even further, we added some layers of cache, and primed it with a set of warming queries.

  • Our customer research team and product specialists found that there were some queries that didn’t produce appealing results for customers — if the query was too specific, there were too few results for some colors, and the results for other queries were sometimes less than stellar. A few of our engineers continued iterating on the sorting algorithms until they found the right equations to give us the most evenly colored, high-contrast, and vibrant images.

  • We wanted to provide a unique search experience, but didn’t want it cluttered with all the search options on the main Shutterstock site. We decided to put this in our Labs environment so that we could build it as a standalone app and get customer feedback on it.

 

color_prototype_3

One of the penultimate iterations of Spectrum had three sliders to toggle lightness, hue, and chroma (saturation). Certain combinations of slider positions gave us some cool effects like the one above.

 

Where We Are Today

After a few rounds of iterating, both on the interface and the back-end implementation, we finally had the simple, intuitive, responsive color exploration experience we had been after.

Even after release, we continued to iterate behind the scenes, tweaking and tuning the index and search algorithms. In the latest iteration, we further optimized the Solr index so that we could sort on a single Solr field for each slider position rather than run complex spatial sorting functions at query time.  We also migrated the color data from the standalone Solr instance we had dedicated to this app into our main Solr pool so that we could use a wider range of data for future iterations of color search.

spectrum-final

The final version of Spectrum as it appears today. Try it out at www.shutterstock.com/labs/spectrum

Spectrum is just one example of some of the cool things that a group of passionate people can build at Shutterstock.  Along with other apps, like Shutterstock Instant, we have numerous other prototypes in various stages of development.  Every week, engineers and product specialists are coming up with new ideas to build, and throwing together quick proofs of concept to assess their potential.  As we work on them, we’ll continue validating our ideas with customers, get better at solving real problems, and build valuable features that can help our users around the world.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | 12 Comments

Our Guide to Building RESTful Services

A few years ago, we began a fun and challenging journey to break a large, monolithic codebase into a set of isolated, independent REST services.  This effort has already yielded a ton of value in simplifying our codebase and speeding up development.

Along the way, we wrote this guide to building services in our ecosystem.  We thought other folks embarking on this path might find it useful, so we’re sharing it here.

Approach

Use REST and JSON

REST is an approach to building web services that encourages speaking in terms of entities, and uses HTTP thoroughly to interact with data and carry out operations. See the HTTP spec for helpful insight into the details of REST.

The world has settled on JSON for now. That’s JavaScript Object Notation, really a subset of eval’able JS. The JSON spec is purposefully readable.

Treat interface as fundamentally important

In REST, the interface is key. If we’re doing it right, we shouldn’t find ourselves wanting to hide services calls behind layers of abstraction on the client side. Rather, the REST interface is the programming interface. This means we should think hard about how we name end points, and name them for the nouns they represent. Responses are a crucial part of the interface, too. They should consist solely of data that represents the resource.

Build features as composable building blocks

Whenever possible, resources should anticipate being used for multiple sites in multiple contexts. It’s often useful to ask, “Would this resource make sense if we were building a t-shirt company?”  It’s the caller who should make the functionality be specific to the application.

Aim for as few resources as are needed

We should see each resource (in this context, “resource” means URL) as somewhat precious. For each one we have, we introduce overhead for maintaining the code, the tests, and the documentation. For example, a resource per photo attribute is too granular — instead, set attributes by POSTing to the resource that represents the photo as a whole.

Encapsulation

Manage dependencies locally

Services should have localized, self-managed dependencies. In the case of Perl, this means specifying packages with Carton and cpanfile and managing with cpanm.  Node.js, Ruby, and PHP all have similar systems which Rock and our build system support. There’s overhead that comes with this approach, but the upside is that we get to upgrade dependencies granularly. In practice, in large inextricably coupled systems, dependencies are almost never upgraded due to the chance of breaking something.

Avoid services calling other services

If you find yourself wanting to call a service from within another service, take that as a sign to step back and evaluate where lines of separation are falling. Very often, the client can call one service and then call the next one, rather than the first directly calling the second. This way we don’t have call stacks multiple services deep, and we can test our resources in isolation.

Services own their own caching

Services should cache their own data as appropriate. Clients may cache according to how the service specifies in response headers, but that should not be the expectation. A memcache pool specific to the service is often the way to go.

Services own their own data

Ideally, services should own their own data. That means instead of storing their data in the main application database, they’d store data in their own local data store, whether it’s a set of MariaDB boxes, or some other data store like Redis.

Services own their own security

Services should take it upon themselves to authenticate and validate incoming requests. In some cases that means integrating with OAuth in the case of actions being taken on the behalf of end users. In other cases that means managing an internal set of users, as the storage service does. Either way, services generally shouldn’t just trust that callers are doing the right thing. Ideally we’d like to be able to open services to the outside world someday. In practice, we use api.shutterstock.com as an additional line of defense for outside users.

Each service lives in its own repository

Repositories for services follow the naming convention shutterstock-service-[name].

Use middleware to share functionality across services

Middleware is packaged utility functionality you may want across projects, that may be likely to run on every request. For example, in a user-facing site this may include setting up a session, or translating the page output, or assigning a visitor ID, etc. In backend services this could include setting up logging, or configuring a caching layer, etc.  Modern web frameworks implement some derivative of WSGI, which is a common interface that facilitates sharing across projects.

Code and Branching

Discuss your changes with project stakeholders

Each repository has an associated Google Groups list. Before you make any significant changes, please get feedback there. That list goes to watchers of the repository. You are encouraged to link to a diff or pull request.

Develop and test on your local instance

To get started, clone the repository to some environment where Rock is installed. Then build and install dependencies with rock build. Then run your instance with rock run, and point callers to your local instance.

Add new features in branches and merge to master as late as possible

Push your feature branches up to origin to share with others, and use our build tools to deploy your branch to lower environments. Once you’re ready to push to production, merge into master and go for it. A merge to master should be treated as a deployment to production.

Testing

Write unit tests for all functionality

Before you add functionality, add a failing test that will succeed when your work is done. Aim for full code coverage. Before you commit, run all tests to make sure you didn’t break other tests. Fix any broken tests you find, even if you weren’t the one who broke them. Mock your test data rather than interacting with a test data store.

Write ntf acceptance tests for every resource

Write acceptance tests for your resources with ntf. ntf continually executes requests against our services in production and tests that responses match what’s expected. Performance data is trended as well. Tests in ntf should be able to return in a matter of seconds (say, less than 10), and be okay to run thousands of times a day.

Production

Make it easy to set up and monitor

Add useful messages at higher-verbosity log levels. Avoid lengthy start-up times (>30s). Don’t require pre-run setup scripts. Fail with descriptive error messages, not just representative HTTP status codes.

Consider dark-launching

Aim to be faster than any previous implementation in production. Make two calls and measure the difference in performance. This is also a chance to prove correctness — that the results you get from the service match the results powering production.

Monitor after you deploy

Watch access logs and error logs.  Also watch real-time request volume graphs on Ground Control, and watch the status of ntf tests.

Documentation

Document every resource

For each resource, state what problem it solves from the perspective of the consumer. Document request parameters. Please use plain, straightforward language. Whenever possible, include complete copy-and-pasteable working examples of requests and responses.

Be descriptive, but concise

Don’t say more than you need to. For example, rather than “this parameter specifies the width of the image”, you can simply say “image width”. Challenge yourself to boil it down to the core meaning of the thing.

Provide a resources resource

By convention, services provide an introspective meta resource at /resources that lists resource endpoints along with supported request methods and documentation. Ground Control will consume /resources and present your docs to the humans. Test and see how things look there.

Failing Gracefully

Send descriptive error messages

Along with the appropriate HTTP status code, send a verbose, human-readable error message when an error has occurred. This can make all the difference to the developer writing client code to consume your resources. It is important that the developer be able to figure out whether he or she has made a mistake, or whether (and exactly how) the service is broken.

Consumers should anticipate failure

If a particular resource is unavailable, often the client may still be able to recover and serve a useful (if degraded) response to the end user. For example, we show the number of approved video clips on the footage site home page. If that number is not available from the media service, we’d rather show the home page without the number than send a 500 response. So in this case we can degrade to a message which just doesn’t reference that number.

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Seven Keys to Keeping the Startup Spirit Alive in a Growing Company

One of the most common questions I get asked when talking about the history of Shutterstock is: “Did everything change after the IPO?” And my answer is always the same: “Hmm… Nope!”

Shutterstock has always been composed of a bunch of people who love startups. We arranged the company to act and feel like a collection of startups, rather than a monolithic behemoth. We were able to carry that attitude through the IPO, and we still actively maintain it today.

There are a few key elements to our approach that I think have let us maintain our startup feel:

1. Small Teams:

There are many advantages to small groups of people. Communication is more efficient, information flows freely and personalities are easier to navigate. As we’ve grown, we’ve tried to maintain the advantages of small groups by creating teams with focused, specific goals that they help create. We’ve grown the organization by establishing new teams to tackle a business initiative, or by increasing the size of existing teams until they’re ready to split into two.

2. Autonomy:

Small teams can’t get much done unless they can fulfill their goals autonomously. To do that, teams need two characteristics:

- They have to be cross-functional and independent. All our teams are staffed with this in mind, so they contain software engineers, Q.A. testers, user-experience developers and product owners.

- The rest of the organization has to refrain from interfering with them. This piece is usually the hardest, particularly at a growing company with lots of smart people who think they know the best path forward. That’s why a few of us make it our mission to nudge would-be meddlers in a different direction, letting the teams remain blissfully focused.

3. Architecture:

As a company grows, so do its business rules and codebases. If you don’t encapsulate that growing complexity, it becomes more and more overwhelming. When that started happening to us, we decided to silo functionality in RESTful services. We’re still hard at work on this, but it’s clearly a better world. Interacting with old, monolithic codebases is a hairy process — who knows what unseen systems might rely on the one line of code you’d like to modify? Service-based architectures, however, are a pleasure to hack on — they have well-defined interfaces, comprehensive test suites and total isolation from other systems.

4. Tools:

Just as a codebase grows, so too do its operational requirements. Shutterstock started as a single web server in a Texas data center, and is now made up of thousands of servers spread across multiple continents. But throughout that time, our goal has always been to make the development and deployment process as simple and fast as possible.

These days, we’re doing our best to match the feature set of services like Heroku in our own deployment process: we run our own cloud using Openstack, our deployment system ties into Github and Jenkins and everything is controlled via a nice web interface.

5. Hire the Right People:

It takes a special sort of person to thrive in a startup environment, so hiring the right people is an important ingredient to our success. We look for people who are active in the open-source community (where flexibility and good communication skills are great benefits) and who are naturally curious and excited about technology (and therefore open to learning new ideas). Creativity, independence and humility are also key, as they make it easier to navigate our small teams.

6. Stay Weird:

To me, one of the sadder realities of most growing companies is a constant, silent push towards a more boring, inoffensive culture. That contrasts sharply with most startups, which tend to be full of risk-taking, opinionated, and stubborn characters.

At Shutterstock, we’ve delayed the reversion to an uninspired mean by letting different teams and groups in the company develop their own culture. Sure, we have company events that project a certain overall attitude, but what people tend to identify more with is the culture of their team, whether it’s our whiskey tastings disguised as Front-End video nights, an unspoken beard rivalry among the architects or a refined appreciation for awful computer humor among a subset of our development team.

The Future:

Shutterstock has enjoyed tremendous growth over the past few years, but as we’ve grown, we’ve done all we can to keep the spirit of the startup alive. I’m happy to say that there are no signs of our strategy petering out, and I’m hopeful that by continuing the same approach, we’ll be able to keep growing, while still feeling like that small startup with a single server cranking away in Texas.

This post was also featured on the Huffington Post on Wednesday, December 11th.

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Adventures in API Usability

Shutterstock developers pay a lot of attention to the user experience of our website. We have a fleet of User Experience experts who help make sure the error states our web application shows to customers are useful and actionable.

But when we’re building backend APIs instead of HTML forms, that experience doesn’t translate. What’s the equivalent of this, in an API?

Validation example


The Shutterstock Contributor Team has been building our next-generation content-review system, so that we can scale our image-review operation. We’re building it in a service-oriented fashion, in Ruby, with DataMapper as an ORM.

As developers building backend APIs, it’s solely our responsibility to provide useful information to the developers who will use our services. A good error validation framework preserves the integrity of our applications’ data and empowers developers to integrate with a new API.

Rather than write custom validation for each API endpoint, we took a systematic approach to add validation to all of them. Now we can avoid many application crashes, while providing useful information to developers.


One of the first things the review system needs is to learn about new items needing review:

POST /items
{
  "domain": "shutterstock-photo",
  "owner": "81",
  "item": "3709",
  "item_type": "photo",
  "queue": "main"
}

This call puts the photo with item id 3709 and owner id 81 into the main review queue.
The expected result is HTTP 201 Created with a Location: header giving the URL of the created item.

There are several other Shutterstock teams that will eventually integrate with this review service. Sometimes, when developers are still writing the software, they will post invalid data:

POST /items
{
  'domain': 'shutterstock-photo',
  'owner': '81',
  'item': '3709',
  'item_type': 'photo'
  // 'queue': 'main'
}

Whoops! This POST left out the queue name, so the review system doesn’t know who’s supposed to review it. Without data validation, our application will throw a 500 error:

500 Internal Server Error TypeError: expected String, got NilClass 

It would be better if we told the programmer what he’s done wrong. Also, we’d like to return HTTP400 Bad Request instead of having an internal server error.

Our team realized that there’s a tool to help us do this sort of thing: the json-schema Ruby gem, an implementation of the IETF JSON Schema spec. To use this, we’ll need to build up a schema. For the items route, it would look like this:

{
  "id":" http://review.shutterstock.com/items.schema",
  "type": "object",
  "required": ["domain","item","item_type","owner", "queue"],
  "properties": {
    "create_time": {"type":"string"}, 
    "item":        {"type":"string"},
    "domain":      {"type":"string"},
    "item_type":   {"type":"string"},
    "owner":       {"type":"string"},
    "queue":       {"type":"string"}
    }
 } 

Now we will make our review service pass the incoming POST data through json-schema’s JSON::Validator before doing anything else:

rest_data = JSON.parse(request.body.read)
json_errors = JSON::Validator.fully_validate(
   schema, 
   rest_data, 
   :version => :draft4)
if json_errors.length > 0
   content_type 'application/json'
   halt 400, JSON[{:errors => json_errors}]
end

If there are any errors, the response looks like this instead:

400 Bad Request

{"errors"=> [
  "The property '#/' did not contain a required property of 'queue' 
   in schema http://review.shutterstock.com/items.schema#"
]}

This message tells us that there’s a property missing in the JSON document root (#/). If there’s more than one item missing, the validator will identify them all. The validator does more than check for the existence of the required fields; it also checks the types of each field. If someone passes in a Hash instead of a string, like so:

POST /items

{
  'owner': '81',
  // eek, I'm not a string, I'm a Hash:
  'item': {'domain': 'shutterstock-photo', 'id': '3709'},
  'item_type': 'photo',
  'queue': 'main'
}

then they’ll get an error message about item. Previously the application would have returned another Internal Server Error about a TypeError as soon as it tried to treat item as a string.)

There’s just one problem. We have a variety of resource types to manage. It would be really great if we didn’t have to write a custom schema for all of them. It’s a fair amount of text to write; it’s easy to get wrong; the hand-written schema can fall out of sync with the actual code; and above all, it’s redundant! Most of that validation information is already encoded in our ORM layer, where it looks like this:

class Item
    include DataMapper::Resource

    property :id, DataMapper::Property::Serial
    property :create_time, DateTime,
             :default => lambda {|_,_| DateTime.now }
    property :external_id, String, :required => true

    belongs_to :domain
    belongs_to :item_type
    belongs_to :owner

    validates_uniqueness_of :external_id, 
      :scope => :domain, 
      :message => "Item must be unique to a domain"

    has n, :reviews
    has n, :queues, :through => :queue_items
    ...

It turns out that we can use this class definition to build our schema:

  • figure out the class of the resource in question (we’ll call it resource_class)
  • ask the resource_class for a list of its properties (resource_class.properties)
  • ignore properties that our application can automatically populate (like the internal database id and create_time)
  • figure out the data type for the remaining properties (property.primitive)
  • ask the properties whether they’re not required (property.required)

Once we’ve done that, we almost have enough information to build a schema. There are a few other wrinkles: our properties include things like domain_id as an integer instead of a string, and we want our consumers to specify shutterstock-photo instead of the internal database ID. So for those we:

Finally, we present all this data in the JSON Schema format.

That’s all the information we need to build schemas for all of our resource types. By computing and caching this at application load time, we can provide a basic schema for all POST and PUT requests.

We may need to customize a generated schema for certain routes that are special cases. For instance, we’ve decided that the POST /items route calls its logical ID field item in the POST and external_id in the database. Such customization is straightforward to accomplish.

Our final realization was that once we had all the information about how a schema ought to look, we could make the schema available to our users. So now they can issue a request against http://review.shutterstock.com/items.schema (or domains.schema and owners.schema) and see for themselves exactly what fields the system is expecting to create a new resource. By providing a URL to the schema in the error message, we end up with a self-documenting API!

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Test All the (Network) Things

Our engineering team supports many different sites, including the Shutterstock photo site, the Shutterstock footage site, the Shutterstock contributor site, Bigstock, Offset, and Skillfeed.

All these sites rely on a core set of REST services for functionality like authentication, payment, and search. Since these core services are so critical, we need to know if they’re functioning properly at all times, and get alerted if they aren’t. There are plenty of solutions for server-level monitoring, but we couldn’t find a good, simple solution for service or API monitoring. So we built one. It’s called ntf, for network testing framework, and it’s part of a large collection of tools that we’ve open-sourced.

ntf Overview

ntf is based on Node.js and nodeunit. The framework provides a server that polls multiple endpoints on all our services once per minute and verifies that they’re responding correctly. We’ve connected it to Icinga and OpenPOM, the two monitoring tools we use, so that we can get alerted if any of our ntf tests fail.

We’ve set things up so that our developers just need to add some tests to a git repository and deploy their changes to production. From there, we get automated testing, reporting, and alerting for free.

Working With ntf

The ntf framework is broken into three pieces:

  • ntf is a command-line tool to run specific tests
  • ntfd is a library for creating a daemon that runs ntf tests at specified intervals in an infinite loop and sends the results to ntfserver
  • ntfserver is a server that stores events from ntfd in a mysql database and provides a web interface to report the status of current and past tests

Let’s work through a full example, and start with ntf itself. Here’s a simple test to check if a particular service is reachable:

var accounts = require('ntf').http('https://accounts');
exports.accounts_reachable = accounts.get('/', function(test) {
    // test status code is 200
    test.statusCode(200);
    // finished
    test.done();
});

This test checks the root URL of our accounts service, and makes sure it’s returning a “200 OK” HTTP response.

We can also add a test to make sure the response to the login resource looks ok:

exports.accounts_login = accounts.get('/login', function(test) {
  test.body("Sign in");
  test.done();
})

If we put both of these tests in a file called accounts.js and run it, we get:

$ ntf accounts.js

accounts.js
✓ accounts_reachable
✓ accounts_login

OK: 2 assertions (512ms)

ntf supports a range of tests to support more complicated interactions with all our services. For a full list, see the documentation.

To run the tests continually in a loop, we use ntfd. ntfd is a library that gets included in a Node project, similar to express. You can build a simple daemon with ntfd like so:

var ntfd = require('ntfd')

ntfd({
  path: __dirname + '/tests',
  agent: 'test',
  plugin: [
    new ntfd.plugin.ConsoleEmitter(),
    new ntfd.plugin.HttpEmitter('http://localhost:8000/store')
  ],
  test: {
    interval: 10
  }
})

The easiest way to get started with it is to copy the ‘example’ directory in the ntfd repository, put the test files (like accounts.js, above) in the ‘tests’ directory, and run:

$ node .

After a minute, you’ll see the output of the tests on the command-line. With an HttpEmitter defined, it will also send test data to http://localhost:8000/store, which is meant to be captured by ntfserver. ntfserver provides a web interface to the test results. To start ntfserver, run:

$ ./bin/ntfserver
   info  - socket.io started

Then, navigate to http://localhost:8000 to see the ntfserver dashboard.

These three components provide a complete framework to manage, run, and monitor tests.

How We Use ntf

As we create RESTful services, we write ntf tests for each new resource to guarantee that the resource is always functioning properly. We have a git repo dedicated to our ntf tests, which contains a directory for each of our services. In that directory is a list of tests that get run for each service.

We always have the ntfserver dashboard up on a monitor. It looks like this when all is well:

… and like this when all isn’t well:

The dashboard lets us drill down to find out the details of any problems that it reports.

Meanwhile, Icinga and OpenPOM can hit the same dashboard and request a /status resource to know if anyone needs to get alerted to a problem.

ntf has been a great help in letting us rapidly expand our services infrastructure with the confidence of knowing our systems are always functional. It’s open source and available on Github, and we’d love for you to check it out and let us know what you think.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , , | Leave a comment

The Psychology of Engineers: a Talk on Introversion/Extroversion and Flow

As an engineer, I have always been curious about why people, and especially other engineers, behave the way they do. How do engineers get “in the zone” when coding and why do they like it so much? Why are so many engineers (including myself) so averse to holding meetings?

This curiosity led me to research two topics, Introversion/Extroversion and Flow. I presented them at an internal tech talk at Shutterstock called “The Psychology of Engineers.” The talk generated enough interest and discussion that I recorded a version to share it more widely.  It is embedded below and runs for 19 minutes.

The slides are available on SpeakerDeck.  Please share your thoughts and feedback on the talk in the comments or on Twitter via @shuttertech.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , | 8 Comments

Simplifying the build, test, and run cycle with Rock

At Shutterstock we have over 50 sites and services running in production, across thousands of VMs, in a range of languages — mostly Perl, Ruby, Node, and PHP.  Supporting such a variety of languages across projects can be daunting.  Each project has a specific version of a language runtime it targets (e.g., our Accounts Service runs on Node 0.10.x, but our internal Data Visualization site is still on Node 0.8.x).  And every project has its own set of unit tests and dependencies that need to be managed.

Each language has its own tools for solving these problems.  Perl has Perlbrew for managing language versions and Carton/cpanm for managing dependencies; Ruby has rvm and bundler; Node has nvm (among others) and npm; PHP has Composer (finally!).  These are great, but across many projects it’s still a lot to know and keep track of.

Enter RockStack, the brainchild of Silas Sewell, a developer here at Shutterstock.  Rock provides two main components:

  1. Packaged language runtimes for Ruby, Node, PHP, Python, and Perl (.deb’s for Ubuntu, .rpm’s for CentOS, homebrew for OS X)
  2. A command-line tool to manage setting up environments to build and run projects

With the RockStack packages installed, we can check out any project and use Rock to install/build local dependencies (“rock build”), run tests (“rock test”), and then start up the service (“rock run”).  Rather than having each project contain a specific set of installation instructions in the README for a human to follow, we can put all of that in the .rock.yml configuration file for the project, and Rock takes it from there.

Rock supports a variety of languages and runtimes. To use it, just install the packages and run a command to initialize your project. For instance, a Ruby 2.0 project would use:

 rock --runtime=ruby20 init 

This will create standard dependency files (package.json for Node, Gemfile for Ruby, etc) and directories for your project. Rock has sane defaults for building, testing, and running most code. These are easiest to see by browsing the YAML files that define the default behavior for each language. You can always override these defaults in the .rock.yml file. For a Node app using bower and grunt, your .rock.yml might look like this:

runtime: node010
build: |
    {{ parent }}
    bower install
    grunt
run: node app run 

Rock is a deceptively simple tool that has proved incredibly valuable.  It has enabled us to continue using multiple languages across many projects, letting us focus on productive hacking, not language logistics.

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Solr as an Analytics Platform

Here at Shutterstock we love digging into data.  We collect large amounts of it, and want a simple, fast way to access it.  One of the tools we use to do this is Apache Solr.

Most users of Solr will know it for its power as a full-text search engine.  Its text analyzers, sorting, filtering, and faceting components provide an ample toolset for many search applications.  A single instance can scale to hundreds of millions of documents (depending on your hardware), and it can scale even further through sharding.  Modern web search applications also need to be fast, and Solr can deliver in this area as well.

The needs of a data analytics platform aren’t much different.  It too requires a platform that can scale to support large volumes of data.  It requires speed, and depends heavily on a system that can scale horizontally through sharding as well.  And some of the main operations of data analytics – counting, slicing, and grouping — can be implemented using Solr’s filtering and faceting options.

One example of how we’ve used Solr this way is for search analytics.  Instead of indexing things like website content, or image keywords, the index for this system consists of search events.  Let’s say we want to analyze our search data based on the language of the query and country where the user is located.  A single document would contain fields for the search term, language, country, and the timestamp of the search (and an auto-generated uuid to identify each unique search event).

<fields>
 <field name="uid"         type="string"/>
 <field name="search_term" type="string"/>
 <field name="country"     type="string"/>
 <field name="language"    type="string"/>
 <field name="search_date" type="date"/>
</fields>
 

Our Solr schema for this example

{
 "search_term": "navidad",
 "country": "Spain",
 "city": "Madrid",
 "language": "Spanish",
 "uid": 123412341234,
 "search_date": '2012-12-04T10:30:45Z'
}

A document representing the search event that we’re storing in Solr

If we ran a query on this data, and faceted on search_term, Solr would give us an ordered list of the most frequent searches, and their counts.  Now we can take a slice of this data and filter by country:Spain.  Now we have the top searches from users in Spain.

Taking this further, we can filter by a date range, and look at, say, searches that occured in December.  And now, surprisingly, we see the top search_term Navidad percolate to the top.

http://localhost:8983/solr/select?q=*:*&fq=country:Spain&
facet=true&facet.field=search_term&facet.mincount=1&
facet.limit=100

This Solr query will get us the top search terms used in Spain

We can take our analysis yet further by utilizing Solr’s powerful date range faceting.  This lets us group our results by some unit of time.  Let’s say we set our interval to a week (set our facet.range.gap to “+7DAY”).  Also, instead of facting on search_term, let’s filter by search_term:Navidad.  Now our results give us the number of times this query was used each week.  Send these numbers to a graph and we can generate a trendline telling us when our Spanish users started getting interested in Christmas last year.

http://localhost:8983/solr/select?q=*:*&fq=country:Spain&
fq=search_term:navidad&facet=true&facet.range=search_date&
facet.range.gap=%2B7DAYS&facet.range.end=2013-01-01T00:00:00Z&
facet.range.start=2012-01-01T00:00:00Z

This query will tell us how many times per week navidad was searched for in Spain.

In essence, what we’ve built is a simple OLAP cube, except on commodity hardware, using open-source software.  And although cubes and the MDX query language provide a much richer set of features, some of the core pieces can be replicated in a Solr-based architecture:

Hierarchies: Tiers of attributes, such as year > month > date,  or Continent > Country > City, can be represented as multiple fields that are populated at index time.  Solr won’t know the relation between each tier, so you may want to index the values more verbosely – e.g. your “country” would be “Europe.Spain”, and your “city” field would be “Europe.Spain.Madrid” — so there’d be no mixing of Madrid, Alabama in your results when filtering or faceting by city.

Measures: this is your unit of measurement of whatever you’re counting.  In our example, the only thing we measured was number of searches.  Some more complex units of measurement might be the 95th percentile response time of a given search, or the total number of dollars spent after a user performed a given search.  These types of calculations for now are beyond the realm of Solr, so we’ll stick to the basic counting we can get through faceting.

Rows and Columns: In our first example we simply faceted on search term.  This just gave us a single column of data where one dimension was search term, and the other was basically time – spanning the age of the entire data set.  If we want to explore retrieving multidimensional datasets in a single query, then the place to look is pivot facets.

If the performance of pivot facets is too much of an obstacle, you can also build your dataset by running multiple queries while filtering on each value in one of your axes.

Filtering: Solr may have OLAP beat in this one area at least.  Whether you use Solr’s standard filters, range filters, fuzzy matches, or wildcards, you should have all the tools you need to grab a decent slice of your data.

If you’re looking to get your feet wet in data analytics, Solr may be a good tool to start with.  It’s easy to get going, and it’s totally free.  And once you’ve invested your time in it, its strong community, and precedence for scaling for high volumes of search traffic and data make it a tool you can grow with.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | 5 Comments

Why I DON’T Care About Statistical Significance

You know the world has come a long way when someone has to espouse the heresy of not caring about statistical significance.

This is not an argument against A/B testing, but rather about how we use A/B test results to make business decisions.  Instead of statistical significance, let’s make decisions based on expected value, i.e. $benefit × probability − $cost.

A little background on statistical significance, or “p < 0.05″. Say you have just deployed an A/B test, comparing the existing red (control) vs. a new green (test) “BUY NOW!” button. Two weeks later you see that the green-button variant is making $0.02 more per visitor than the red-button variant. You run some stats and see that the p-value is less than 0.05, and are ready to declare the results “significant.”  ”Significant” here means that there’s an over 95% chance that the color made a difference, or more true to the statistics, there’s less than 5% chance that the $0.02 difference is simply due to random fluctuations.

That last sentence there is probably too long to fit in anyone’s attention span. Let me break it down a little. The problem here is that you need to prove, or disprove, that the difference between the two variants is real — “real” meaning generalizable to the larger audience outside of the test traffic. The philosophy of science (confirmation is indefinite while negation is absolute — a thousand white swans can’t prove that all swans are white, but one black swan can disprove that all swans are white) and practicality both require that people set out to prove that the difference is real by disproving the logical opposite, i.e. there is no real difference. Statistics allows us to figure out that if we assume there is no difference between the red- and green-button variants, the probability of observing a $0.02 or larger difference by random chance is less than 0.05, i.e. p < 0.05. That is pretty unlikely. So we accept the alternative assumption, that the difference is real.

What if you have a p-value of 0.4, i.e. a 40% chance of getting a $0.02 or larger difference simply by random fluctuations? Well, you may be asked to leave the test running for longer until it reaches “significance,” which may never happen if the two variants are really not that different, or you may be told to scrap the test.

Is that really the best decision for a business? If we start out with the alternative assumption that there is some difference between the variants, 60% of the time we will make more money with the test variant and 40% of the time we will lose money compared to the control. The net gain in extra-money-making probability is 20%. The expected size of the gain is $0.02. Say that we have 100K visitors each day, that’s $0.02 × 100,000 × 0.2 = $400 in expected extra revenue. It doesn’t cost me any extra to implement the green vs. red button. Of course I should go for the green button.

If we go back to the option of letting the test run for longer before making a decision, the upside is that we will have a more accurate estimate of the impact of the test variant. The downside is that, if one variant has $400 expected extra revenue each day, that’s $400 × (1 − traffic_in_test_variant%) extra dollars we are not taking in each day.

Now suppose you are so diligent that you keep rolling out A/B tests, this time testing a fancy search ranking algorithm. Two weeks later you see that there is a $0.10 increase in dollar spent per visitor for the test variant compared to the control (i.e. existing search ranking algorithm) variant. If the increase is real, with 100K visitors each day, that’s $0.10 × 100,000 = $10,000 dollars extra revenue each day. Now, let’s add a twist: you need five extra servers to support that fancy algorithm in production, and the servers cost $10,000 each to buy, and another $10,000 to run per year. You want to make sure it’s worth the investment. Your stats tell you that you currently have a p-value of 0.3, which most people would interpret as a “nonsignificant” result. But a p-value of 0.3 means that with the new ranking algorithm the net gain in extra-money-making probability is 0.7 − 0.3 = 0.4. With the expected size of the gain being $0.10 per visitor, the expected extra revenue per year is $0.10 × 100,000 × 0.4 × 365 = $1.46M dollars. The rational thing to do is of course release it.

Now, the $0.10 increase is the expected amount of increase. There is risk associated with it. In addition, humans are not rational decision makers, so a better theory is to use expected utility and include risk aversion in the calculation, but that’s outside of the point of this article. This article is about using statistical significance vs. expected value for making decisions.

Statistical significance is that magical point on the probability curve beyond which we accept a difference as real and beneath which we treat the difference as negligible. The problem is, as the above examples have demonstrated, probabilities fall on a continuous curve. Even if you do have a statistically significant result, a significance level of p = 0.05 means that 1 in 20 A/B comparisons will give you a statistically significant result simply by random chance. If you have 20 test variants in the same test, just by chance alone 1 in 20 of these variants will produce “statistically significant” results (unless you adjust the significance level by the number of variants).

The normal distribution (or whatever distribution you use to get the probabilities) does not come with a marker of statistical significance, much like the earth does not come with latitudinal or longitudinal lines. Those lines are added essentially arbitrarily to help you navigate, but they are not the essence of the thing you are dealing with.

The essence of the thing you are dealing with in A/B tests is probability. So let’s go back to the basics and make use of probabilities. Talk about benefit and probability and cost, not statistical significance. It’s no more than a line in the sand.


Notes:

1. The above examples assumed that the A/B tests per se were sound and that the observed differences were stable. To estimate the point at which the data is stable, use power analysis to calculate sample size.

2. Typical hypothesis testing procedure: to investigate whether an observed difference is generalizable outside of the test, we set up two competing hypotheses. The null hypothesis assumes that there is no difference between the two means, i.e. the two samples (e.g. two A/B test variants) are drawn from the same population, their means fall on the same sampling distribution. The alternative hypothesis assumes that the two samples are drawn from different populations, i.e. the means fall on two different sampling distributions. We start out assuming the null hypothesis to be true, and that the mean of the control variant represents the true mean of the population. We calculate the probability of getting the test variant mean under this assumption. If it’s less than some small number, for example p < 0.05, we reject the null hypothesis and accept the alternative hypothesis.


3. Significance levels are very much a convention and vary across disciplines and situations. Sometimes people use 0.01 or 0.001 instead of 0.05 as the significance level. As we all learned from the Higgs boson discovery, they need 5 sigmas (that translates to a p-value of about 0.0000003) to be officially accepted as a “discovery.” Traditional significance levels are biased strongly against false positives (claiming an effect to be true when it’s actually false) because of the severe cost in championing a false new theory or investing in a false new drug.

Interested in working at Shutterstock? We're hiring! >>
6 Comments