Taking Control of Your Own Curriculum.

Throughout my career as a software engineer and technology leader, I’ve helped companies find and develop talent. The one constant: how unprepared most college grads are when entering the industry. The process of obtaining a Bachelor’s Degree exposes engineers to new ideas and most importantly, to how to learn. To gain real world skills, developers have to create and take control of their own curriculum.

Get real life software development experience

Internships are the best way to get real life experience. The tangibility of using real-world technologies, real-world data matched with real-world customers – creates a learning experience not found in a classroom. In the best internships, you are working as part of a team to solve customer problems – an experience that is invaluable regardless of the type of company you ultimately choose to work for or to start.

There are many things you will learn quickly in an internship in dynamic environments that will further your understanding. Agile development methodologies and using source code version control are two, as well as how to package, distribute and deploy a project. You might learn the best practices for writing unit tests and integration tests and load tests. You’ll get exposure to teamwork with a source management tool or a continuous deployment environment and learn how to operate with non-technical professionals as well.

If you can retain the big strategic picture and all of the steps to execute on it effectively – you’ll have the kind of valuable experience companies crave which you can learn during an internship. Those experiences help your resume stand out, with the promise of having an easier transition from school to career, simply because you’ll have less of a ramp-up period when you start your new job.

Be part of the community

Focus on a specific technology, language, or tech stack that has captured your interest and start attending the local meetups to both network and to learn how different technologies are progressing.

Another way to be part of the community and get experience is to contribute to Open Source projects. You’ll get experience working with teams, especially far-flung teams, and all of your work is public which enables future employers to evaluate your code. Two great projects I am happy to recommend which offer structure and guidance: Mozilla and Google Summer of Code.

Write tests and understand what coverage means

Get in the habit of writing tests every time you write code. When you have an assignment, first write unit tests to run the code you’ll write. This will prove whether the answers are correct, before you write code. Then, write the code. Build this habit early, and your life as an engineer will be much easier.

For whatever language you’re using, learn the test framework. During your interview, it’s likely the hiring manager will pose a problem and ask you to propose a solution. One way to make a big impression: Write out or at least talk about how you would test your solution. It’s a simple way to stand out in the interview process.

Learn to profile your code

In the real world, you will rarely work on code where you have full access to evaluate every method being called by all the relevant libraries. It’s important to learn how to find bottlenecks in your code because, as an engineer, you will spend a good amount of your time on this part of the job.

A code profiler runs your application and identifies “hot spots” that took some proportion longer to run, relative to other parts of your code. Many languages offer tools for code profiling. SQL queries can be analyzed using “explain” or query analyzer tools depending on your database. There are also end-to-end load testing tools. Regardless of the tool, learn how to run and evaluate the results of a profiler or some type of analyzer.

Get comfortable with ambiguity

There are no easy answers in business, no clear-cut solutions. The only solution is the one that you create. Sometimes, you try everything you know and you still can’t get something to work. That’s life. Being able to deal with uncertainty is one of the most important skills engineers can develop.

To understand if candidates are okay with truly not knowing, I ask open-ended, ambiguous questions. One question I often ask is, “tell me how the browser works.” My goal is to get more insight into their thought process. At other times, I’ll focus on the customer complaints: “A call came in stating its taking our homepage up to 30 seconds to load. What do you do and how do you investigate?”

Better yet, the question that really told us how potential candidates think was this question: “You’ve just hooked up a computer to the network. It’s not connecting to the internet. What do you do?” There is no right answer, but regardless of the candidates’ answers, we responded with “that solution didn’t work, the computer isn’t online.” If the candidate asked clarifying questions, we provided the most common case and kept the focus on troubleshooting the main issue. Perseverance, without frustration – well that’s a rare candidate to find.

It’s all good experience

Your career will be a lifelong learning process. There’s no doubt a job is in your future – as always it’s what you make the opportunities. With a strong foundation, an open mind and a robust toolset, you can thrive in any environment and adapt as technologies, languages and industries change.

This is an adapted version of a previous article by Augustina Ragwitz.

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Docker Registry Design

First a little background; Docker is an open source platform designed to make applications ship in small footprint containers which are easy to maintain.  The idea is to speed up the development lifecycle and go to production more rapidly.  Docker helps to abstract the application from the underlying system environment.  This approach also makes it easier to build application images which can run anywhere.

The images need to be stored for use before execution somewhere, though.  That is where a Docker Registry comes in.  The registry is analogous to code source control (i.e. subversion or git).  Since we have three major environments (development, QA and production) we have a docker registry for each environment.

Now onto the fun part!  They say a picture is worth a thousand words.. let’s try:



There’s 40 words in our picture.  Close enough.. let’s explore this diagram in more detail and what led us to this design.


  • Minimizes Single Points of Failure (SPoF)

  • Reliable:  Be realistic that anything can fail at any time.

  • Easily maintainable: Well known technologies and no fancy tricks to slow down troubleshooting or recovery

A Deeper Look:

  • Load balancing and fault recovery from the user to the registry servers is fronted by anycast using ExaBGP.

  • The registry code runs natively, not in a docker container.  That means fewer pieces, fewer things to go wrong and fewer levels of unnecessary abstraction.

  • The Search HA Interface is a classic heartbeat managed virtual IP backed with two MySQL databases which replicate to each other.

  • The storage layer is Gluster with three replicas.

There was skepticism about using the Gluster distributed filesystem, as a storage backend when I raised this issue with my colleagues at meetups and tech talks.  However, we went this route because Gluster is more easily administered, maintainable and recoverable than object storage systems.

We looked into Amazon S3, but didn’t like the transit cost.  It also meant more components could go wrong (S3 itself, Internet connectivity or our border network equipment) and violate the SPoF and reliability specifications.  In practice, our Gluster storage clusters have been running well and within acceptable service level targets.

For companies such as ours where we have knowledge and resources to invest internally, this is a great solution.  It enables us to provide a reliable service to our peers and consumers within our company and ultimately to our customers.

Feel free to submit your questions or comments below regarding Dockery registry and how you can incorporate it on your own.


Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Building your own network automation

There’s an important piece of infrastructure lacking the appropriate level of automation. In fact, without this part you are not connected to the Internet. I’m talking about the network hardware that moves packets between your backend servers and your customers.

The current state of the network industry is far behind that of the server industry, at least in terms of software customizations and the ability to “Do What You Want™”. This comes largely from the idea of requiring specific hardware vendors to write their own software. For all intents and purposes, this software is Cisco’s IOS, and Juniper’s JUNOS.

At Shutterstock, we use Juniper for all our networking needs. These guys have done a decent job allowing you to do anything you can from the command-line interface (CLI), via their XML-RPC interface which conforms to the NETCONF standards. Building on this and understanding the various stages of deployment, it is possible to automate your network. It just requires writing a lot of your own software, which honestly, gives you the greatest flexibility to mesh your customer facing application with your lower level network stack.

Since no application or network is designed the same, I’m going to describe some of the base components necessary to give you more control over your network.

Determine the protocol to interact with devices

A large part of the networking space has adopted the NETCONF protocol for running XML-RPC based commands. Thankfully, there is a decent collection of client libraries for this protocol written in many languages.

As a services-oriented company, we tend to break things up into their smallest component and add orchestration on top of those building blocks. As such, I’ve written a NETCONF proxy which can run commands concurrently across a provided list of devices.

Even if you’re not at the scale where high concurrency is necessary, having a simple API or library to interact with your network devices is key for taking control. For some great inspiration, I recommend taking a look at py-ez-junos which makes it easy to gather data about your devices.

Metadata storage for all devices

On the server software side of things, this is often referred to the Configuration Management Database or CMDB for short. If you’re lucky enough to already have this, all you have to do is extend it to support networking devices. If you do not have this, it’s worth the time and effort to get one in place.

At Shutterstock, we currently use a PostgreSQL database to store things like facter facts, IP addresses, ethernet interfaces, and other custom properties about individual nodes in our environment. We add on an Elasticsearch index to expose lucene search syntax into our data, making it easy to query for nodes based on any of the data we collect.

If you were to create a collection of YAML or even text files, that would be better than nothing. Having an authoritative list of network devices is critical for managing them in an automated fashion.

Discover new devices as they come online, then configure them

This one is a bit tougher depending on your network vendor’s default configuration. In Juniper land, the first thing a device does is blast out DHCP on the management interface and the default VLAN looking for an IP. With the appropriate dhcpd setup, you can cause a device to pull a configuration from an HTTP server.

We follow the paradigm outlined by Jeremy Schulman to pull a base configuration and install a proper JUNOS version when a device comes online. Once at this stage, a separate process that runs periodically will look for IPs responding to SNMP on our management network. The process then connects and collects SNMP data and stores it in our metadata storage.

You get some serious benefit here as you start to scale up. If you have to manually provision 20 new racks worth of network equipment, you are bound to have copypasta problems and even varying configurations if you have multiple engineers touching these things.

Tools to orchestrate the automation

Tools are why you build all the other components. If you have your devices easily searchable, and a library to interact with them, you can build useful CLIs or web interfaces for others to use.

This is where you can easily do a mass configuration change, write monitoring against devices for configuration drift, or simply use your metadata store to generate a nagios configuration. You can even schedule a configuration load for a specific time across multiple devices, and be on the beach while it happens.


These are the basics to learning how to control your own network. Feel free to comment below if you do something similar or know of any components that have been overlooked.

Interested in working at Shutterstock? We're hiring! >>
tags: , | Leave a comment

Swift: An Introduction to the Language

There was a huge amount of interest in Swift after it was introduced by Apple earlier this year. Not only were iOS engineers at Shutterstock learning everything they could about the new and evolving language, but there was a ton of interest in Swift from across the engineering organization.

Due to the interest, I prepared and presented a tech talk introducing Swift to curious Shutterstock engineers. The video below is a recoded version of that talk.

The talk introduces Swift to someone who already understands programming but is not familiar with Objective-C. High-level concepts are presented and backed up with simple code samples. Concepts which many engineers might have never encountered, such as protocols and generics, are discussed and illustrated with examples. The talk focuses only on the Swift language and does not discuss writing apps for iOS or OS X.

(View or download the slides here.)

Interested in working at Shutterstock? We're hiring! >>
tags: , , , | Leave a comment

12 Questions To Ask About PCI

Organizations either breeze through or struggle with PCI certification. The struggle parallels to a fight against zombies. You must stay on your toes. Once they start coming towards you they don’t stop and as your team deals with their own zombies, you realize you can’t keep up. The challenge does not stop there. This poem by Sam Kassoumeh sums PCI up in a nutshell. You must manage tech debt, legacy access control rules, and getting attention from developers or operations. How does anyone get through this?

Never stray from your main goal. Certification is the immediate point of the program. It is supposed to comfort customers and partners because due diligence keeps their financial data safe. Remember PCI is a means to an end and not a goal itself. The PCI process is supposed to make you think of ways to handle sensitive data in a general way. Nobody would argue that a single certification, piece of paper, or audit is enough to protect the organization. Build a process and worldview to work with.

The PCI Level 3 document is 112 pages long with 4 appendices and 12 sections. It sounds daunting and can be if your approach to security is ad-hoc. You will scramble to figure out what is covered, where covered assets are, who has access to those assets, and maybe even what the term asset really means. In the middle of an assessment you find yourself questioning the meaning of just about everything including:

  • Security strategy  - Do I really have a coherent strategy?
  • Tools used
  • How to store data

So what can you do to make PCI compliance achievable on that big day? Start today and think about:

  1. What customer data do we need to hold onto? For how long?
  2. How do I dispose of storage and printouts that have this data?
  3. Does everyone who needs to access this information have 2-factor authentication?
  4. Is the pathway to this data secured by encrypted connections (e.g. HTTPS)?
  5. Is it possible for an insider or intruder to see sensitive data through some other segment of our network?
  6. Is sensitive data only available to people and apps that really need it?
  7. If someone’s access level changes, would I know about it?
  8. If a related network rule changes, would I know about it?
  9. Am I keeping up with patches on high priority servers?
  10. Am I monitoring for and/or alerting on suspicious traffic?
  11. Do I know what ciphers we use?
  12. What’s our process for offboarding employees with access to sensitive data?

If you store data about customer transactions unrelated to credit cards (which is the domain of PCI) is it really a stretch to treat that data with same care? Why encrypt credit card information but not bank account numbers? Why mask part of the credit number and not a customer’s address? An address can be used for identity theft, too.

It’s not to say you should encrypt or mask everything everywhere. The point is to consider it. Maybe you don’t need to store so much data. Maybe you can build your network and application access rules earlier when paying attention to areas of the network with personal information.

This is just the beginning but if you can ask yourself these questions early, you can construct a strong strategy which is the true end goal of the PCI compliance process.

Once you know what you’re looking for use any resource you find helpful. For example, IBM has posted a guide on the importance of complying with PCI DSS 3.0 Requirement 6. You can view the guide here.

What are your thoughts on PCI? Be sure to comment below.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Stop Using One Language

In any technology company one of the fundamental aspects of its identity is the technology stack, and programming language that it’s built on.  This is what defines types of tools that are fair game, and more importantly, defines the types of engineers who are hired and capable of succeeding there.

Back in the middle of the last decade, when Shutterstock had its beginnings, the tech team was made up primarily of die-hard Perl developers.  The benefits of CPAN, and flexibility of the language were touted as the primary reasons why Perl was the right tool for anything we wanted to build.  The only problem was our hiring pool was limited to people eager to work with Perl– and although the Perl folks who joined us were indeed some of our most passionate and skilled engineers, there were countless engineers outside the Perl community who we totally ended up ignoring.

Fast forward to the last few years here, and Shutterstock has become a much more “multilingual” place for software engineers to work.  We have services written in Node.js, Ruby, and Java; Data processing tools written in Python; a few of our sites written in PHP, and apps written in Objective-C.

Even though we have developers who specialize in each language, it’s become increasingly important that we remove the barriers to letting people work across multiple languages when they need to, whether it’s for debugging, writing new features, or building brand new apps and services.  At Shutterstock, there have been a few strategic decisions and technology choices that have facilitated our evolution to the more multilingual programming environment and culture we have today.

Service Oriented Architectures

One of the architectural decisions we made early on to support development in multiple languages was to build out all our core functionality into siloed services.  Each service could be written in any language while providing a language-agnostic interface through REST frameworks.   This has enabled us to write separate pieces of functionality in the language most suited to it.  For example, search makes use of Lucene & Solr, and so Java made sense there.  For our translation services, Unicode support is highly important, so Perl was the strongest option there for us.

Common Frameworks

Between languages there are numerous frameworks and standards that have been inspired or replicated by one another.  When possible, we try to use one of those common technologies in our services.  As mentioned above, all of our services provide RESTful interfaces, and internally we use Sinatra-inspired frameworks for implementing them (Dancer for Perl, Slim for PHP, Express for Node, etc).  For templating we use Django inspired frameworks such as Template::Swig for Perl, Twig for PHP, and Liquid for Ruby.  By using these frameworks we can help improve the learning curve when a developer jumps between languages.

Runtime Management

When it comes down to the nuts and bolts of actually running code in a particular language, one of the obstacles that blocks new developers from getting into it is all the technical bureaucracy needed to manage each runtime — dependency management, environment paths, and all the command line settings and flags needed to do common tasks.

The tool we use at Shutterstock to simplify all this is Rockstack.  Rockstack provides a standardized interface for building, running, and testing code in any of its supported runtimes (currently: Perl, PHP, Python, Ruby, Java).   Have a java app that you need to spin up? Run “rock build” and “rock run”.  Have a Perl service you want a java developer to debug?  “rock build”, “rock run”.

Another major benefit of using Rockstack, is that not only do our developers get a standard interface for building, testing, and running code, but our build and deployment system only has to deal with one standard set of commands for running those operations as well for any language.  Rockstack is used by our Jenkins cluster for running builds and tests, and our home-grown deployment system makes use of it for for launching applications in dev, qa, and production.

One of the biggest obstacles for people jumping into a new language is the cognitive load of having to figure out all the details of setting up and working with the development environment for that language.  Once you remove that burden, people can actually focus their energy on the important engineering problems they need to solve.

Testing Frameworks

In order to create a standardized method for testing all the services we have running, we developed (and open sourced!) NTF (Network Testing Framework).  NTF lets us write tests that hit special resources on our services’ API’s to provide status information that show the service is running in proper form.  NTF supplements our collection of unit and integration tests by constantly running in production and telling us if any functionality has been impaired in any of our services.

Developer Meetups

In addition to tools and frameworks, we also support our developers in learning and evolving their skillsets as well.  On a regular basis, we’ll have internal meetups for Shutterstock’s Node developers, PHP Developers, or Ruby developers where they give each other feedback on code in progress, share successes or failures with third-party libraries, and polish up the style guide for their language.  These meetups are a great way for someone new to a language to ask questions and improve on their coding skills.


Part of what makes it easy to jump into another language is that all the code for every Shutterstock site and service is available for everyone to look at on our internal Github server. This means that anyone can review anyone elses code, or check out a copy and run it.  If you have an idea for a feature, you can fork off a branch, and submit a pull request to the shepherd of that service.  Creating this sense of openness with our code helps prevent us from creating walled gardens, and encourages people to share ideas and try new things.


Even though language-agnostic engineering comes with some nice benefits, it’s crucial to bring a modicum of pragmatism to this vision.  A completely language agnostic environment may be idealistic and impractical.  Allowing developers to build services and tools in any language that interests them may lead to a huge amount of fragmentation.  Having 50 tools written in 50 different languages would be a nightmare to maintain, and would kill any opportunities for code reuse between them.  Additionally, with a greater breadth of technologies, it becomes much more difficult to have people on hand with the depth of knowledge needed to lead initiatives with them.

As a matter of practicality, we keep a list of Preferred Technologies which is broad enough to provide plenty of choice, but narrow enough so that we can trust we’ll have plenty of expertise on hand.  If a new technology is vetted and deemed valuable it will be considered for addition to this list.  However if one developer wants to go and write a new site in Haskell, they’ll probably be shot down*.

*we have nothing but respect for Haskell

Although we want to make it easy for all of our developers to work in any of our common languages, there’s always going to be a need for specialists.  Every language is going to have its nuances, buggy behaviors, and performance quirks that only someone with extensive language experience will be able to recognize.   For each of our preferred technologies, we have several people on hand with deep knowledge in it.

* * *

Since Shutterstock is built on a plethora of services, any one of our sites may be receiving data that came from something built in Perl, Java, Node, or Ruby.   If one of those sites needs an extra tweak in an api resource, it’s incredibly helpful when a developer can jump in and help make the necessary change to any of those services regardless of the language it was written in.  When developers can work in this way, it helps ease dependencies between teams, which helps the organization move faster as a whole.

Many of our strategies and tools are designed to help give engineers more language agnostic skills to better work across multiple languages.  Whether it’s frameworks that share standards, build and runtime tools that work across languages, architecture strategies, or testing frameworks, having common approaches for all these things allows everyone in the organization to work together, instead of siloing themselves off based on language-specific skillsets.

As the world of programming languages becomes much more fragmented, it’s becoming more important than ever from a business perspective to develop multilingual-friendly approaches to building a tech company.  Some of the tools and processes we developed at Shutterstock have helped us move in that direction, but there’s a lot more that could be done to facilitate an environment where the tech stack of choice isn’t a barrier to bringing in talent.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Code snippets to calculate percentiles in databases

As a Datavis Engineer at Shutterstock, I dive into a lot of data everyday and routinely answer questions regarding customer behaviors, site security, and latency issues. I keep a list of SQL snippets to copy and paste into my work, and I found that keeping the list of examples is easier than memorizing a bunch of similar-but-different SQL dialects.

Calculating percentiles comes up regularly in my work, and it’s also a popular area of confusion. So, I’m going to be breaking down the appliation of calculating percentiles for you. I find a business context helps me understand the tech behind the topic, so I organized queries into four case studies. I’ll state the situation, present the technology, give the query, then wrap it up.

Note: I am very specifically not comparing technologies. The internet is filthy with those posts. I am offering copiable code without the hype.

Also note: all data is fake. I show numbers to give a sense of the output to expect, not a scale for comparison.



Shutterstock is a two-sided marketplace. We license crowd-sourced media. Contributors do not receive money for their images until it is published on our website. Tens of thousands of images are submitted daily, and they all need to be inspected carefully. Hence, an entire team is dedicated to inspecting the images, and they aim to maintain a thorough and speedy process.


We use Vertica for analytic duties. Like most columnar database technologies, Vertica’s row lookups are slow, but the aggregates are blazingly fast. It is the only pay-for-use database in this post.


Vertica’s flavor of SQL is fairly similar to MySQL’s, with analytic functions similar to other OLAP databases. Vertica has many built in analytical functions. I use PERCENTILE_DISC() for this case study.

    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS '90th',
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS 'median'
  added_date >= current_date() - interval '4 day'

RESULTS (as a reminder, there are 1440 minutes in a day):

added_date median 90th
2014-01-01 2880 5000
2014-01-02 1440 6000
2014-01-03 2000 4800
2014-01-04 3000 5500

Half of the photos uploaded on January 1 took two days to show up on our website.  There is a big gap between the median and 90th percentile approval times on January 2. If this data was real, I would investigate why January 2 is different from other days.




We track the efficacy of our designs. Knowing how often an element is clicked gives us insight into what our customers actually see on a page. If customers click an element that is not a link, then we can make changes to our HTML.


We store raw click data in HDFS. Hive is a familiar SQL interface to data stored in HDFS. It has a built in PERCENTILE() UDF.


I begin by running an inner query to get customer click counts on the specific element per day. I wrap that inner query in the daily percentile main query to find percentiles of customer behavior. I need to build the inner query because PERCENTILE() does not accept COUNT() as its first argument.

  percentile(count, 0.25),
  percentile(count, 0.50),
  percentile(count, 0.75),
  percentile(count, 0.99)
  -- inner query
      count(*) as count
      element = 'header_div'
      and page = '/search_results.html'
      and year = 2014 and month = 4
    group by
  ) c
group by


Sample data result of inner query:

   visitor_id | day | count
       1      |  1  |  5
       2      |  1  |  7
       2      |  2  |  9

All results:

day _c1 _c2 _c3 _c4
01 1.0 3.0 15.0 52.0
02 1.0 3.0 15.0 64.0
03 1.0 3.0 14.0 68.0

Judging by median click counts, _c2, customers click on a non-link element about three times in a session. Some click as many as fifteen times. Wow. The header_div element should be made clickable.




Shutterstock’s B.I. team does an excellent job of analyzing marketing spends and conversions. Sometimes it is easier for me to get the data than pull an analyst away from their work.


MySQL is a widely used transactional database. It’s fast and full-featured, but does not have built-in support for calculating percentiles.


I need to compare a rank, meaning a position in an ordered list, against a total count of rows. Position over total count gives me a percentile.

Complex queries are built inside out. This query starts with inner query t, which counts the total number of accounts per language. I join the per language counts to the accounts table, and pull out a customer’s signup date.

There are plenty of resources for calculating row level ranks in MySQL around the web. The techniques boil down to setting variables in an ordered result set. Here, I order the inner query r by language and keep track of the current language in @lang. When language changes, @rank resets to 0 in _resetRank column. Neat!

The outer query compares signup dates to milestone dates.

Given that we’re only looking at sign ups that occurred in the past year, we would expect twenty-five percent of signups to happen exactly nine months ago under consistent signups. 50% six months ago. If 25% of signups happens before the nine month milestone, the first quarter of the year saw “fast” signup. This query returns “gut check” data; it’s not vigorously tested or verified.

    date(max(case when percentile <= 25 then signup_datetime end)),
    current_date - interval 9 month
  ) AS '25th',
    date(min(case when percentile >= 50 then signup_datetime end)),
    current_date - interval 6 month
  ) as 'median',
    date(min(case when percentile >= 75 then signup_datetime end)),
    current_date - interval 3 month
  ) as '75th'
    /* build ranks and percentiles */
      @rank := if( @lang = a.language, @rank, 0 ) as _resetRank,
      @lang := a.language as language,
      @rank := @rank+1 as rank,
      round( 100 * @rank / t.cnt, 3 ) as percentile
      accounts a
    /* t counts rows per language */
           count(*) as cnt
           signup_datetime > current_date - interval 1 year
         group by
       ) t on t.language = a.language
      a.signup_datetime > current_date - interval 1 year
    order by
  ) r
group by
order by
  min(case when percentile >= 25 then signup_datetime end),
  min(case when percentile >= 50 then signup_datetime end);


| language | 25th | median | 75th |
| de       |  -18 |     -9 |    2 |
| en       |    0 |      0 |    0 |
| vu       |   82 |     54 |   39 |

German language sign ups hit first the 25th percentile 18 days early and median nine days early, but third quartile was not reached until two days after expected. German language signups are slowing down. Vulcan, which was a slow trickle at the beginning of the year, hit a boom in the recent 3 months; guess that booth at the convention worked out.




As events happen on our site, domain experts post comments to our internal annotation service. Comments are tagged with multiple keywords, and those tags form the structure for our knowledge base. It is one way we can link homepage latency with conversion rates. In such a system, keyword breadth is highly important, so I want to know how many annotations keywords link together.


MongoDB is a document store, NoSQL technology. It does not have out-of-the-box percentile functionality, but it does have a well documented MapReduce framework. Up until now, all the percentile solutions, even MySQL, have been a DSL; MapReduce is full-on programming.


“tags” is an array field for a AnnotationCollection document. I emit() each tag, summing them up in the reducer. Basic MapReduce word counting. I inline the output of the MapRedue job, ‘{ inline : 1 }‘, to capture the results in an array of key, value tuples. I then sort the tuple array, biggest first. Finally, use the percentile times the total number of records to get the index of the tuple.

mapper = function() {
  if ( this.tags ) {
    for ( var i = 0; i < this.tags.length; i++ ) {
      emit( this.tags[i],1 )

reducer = function(pre,curr) {
  var count = 0;
  for (var i in curr) {
    count += curr[i]
  return count;

out = db.runCommand({
   mapreduce  : 'AnnotationCollection',
   map        : mapper,
   reduce     : reducer,
   out        : { inline : 1 }

/* sort them */
out.results.sort( function(a,b ) { return a.value - b.value } )

/* these percentiles */
[ 50, 80, 90, 99 ].forEach(function(v) {
   print( v + 'th percentile of annotations linked by keyword tags is ', out.results[Math.floor(v * out.counts.output / 100) ].value );


Sample results from out:

			"_id" : "german seo",
			"value" : 98
			"_id" : "glusterfs",
			"value" : 145
			"_id" : "googlebot",
			"value" : 2123
	"counts" : {
		"input" : 711475,
		"emit" : 1543510,
		"reduce" : 711475,
		"output" : 738

All results:

50th percentile of annotations linked by keyword tags is  4
80th percentile of annotations linked by keyword tags is  8
90th percentile of annotations linked by keyword tags is  27
99th percentile of annotations linked by keyword tags is  853

Keyword tags link, on average, 4 annotations. Decent coverage, but could be better. But the top 1% of our (fake-data) keywords link 800+ annotations? “deployment” is the top keyword; automated processes are good knowledge sharers.

This post is a reference for calculating percentiles in different data stores. I keep a list of working queries to copy from at work. And now I’m pleased to share some of this list with you. Try them out, and let me know what you think.


Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Increase Performance with Automatic Keyword Recommendation

For most large-scale image retrieval systems, performance depends upon accurate meta-data. While content-based image retrieval has progressed in recent years, typically image contributors must provide appropriate keywords or tags that describe the image. Tagging, however, is a difficult and time-consuming task, especially for non-native English speaking contributors.

At Shutterstock, we mitigate this problem for our contributors by providing automatic tag recommendations. In this talk, delivered as a Webinar for Bright Talk’s “Business Intelligence and Analytics” channel, I describe the machine learning system behind the keyword recommendation system which Shutterstock’s Search and Algorithm Teams developed and deployed to the site.

Tag co-occurrence forms the basis of the recommendation algorithm. Co-occurrence is also the basis for some previous systems of tag recommendation deployed in the context of popular photo sharing services such as Flickr. In the context of online stock photography, tag recommendation has several aspects which are different from the context of photo sharing sites. In online stock photography, contributors are highly motivated to provide high quality tags because they make images easier to find and consequently earn higher contributor revenue. In building the system, we explored several different recommendation strategies and found that significant improvements are possible as compared to a recommender that only uses tag co-occurrence.

The three principle points of the talk are as follows:

(1) we characterize tagging behavior in the stock photography setting and show it is demonstrably different from popular photo sharing services.
(2) we explore different tag co-occurrence measures and in contrast to previous studies and a linear combination of two different measures to be optimal, and
(3) we show that a novel strategy that incorporates similar images can expand contextual information and significantly improve the precision of recommended tags.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , , , , | Leave a comment

Monitoring High Scale Search at a Glance

One of our key missions on the search team at Shutterstock is to constantly improve the reliability and speed of our search system.  To do this well, we need to be able to measure many aspects of our system’s health.  In this post we’ll go into some of the key metrics that we use at Shutterstock to measure the overall health of our search system.


The image above shows our search team’s main health dashboard.  Anytime we get an alert, a single glance at this dashboard can usually point us toward which part of the system is failing.

On a high level, the health metrics for our search system focus on its ability to respond to search requests, and its ability to index new content.  Each of these capabilities is handled by several different systems working together, and requires a handful of core metrics to monitor its end-to-end functionality.

One of our key metrics is the rate of traffic that the search service is currently receiving.  Since our search service serves traffic from multiple sites, we also have other dashboards that break down those metrics further for each site.  In addition to the total number of requests we see, we also measure the rate of memcache hits and misses, the error rate, and the number of searches returning zero results.

One of the most critical metrics we focus on is our search service latency.  This varies greatly depending on the type of query, number of results, and type of sort order being used, so this metric is also broken down into more detail in other dashboards.  For the most part we aim to maintain response times of 300ms or less for 95% of our queries.  Our search service runs a number of different processes before running a query on our Solr pool– language identification, spellcheck, translation, etc, so this latency represents the sum total of all those processes.

In addition to search service latency, we also track latency on our Solr cluster itself.  Our Solr pool will only see queries that did not have a hit in memcache, so the queries that run there may be a little slower on average.

When something in the search service fails or times out, we also track the rate of each type of error that the search service may return.  At any time there’s always a steady stream of garbage traffic from bots generating queries that may error out, so there’s a small but consistent stream of failed queries.  If a search service node is restarted we may also see a blip in HTTP 502 errors, although that’s a problem we’re trying to address by improving our load balancer’s responsiveness in taking nodes out of the pool before they’re about to go down.

A big part of the overall health of our system also includes making sure that we’re serving up new content in a timely manner.  Another graph on our dashboard tracks the volume and burndown of items in our message queues which serves as our pipeline for ingesting new images, videos, and other assets into our Solr index.  This ensures that content is making it into our indexing pipeline, where all the data needed to make it searchable is processed.  If the indexing system stops being able to process data, then that will usually cause the burndown rate of each queue to come to a halt.

There’s other ways in which our indexing pipeline may fail too, so we also have another metric that measures the amount of content that is making it through our indexing system, getting into Solr, and showing up in the actual output of Solr queries.  Each document that goes into Solr receives a timestamp when it was indexed.  One of our monitoring scripts then polls Solr at regular intervals to see how many documents were added or modified in a recent window of time.  This helps us serve our contributors well by making sure that their new content is being made available to customers in a timely manner.

Behind the scenes we also have a whole host of other dashboards that break out the health and performance of each system covered in this dashboard, as well as metrics for other services in our search ecosystem.  When we’re deploying new features or troubleshooting issues, having metrics like these helps us very quickly determine what the impact is and guides us to quickly resolving it.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , | 1 Comment

Stop Buying Load Balancers and Start Controlling Your Traffic Flow with Software

When it comes to traditional load balancers, you can either splurge on expensive hardware or go the software route. Hardware load balancers typically have poor/outdated API designs and are, at least in my experience, slow. You can find a few software load balancing products with decent APIs, but trying to use free alternatives like HAproxy leaves you with bolt on software that generates the configuration file for you. Even then, if you need high throughput you have to rely on vertical scaling of your load balancer or round robin DNS to distribute horizontally.

We were trying to figure out how to avoid buying a half million dollars worth of load balancers everytime we needed a new data center. What if you didn’t want to use a regular layer 4/7 load balancer and, instead, relied exclusively on layer 3? This seems entirely possible, especially after reading about how CloudFlare uses Anycast to solve this problem. There are a few ways to accomplish this. You can go full blown BGP and run that all the way down to your top of rack switches, but that’s a commitment and likely requires a handful of full time network engineers on your team. Running a BGP daemon on your servers is the easiest way to mix “Anycast for load balancing” into your network. You have multiple options to do this:

After my own research, I decided that ExaBGP is the easiest way to manipulate routes. The entire application is written in Python, making it perfect for hacking on. ExaBGP has a decent API, and even supports JSON for parts of it. The API works by reading STDOUT from your process and sending your process information through STDIN. In the end, I’m looking for automated control over my network, rather than more configuration management.

At this point, I can create a basic “healthcheck” process that might look like:

#!/usr/bin/env bash

while true; do
  curl localhost:4000/healthcheck.html 2>/dev/null | grep OK

  if [[ $? == 0 ]]; then
    if [[ "$STATE" != "up" ]]; then
      echo "announce next-hop self"
    if [[ "$STATE" != "down" ]]; then
      echo "withdraw next-hop self"

  sleep 2

Then in your ExaBGP configuration file, you would add something like this:

group anycast-test {
  local-as 65001;
  peer-as 65002;

  process watch-application {
    run /usr/local/bin/healthcheck.sh

  neighbor {

Now, anytime your curl | grep check is passing, your BGP neighbor ( will have a route to your service IP ( When it begins to fail, the route will be withdrawn from the neighbor. If you now deploy this on a handful of servers, your upstream BGP neighbor will have multiple routes. At this point, you have to configure your router to properly spread traffic between the multiple paths with equal cost. In JUNOS, this would look like:

set policy-options policy-statement load-balancing-policy then load-balance per-packet
set routing-options forwarding-table export load-balancing-policy

Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution. You can read more about this configuration on Juniper’s website.. Below is our new network topology for accessing a service:


There are some scale limitations though. It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind. A tiered approach may be appropriate if you need a high number of backend machines. This would include a layer of route servers running BIRD/Quagga and then your backend services peer to this using ExaBGP. You could even use this approach to scale HAproxy horizontally.

In conclusion, replacing a traditional load balancer with layer 3 routing is entirely possible. In fact, it can even give you more control of where traffic is flowing in your datacenter if done right. I look forward to rolling this out with more backend services over the coming months and learning what problems may arise. The possibilities are endless, and I’d love to hear more about what others are doing.

Interested in working at Shutterstock? We're hiring! >>