12 Questions To Ask About PCI

Organizations either breeze through or struggle with PCI certification. The struggle parallels to a fight against zombies. You must stay on your toes. Once they start coming towards you they don’t stop and as your team deals with their own zombies, you realize you can’t keep up. The challenge does not stop there. This poem by Sam Kassoumeh sums PCI up in a nutshell. You must manage tech debt, legacy access control rules, and getting attention from developers or operations. How does anyone get through this?

Never stray from your main goal. Certification is the immediate point of the program. It is supposed to comfort customers and partners because due diligence keeps their financial data safe. Remember PCI is a means to an end and not a goal itself. The PCI process is supposed to make you think of ways to handle sensitive data in a general way. Nobody would argue that a single certification, piece of paper, or audit is enough to protect the organization. Build a process and worldview to work with.

The PCI Level 3 document is 112 pages long with 4 appendices and 12 sections. It sounds daunting and can be if your approach to security is ad-hoc. You will scramble to figure out what is covered, where covered assets are, who has access to those assets, and maybe even what the term asset really means. In the middle of an assessment you find yourself questioning the meaning of just about everything including:

  • Security strategy  - Do I really have a coherent strategy?
  • Tools used
  • How to store data

So what can you do to make PCI compliance achievable on that big day? Start today and think about:

  1. What customer data do we need to hold onto? For how long?
  2. How do I dispose of storage and printouts that have this data?
  3. Does everyone who needs to access this information have 2-factor authentication?
  4. Is the pathway to this data secured by encrypted connections (e.g. HTTPS)?
  5. Is it possible for an insider or intruder to see sensitive data through some other segment of our network?
  6. Is sensitive data only available to people and apps that really need it?
  7. If someone’s access level changes, would I know about it?
  8. If a related network rule changes, would I know about it?
  9. Am I keeping up with patches on high priority servers?
  10. Am I monitoring for and/or alerting on suspicious traffic?
  11. Do I know what ciphers we use?
  12. What’s our process for offboarding employees with access to sensitive data?

If you store data about customer transactions unrelated to credit cards (which is the domain of PCI) is it really a stretch to treat that data with same care? Why encrypt credit card information but not bank account numbers? Why mask part of the credit number and not a customer’s address? An address can be used for identity theft, too.

It’s not to say you should encrypt or mask everything everywhere. The point is to consider it. Maybe you don’t need to store so much data. Maybe you can build your network and application access rules earlier when paying attention to areas of the network with personal information.

This is just the beginning but if you can ask yourself these questions early, you can construct a strong strategy which is the true end goal of the PCI compliance process.

Once you know what you’re looking for use any resource you find helpful. For example, IBM has posted a guide on the importance of complying with PCI DSS 3.0 Requirement 6. You can view the guide here.

What are your thoughts on PCI? Be sure to comment below.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Stop Using One Language

In any technology company one of the fundamental aspects of its identity is the technology stack, and programming language that it’s built on.  This is what defines types of tools that are fair game, and more importantly, defines the types of engineers who are hired and capable of succeeding there.

Back in the middle of the last decade, when Shutterstock had its beginnings, the tech team was made up primarily of die-hard Perl developers.  The benefits of CPAN, and flexibility of the language were touted as the primary reasons why Perl was the right tool for anything we wanted to build.  The only problem was our hiring pool was limited to people eager to work with Perl– and although the Perl folks who joined us were indeed some of our most passionate and skilled engineers, there were countless engineers outside the Perl community who we totally ended up ignoring.

Fast forward to the last few years here, and Shutterstock has become a much more “multilingual” place for software engineers to work.  We have services written in Node.js, Ruby, and Java; Data processing tools written in Python; a few of our sites written in PHP, and apps written in Objective-C.

Even though we have developers who specialize in each language, it’s become increasingly important that we remove the barriers to letting people work across multiple languages when they need to, whether it’s for debugging, writing new features, or building brand new apps and services.  At Shutterstock, there have been a few strategic decisions and technology choices that have facilitated our evolution to the more multilingual programming environment and culture we have today.

Service Oriented Architectures

One of the architectural decisions we made early on to support development in multiple languages was to build out all our core functionality into siloed services.  Each service could be written in any language while providing a language-agnostic interface through REST frameworks.   This has enabled us to write separate pieces of functionality in the language most suited to it.  For example, search makes use of Lucene & Solr, and so Java made sense there.  For our translation services, Unicode support is highly important, so Perl was the strongest option there for us.

Common Frameworks

Between languages there are numerous frameworks and standards that have been inspired or replicated by one another.  When possible, we try to use one of those common technologies in our services.  As mentioned above, all of our services provide RESTful interfaces, and internally we use Sinatra-inspired frameworks for implementing them (Dancer for Perl, Slim for PHP, Express for Node, etc).  For templating we use Django inspired frameworks such as Template::Swig for Perl, Twig for PHP, and Liquid for Ruby.  By using these frameworks we can help improve the learning curve when a developer jumps between languages.

Runtime Management

When it comes down to the nuts and bolts of actually running code in a particular language, one of the obstacles that blocks new developers from getting into it is all the technical bureaucracy needed to manage each runtime — dependency management, environment paths, and all the command line settings and flags needed to do common tasks.

The tool we use at Shutterstock to simplify all this is Rockstack.  Rockstack provides a standardized interface for building, running, and testing code in any of its supported runtimes (currently: Perl, PHP, Python, Ruby, Java).   Have a java app that you need to spin up? Run “rock build” and “rock run”.  Have a Perl service you want a java developer to debug?  “rock build”, “rock run”.

Another major benefit of using Rockstack, is that not only do our developers get a standard interface for building, testing, and running code, but our build and deployment system only has to deal with one standard set of commands for running those operations as well for any language.  Rockstack is used by our Jenkins cluster for running builds and tests, and our home-grown deployment system makes use of it for for launching applications in dev, qa, and production.

One of the biggest obstacles for people jumping into a new language is the cognitive load of having to figure out all the details of setting up and working with the development environment for that language.  Once you remove that burden, people can actually focus their energy on the important engineering problems they need to solve.

Testing Frameworks

In order to create a standardized method for testing all the services we have running, we developed (and open sourced!) NTF (Network Testing Framework).  NTF lets us write tests that hit special resources on our services’ API’s to provide status information that show the service is running in proper form.  NTF supplements our collection of unit and integration tests by constantly running in production and telling us if any functionality has been impaired in any of our services.

Developer Meetups

In addition to tools and frameworks, we also support our developers in learning and evolving their skillsets as well.  On a regular basis, we’ll have internal meetups for Shutterstock’s Node developers, PHP Developers, or Ruby developers where they give each other feedback on code in progress, share successes or failures with third-party libraries, and polish up the style guide for their language.  These meetups are a great way for someone new to a language to ask questions and improve on their coding skills.


Part of what makes it easy to jump into another language is that all the code for every Shutterstock site and service is available for everyone to look at on our internal Github server. This means that anyone can review anyone elses code, or check out a copy and run it.  If you have an idea for a feature, you can fork off a branch, and submit a pull request to the shepherd of that service.  Creating this sense of openness with our code helps prevent us from creating walled gardens, and encourages people to share ideas and try new things.


Even though language-agnostic engineering comes with some nice benefits, it’s crucial to bring a modicum of pragmatism to this vision.  A completely language agnostic environment may be idealistic and impractical.  Allowing developers to build services and tools in any language that interests them may lead to a huge amount of fragmentation.  Having 50 tools written in 50 different languages would be a nightmare to maintain, and would kill any opportunities for code reuse between them.  Additionally, with a greater breadth of technologies, it becomes much more difficult to have people on hand with the depth of knowledge needed to lead initiatives with them.

As a matter of practicality, we keep a list of Preferred Technologies which is broad enough to provide plenty of choice, but narrow enough so that we can trust we’ll have plenty of expertise on hand.  If a new technology is vetted and deemed valuable it will be considered for addition to this list.  However if one developer wants to go and write a new site in Haskell, they’ll probably be shot down*.

*we have nothing but respect for Haskell

Although we want to make it easy for all of our developers to work in any of our common languages, there’s always going to be a need for specialists.  Every language is going to have its nuances, buggy behaviors, and performance quirks that only someone with extensive language experience will be able to recognize.   For each of our preferred technologies, we have several people on hand with deep knowledge in it.

* * *

Since Shutterstock is built on a plethora of services, any one of our sites may be receiving data that came from something built in Perl, Java, Node, or Ruby.   If one of those sites needs an extra tweak in an api resource, it’s incredibly helpful when a developer can jump in and help make the necessary change to any of those services regardless of the language it was written in.  When developers can work in this way, it helps ease dependencies between teams, which helps the organization move faster as a whole.

Many of our strategies and tools are designed to help give engineers more language agnostic skills to better work across multiple languages.  Whether it’s frameworks that share standards, build and runtime tools that work across languages, architecture strategies, or testing frameworks, having common approaches for all these things allows everyone in the organization to work together, instead of siloing themselves off based on language-specific skillsets.

As the world of programming languages becomes much more fragmented, it’s becoming more important than ever from a business perspective to develop multilingual-friendly approaches to building a tech company.  Some of the tools and processes we developed at Shutterstock have helped us move in that direction, but there’s a lot more that could be done to facilitate an environment where the tech stack of choice isn’t a barrier to bringing in talent.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Code snippets to calculate percentiles in databases

As a Datavis Engineer at Shutterstock, I dive into a lot of data everyday and routinely answer questions regarding customer behaviors, site security, and latency issues. I keep a list of SQL snippets to copy and paste into my work, and I found that keeping the list of examples is easier than memorizing a bunch of similar-but-different SQL dialects.

Calculating percentiles comes up regularly in my work, and it’s also a popular area of confusion. So, I’m going to be breaking down the appliation of calculating percentiles for you. I find a business context helps me understand the tech behind the topic, so I organized queries into four case studies. I’ll state the situation, present the technology, give the query, then wrap it up.

Note: I am very specifically not comparing technologies. The internet is filthy with those posts. I am offering copiable code without the hype.

Also note: all data is fake. I show numbers to give a sense of the output to expect, not a scale for comparison.



Shutterstock is a two-sided marketplace. We license crowd-sourced media. Contributors do not receive money for their images until it is published on our website. Tens of thousands of images are submitted daily, and they all need to be inspected carefully. Hence, an entire team is dedicated to inspecting the images, and they aim to maintain a thorough and speedy process.


We use Vertica for analytic duties. Like most columnar database technologies, Vertica’s row lookups are slow, but the aggregates are blazingly fast. It is the only pay-for-use database in this post.


Vertica’s flavor of SQL is fairly similar to MySQL’s, with analytic functions similar to other OLAP databases. Vertica has many built in analytical functions. I use PERCENTILE_DISC() for this case study.

    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS '90th',
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS 'median'
  added_date >= current_date() - interval '4 day'

RESULTS (as a reminder, there are 1440 minutes in a day):

added_date median 90th
2014-01-01 2880 5000
2014-01-02 1440 6000
2014-01-03 2000 4800
2014-01-04 3000 5500

Half of the photos uploaded on January 1 took two days to show up on our website.  There is a big gap between the median and 90th percentile approval times on January 2. If this data was real, I would investigate why January 2 is different from other days.




We track the efficacy of our designs. Knowing how often an element is clicked gives us insight into what our customers actually see on a page. If customers click an element that is not a link, then we can make changes to our HTML.


We store raw click data in HDFS. Hive is a familiar SQL interface to data stored in HDFS. It has a built in PERCENTILE() UDF.


I begin by running an inner query to get customer click counts on the specific element per day. I wrap that inner query in the daily percentile main query to find percentiles of customer behavior. I need to build the inner query because PERCENTILE() does not accept COUNT() as its first argument.

  percentile(count, 0.25),
  percentile(count, 0.50),
  percentile(count, 0.75),
  percentile(count, 0.99)
  -- inner query
      count(*) as count
      element = 'header_div'
      and page = '/search_results.html'
      and year = 2014 and month = 4
    group by
  ) c
group by


Sample data result of inner query:

   visitor_id | day | count
       1      |  1  |  5
       2      |  1  |  7
       2      |  2  |  9

All results:

day _c1 _c2 _c3 _c4
01 1.0 3.0 15.0 52.0
02 1.0 3.0 15.0 64.0
03 1.0 3.0 14.0 68.0

Judging by median click counts, _c2, customers click on a non-link element about three times in a session. Some click as many as fifteen times. Wow. The header_div element should be made clickable.




Shutterstock’s B.I. team does an excellent job of analyzing marketing spends and conversions. Sometimes it is easier for me to get the data than pull an analyst away from their work.


MySQL is a widely used transactional database. It’s fast and full-featured, but does not have built-in support for calculating percentiles.


I need to compare a rank, meaning a position in an ordered list, against a total count of rows. Position over total count gives me a percentile.

Complex queries are built inside out. This query starts with inner query t, which counts the total number of accounts per language. I join the per language counts to the accounts table, and pull out a customer’s signup date.

There are plenty of resources for calculating row level ranks in MySQL around the web. The techniques boil down to setting variables in an ordered result set. Here, I order the inner query r by language and keep track of the current language in @lang. When language changes, @rank resets to 0 in _resetRank column. Neat!

The outer query compares signup dates to milestone dates.

Given that we’re only looking at sign ups that occurred in the past year, we would expect twenty-five percent of signups to happen exactly nine months ago under consistent signups. 50% six months ago. If 25% of signups happens before the nine month milestone, the first quarter of the year saw “fast” signup. This query returns “gut check” data; it’s not vigorously tested or verified.

    date(max(case when percentile <= 25 then signup_datetime end)),
    current_date - interval 9 month
  ) AS '25th',
    date(min(case when percentile >= 50 then signup_datetime end)),
    current_date - interval 6 month
  ) as 'median',
    date(min(case when percentile >= 75 then signup_datetime end)),
    current_date - interval 3 month
  ) as '75th'
    /* build ranks and percentiles */
      @rank := if( @lang = a.language, @rank, 0 ) as _resetRank,
      @lang := a.language as language,
      @rank := @rank+1 as rank,
      round( 100 * @rank / t.cnt, 3 ) as percentile
      accounts a
    /* t counts rows per language */
           count(*) as cnt
           signup_datetime > current_date - interval 1 year
         group by
       ) t on t.language = a.language
      a.signup_datetime > current_date - interval 1 year
    order by
  ) r
group by
order by
  min(case when percentile >= 25 then signup_datetime end),
  min(case when percentile >= 50 then signup_datetime end);


| language | 25th | median | 75th |
| de       |  -18 |     -9 |    2 |
| en       |    0 |      0 |    0 |
| vu       |   82 |     54 |   39 |

German language sign ups hit first the 25th percentile 18 days early and median nine days early, but third quartile was not reached until two days after expected. German language signups are slowing down. Vulcan, which was a slow trickle at the beginning of the year, hit a boom in the recent 3 months; guess that booth at the convention worked out.




As events happen on our site, domain experts post comments to our internal annotation service. Comments are tagged with multiple keywords, and those tags form the structure for our knowledge base. It is one way we can link homepage latency with conversion rates. In such a system, keyword breadth is highly important, so I want to know how many annotations keywords link together.


MongoDB is a document store, NoSQL technology. It does not have out-of-the-box percentile functionality, but it does have a well documented MapReduce framework. Up until now, all the percentile solutions, even MySQL, have been a DSL; MapReduce is full-on programming.


“tags” is an array field for a AnnotationCollection document. I emit() each tag, summing them up in the reducer. Basic MapReduce word counting. I inline the output of the MapRedue job, ‘{ inline : 1 }‘, to capture the results in an array of key, value tuples. I then sort the tuple array, biggest first. Finally, use the percentile times the total number of records to get the index of the tuple.

mapper = function() {
  if ( this.tags ) {
    for ( var i = 0; i < this.tags.length; i++ ) {
      emit( this.tags[i],1 )

reducer = function(pre,curr) {
  var count = 0;
  for (var i in curr) {
    count += curr[i]
  return count;

out = db.runCommand({
   mapreduce  : 'AnnotationCollection',
   map        : mapper,
   reduce     : reducer,
   out        : { inline : 1 }

/* sort them */
out.results.sort( function(a,b ) { return a.value - b.value } )

/* these percentiles */
[ 50, 80, 90, 99 ].forEach(function(v) {
   print( v + 'th percentile of annotations linked by keyword tags is ', out.results[Math.floor(v * out.counts.output / 100) ].value );


Sample results from out:

			"_id" : "german seo",
			"value" : 98
			"_id" : "glusterfs",
			"value" : 145
			"_id" : "googlebot",
			"value" : 2123
	"counts" : {
		"input" : 711475,
		"emit" : 1543510,
		"reduce" : 711475,
		"output" : 738

All results:

50th percentile of annotations linked by keyword tags is  4
80th percentile of annotations linked by keyword tags is  8
90th percentile of annotations linked by keyword tags is  27
99th percentile of annotations linked by keyword tags is  853

Keyword tags link, on average, 4 annotations. Decent coverage, but could be better. But the top 1% of our (fake-data) keywords link 800+ annotations? “deployment” is the top keyword; automated processes are good knowledge sharers.

This post is a reference for calculating percentiles in different data stores. I keep a list of working queries to copy from at work. And now I’m pleased to share some of this list with you. Try them out, and let me know what you think.


Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Increase Performance with Automatic Keyword Recommendation

For most large-scale image retrieval systems, performance depends upon accurate meta-data. While content-based image retrieval has progressed in recent years, typically image contributors must provide appropriate keywords or tags that describe the image. Tagging, however, is a difficult and time-consuming task, especially for non-native English speaking contributors.

At Shutterstock, we mitigate this problem for our contributors by providing automatic tag recommendations. In this talk, delivered as a Webinar for Bright Talk’s “Business Intelligence and Analytics” channel, I describe the machine learning system behind the keyword recommendation system which Shutterstock’s Search and Algorithm Teams developed and deployed to the site.

Tag co-occurrence forms the basis of the recommendation algorithm. Co-occurrence is also the basis for some previous systems of tag recommendation deployed in the context of popular photo sharing services such as Flickr. In the context of online stock photography, tag recommendation has several aspects which are different from the context of photo sharing sites. In online stock photography, contributors are highly motivated to provide high quality tags because they make images easier to find and consequently earn higher contributor revenue. In building the system, we explored several different recommendation strategies and found that significant improvements are possible as compared to a recommender that only uses tag co-occurrence.

The three principle points of the talk are as follows:

(1) we characterize tagging behavior in the stock photography setting and show it is demonstrably different from popular photo sharing services.
(2) we explore different tag co-occurrence measures and in contrast to previous studies and a linear combination of two different measures to be optimal, and
(3) we show that a novel strategy that incorporates similar images can expand contextual information and significantly improve the precision of recommended tags.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , , , , | Leave a comment

Monitoring High Scale Search at a Glance

One of our key missions on the search team at Shutterstock is to constantly improve the reliability and speed of our search system.  To do this well, we need to be able to measure many aspects of our system’s health.  In this post we’ll go into some of the key metrics that we use at Shutterstock to measure the overall health of our search system.


The image above shows our search team’s main health dashboard.  Anytime we get an alert, a single glance at this dashboard can usually point us toward which part of the system is failing.

On a high level, the health metrics for our search system focus on its ability to respond to search requests, and its ability to index new content.  Each of these capabilities is handled by several different systems working together, and requires a handful of core metrics to monitor its end-to-end functionality.

One of our key metrics is the rate of traffic that the search service is currently receiving.  Since our search service serves traffic from multiple sites, we also have other dashboards that break down those metrics further for each site.  In addition to the total number of requests we see, we also measure the rate of memcache hits and misses, the error rate, and the number of searches returning zero results.

One of the most critical metrics we focus on is our search service latency.  This varies greatly depending on the type of query, number of results, and type of sort order being used, so this metric is also broken down into more detail in other dashboards.  For the most part we aim to maintain response times of 300ms or less for 95% of our queries.  Our search service runs a number of different processes before running a query on our Solr pool– language identification, spellcheck, translation, etc, so this latency represents the sum total of all those processes.

In addition to search service latency, we also track latency on our Solr cluster itself.  Our Solr pool will only see queries that did not have a hit in memcache, so the queries that run there may be a little slower on average.

When something in the search service fails or times out, we also track the rate of each type of error that the search service may return.  At any time there’s always a steady stream of garbage traffic from bots generating queries that may error out, so there’s a small but consistent stream of failed queries.  If a search service node is restarted we may also see a blip in HTTP 502 errors, although that’s a problem we’re trying to address by improving our load balancer’s responsiveness in taking nodes out of the pool before they’re about to go down.

A big part of the overall health of our system also includes making sure that we’re serving up new content in a timely manner.  Another graph on our dashboard tracks the volume and burndown of items in our message queues which serves as our pipeline for ingesting new images, videos, and other assets into our Solr index.  This ensures that content is making it into our indexing pipeline, where all the data needed to make it searchable is processed.  If the indexing system stops being able to process data, then that will usually cause the burndown rate of each queue to come to a halt.

There’s other ways in which our indexing pipeline may fail too, so we also have another metric that measures the amount of content that is making it through our indexing system, getting into Solr, and showing up in the actual output of Solr queries.  Each document that goes into Solr receives a timestamp when it was indexed.  One of our monitoring scripts then polls Solr at regular intervals to see how many documents were added or modified in a recent window of time.  This helps us serve our contributors well by making sure that their new content is being made available to customers in a timely manner.

Behind the scenes we also have a whole host of other dashboards that break out the health and performance of each system covered in this dashboard, as well as metrics for other services in our search ecosystem.  When we’re deploying new features or troubleshooting issues, having metrics like these helps us very quickly determine what the impact is and guides us to quickly resolving it.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , | 1 Comment

Stop Buying Load Balancers and Start Controlling Your Traffic Flow with Software

When it comes to traditional load balancers, you can either splurge on expensive hardware or go the software route. Hardware load balancers typically have poor/outdated API designs and are, at least in my experience, slow. You can find a few software load balancing products with decent APIs, but trying to use free alternatives like HAproxy leaves you with bolt on software that generates the configuration file for you. Even then, if you need high throughput you have to rely on vertical scaling of your load balancer or round robin DNS to distribute horizontally.

We were trying to figure out how to avoid buying a half million dollars worth of load balancers everytime we needed a new data center. What if you didn’t want to use a regular layer 4/7 load balancer and, instead, relied exclusively on layer 3? This seems entirely possible, especially after reading about how CloudFlare uses Anycast to solve this problem. There are a few ways to accomplish this. You can go full blown BGP and run that all the way down to your top of rack switches, but that’s a commitment and likely requires a handful of full time network engineers on your team. Running a BGP daemon on your servers is the easiest way to mix “Anycast for load balancing” into your network. You have multiple options to do this:

After my own research, I decided that ExaBGP is the easiest way to manipulate routes. The entire application is written in Python, making it perfect for hacking on. ExaBGP has a decent API, and even supports JSON for parts of it. The API works by reading STDOUT from your process and sending your process information through STDIN. In the end, I’m looking for automated control over my network, rather than more configuration management.

At this point, I can create a basic “healthcheck” process that might look like:

#!/usr/bin/env bash

while true; do
  curl localhost:4000/healthcheck.html 2>/dev/null | grep OK

  if [[ $? == 0 ]]; then
    if [[ "$STATE" != "up" ]]; then
      echo "announce next-hop self"
    if [[ "$STATE" != "down" ]]; then
      echo "withdraw next-hop self"

  sleep 2

Then in your ExaBGP configuration file, you would add something like this:

group anycast-test {
  local-as 65001;
  peer-as 65002;

  process watch-application {
    run /usr/local/bin/healthcheck.sh

  neighbor {

Now, anytime your curl | grep check is passing, your BGP neighbor ( will have a route to your service IP ( When it begins to fail, the route will be withdrawn from the neighbor. If you now deploy this on a handful of servers, your upstream BGP neighbor will have multiple routes. At this point, you have to configure your router to properly spread traffic between the multiple paths with equal cost. In JUNOS, this would look like:

set policy-options policy-statement load-balancing-policy then load-balance per-packet
set routing-options forwarding-table export load-balancing-policy

Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution. You can read more about this configuration on Juniper’s website.. Below is our new network topology for accessing a service:


There are some scale limitations though. It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind. A tiered approach may be appropriate if you need a high number of backend machines. This would include a layer of route servers running BIRD/Quagga and then your backend services peer to this using ExaBGP. You could even use this approach to scale HAproxy horizontally.

In conclusion, replacing a traditional load balancer with layer 3 routing is entirely possible. In fact, it can even give you more control of where traffic is flowing in your datacenter if done right. I look forward to rolling this out with more backend services over the coming months and learning what problems may arise. The possibilities are endless, and I’d love to hear more about what others are doing.

Interested in working at Shutterstock? We're hiring! >>

How we built interactive heatmaps using Solr and Heatmap.js

One of the things we obsess over at Shutterstock is the customer experience.  We’re always aiming to better understand how customers interact with our site in their day to day work.  One crucial piece of information we wanted to know was which elements of our site customers were engaging with the most.  Although we could get that by running a one-off report, we wanted to be able to dig into that data for different segments of customers based on their language, country, purchase decisions, or a/b test variations they were viewing in various periods of time.

To do this we built an interactive heatmap tool to easily show us where the “hot” and “cold” parts of our pages were — where customers clicked the most, and where they clicked the least.  The tool we built overlaid this heatmap on top of the live site,  so we could see the site the way users saw it, and understand where most of our customer’s clicks took place.  Since customers are viewing our site in many different screen resolutions we wanted the heatmap tool to also account for the dynamic nature of web layouts and show us heatmaps for any size viewport that our site is used in.

Heatmap on Shutterstock.com

Shutterstock’s heatmap tool running on our home page

The main technologies used to build our interactive heatmap tool were our click tracking system, Lil Brother,  Apache Solr, and Heatmap.js.  Lil brother is able to track every click a user makes on our site, along with the x,y coordinates of the cursor, the page element clicked, and some basic info about the customer (country, language, browser, and a/b test variations).

Solr provided the means to filter and aggregate our click data.  The way in which we were using Solr is described in more detail in our post Solr as an Analytics Platform .   In this case, we indexed each click event as a separate document in solr along with all the customer metadata linked to it.

 Our schema.xml file contained the following fields:

<field name="mouse_x_y" type="string" indexed="true" />
<field name="page_url" type="string" indexed="true"/>
<field name="country" type="string" indexed="true"/>
<field name="language" type="string" indexed="true"/>

Once we generated our solr index, we needed to build a query to get the data for our heatmap.  To do this we ran a facet query on the mouse_x_y field.  This gave us a histogram of the number of clicks in each position on the page (we rounded off the coordinates to the nearest 10 pixels in order to group clicks into reasonably sized buckets). Once we had the number of clicks per bucket from Solr, we passed that data to heatmap.js which rendered the heatmap in the browser.

In order to run Heatmap.js on all the pages on our production site we launched the app through a bookmarklet that loaded up the javascript, and ran AJAX requests against Solr. The bookmarklet also exposed controls for us to select other parameters like country, language, and a/b variations so that we could drilldown into specific groups of customers. As a bookmarklet, the tool was able to detect settings like the browser viewport size and display the heatmap based on those dimensions.

Since we developed the heatmap tool, our designers and product specialists have been using it to identify elements of our site that could be optimized – either by changing or removing some elements – to better serve customers’ needs.  Knowing that nearly all of our customers interact with the search bar helped to steer our design to make it the most prominent element on the page above the fold.  Knowing that many of the links lower down on the page were not used as often helped us make the decision to redesign that area and put more valuable discovery paths there for customers.

In order to help out folks who are interested in building out an interactive heatmap tool for their sites, we’ve open sourced the Shutterstock Heatmap Toolkit.  The toolkit allows you to run a solr instance and web server, and includes a batch of sample data to try it out on.

You can run the tool on your own data too by creating a JSON file with individual click events — where each event includes the mouse x/y coordinates, and any other attributes such as the page element clicked, and information about the user (the toolkit itself contains a sample set of data that you can base it on).  A script is also included to start solr, build an index, and run the web server that powers the heatmap app itself.  Follow the steps in the README on Github to try it out on the example data.

Being able to visualize and dig into our customers’ interactions with our site has provided valuable insight for our designers, product specialists, and developers.  Having the ability to navigate and dig into different slices of this data in realtime is highly valuable if you want your product team to be nimble and have answers to questions as quickly as they can ask them.

Interested in working at Shutterstock? We're hiring! >>
tags: , | 2 Comments

The Secret to Shutterstock Tech Teams

Being fast and nimble is important to us at Shutterstock, and one way we accomplish this is by working in small teams.  This approach has yielded tremendous benefits over the years, but it comes with its own challenges: Shutterstock now has over 300 people and dozens of teams.  How do we coordinate everything with so many different groups?

Here’s a bit of information about how our approach to small teams has evolved, and how we continue to change it as we grow.

The Early Days

About five years ago, we learned the value of small teams the hard way — by not having small teams.  Shutterstock started with a few developers who would work on a few different projects at any given time.  We followed that approach as we grew the team, until suddenly we had 10 developers working on 10 projects, and nothing was getting done.

We addressed this problem by breaking into smaller teams.  Each team has a product owner, a few back-end developers, a front-end developer, a designer, and a QA engineer.  The teams are  meant to be independent and autonomous, capable of taking any project from idea to completion without outside help.  This lets them move very quickly and stay focused on their goals.

We started with three teams: a customer team, a contributor team, and a “business value” team that was meant to focus on internal projects that bring value to the business.

Lessons Learned

The customer and contributor teams got off to a great start, and exist to this day.  But the “business value” team floundered, and we learned some early important lessons about teams:

  • Each team needs a clear customer (or more generally, a clear target user).  The team has to come to work every day excited to solve a problem for a particular audience.  “Business value” was just too vague; there were many audiences within Shutterstock that needed projects, but there was no good way to prioritize one project over another.  Consequently, the team was tossed from project to project, and often ended up doing things that the other teams simply didn’t want to do.  After a few quarters, we decided to dissolve the team.

  • Each team needs a clear goal.  Our customer team was living a schizophrenic existence: half its projects were focused on improving the search and download experience for our customers, the other half were about working on the signup and subscription flow to increase revenue.  We addressed this by splitting the team in two.  We decided that our “customer experience” team should stay focused on the primary customer experience on our site.  The revenue team took over the signup and payment flow.  After the split, this distinction came to feel more and more natural, and we look back on it as moving us in a better direction.

  • Teams need to be autonomous.   As Shutterstock grew over the years, we were able to expand our offerings by creating new teams.  Sometimes we’d assemble a team without making sure it had every role it needed — perhaps it would only have one developer or not have a QA engineer.  We always ended up regretting this.  The process only works well if the team can truly be independent and autonomous.  Now we know: if we don’t have enough people to form a team, we wait until we can hire all the roles before we launch it.

Core Services Teams

As we added more product-oriented teams, there was a growing need to build common architectural pieces that all the teams could use.  We decided several years ago to move towards a RESTful architecture, and soon many teams jointly used back-end services to support their product.  But ownership of the services was problematic.  If a service needed changes, it was unclear who was responsible for making that happen.

We solved that problem by introducing the latest evolution of our team strategy: core services teams.  Each of these teams own one or more RESTful services, and work with the product teams to prioritize their work.  Their goal is to build core infrastructure that other teams can leverage to serve their customers.

The Challenge of Coordination

Today, Shutterstock has over 20 teams, all of which follow agile development practices of fast interaction and frequent customer feedback.  With so many teams moving so quickly, coordination has become a challenge.  This is partly addressed by returning to a core team principle: strive for autonomy and independence.  We encourage teams to pursue projects that are within their power to take from idea to completion without outside help, which eliminates the need to coordinate altogether.

However, there are inevitably projects that require multiple teams to work together.  In those cases, we promote four ways to improve coordination:

  • Each of our teams has a planning meeting every two weeks.  Anyone can attend these meetings, and we encourage teams that are working together to attend each other’s planning meetings.

  • Each of our teams also has a demo every two weeks, in which they show off the work they’ve done recently.  We also encourage teams that are working together to attend each other’s demos.

  • We have a weekly product backlog meeting, where all our product teams share upcoming projects and discuss metrics related to recently-launched features.

  • Finally, each team has a lead developer and product owner, and we give them the specific responsibility of pro-actively reaching out to other teams to discuss upcoming work.

These approaches are intentionally lightweight and simple.  We rely on people’s own initiative to share their work, communicate actively with others, and work out the details themselves to address many challenges of coordination.  Having a non-prescriptive process makes it clear to people that it’s their responsibility to talk to whomever they need to.  So far, this approach has worked out well.

We’ll continue to evolve and adapt our team strategy as we grow.  Though we’ve had some minor challenges with our approach over the years, overall it has served us very well.  We’d love to hear from others about their team-building lessons.  What has worked well for you?  How have you changed your approach as your company grew?  Let us know in the comments below.


Interested in working at Shutterstock? We're hiring! >>
1 Comment

When a Space Is Not Just a Space

During a recent email exchange with our search team, Nick Patch, our resident Unicode expert, offered the following advice for a chunk of Java code used to detect Japanese characters:

> Pattern.compile(“^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\s]+$”);

We should use one of the following options instead:



Pattern.compile(“^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\s]+$”, Pattern.UNICODE_CHARACTER_CLASS);

They all do the exact same thing, which is matching on any Unicode whitespace instead of just ASCII whitespace. This is important so that it will also match U+3000 IDEOGRAPHIC SPACE, which is commonly found in CJK text.

By default the predefined character class \s just matches ASCII whitespace while \p{IsWhite_Space} matches Unicode whitespace. When Unicode character class mode is enabled, it makes \s work just like \p{IsWhite_Space} as well as the corresponding ASCII to Unicode mappings for \d, \w, \b, and their negated versions. Unicode character class mode can be enabled with Pattern.UNICODE_CHARACTER_CLASS or by starting the regex with (?U). All predefined character classes that were defined only with Unicode semantics are the same in either mode, like Unicode property matching using \p{…}.

Nick’s insightful reply left me full of questions, so I sat down with him to get some more details.

So there are different kinds of spaces in Unicode?  What’s up with that?

There are lots of different character encodings out there, and different ones have encoded characters for different types of spaces.  Some of these have been for traditional typographical use such as an “em space,” which is the width of an uppercase M in the font that you’re using.  Another one is the hairline space, which is extremely thin.  And then in CJK (Chinese, Japanese, and Korean) languages, there’s an ideographic space, which is a square space that is the same size as the CJK characters, whether it’s hanzi in Chinese, kanji in Japanese, etc.

If you were to create a character encoding from scratch—say you were going to invent Unicode—and not care about backward compatibility with any existing encoding, you would probably just have one space that’s the whitespace character.  But we do have to have compatibility with lots of historical encodings so that we can both take that encoding and transform it into Unicode and then back, or so we can represent the same characters that we formerly represented in our old encoding.

How many different kinds of spaces are there in Unicode?

Twenty-five different characters have the White_Space property in Unicode 6.3.  Any regular expression engine with proper Unicode support will match these and only these characters with \s.  It can also be more explicitly matched with \p{White_Space} or \p{IsWhite_Space}, depending on the regex engine (Perl and ICU4C use the former while Java uses the latter).

Do different spaces have different meanings?

Most of the spaces you’ll find are just for width or formatting.  Ideally, you don’t want to perform document layout on the character level.  Instead, it’s better to do that with your markup language or word processor—say, CSS if you’re using HTML—and you’d just stick with the standard space character within your text.

But there are a few space characters that have interesting rules to them, like the “non-breaking space,” which forces line breaking algorithms to not provide a break opportunity for line wrapping.

Alternately, newline characters are a form of whitespace that designate a mandatory line break.

How do CJK languages use spaces?

In most cases, CJK languages don’t use spaces between ideographs.  You’ll often see a long series of characters without any spaces.  If you’re able to read the language, you’re able to determine the word boundaries.  But there is no computer algorithm that can precisely detect CJK word boundaries.  We have to use a hybrid approach that’s based more on a dictionary than an algorithm, and it’s never going to be perfect.  The only perfect way is to sit a human down and have them read the text, which makes it difficult for us to figure out what the words are within a search query.  In CJK characters, one ideograph can be a word, but also a series of multiple ideographs can be a single word.  It’s a tricky problem to determine the boundaries.

How does Unicode define a space?

In Unicode, every character has a set of properties.  So it’s more than just an encoding scheme for characters, it has defined metadata for every character.  For example, “Is this character a symbol?  A number?  A separator?  Is it punctuation?  Or alphabetic?  Or numeric?”  It also has rules around the type of character—so if it’s a letter, what’s the uppercase version?  What’s the lowercase version?  What’s the title case version?”

With whitespace, there’s a boolean property called “White_Space.”  Additionally, there’s a property called “General_Category,” and every character has a value for this property.  Examples of the values are “letter,” “number,” “punctuation,” “symbol,” “separator,” “mark,” and “other.”  But there are also subcategories, and one of the subcategories of “separator” is “space separator,” which is given to any character which is specifically used as a space between words, as opposed to lines or paragraphs.  So there are programmatic ways to determine not just, “What is whitespace?” but “How is it used?”

How do different regular expression engines handle different kinds of spaces?

Traditionally, regex engines only understood ASCII characters, where the whitespace characters include just one space character plus the tab and newline characters.  Then, regular expressions started to support Unicode.  Some of them started treating all matches with Unicode semantics, so that if you’re matching on whitespace, now you would match on any Unicode whitespace (which includes ASCII whitespace).

Other ones, for backward compatibility, continue to match only on ASCII whitespace and provide a “Unicode mode” that will allow you to match on any Unicode whitespace.  That’s what Java and many languages do, whereas some of the dynamic languages like Perl and Python 3 have upgraded to Unicode semantics by default and provide an optional “ASCII mode.”

Unfortunately, regex engines that default to ASCII semantics make it increasingly difficult to work with Unicode, because every time you want to execute a regular expression against a Unicode string, you have to put each regex in Unicode mode.  In ten years, this will seem very antiquated.

Fascinating!  Thanks, Nick!


Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Interview with a CodeRage Finalist: Dave K and Projector

Every quarter, the tech team at Shutterstock holds CodeRage, a 24-hour hackathon where we’re encouraged to work on any project that can bring value to the company.

This quarter, one of the winning projects was called Projector. It’s a web app that lets you turn your webcam into a projector to show drawings and diagrams to other people. Here’s a quick demo of how it works:

Dave K, the lead engineer on our footage team, wrote the app. I interviewed him about the project.

What problem were you trying to solve?

I was onboarding one of our new developers in our Denver office, and I was in New York, and I wanted to show him how our different systems were set up and how they work together. I thought, “Wow, every time I onboard somebody I usually go to a whiteboard and sketch out how this works, because it’s so hard to talk about it without diagrams.” I really just wanted to give him a quick sketch of how our servers are set up. I couldn’t find any good online solutions for drawing with a touchpad, and if you point a webcam at a whiteboard it’s really hard for the person on the other side to see anything.

So a few days later I was thinking it’d be cool if we could use a webcam to show a facsimile of a piece of paper you have in front of you to someone across the country and be able to make changes in real time.

How did you approach the problem?

Every hackathon, I think of a problem beforehand — I take hackathons very seriously! — and I need to have a clear project that I’m going to be working on. A lot of times I’ll try to stay on top of Javascript libraries and HTML5 features because that stuff really interests me. Then, if I try to approach a problem like this, I’ll try to think of what technologies are available to use, and sort out in my head a little bit how I’ll do it.

For this project, the main problem was how to detect where the piece of paper was. So I brainstormed a bit about that. But sometimes it doesn’t work out the way you expect. This was a perfect example — the initial plan I had didn’t work at all, and I had to re-formulate it and sleep on it to find a better solution.

What was the first approach you tried?

Well, I needed to detect the edges of the piece of paper. Originally, I was going to have a setup where you made four black dots on a piece of paper. Then, I was going to try converting the webcam image to black and white, and then detect every shape on the page. Any shape that was touching the corner of the image, I’d delete from the shapes that I’m looking at. And then I’d try to discover the shapes that were closest to the corners of the frame because those would probably be my black reference dots.

Part of trying to solve that problem was to write a fill algorithm, and so I created a structure of every black pixel on the page and then I’d loop through the pixels and try to determine if it was part of a bigger shape based on neighboring pixels. I wrote it as a recursive function, and although it worked on a small scale — like a 10×10 pixel image — on a bigger image I was getting a stack overflow — it was just using too much memory.

So when that didn’t work out, I looked online for different fill algorithms, and one of them was a flood fill algorithm which was supposed to be more performant. I was able to tweak some open source code that I found to get that working, but on a big image it would still crash from using too much memory. It was kind of upsetting because I spent a whole night getting that to work — going down this one rabbit hole. So I thought, “I should just go home and go to sleep.” It was about midnight at the time.

The last thought that entered my mind was, “Wait. Why make these dots black? What happens if we add a color in there?” Then you could do a simple color detection based on different quadrants of the image. And then I felt a little better going to sleep with that idea in my head. The next morning I woke up and just focused on that and it seemed to work pretty well.

What third-party libraries did you use for the project?

A lot of this is reliant on these new, awesome features available in HTML5. One of the things that was really crucial was the getUserMedia() HTML5 function. That lets the browser get access to your webcam. Then I used some of the canvas manipulation tools. HTML5 lets you draw an image to the canvas, and then you can get RGBA values for every pixel on that canvas so you can determine what color something is. You basically have an array you can loop through, and that’s how I’m able to find the green pixels.

The other library I used is BinaryJS that lets you send and receive streaming binary data over web sockets. It uses a compact serialization method to make that as efficient as possible. I also had to use a polyfill for Canvas’ toBlob() method, which turns an image into raw binary data so that BinaryJS can segment the packets. It’s not implemented in mainstream browsers yet, so the polyfill allowed me to use it in browsers that did not already have support.

I used ImageMagick for server-side image processing, and ran a threshold function on the image so everything that fell into the lighter 50% of a black and white image turned to white, and everything in the darker 50% turned to black. That makes it easier to create a facsimile of the image. The place I got the idea for that was from an app called JotNot Pro, which lets you scan documents by using the camera on your smartphone. It uses a similar approach to thresholds to make the scanned text clearer.

The other thing I used ImageMagick for is for perspective distortion. ImageMagick has a perspective distortion function that lets you take four points and map them to new coordinates, which is really neat because I can take those four control points (the green dots) and map them to the corners of the viewer to flatten the image.

What’s nice is that if I’m holding a piece of paper, no matter how I’m holding it, it keeps it in place so it doesn’t jump around. It also makes it so that it doesn’t look squashed.

Have you thought of open sourcing this project?

Yeah, I have to clean it up a bit and make it a little more practical to use, but then I think we could release it.

(Sign up for email updates on the right to get notified when we release it.)

Interested in working at Shutterstock? We're hiring! >>
Leave a comment