Critical Skills You Won’t Learn In School

1

Here at Shutterstock, we’ve increased our efforts to reach out to college students and new grads. After participating in outreach throughout my career as a software engineer, I’m amazed at how unprepared most college grads are when entering the software industry. Most companies care less about your degree, major, or GPA and more about your raw talent. The process of obtaining a Bachelor’s Degree exposes you to new ideas and most importantly, you learn how to learn.

To gain real development skills, you must either find time outside of class or take control of your curriculum. Here are some things to take advantage of that you won’t learn in school:  

Work on projects outside of class

When interviewing new grads for junior developer roles, we want to know how a potential candidate will function outside of a structured academic environment. While it’s great if you did a bunch of school projects and passed all of your exams, we’re more interested in what you did on your own, from start to finish. We’re less interested in projects that came with the class and more interested in projects where you came up with the idea yourself. If your school offers independent study, there’s a good chance you could wrangle some credit for working on your own project. That’s fine too, the point being that it should be something you owned from beginning to end.

Get real life software development experience

There are so many things you simply are not taught in school. You don’t learn how to package, distribute, and deploy a project. You don’t really learn the best practices for writing unit tests and integration tests or load tests for that matter. You get limited exposure to team work with a source management tool or a continuous deployment environment.

Most internships are either too structured or not structured enough. Meaning, either you get told what to do at a very granular level and don’t really get to think for yourself, or you aren’t given any guidance at all and left completely to your own devices. This also happens in jobs in real life and having sufficient experience to deal with it is very important. Also, most of your work is proprietary so there is no way to show future employers the nature or quality of your experience.

The easiest way to get experience is to get involved with Open Source projects. The advantage is you get experience working with teams, especially remote teams, and all of your work is public, so you can share code samples with future employers. You can even continue working on your open source projects once you’ve found work and it can also help to fill any employment gaps you might experience.

This is a talk I’ve done about getting started with Open Source, and while it’s been more targeted at the Perl community, there are tips there that are universal. In addition, I do a short tutorial on making a pull request in Github.

 Here are some great projects to get started on that offer a bit more structure and guidance:

As a woman in this industry, when starting out, I found myself especially isolated and often wishing for more female peers. These groups offer the opportunity to connect with other women engineers and special programs focused on getting more women involved in Open Source.

Use Git and Github for EVERYTHING you do

Whether it’s a coding project for a college class or homework for a free class online, always use source control with your code. Find a place to host your repositories. Github happens to be one of the most popular and it’s pretty easy to use. I have plenty of colleagues that don’t even have a resume, they just send hiring managers a link to their Github profile. In the end, we really don’t care where you’ve worked, we care about what you’ve done and how you did it.

Write tests and understand what coverage means

Some college courses might introduce the notion of testing but it’s up to you to apply it. Get in the habit of writing tests every time you write code. When you have a homework assignment, first write unit tests to run the code you’ll write to prove whether you got the answers correct. Then write the code. Get in the habit of doing this early and your life as an engineer will be much easier. For whatever language you’re using, learn the test framework. Want to impress the interviewer? When they propose a problem for you to whiteboard a solution to, first write out (or at least talk about) how you’d test it. I guarantee it’s one of the few times they’ve seen any candidate do that, and very likely the first time they’ve seen a new college grad do it.

Learn to profile your code

You’ve learned about big O notation in your classes, but in the real world you rarely work on code such that you have full access to evaluate every method being called by all the libraries you’re using. A code profiler is a software tool that runs your application and identifies “hot spots” of areas that took some proportion longer to run than other parts of your program. These are only relative to total running times within the profile results so something highlighted doesn’t necessarily mean it took a long time. Many languages offer tools for code profiling. SQL queries can be analyzed using “explain” or query analyzer tools depending on your database. There are also end-to-end load testing tools. Whatever you are using, learn how to run and evaluate the results of a profiler or some type of analyzer. Start learning how to find the bottlenecks, I guarantee that’s going to be what you spend a good amount of your time as an engineer doing.

Explore engineering flavors

When someone asks you what kind of role you’re looking for when you graduate, just saying “software engineer” or “programmer” isn’t good enough. All that shows is you don’t really know anything about the industry and it doesn’t help the interviewer to get to know you. The field of software engineering is huge and there are a lot of areas you can focus on. My advice is to explore the different roles that are out there so you can see what you like best. For starters, there are Front-End Engineers that work mainly with User Interfaces, there are Back-End Engineers that work with data stores and API’s that the User Interfaces consume, and there are Full Stack Engineers that work with both. In addition, there are Big Data Engineers, Search Engineers, Algorithm Engineers, DevOps Engineers, Release Engineers, and even Software Engineers in Test (QA Engineers).

You don’t have to pick one necessarily, but at least talk about your knowledge of the different types of roles that you think might be interesting to you and explain how they line up with your interests. You’ve spent the past however many years studying the subject of computer science, there’s got to be an area that particularly interests you. Don’t try to tell the interviewer something you think they want to hear. If you’re looking at a backend software engineer role for a financial institution, saying the class you took on Embedded Systems programming was your favorite won’t take you out of the running. It opens up a conversation about “what did you like about that subject” and lets the interviewer get a better idea of what your own professional interests are.

In fact, the fact that you have interests is a good start. Let the interviewer decide if your interests are a good fit for things the company is doing. There might be a team that’s doing work in your area of interest that simply hasn’t considered hiring a junior candidate. It also tells the company up front that if they start you on a certain role, they should also have a growth path for you in the area you’re interested in (and you should look for this as well).

Get comfortable with ambiguity

Answers aren’t always easy or straightforward. Most of the time, you’re faced with a problem and there isn’t a professor sitting at the end of the journey ready to grade you on whether your solution matched theirs. Most of the time, there is no solution except for the ones you can come up with.

One of my favorite categories of questions to ask in interviews is things I call “troubleshooting questions”. These are vague, open-ended questions that are meant to show me how a person really thinks on the spot. When I was in college, I worked for my university’s housing internet department. My job was to hook students’ computers up to the school’s network. This job didn’t have a lot of structure or supervision. We needed resourceful people that could solve problems on their own.

When we interviewed candidates, the question that really told us how the potential candidates would do was this: “You’ve just hooked a computer up to the network. It’s not connecting to the Internet. What do you do?”

There is no right answer but with every answer the candidate gave, we responded with “that didn’t work, the computer isn’t online”. If they asked clarifying questions, we made something up based on the most common case and also to keep the focus on troubleshooting the main issue. Sometimes, you try everything you know and you still can’t get it to work. That’s life. Being able to deal with uncertainty is one of the most important skills you need as an engineer.

Think in terms of the problem, not the answer

A common mistake, even among veteran engineers, is to decide on a solution and make the problem fit into that solution. Never do that. Always start with the problem. Always ask “What is the problem we’re trying to solve?” and then determine how you will consider the problem solved. This is called defining your criteria for success.

Even in class as you’re learning new computer science concepts, always ask yourself, “what problem is this solving?”. Being able to explain your solution in terms of the problem is a skill that will not only earn you the respect of your mentors but also endear you to your non-technical peers (and your professors!).

One way to do this is to start thinking in terms of metrics. I’ve had many an issue come my way that consisted of “The website is slow, we need to make it faster”. First we need details. What specifically is slow? What are the conditions to reproduce the exact scenario being reported? What are our current metrics (ie, how long does it take page x to load)? So our problem actually is, the user’s profile page takes 3 seconds to load on average. Our criteria for success should then be something like, the user profile page should take 1 second to load on average. Now that you are thinking in terms of the problem and in terms of metrics, solving the problem (and preventing it in the future) is much more straightforward.

Bonus Points

Click here to find additional tips you may find helpful.

Final Thoughts

Don’t think you need to learn all of these things in order to be marketable. As I said earlier, just knowing a little bit about one or two of these things (or even knowing that they exist) is a good step in the right direction. A career in software engineering is a life long learning process. Make learning more about these areas in depth a goal for yourself as you progress in your career. Your goal isn’t to have a list of buzzwords on your resume that you can rate yourself a 10 on. Your goal is to have a strong foundation and a broad toolset to help you adapt as technologies, languages, and the industry change.

 

Interested in working at Shutterstock? We're hiring! >>
tags: , , , , | Leave a comment

12 Questions To Ask About PCI

Organizations either breeze through or struggle with PCI certification. The struggle parallels to a fight against zombies. You must stay on your toes. Once they start coming towards you they don’t stop and as your team deals with their own zombies, you realize you can’t keep up. The challenge does not stop there. This poem by Sam Kassoumeh sums PCI up in a nutshell. You must manage tech debt, legacy access control rules, and getting attention from developers or operations. How does anyone get through this?

Never stray from your main goal. Certification is the immediate point of the program. It is supposed to comfort customers and partners because due diligence keeps their financial data safe. Remember PCI is a means to an end and not a goal itself. The PCI process is supposed to make you think of ways to handle sensitive data in a general way. Nobody would argue that a single certification, piece of paper, or audit is enough to protect the organization. Build a process and worldview to work with.

The PCI Level 3 document is 112 pages long with 4 appendices and 12 sections. It sounds daunting and can be if your approach to security is ad-hoc. You will scramble to figure out what is covered, where covered assets are, who has access to those assets, and maybe even what the term asset really means. In the middle of an assessment you find yourself questioning the meaning of just about everything including:

  • Security strategy  - Do I really have a coherent strategy?
  • Tools used
  • How to store data

So what can you do to make PCI compliance achievable on that big day? Start today and think about:

  1. What customer data do we need to hold onto? For how long?
  2. How do I dispose of storage and printouts that have this data?
  3. Does everyone who needs to access this information have 2-factor authentication?
  4. Is the pathway to this data secured by encrypted connections (e.g. HTTPS)?
  5. Is it possible for an insider or intruder to see sensitive data through some other segment of our network?
  6. Is sensitive data only available to people and apps that really need it?
  7. If someone’s access level changes, would I know about it?
  8. If a related network rule changes, would I know about it?
  9. Am I keeping up with patches on high priority servers?
  10. Am I monitoring for and/or alerting on suspicious traffic?
  11. Do I know what ciphers we use?
  12. What’s our process for offboarding employees with access to sensitive data?

If you store data about customer transactions unrelated to credit cards (which is the domain of PCI) is it really a stretch to treat that data with same care? Why encrypt credit card information but not bank account numbers? Why mask part of the credit number and not a customer’s address? An address can be used for identity theft, too.

It’s not to say you should encrypt or mask everything everywhere. The point is to consider it. Maybe you don’t need to store so much data. Maybe you can build your network and application access rules earlier when paying attention to areas of the network with personal information.

This is just the beginning but if you can ask yourself these questions early, you can construct a strong strategy which is the true end goal of the PCI compliance process.

Once you know what you’re looking for use any resource you find helpful. For example, IBM has posted a guide on the importance of complying with PCI DSS 3.0 Requirement 6. You can view the guide here.

What are your thoughts on PCI? Be sure to comment below.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Stop Using One Language

In any technology company one of the fundamental aspects of its identity is the technology stack, and programming language that it’s built on.  This is what defines types of tools that are fair game, and more importantly, defines the types of engineers who are hired and capable of succeeding there.

Back in the middle of the last decade, when Shutterstock had its beginnings, the tech team was made up primarily of die-hard Perl developers.  The benefits of CPAN, and flexibility of the language were touted as the primary reasons why Perl was the right tool for anything we wanted to build.  The only problem was our hiring pool was limited to people eager to work with Perl– and although the Perl folks who joined us were indeed some of our most passionate and skilled engineers, there were countless engineers outside the Perl community who we totally ended up ignoring.

Fast forward to the last few years here, and Shutterstock has become a much more “multilingual” place for software engineers to work.  We have services written in Node.js, Ruby, and Java; Data processing tools written in Python; a few of our sites written in PHP, and apps written in Objective-C.

Even though we have developers who specialize in each language, it’s become increasingly important that we remove the barriers to letting people work across multiple languages when they need to, whether it’s for debugging, writing new features, or building brand new apps and services.  At Shutterstock, there have been a few strategic decisions and technology choices that have facilitated our evolution to the more multilingual programming environment and culture we have today.

Service Oriented Architectures

One of the architectural decisions we made early on to support development in multiple languages was to build out all our core functionality into siloed services.  Each service could be written in any language while providing a language-agnostic interface through REST frameworks.   This has enabled us to write separate pieces of functionality in the language most suited to it.  For example, search makes use of Lucene & Solr, and so Java made sense there.  For our translation services, Unicode support is highly important, so Perl was the strongest option there for us.

Common Frameworks

Between languages there are numerous frameworks and standards that have been inspired or replicated by one another.  When possible, we try to use one of those common technologies in our services.  As mentioned above, all of our services provide RESTful interfaces, and internally we use Sinatra-inspired frameworks for implementing them (Dancer for Perl, Slim for PHP, Express for Node, etc).  For templating we use Django inspired frameworks such as Template::Swig for Perl, Twig for PHP, and Liquid for Ruby.  By using these frameworks we can help improve the learning curve when a developer jumps between languages.

Runtime Management

When it comes down to the nuts and bolts of actually running code in a particular language, one of the obstacles that blocks new developers from getting into it is all the technical bureaucracy needed to manage each runtime — dependency management, environment paths, and all the command line settings and flags needed to do common tasks.

The tool we use at Shutterstock to simplify all this is Rockstack.  Rockstack provides a standardized interface for building, running, and testing code in any of its supported runtimes (currently: Perl, PHP, Python, Ruby, Java).   Have a java app that you need to spin up? Run “rock build” and “rock run”.  Have a Perl service you want a java developer to debug?  “rock build”, “rock run”.

Another major benefit of using Rockstack, is that not only do our developers get a standard interface for building, testing, and running code, but our build and deployment system only has to deal with one standard set of commands for running those operations as well for any language.  Rockstack is used by our Jenkins cluster for running builds and tests, and our home-grown deployment system makes use of it for for launching applications in dev, qa, and production.

One of the biggest obstacles for people jumping into a new language is the cognitive load of having to figure out all the details of setting up and working with the development environment for that language.  Once you remove that burden, people can actually focus their energy on the important engineering problems they need to solve.

Testing Frameworks

In order to create a standardized method for testing all the services we have running, we developed (and open sourced!) NTF (Network Testing Framework).  NTF lets us write tests that hit special resources on our services’ API’s to provide status information that show the service is running in proper form.  NTF supplements our collection of unit and integration tests by constantly running in production and telling us if any functionality has been impaired in any of our services.

Developer Meetups

In addition to tools and frameworks, we also support our developers in learning and evolving their skillsets as well.  On a regular basis, we’ll have internal meetups for Shutterstock’s Node developers, PHP Developers, or Ruby developers where they give each other feedback on code in progress, share successes or failures with third-party libraries, and polish up the style guide for their language.  These meetups are a great way for someone new to a language to ask questions and improve on their coding skills.

Openness

Part of what makes it easy to jump into another language is that all the code for every Shutterstock site and service is available for everyone to look at on our internal Github server. This means that anyone can review anyone elses code, or check out a copy and run it.  If you have an idea for a feature, you can fork off a branch, and submit a pull request to the shepherd of that service.  Creating this sense of openness with our code helps prevent us from creating walled gardens, and encourages people to share ideas and try new things.

Challenges

Even though language-agnostic engineering comes with some nice benefits, it’s crucial to bring a modicum of pragmatism to this vision.  A completely language agnostic environment may be idealistic and impractical.  Allowing developers to build services and tools in any language that interests them may lead to a huge amount of fragmentation.  Having 50 tools written in 50 different languages would be a nightmare to maintain, and would kill any opportunities for code reuse between them.  Additionally, with a greater breadth of technologies, it becomes much more difficult to have people on hand with the depth of knowledge needed to lead initiatives with them.

As a matter of practicality, we keep a list of Preferred Technologies which is broad enough to provide plenty of choice, but narrow enough so that we can trust we’ll have plenty of expertise on hand.  If a new technology is vetted and deemed valuable it will be considered for addition to this list.  However if one developer wants to go and write a new site in Haskell, they’ll probably be shot down*.

*we have nothing but respect for Haskell

Although we want to make it easy for all of our developers to work in any of our common languages, there’s always going to be a need for specialists.  Every language is going to have its nuances, buggy behaviors, and performance quirks that only someone with extensive language experience will be able to recognize.   For each of our preferred technologies, we have several people on hand with deep knowledge in it.

* * *

Since Shutterstock is built on a plethora of services, any one of our sites may be receiving data that came from something built in Perl, Java, Node, or Ruby.   If one of those sites needs an extra tweak in an api resource, it’s incredibly helpful when a developer can jump in and help make the necessary change to any of those services regardless of the language it was written in.  When developers can work in this way, it helps ease dependencies between teams, which helps the organization move faster as a whole.

Many of our strategies and tools are designed to help give engineers more language agnostic skills to better work across multiple languages.  Whether it’s frameworks that share standards, build and runtime tools that work across languages, architecture strategies, or testing frameworks, having common approaches for all these things allows everyone in the organization to work together, instead of siloing themselves off based on language-specific skillsets.

As the world of programming languages becomes much more fragmented, it’s becoming more important than ever from a business perspective to develop multilingual-friendly approaches to building a tech company.  Some of the tools and processes we developed at Shutterstock have helped us move in that direction, but there’s a lot more that could be done to facilitate an environment where the tech stack of choice isn’t a barrier to bringing in talent.

Interested in working at Shutterstock? We're hiring! >>
tags: , , | Leave a comment

Code snippets to calculate percentiles in databases

As a Datavis Engineer at Shutterstock, I dive into a lot of data everyday and routinely answer questions regarding customer behaviors, site security, and latency issues. I keep a list of SQL snippets to copy and paste into my work, and I found that keeping the list of examples is easier than memorizing a bunch of similar-but-different SQL dialects.

Calculating percentiles comes up regularly in my work, and it’s also a popular area of confusion. So, I’m going to be breaking down the appliation of calculating percentiles for you. I find a business context helps me understand the tech behind the topic, so I organized queries into four case studies. I’ll state the situation, present the technology, give the query, then wrap it up.

Note: I am very specifically not comparing technologies. The internet is filthy with those posts. I am offering copiable code without the hype.

Also note: all data is fake. I show numbers to give a sense of the output to expect, not a scale for comparison.

CASE STUDY, USING HP VERTICA: HOW LONG DOES A CONTRIBUTOR WAIT FOR THEIR PHOTO TO GO PUBLIC ON THE SITE?

SITUATION:

Shutterstock is a two-sided marketplace. We license crowd-sourced media. Contributors do not receive money for their images until it is published on our website. Tens of thousands of images are submitted daily, and they all need to be inspected carefully. Hence, an entire team is dedicated to inspecting the images, and they aim to maintain a thorough and speedy process.

TECHNOLOGY:

We use Vertica for analytic duties. Like most columnar database technologies, Vertica’s row lookups are slow, but the aggregates are blazingly fast. It is the only pay-for-use database in this post.

QUERY:

Vertica’s flavor of SQL is fairly similar to MySQL’s, with analytic functions similar to other OLAP databases. Vertica has many built in analytical functions. I use PERCENTILE_DISC() for this case study.

SELECT
  DISTINCT
  added_date,
  PERCENTILE_DISC(.9)
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS '90th',
  PERCENTILE_DISC(.5)
    WITHIN GROUP(ORDER BY datediff(minute, added_datetime, approved_datetime))
    OVER (PARTITION BY added_date)
    AS 'median'
FROM
  photo_approval_table
WHERE
  added_date >= current_date() - interval '4 day'

RESULTS (as a reminder, there are 1440 minutes in a day):

added_date median 90th
2014-01-01 2880 5000
2014-01-02 1440 6000
2014-01-03 2000 4800
2014-01-04 3000 5500

Half of the photos uploaded on January 1 took two days to show up on our website.  There is a big gap between the median and 90th percentile approval times on January 2. If this data was real, I would investigate why January 2 is different from other days.

 

CASE STUDY, USING Apache HIVE: HOW MANY TIMES IN A DAY DOES A SPECIFIC ELEMENT GET CLICKED?

SITUATION:

We track the efficacy of our designs. Knowing how often an element is clicked gives us insight into what our customers actually see on a page. If customers click an element that is not a link, then we can make changes to our HTML.

TECHNOLOGY:

We store raw click data in HDFS. Hive is a familiar SQL interface to data stored in HDFS. It has a built in PERCENTILE() UDF.

QUERY:

I begin by running an inner query to get customer click counts on the specific element per day. I wrap that inner query in the daily percentile main query to find percentiles of customer behavior. I need to build the inner query because PERCENTILE() does not accept COUNT() as its first argument.

SELECT
  day,
  percentile(count, 0.25),
  percentile(count, 0.50),
  percentile(count, 0.75),
  percentile(count, 0.99)
from
  (
  -- inner query
    select
      day,
      visit_id,
      count(*) as count
    from
      click_tracking_table
    where
      element = 'header_div'
      and page = '/search_results.html'
      and year = 2014 and month = 4
    group by
      visit_id,
      day
  ) c
group by
  day

RESULTS:

Sample data result of inner query:

   visitor_id | day | count
       1      |  1  |  5
       2      |  1  |  7
       2      |  2  |  9

All results:

day _c1 _c2 _c3 _c4
01 1.0 3.0 15.0 52.0
02 1.0 3.0 15.0 64.0
03 1.0 3.0 14.0 68.0

Judging by median click counts, _c2, customers click on a non-link element about three times in a session. Some click as many as fifteen times. Wow. The header_div element should be made clickable.

 

CASE STUDY, USING MySQL: ARE CUSTOMERS SIGNING UP FOR ACCOUNTS FASTER NOW THAN THEY WERE A YEAR AGO?

SITUATION:

Shutterstock’s B.I. team does an excellent job of analyzing marketing spends and conversions. Sometimes it is easier for me to get the data than pull an analyst away from their work.

TECHNOLOGY:

MySQL is a widely used transactional database. It’s fast and full-featured, but does not have built-in support for calculating percentiles.

QUERY:

I need to compare a rank, meaning a position in an ordered list, against a total count of rows. Position over total count gives me a percentile.

Complex queries are built inside out. This query starts with inner query t, which counts the total number of accounts per language. I join the per language counts to the accounts table, and pull out a customer’s signup date.

There are plenty of resources for calculating row level ranks in MySQL around the web. The techniques boil down to setting variables in an ordered result set. Here, I order the inner query r by language and keep track of the current language in @lang. When language changes, @rank resets to 0 in _resetRank column. Neat!

The outer query compares signup dates to milestone dates.

Given that we’re only looking at sign ups that occurred in the past year, we would expect twenty-five percent of signups to happen exactly nine months ago under consistent signups. 50% six months ago. If 25% of signups happens before the nine month milestone, the first quarter of the year saw “fast” signup. This query returns “gut check” data; it’s not vigorously tested or verified.

select
  r.language,
  datediff(
    date(max(case when percentile <= 25 then signup_datetime end)),
    current_date - interval 9 month
  ) AS '25th',
  datediff(
    date(min(case when percentile >= 50 then signup_datetime end)),
    current_date - interval 6 month
  ) as 'median',
  datediff(
    date(min(case when percentile >= 75 then signup_datetime end)),
    current_date - interval 3 month
  ) as '75th'
from
  (
    /* build ranks and percentiles */
    select
      a.signup_datetime,
      @rank := if( @lang = a.language, @rank, 0 ) as _resetRank,
      @lang := a.language as language,
      @rank := @rank+1 as rank,
      round( 100 * @rank / t.cnt, 3 ) as percentile
    from
      accounts a
    /* t counts rows per language */
    join
      (
         select
           language,
           count(*) as cnt
         from
           accounts
         where
           signup_datetime > current_date - interval 1 year
         group by
           language
       ) t on t.language = a.language
    where
      a.signup_datetime > current_date - interval 1 year
    order by
      a.language,
      a.signup_datetime
  ) r
group by
  r.language
order by
  min(case when percentile >= 25 then signup_datetime end),
  min(case when percentile >= 50 then signup_datetime end);

RESULTS:

+----------+------+--------+------+
| language | 25th | median | 75th |
+----------+------+--------+------+
| de       |  -18 |     -9 |    2 |
| en       |    0 |      0 |    0 |
| vu       |   82 |     54 |   39 |
+----------+------+--------+------+

German language sign ups hit first the 25th percentile 18 days early and median nine days early, but third quartile was not reached until two days after expected. German language signups are slowing down. Vulcan, which was a slow trickle at the beginning of the year, hit a boom in the recent 3 months; guess that booth at the convention worked out.

 

CASE STUDY, USING MONGO: HOW MANY TAGS DO ANNOTATIONS HAVE?

SITUATION:

As events happen on our site, domain experts post comments to our internal annotation service. Comments are tagged with multiple keywords, and those tags form the structure for our knowledge base. It is one way we can link homepage latency with conversion rates. In such a system, keyword breadth is highly important, so I want to know how many annotations keywords link together.

TECHNOLOGY:

MongoDB is a document store, NoSQL technology. It does not have out-of-the-box percentile functionality, but it does have a well documented MapReduce framework. Up until now, all the percentile solutions, even MySQL, have been a DSL; MapReduce is full-on programming.

QUERY:

“tags” is an array field for a AnnotationCollection document. I emit() each tag, summing them up in the reducer. Basic MapReduce word counting. I inline the output of the MapRedue job, ‘{ inline : 1 }‘, to capture the results in an array of key, value tuples. I then sort the tuple array, biggest first. Finally, use the percentile times the total number of records to get the index of the tuple.

mapper = function() {
  if ( this.tags ) {
    for ( var i = 0; i < this.tags.length; i++ ) {
      emit( this.tags[i],1 )
    };
  }
}

reducer = function(pre,curr) {
  var count = 0;
  for (var i in curr) {
    count += curr[i]
  };
  return count;
}

out = db.runCommand({
   mapreduce  : 'AnnotationCollection',
   map        : mapper,
   reduce     : reducer,
   out        : { inline : 1 }
});

/* sort them */
out.results.sort( function(a,b ) { return a.value - b.value } )

/* these percentiles */
[ 50, 80, 90, 99 ].forEach(function(v) {
   print( v + 'th percentile of annotations linked by keyword tags is ', out.results[Math.floor(v * out.counts.output / 100) ].value );
})

RESULTS:

Sample results from out:

...
		{
			"_id" : "german seo",
			"value" : 98
		},
		{
			"_id" : "glusterfs",
			"value" : 145
		},
		{
			"_id" : "googlebot",
			"value" : 2123
		},
...
	"counts" : {
		"input" : 711475,
		"emit" : 1543510,
		"reduce" : 711475,
		"output" : 738
	},

All results:

50th percentile of annotations linked by keyword tags is  4
80th percentile of annotations linked by keyword tags is  8
90th percentile of annotations linked by keyword tags is  27
99th percentile of annotations linked by keyword tags is  853

Keyword tags link, on average, 4 annotations. Decent coverage, but could be better. But the top 1% of our (fake-data) keywords link 800+ annotations? “deployment” is the top keyword; automated processes are good knowledge sharers.

This post is a reference for calculating percentiles in different data stores. I keep a list of working queries to copy from at work. And now I’m pleased to share some of this list with you. Try them out, and let me know what you think.

 

Interested in working at Shutterstock? We're hiring! >>
Leave a comment

Increase Performance with Automatic Keyword Recommendation

For most large-scale image retrieval systems, performance depends upon accurate meta-data. While content-based image retrieval has progressed in recent years, typically image contributors must provide appropriate keywords or tags that describe the image. Tagging, however, is a difficult and time-consuming task, especially for non-native English speaking contributors.

At Shutterstock, we mitigate this problem for our contributors by providing automatic tag recommendations. In this talk, delivered as a Webinar for Bright Talk’s “Business Intelligence and Analytics” channel, I describe the machine learning system behind the keyword recommendation system which Shutterstock’s Search and Algorithm Teams developed and deployed to the site.

Tag co-occurrence forms the basis of the recommendation algorithm. Co-occurrence is also the basis for some previous systems of tag recommendation deployed in the context of popular photo sharing services such as Flickr. In the context of online stock photography, tag recommendation has several aspects which are different from the context of photo sharing sites. In online stock photography, contributors are highly motivated to provide high quality tags because they make images easier to find and consequently earn higher contributor revenue. In building the system, we explored several different recommendation strategies and found that significant improvements are possible as compared to a recommender that only uses tag co-occurrence.

The three principle points of the talk are as follows:

(1) we characterize tagging behavior in the stock photography setting and show it is demonstrably different from popular photo sharing services.
(2) we explore different tag co-occurrence measures and in contrast to previous studies and a linear combination of two different measures to be optimal, and
(3) we show that a novel strategy that incorporates similar images can expand contextual information and significantly improve the precision of recommended tags.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , , , , | Leave a comment

Monitoring High Scale Search at a Glance

One of our key missions on the search team at Shutterstock is to constantly improve the reliability and speed of our search system.  To do this well, we need to be able to measure many aspects of our system’s health.  In this post we’ll go into some of the key metrics that we use at Shutterstock to measure the overall health of our search system.

search-dashboard

The image above shows our search team’s main health dashboard.  Anytime we get an alert, a single glance at this dashboard can usually point us toward which part of the system is failing.

On a high level, the health metrics for our search system focus on its ability to respond to search requests, and its ability to index new content.  Each of these capabilities is handled by several different systems working together, and requires a handful of core metrics to monitor its end-to-end functionality.

One of our key metrics is the rate of traffic that the search service is currently receiving.  Since our search service serves traffic from multiple sites, we also have other dashboards that break down those metrics further for each site.  In addition to the total number of requests we see, we also measure the rate of memcache hits and misses, the error rate, and the number of searches returning zero results.

One of the most critical metrics we focus on is our search service latency.  This varies greatly depending on the type of query, number of results, and type of sort order being used, so this metric is also broken down into more detail in other dashboards.  For the most part we aim to maintain response times of 300ms or less for 95% of our queries.  Our search service runs a number of different processes before running a query on our Solr pool– language identification, spellcheck, translation, etc, so this latency represents the sum total of all those processes.

In addition to search service latency, we also track latency on our Solr cluster itself.  Our Solr pool will only see queries that did not have a hit in memcache, so the queries that run there may be a little slower on average.

When something in the search service fails or times out, we also track the rate of each type of error that the search service may return.  At any time there’s always a steady stream of garbage traffic from bots generating queries that may error out, so there’s a small but consistent stream of failed queries.  If a search service node is restarted we may also see a blip in HTTP 502 errors, although that’s a problem we’re trying to address by improving our load balancer’s responsiveness in taking nodes out of the pool before they’re about to go down.

A big part of the overall health of our system also includes making sure that we’re serving up new content in a timely manner.  Another graph on our dashboard tracks the volume and burndown of items in our message queues which serves as our pipeline for ingesting new images, videos, and other assets into our Solr index.  This ensures that content is making it into our indexing pipeline, where all the data needed to make it searchable is processed.  If the indexing system stops being able to process data, then that will usually cause the burndown rate of each queue to come to a halt.

There’s other ways in which our indexing pipeline may fail too, so we also have another metric that measures the amount of content that is making it through our indexing system, getting into Solr, and showing up in the actual output of Solr queries.  Each document that goes into Solr receives a timestamp when it was indexed.  One of our monitoring scripts then polls Solr at regular intervals to see how many documents were added or modified in a recent window of time.  This helps us serve our contributors well by making sure that their new content is being made available to customers in a timely manner.

Behind the scenes we also have a whole host of other dashboards that break out the health and performance of each system covered in this dashboard, as well as metrics for other services in our search ecosystem.  When we’re deploying new features or troubleshooting issues, having metrics like these helps us very quickly determine what the impact is and guides us to quickly resolving it.

Interested in working at Shutterstock? We're hiring! >>
tags: , , , | 1 Comment

Stop Buying Load Balancers and Start Controlling Your Traffic Flow with Software

When it comes to traditional load balancers, you can either splurge on expensive hardware or go the software route. Hardware load balancers typically have poor/outdated API designs and are, at least in my experience, slow. You can find a few software load balancing products with decent APIs, but trying to use free alternatives like HAproxy leaves you with bolt on software that generates the configuration file for you. Even then, if you need high throughput you have to rely on vertical scaling of your load balancer or round robin DNS to distribute horizontally.

We were trying to figure out how to avoid buying a half million dollars worth of load balancers everytime we needed a new data center. What if you didn’t want to use a regular layer 4/7 load balancer and, instead, relied exclusively on layer 3? This seems entirely possible, especially after reading about how CloudFlare uses Anycast to solve this problem. There are a few ways to accomplish this. You can go full blown BGP and run that all the way down to your top of rack switches, but that’s a commitment and likely requires a handful of full time network engineers on your team. Running a BGP daemon on your servers is the easiest way to mix “Anycast for load balancing” into your network. You have multiple options to do this:

After my own research, I decided that ExaBGP is the easiest way to manipulate routes. The entire application is written in Python, making it perfect for hacking on. ExaBGP has a decent API, and even supports JSON for parts of it. The API works by reading STDOUT from your process and sending your process information through STDIN. In the end, I’m looking for automated control over my network, rather than more configuration management.

At this point, I can create a basic “healthcheck” process that might look like:

#!/usr/bin/env bash
STATE="down"

while true; do
  curl localhost:4000/healthcheck.html 2>/dev/null | grep OK

  if [[ $? == 0 ]]; then
    if [[ "$STATE" != "up" ]]; then
      echo "announce 10.1.1.2/32 next-hop self"
      STATE="up"
    fi
  else
    if [[ "$STATE" != "down" ]]; then
      echo "withdraw 10.1.1.2/32 next-hop self"
      STATE="down"
    fi
  fi

  sleep 2
done

Then in your ExaBGP configuration file, you would add something like this:

group anycast-test {
  router-id 10.1.10.11;
  local-as 65001;
  peer-as 65002;

  process watch-application {
    run /usr/local/bin/healthcheck.sh
  }

  neighbor 10.1.10.1 {
    local-address 10.1.10.11;
  }
}

Now, anytime your curl | grep check is passing, your BGP neighbor (10.1.10.1) will have a route to your service IP (10.1.1.2). When it begins to fail, the route will be withdrawn from the neighbor. If you now deploy this on a handful of servers, your upstream BGP neighbor will have multiple routes. At this point, you have to configure your router to properly spread traffic between the multiple paths with equal cost. In JUNOS, this would look like:

set policy-options policy-statement load-balancing-policy then load-balance per-packet
set routing-options forwarding-table export load-balancing-policy
commit

Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution. You can read more about this configuration on Juniper’s website.. Below is our new network topology for accessing a service:

topology

There are some scale limitations though. It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind. A tiered approach may be appropriate if you need a high number of backend machines. This would include a layer of route servers running BIRD/Quagga and then your backend services peer to this using ExaBGP. You could even use this approach to scale HAproxy horizontally.

In conclusion, replacing a traditional load balancer with layer 3 routing is entirely possible. In fact, it can even give you more control of where traffic is flowing in your datacenter if done right. I look forward to rolling this out with more backend services over the coming months and learning what problems may arise. The possibilities are endless, and I’d love to hear more about what others are doing.

Interested in working at Shutterstock? We're hiring! >>
7 Comments

How we built interactive heatmaps using Solr and Heatmap.js

One of the things we obsess over at Shutterstock is the customer experience.  We’re always aiming to better understand how customers interact with our site in their day to day work.  One crucial piece of information we wanted to know was which elements of our site customers were engaging with the most.  Although we could get that by running a one-off report, we wanted to be able to dig into that data for different segments of customers based on their language, country, purchase decisions, or a/b test variations they were viewing in various periods of time.

To do this we built an interactive heatmap tool to easily show us where the “hot” and “cold” parts of our pages were — where customers clicked the most, and where they clicked the least.  The tool we built overlaid this heatmap on top of the live site,  so we could see the site the way users saw it, and understand where most of our customer’s clicks took place.  Since customers are viewing our site in many different screen resolutions we wanted the heatmap tool to also account for the dynamic nature of web layouts and show us heatmaps for any size viewport that our site is used in.

Heatmap on Shutterstock.com

Shutterstock’s heatmap tool running on our home page

The main technologies used to build our interactive heatmap tool were our click tracking system, Lil Brother,  Apache Solr, and Heatmap.js.  Lil brother is able to track every click a user makes on our site, along with the x,y coordinates of the cursor, the page element clicked, and some basic info about the customer (country, language, browser, and a/b test variations).

Solr provided the means to filter and aggregate our click data.  The way in which we were using Solr is described in more detail in our post Solr as an Analytics Platform .   In this case, we indexed each click event as a separate document in solr along with all the customer metadata linked to it.

 Our schema.xml file contained the following fields:

<field name="mouse_x_y" type="string" indexed="true" />
<field name="page_url" type="string" indexed="true"/>
<field name="country" type="string" indexed="true"/>
<field name="language" type="string" indexed="true"/>
...

Once we generated our solr index, we needed to build a query to get the data for our heatmap.  To do this we ran a facet query on the mouse_x_y field.  This gave us a histogram of the number of clicks in each position on the page (we rounded off the coordinates to the nearest 10 pixels in order to group clicks into reasonably sized buckets). Once we had the number of clicks per bucket from Solr, we passed that data to heatmap.js which rendered the heatmap in the browser.

In order to run Heatmap.js on all the pages on our production site we launched the app through a bookmarklet that loaded up the javascript, and ran AJAX requests against Solr. The bookmarklet also exposed controls for us to select other parameters like country, language, and a/b variations so that we could drilldown into specific groups of customers. As a bookmarklet, the tool was able to detect settings like the browser viewport size and display the heatmap based on those dimensions.

Since we developed the heatmap tool, our designers and product specialists have been using it to identify elements of our site that could be optimized – either by changing or removing some elements – to better serve customers’ needs.  Knowing that nearly all of our customers interact with the search bar helped to steer our design to make it the most prominent element on the page above the fold.  Knowing that many of the links lower down on the page were not used as often helped us make the decision to redesign that area and put more valuable discovery paths there for customers.

In order to help out folks who are interested in building out an interactive heatmap tool for their sites, we’ve open sourced the Shutterstock Heatmap Toolkit.  The toolkit allows you to run a solr instance and web server, and includes a batch of sample data to try it out on.

You can run the tool on your own data too by creating a JSON file with individual click events — where each event includes the mouse x/y coordinates, and any other attributes such as the page element clicked, and information about the user (the toolkit itself contains a sample set of data that you can base it on).  A script is also included to start solr, build an index, and run the web server that powers the heatmap app itself.  Follow the steps in the README on Github to try it out on the example data.

Being able to visualize and dig into our customers’ interactions with our site has provided valuable insight for our designers, product specialists, and developers.  Having the ability to navigate and dig into different slices of this data in realtime is highly valuable if you want your product team to be nimble and have answers to questions as quickly as they can ask them.

Interested in working at Shutterstock? We're hiring! >>
tags: , | 2 Comments

The Secret to Shutterstock Tech Teams

Being fast and nimble is important to us at Shutterstock, and one way we accomplish this is by working in small teams.  This approach has yielded tremendous benefits over the years, but it comes with its own challenges: Shutterstock now has over 300 people and dozens of teams.  How do we coordinate everything with so many different groups?

Here’s a bit of information about how our approach to small teams has evolved, and how we continue to change it as we grow.

The Early Days

About five years ago, we learned the value of small teams the hard way — by not having small teams.  Shutterstock started with a few developers who would work on a few different projects at any given time.  We followed that approach as we grew the team, until suddenly we had 10 developers working on 10 projects, and nothing was getting done.

We addressed this problem by breaking into smaller teams.  Each team has a product owner, a few back-end developers, a front-end developer, a designer, and a QA engineer.  The teams are  meant to be independent and autonomous, capable of taking any project from idea to completion without outside help.  This lets them move very quickly and stay focused on their goals.

We started with three teams: a customer team, a contributor team, and a “business value” team that was meant to focus on internal projects that bring value to the business.

Lessons Learned

The customer and contributor teams got off to a great start, and exist to this day.  But the “business value” team floundered, and we learned some early important lessons about teams:

  • Each team needs a clear customer (or more generally, a clear target user).  The team has to come to work every day excited to solve a problem for a particular audience.  “Business value” was just too vague; there were many audiences within Shutterstock that needed projects, but there was no good way to prioritize one project over another.  Consequently, the team was tossed from project to project, and often ended up doing things that the other teams simply didn’t want to do.  After a few quarters, we decided to dissolve the team.

  • Each team needs a clear goal.  Our customer team was living a schizophrenic existence: half its projects were focused on improving the search and download experience for our customers, the other half were about working on the signup and subscription flow to increase revenue.  We addressed this by splitting the team in two.  We decided that our “customer experience” team should stay focused on the primary customer experience on our site.  The revenue team took over the signup and payment flow.  After the split, this distinction came to feel more and more natural, and we look back on it as moving us in a better direction.

  • Teams need to be autonomous.   As Shutterstock grew over the years, we were able to expand our offerings by creating new teams.  Sometimes we’d assemble a team without making sure it had every role it needed — perhaps it would only have one developer or not have a QA engineer.  We always ended up regretting this.  The process only works well if the team can truly be independent and autonomous.  Now we know: if we don’t have enough people to form a team, we wait until we can hire all the roles before we launch it.

Core Services Teams

As we added more product-oriented teams, there was a growing need to build common architectural pieces that all the teams could use.  We decided several years ago to move towards a RESTful architecture, and soon many teams jointly used back-end services to support their product.  But ownership of the services was problematic.  If a service needed changes, it was unclear who was responsible for making that happen.

We solved that problem by introducing the latest evolution of our team strategy: core services teams.  Each of these teams own one or more RESTful services, and work with the product teams to prioritize their work.  Their goal is to build core infrastructure that other teams can leverage to serve their customers.

The Challenge of Coordination

Today, Shutterstock has over 20 teams, all of which follow agile development practices of fast interaction and frequent customer feedback.  With so many teams moving so quickly, coordination has become a challenge.  This is partly addressed by returning to a core team principle: strive for autonomy and independence.  We encourage teams to pursue projects that are within their power to take from idea to completion without outside help, which eliminates the need to coordinate altogether.

However, there are inevitably projects that require multiple teams to work together.  In those cases, we promote four ways to improve coordination:

  • Each of our teams has a planning meeting every two weeks.  Anyone can attend these meetings, and we encourage teams that are working together to attend each other’s planning meetings.

  • Each of our teams also has a demo every two weeks, in which they show off the work they’ve done recently.  We also encourage teams that are working together to attend each other’s demos.

  • We have a weekly product backlog meeting, where all our product teams share upcoming projects and discuss metrics related to recently-launched features.

  • Finally, each team has a lead developer and product owner, and we give them the specific responsibility of pro-actively reaching out to other teams to discuss upcoming work.

These approaches are intentionally lightweight and simple.  We rely on people’s own initiative to share their work, communicate actively with others, and work out the details themselves to address many challenges of coordination.  Having a non-prescriptive process makes it clear to people that it’s their responsibility to talk to whomever they need to.  So far, this approach has worked out well.

We’ll continue to evolve and adapt our team strategy as we grow.  Though we’ve had some minor challenges with our approach over the years, overall it has served us very well.  We’d love to hear from others about their team-building lessons.  What has worked well for you?  How have you changed your approach as your company grew?  Let us know in the comments below.

 

Interested in working at Shutterstock? We're hiring! >>
1 Comment

When a Space Is Not Just a Space

During a recent email exchange with our search team, Nick Patch, our resident Unicode expert, offered the following advice for a chunk of Java code used to detect Japanese characters:

> Pattern.compile(“^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\s]+$”);

We should use one of the following options instead:

Pattern.compile(“^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\p{IsWhite_Space}]+$”);

Pattern.compile(“(?U)^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\s}]+$”);

Pattern.compile(“^[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}\\{IsCommon}\\s]+$”, Pattern.UNICODE_CHARACTER_CLASS);

They all do the exact same thing, which is matching on any Unicode whitespace instead of just ASCII whitespace. This is important so that it will also match U+3000 IDEOGRAPHIC SPACE, which is commonly found in CJK text.

By default the predefined character class \s just matches ASCII whitespace while \p{IsWhite_Space} matches Unicode whitespace. When Unicode character class mode is enabled, it makes \s work just like \p{IsWhite_Space} as well as the corresponding ASCII to Unicode mappings for \d, \w, \b, and their negated versions. Unicode character class mode can be enabled with Pattern.UNICODE_CHARACTER_CLASS or by starting the regex with (?U). All predefined character classes that were defined only with Unicode semantics are the same in either mode, like Unicode property matching using \p{…}.

Nick’s insightful reply left me full of questions, so I sat down with him to get some more details.

So there are different kinds of spaces in Unicode?  What’s up with that?

There are lots of different character encodings out there, and different ones have encoded characters for different types of spaces.  Some of these have been for traditional typographical use such as an “em space,” which is the width of an uppercase M in the font that you’re using.  Another one is the hairline space, which is extremely thin.  And then in CJK (Chinese, Japanese, and Korean) languages, there’s an ideographic space, which is a square space that is the same size as the CJK characters, whether it’s hanzi in Chinese, kanji in Japanese, etc.

If you were to create a character encoding from scratch—say you were going to invent Unicode—and not care about backward compatibility with any existing encoding, you would probably just have one space that’s the whitespace character.  But we do have to have compatibility with lots of historical encodings so that we can both take that encoding and transform it into Unicode and then back, or so we can represent the same characters that we formerly represented in our old encoding.

How many different kinds of spaces are there in Unicode?

Twenty-five different characters have the White_Space property in Unicode 6.3.  Any regular expression engine with proper Unicode support will match these and only these characters with \s.  It can also be more explicitly matched with \p{White_Space} or \p{IsWhite_Space}, depending on the regex engine (Perl and ICU4C use the former while Java uses the latter).

Do different spaces have different meanings?

Most of the spaces you’ll find are just for width or formatting.  Ideally, you don’t want to perform document layout on the character level.  Instead, it’s better to do that with your markup language or word processor—say, CSS if you’re using HTML—and you’d just stick with the standard space character within your text.

But there are a few space characters that have interesting rules to them, like the “non-breaking space,” which forces line breaking algorithms to not provide a break opportunity for line wrapping.

Alternately, newline characters are a form of whitespace that designate a mandatory line break.

How do CJK languages use spaces?

In most cases, CJK languages don’t use spaces between ideographs.  You’ll often see a long series of characters without any spaces.  If you’re able to read the language, you’re able to determine the word boundaries.  But there is no computer algorithm that can precisely detect CJK word boundaries.  We have to use a hybrid approach that’s based more on a dictionary than an algorithm, and it’s never going to be perfect.  The only perfect way is to sit a human down and have them read the text, which makes it difficult for us to figure out what the words are within a search query.  In CJK characters, one ideograph can be a word, but also a series of multiple ideographs can be a single word.  It’s a tricky problem to determine the boundaries.

How does Unicode define a space?

In Unicode, every character has a set of properties.  So it’s more than just an encoding scheme for characters, it has defined metadata for every character.  For example, “Is this character a symbol?  A number?  A separator?  Is it punctuation?  Or alphabetic?  Or numeric?”  It also has rules around the type of character—so if it’s a letter, what’s the uppercase version?  What’s the lowercase version?  What’s the title case version?”

With whitespace, there’s a boolean property called “White_Space.”  Additionally, there’s a property called “General_Category,” and every character has a value for this property.  Examples of the values are “letter,” “number,” “punctuation,” “symbol,” “separator,” “mark,” and “other.”  But there are also subcategories, and one of the subcategories of “separator” is “space separator,” which is given to any character which is specifically used as a space between words, as opposed to lines or paragraphs.  So there are programmatic ways to determine not just, “What is whitespace?” but “How is it used?”

How do different regular expression engines handle different kinds of spaces?

Traditionally, regex engines only understood ASCII characters, where the whitespace characters include just one space character plus the tab and newline characters.  Then, regular expressions started to support Unicode.  Some of them started treating all matches with Unicode semantics, so that if you’re matching on whitespace, now you would match on any Unicode whitespace (which includes ASCII whitespace).

Other ones, for backward compatibility, continue to match only on ASCII whitespace and provide a “Unicode mode” that will allow you to match on any Unicode whitespace.  That’s what Java and many languages do, whereas some of the dynamic languages like Perl and Python 3 have upgraded to Unicode semantics by default and provide an optional “ASCII mode.”

Unfortunately, regex engines that default to ASCII semantics make it increasingly difficult to work with Unicode, because every time you want to execute a regular expression against a Unicode string, you have to put each regex in Unicode mode.  In ten years, this will seem very antiquated.

Fascinating!  Thanks, Nick!

 

Interested in working at Shutterstock? We're hiring! >>
Leave a comment