Bad films are bad

which is why there's goodfil.ms

New Profile Pages & Netflix Giveaway

New Profile Pages

Everyone’s profile page just got a massive overhaul. Check out the team’s profiles at John, Glen and Charlie for an example.

You now have a personalised ratings graph to share with friends, a list of your favourite and worst films and a public queue of movies you’re itching to see.

As an extra sweetener we’re giving away three 6 month Netflix memberships throughout December. That should help get you through your queued backlog of films.

The Minimum Viable Rails Stack, Now Reddit Frontpage Approved.

This week we hit a bit of a StumbleUpon bonanza, landed on the Reddit front page, and got featured on Lifehacker. These three things threw quite a bit of traffic our way, and we got through it all with around 3 hours of degraded performance, and patches of intermittent bad gateway errors. Not too shabby.

With alarmingly serendipitous timing, two weeks ago I did a presentation at the Melbourne Ruby group on Scaling Rails. I posted the talk and my explanatory notes on my personal tumblr for two reasons. The first is that it includes a bit of my personal, subjective, and somewhat unsubstantiated ideas on developer types and ops, and I prefer to keep those kinds of musings off the team blog.

The second was that while I’d promoted the idea of the “minimum viable stack” which, while I thought pretty sound (and bet Goodfilms’ tech direction on), was untested. This week we gave it a solid thrashing, so it’s time to promote it as “official Goodfilms policy”.

The TL; DR is that it worked, Mostly and I’m now even more comfortable recommending the stack as a good starting point.

The “Minimum Viable Stack” for your just-past “Minimum Viable Product”

In my presentation I suggested that the “golden stack” for a rails app just out of its’ MVP stage is roughly:

  • Deployed to a cloud provider like Amazon or Rackspace
  • Uses MySQL or Postgres as the datastore, deployed to a single instance, with frequent data backups to cloud storage
  • Has two “app instances”, which host both your web processes and delayed job workers. Each instance should be capable of holding all your regular traffic
  • Load balances web requests between the two app servers (with health checks enabled) using whichever magic load balancer your cloud provider gives you
  • Performance monitored using either Scout or NewRelic, or both if you like

There’s more detail over on my original post, but the key inputs into that design is were:

  • Low chance of “off the air” downtime, despite cloud servers being notoriously “ephemeral”
  • Lowest operational cost, except where it grossly impacts the above point
  • Favouring simplicity, except where it grossly impacts the above two points
  • Favouring traditional SQL databases as it’s the storage paradigm Rails “grew up with”, so has the best tooling and knowledge in the community
  • Favouring SQL again as if it becomes the bottleneck you can vertically scale it just long enough to get you out of trouble, and easy to hire experts to help
  • Avoiding vendor lock in where possible

The Goodfilms Stack

Goodfilms’ setup is almost exactly what I listed above. We run Rails 3.2, use Postgres as our datastore, host on Rackspace cloud using both their load balancers and cloud storage/cdn on top of regular cloud servers, and monitoring the whole setup using NewRelic.

The only departure from the stack listed above is that we have a third server setup as a general utility box. It has two main jobs: import catalog data from Netflix, iTunes, and the Movie DB, and also to run our elaborate taste comparison engine codenamed “Project Ingen”

So, that’s the background, lets talk about what happened, what worked well for us with the stack, and what didn’t

What Happened?

I’m just going to focus on the Reddit front page part of the story, because that’s where the interesting things happened.

For a long time, we’ve thought that there a lot of people who want to pay for film content, and that one of the things Goodfilms can do is help those people find the best things to watch that are available legally. To test that assumption, we did a couple of MVP (minimum viable product pages) for both iTunes and Netflix.

If you look at the iTunes page, you can see how minimal our minimum is, as we’ve not yet upgraded that page. Seeing that the Netflix page was pulling in enough organic search, we tasked our talented new designer Charlie with taking the Netflix page to the next level and see how far we can take it.

Once Charlie was done with it, it looked like we had a winner. Realising that a lot of people might want to use the page, I took a full day to double check all the performance of the queries in the page and fix up what I could.

NewRelic had shown us that the MVP page was only just scraping through performance wise, which is fine for a proof of concept, but not fine for a page you want to load. Once everything was looking OK in the front and back end, I gave the thumbs up to Glen to “throw as many people at it as he can find”.

There are quite a few people on Reddit, and Glen found them, and got a good section of them to come and use it.

What might not be immediately obvious is that we’re based in Australia. The bulk of the Redditing happened in the middle of the night, and so Glen rang me up, pulled me out of bed, and we babysat the servers.

A few hours later we both went back to bed, confident that the site was going to stay up, and very pleased with the influx of new users. In those hours, we added a couple more app servers, and set up full page caching for signed out users.

What went well?

The first thing that went well was taking the time to look at the performance of our new features, doing some work, but not overdoing it.

Defining “good enough” performance for as yet unproven features is tricky, especially when taking into account the opportunity cost of the dev time that can go into other features.

In this case, we got the balance right. I spent about a day getting performance to OK, and it was good enough to degrade gracefully under load, rather than explode. That gave us the breathing room to do the rest of our work in the middle of the night, without having wasted too much time beforehand when we were unsure of it’s success.

The other thing that worked really well was a direct result of three stack decisions that interrelated: host on the cloud, and split your app servers early, and monitor performance using NewRelic.

When traffic started climbing, we quickly switched our NewRelic subscription up to the pro level. We can’t afford to run it all the time, but it’s easy to turn it up when you do need it.

Under heavy load, there are generally only two things I want to look at in NewRelic: is there any page that is ridiculously broken from a performance perspective, and do we have enough capacity. If performance is broken somewhere, fix it. If you don’t have capacity, add it.

This time, all the pages were OK (because of the small upfront investment of work), but we did have a capacity problem. This did not take us long to figure out, because this is what NewRelic is good at. First design decision validated.

To add capacity, we just took a snapshot image of one of our app servers, and then spun up new servers directly from the image. This was pretty easy, but if we weren’t on commodity cloud hosting, it couldn’t have happened and we would have been boned. Second design decision validated.

Cloning the servers “just worked” because by keeping the database away from the app servers made sure there wasn’t any hidden coupling there. Moving to two app servers early made sure we weren’t accidentally relying on shared state. It’s the old programming truism about there only being three numbers in computing: zero, one, and many. Once you have “many” servers, adding new ones is no stress.

The final thing that worked well was following the idea of stack simplicity. I had worked a long day polishing the Netflix page code, then had my final game of indoor soccer for the season (with a win), and then went to the pub to celebrate with my team.

The only things that made working with the stack tricky that night were personal, not technical: I was tired, and not 100% sober. If simplicity wasn’t a core tenet of the stack, we could have kissed that uptime goodbye.

What went poorly?

In the presentation to the Ruby group, I stressed that you really needed to understand what the workload for your app is, and find the best bang for buck scaling strategy and use that.

Until this week, I’d been treating Goodfilms purely as a social network, and picked all our strategies to match that. The realisation that we’ve had as a team lately is that Goodfilms really is two sites living side by side: a community site/social network for films, and a rich content site for browsing for films.

With content based sites, full HTML page caches are your most cost effective technique for dealing with load. We didn’t have any page caching in place at all.

Glen’s flatmate Ben, de-facto ops guy for theconversation.edu.au and expert in content based sites chipped in and got us our first cut of page caching in place. This got load under control, but meant that signed in users weren’t getting the proper experience for that page.

The next step was to set things up so we could do signed in vs. signed out caches on demand. This was a little tricky half asleep, but we got there in the end. If you’re interested in how it works, you can check out the specific nginx config and capistrano task I wrote here and it might help you out of a jam.

The “frequent database backups” policy caused us a few headaches. Doing a full dump of a growing database while it’s under load isn’t what I would call “ideal”. That said, I’d rather have a periodical couple of bad gateway errors than risk data loss, so I’d still do it the exact same way again.

If I come up with a better balance of low cost/good uptime/good performance I’ll refine my suggestion for the datastore, but for the mean time I’m sticking with it, but flagging it as an explicit trade off you’ll be making if you follow our stack suggestions.

Final thoughts

I can’t say we’re 100% web scale. Staying up half the night with your servers means you’re not there yet. I think this is OK for where we’re at as a business. We’re still feeling our way through to the final feature set of the product, and learning what our market is like. Too much engineering now would be premature.

I strongly believe that there are no “right” answers in scaling a web app, but after giving this stack a solid thrashing, I feel very comfortable putting it forward as a good starting point. It’s a solid base from which you can respond to growth, and evolve it to match your situation well.

Make sure you put the machinery in place for page caching early, even if you’re not using it. The code is easy to write when you’re relaxed in the middle of the day, and a royal pain in the ass in the middle of the night.

Updated: If you found this useful please discuss or upvote over on Hacker News

Goodfilms

Goodfilms is a way to share the movies you watch with your friends. We rate movies on two criteria - ‘quality’ and ‘rewatchability’, so you can admit to your guilty pleasures and properly capture the feeling you get when a film leaves you exhausted. Sign up now and keep track of the films you love, and find great, challenging or silly new ones to watch.

Why Ratings Systems Don’t Work

Today’s XKCD reminded us of a problem we’ve known about for a while:

XKCD's recent summary of online ratings

A little over a year ago, we started building a movie site with this exact problem in our sights - how could we make film ratings more useful? Bad ratings are all over the internet, and as someone who’s tried to make an improvement, I thought it was worth responding.

Goodfilms Goes Mobile

We’re excited to announce the beta version of the Goodfilms mobile site! We’d like to talk a bit about the goals for building this, and some of the design inspiration behind it. We’ll also talk a little about how its put together using the absolutely excellent AngularJS.

Goodfilms Mobile Feed

The site’s optimised to give existing users of our site access to their Queue whenever they need it and the ability to rate a film when they’ve just seen it. And, of course, to help them keep up with what movies their friends have been watching, or want to go see.

New Feature: Get Notified When Friends Rate & Queue Films Based on Your Activity

Here at Goodfilms, we’ve always believed a couple of things about films ratings. The first, the obvious one, is that ordinary rating scales are broken. The second, which is just as big an influence on us, but that we write about less often, is that how you hear about a film (and from whom) matters a lot.

We’ve been recording the way people interact with films after their friends for a little while now, and it’s been amazing to watch. We can draw little “family trees” of who originally reviewed a film, then which of their friends rated it next, and then which of their friends rate it. It’s like watching our own little zombie outbreak inside the computers. Once we’ve collated a bit more data, we’ll write a blog post about some of the interesting ways different films move through social groups.

Netflix Quietly Smothers 3rd Party App Ecosystem (Updated)

Update: We’ve received word that the new API Terms of Use aren’t as sinister as on first glance. From VentureBeat:

We are not prohibiting sites from showing competing services, however we do not want anyone to use Netflix content such as titles and descriptions to advertise a competing service.

We’re not prohibiting developers from monetizing their applications by selling them directly to consumers. We will not, however, permit resale of our information in a business-to-business fashion.

This definitely goes a long way to remove the uncertainty around the new API Terms. We feel that the inability for users to retrieve their viewing history and ratings remains an issue, but we’re glad to learn that existing, mutually-beneficial uses of the API (such as ours) appear to be still valid.

Update 2: TechCrunch and TheNextWeb have now confirmed this. I wouldn’t be surprised if we see an updated Terms of Use from Netflix in the near future to put the issue beyond doubt. We’ll update here again if that occurs.


On Friday afternoon, Netflix published a blog post announcing a breaking change to their API, and dedicated a small paragraph to the fact that their API terms of use had been updated. On a technical level, these changes will cripple many apps currently integrating with Netflix, but the legal changes may be even more significant. Netflix customers should be aware of not only the upcoming changes to any 3rd party apps they might use, but what this says about Netflix as a business.

Iphone Sneak Peek

We’ve put together a short video with a sneak peek of how the iPhone-optimised site is shaping up. We’re really excited about the different possibilities a touchscreen gives us.

We love feedback, so please tell us what you think in the comments.

May Roundup

Wow, what an exciting month May was. Heaps of new users and plenty of activity on the site, and a lot of new features to talk about. Here’s a bit of a summary: