This site provides the following access keys:

Brandan Lennox's

Recent Articles

Extremely Large File Uploads with nginx, Passenger, Rails, and jQuery

We have to handle some really frackin’ huge uploads (approaching 2 TB) in our Rails-Passenger-nginx application at work. This results in some interesting requirements:

  1. Murphy’s Law guarantees that uploads this big will get interrupted, so we need to support resumable uploads.
  2. Even if the upload doesn’t get interrupted, we have to report progress to the user since it’s such a long feedback cycle.
  3. Luckily, we can restrict the browsers we support, so we can use some of the advanced W3C APIs (like File) and avoid Flash.
  4. Only one partition in our appliance is large enough to contain a file that size, and it’s not /tmp.

For the first three requirements, it seemed like the jQuery File Upload plugin was a perfect fit. For the last, we just needed to tweak Passenger to change the temporary location of uploaded files…

Many Googles later, I realized that option is only supported in Apache and my best bet was the third-party nginx upload module. But its documentation is fairly sparse, and getting it to work with the jQuery plugin was a lot more work than I anticipated.

Below is my solution.

nginx and the Upload Module

The first step was recompiling nginx with the upload module. In our case, this meant modifying an RPM spec and rebuilding it, but in general, you just need to extract the upload module’s tarball to your filesystem and reference it in the ./configure command when building nginx:

./configure --add-module=/path/to/nginx_upload_module ...

Once that was built and installed, I added the following section to our nginx.conf:

# See http://wiki.nginx.org/HttpUploadModule
location = /upload-restore-archive {

  # if resumable uploads are on, then the $upload_field_name variable
  # won't be set because the Content-Type isn't (and isn't allowed to be)
  # multipart/form-data, which is where the field name would normally be
  # defined, so this *must* correspond to the field name in the Rails view
  set $upload_field_name "archive";

  # location to forward to once the upload completes
  upload_pass /backups/archives/restore.json;

  # filesystem location where we store uploads
  #
  # The second argument is the level of "hashing" that nginx will perform
  # on the filenames before storing them to the filesystem. I can't find
  # any documentation online, so as an example, say we were using this
  # configuration:
  #
  #   upload_store /tmp/uploads 2 1;
  #
  # A file named '43829042' would be written to this path:
  #
  #   /tmp/uploads/42/0/43829042
  #
  # I hope that's clear enough. The argument is required and must be
  # greater than 0. You can see the implementation here:
  #
  #  http://lxr.evanmiller.org/http/source/core/ngx_file.c#L118
  upload_store /backup/upload 1;

  # whether uploads are resumable
  upload_resumable on;

  # access mode for storing uploads
  upload_store_access user:r;

  # maximum upload size (0 for unlimited)
  upload_max_file_size 0;

  # form fields to be passed to Rails
  upload_set_form_field $upload_field_name[filename] "$upload_file_name";
  upload_set_form_field $upload_field_name[path] "$upload_tmp_path";
  upload_set_form_field $upload_field_name[content_type] "$upload_content_type";
  upload_aggregate_form_field $upload_field_name[size] "$upload_file_size";

  # hashes are not supported for resumable uploads
  # https://github.com/vkholodkov/nginx-upload-module/issues/12
  #upload_aggregate_form_field $upload_field_name[signature] "$upload_file_sha1";
}

That’s a literal copy-and-paste from the config. I’m including the comments here because the documentation wasn’t as explicit as I apparently needed it to be.

Some important points:

  • Valery Kholodkov, the author of the upload module, has written a protocol defining how resumable uploads work. You should definitely read it and understand the Content-Range and Session-Id headers.
  • I can’t find any documentation on “nginx directory hashes”. That comment is the best I could do to explain it.
  • Once the upload is completely finished, the module sends a request to a given URL with a given set of parameters. That’s what upload_set_form_field and upload_aggregate_form_field are for, so you can make the request look like a multipart form submission to your application.
  • The module supports automatic calculation of a SHA1 (or MD5) hash of uploaded files, presumably implemented as a filter during the upload to save time. I would’ve liked to have that hash passed to Rails for verification of the file, but it’s unsupported for resumable uploads. I’m leaving that setting commented out for future developers’ sakes.

At this point, I was able to use curl to upload files and observe what was happening on the filesystem. The next step was configuring the jQuery plugin.

The jQuery File Upload Plugin

This plugin is extremely full-featured and comprehensively documented, which was exactly the problem I had with it. I needed something in between the basic example and the kitchen sink example, and the docs were spread over a series of wiki pages that I personally had trouble following. A curse of plenty.

Here’s the essence of what I came up with (in CoffeeScript):

# We need a simple hashing function to turn the filename into a
# numeric value for the nginx session ID. See:
#
#   http://pmav.eu/stuff/javascript-hashing-functions/index.html
hash = (s, tableSize) ->
  b = 27183
  h = 0
  a = 31415

  for i in [0...s.length]
    h = (a * h + s[i].charCodeAt()) % tableSize
    a = ((a % tableSize) * (b % tableSize)) % (tableSize)
  h

sessionId = (filename) ->
  hash(filename, 16384)

$('#restore-archive').fileupload

  # nginx's upload module responds to these requests with a simple
  # byte range value (like "0-2097152/3892384590"), so we shouldn't
  # try to parse that response as the default JSON dataType
  dataType: 'text',

  # upload 8 MB at a time
  maxChunkSize: 8 * 1024 * 1024,

  # very importantly, the nginx upload module *does not allow*
  # resumable uploads for a Content-Type of "multipart/form-data"
  multipart: false,

  # add the Session-Id header to the request when the user adds the
  # file and we know its filename
  add: (e, data) ->
    data.headers or= {}
    data.headers['Session-Id'] = sessionId(data.files[0].name)

  # update the progress bar on the page during upload
  progress: (e, data) ->
    updateProgress(data.loaded, data.total)

Unlike the nginx config above, this example leaves out a lot of application-specific settings that aren’t relevant to getting the plugin to work with nginx.

Some important points:

  • I decided to use a simple JavaScript hashing function to hash the filename for the Session-Id. It might not need to be numeric, but all the nginx examples I read used numeric filenames, and the Session-Id is used directly by nginx as the filename on disk.
  • As noted in the comment, the response to an individual upload request is a plain-text byte range, which is also present in the Content-Range header. The plugin uses this value to determine the next chunk of the file to upload.
  • This means that in order to resume an upload, the first chunk of the file must be re-uploaded. Then nginx responds with the last successful byte range, and the plugin will start from there on the next request. This can be momentarily disconcerting, since it looks like the upload has started over. Set your chunk size accordingly.
  • You must set multipart: false for resumable uploads to work. I missed that note in the protocol, and I wasted a lot of time trying to figure out why my uploads weren’t resuming.

At this point, I could interrupt an upload, resume it by simply uploading the same file again, and I had a lovely progress bar to boot. The last step was making sure Rails worked.

Rails

All the hard work has been done by the time Rails even realizes somebody’s uploading something. The controller action looks exactly like you’d expect it to:

class ArchivesController < ApplicationController
  def restore
    archive = RestoreArchive.new(params[:archive])

    if archive.valid? && archive.perform!
      head(:ok)
    else
      render json: { errors: archive.errors.full_messages }, status: :error
    end
  end
end

The view suffers a bit, since the jQuery plugin wants to own the form and nginx has its configuration hard-coded:

<!-- The fileupload plugin takes care of all the normal form options -->
<form>
  <input id="restore-archive" type="file" data-url="/upload-restore-archive">
  <%= button_tag 'Upload and Restore', id: 'restore-upload-button', type: 'button' %>
</form>

That’s about it.

Success!

It was pretty sweet once it worked, but the journey was arduous. Hope this helps some people.

Who Deserves My Money?

I recently backed App.net after years of wishing I understood Twitter. Much has been said about App.net’s pricing structure — $50 per year to be a member — and it got me thinking about which other services I pay for on either a monthly or yearly basis. Who am I happy to pay? Who do I pay because I have to?

These are the services I thought of in approximately the order I thought of them:

  • AT&T U-Verse ($48) — 12 Mbps↓, 1.5 Mbps↑. No television or landline service, although they snail mail me once a week with tales of the money I’d save by bundling said services and…paying them more money.
  • AT&T Wireless (~$70) — 200 MB of data, ∞ minutes of voice, no text messages (I send essentially all iMessages). Of the major carriers, this is among the cheapest iPhone post-paid plans that I’m aware of. I’ll be evaluating my options soon since my contract has expired.
  • GitHub ($7) — five private repos, one private collaborator. I don’t write much open source code anymore, but I do occasionally deploy a few private sites with Capistrano, and we use it at work all the time.
  • Instapaper ($1) — far and away the best return-on-investment of all these services. I read in Instapaper almost every day on all my devices. It’s indispensible.
  • Site5 (~$4) — basic shared hosting plan. It’s where this site lives. Shared hosting keeps getting cheaper, but I got tired of changing providers a few years ago, so I continue giving my money to these guys. They’re good.
  • Railscasts ($9) — Ryan Bates deserves my money. He deserves everyone’s money.
  • Typekit (~$5) — Portfolio plan. So far, I’m actually only using this on my resume. (How could I not take advantage of Brandon Grotesque?)
  • Hover (~$1) — one domain registered (you’re looking at it). I like Hover, at least as much as I’ve used it.
  • App.net (~$5) — standard user account. Not being a Twitter user, I don’t have much to say about it re: Twitter. I can follow here practically everyone I’d follow there, and I like that Dalton is making money without advertisers.

This clearly doesn’t include one-time digital content purchases like software and music, but it’s fascinating to see that I pay about three times as much money to AT&T for access to the content I’m interested in than I do to the service and content providers themselves, and to consider how reluctant we web users are to pay even a minimal cost for a service we might love (like Instapaper) while inundating the infrastructure companies with hundreds of our dollars a month.

Still a Happy Mac User

A storm took out my power for about two hours while working from home last Friday. It was a minor annoyance since I was working over the VPN and couldn’t finish up what I was doing, so I decided to buy a CyberPower UPS from Amazon. Between Friday and today, when I was finally able to unbox it and plug it in, I had lost power three more times for a total of fifteen hours and burned up my cable modem. But that’s not important.

The UPS came with a CD of software. With no intentions of installing said software, I glanced at the instructions for Windows and OS X:

Instructions for using a CyberPower UPS on Windows and OS X

Here’s what they say, paraphrasing slightly:

Windows Users: Installing PowerPanel® Personal Edition

When you first get a new CyberPower UPS, you’ll need to install some software on your computer to control your UPS and begin using it.

  1. Place the CD in your CD drive and wait for the setup wizard to begin. If the wizard does not begin, go to your CD drive in “My Computer” and open the “PowerPanel® PE” folder and double click “Setup.exe”.
  2. Follow the instructions on your screen and complete the installation. The default settings offered by the installation wizard are acceptable for most users and can be changed at any time if necessary.
  3. After the setup is complete, plug the USB cord from your CyberPower UPS to an available USB port on your computer.
  4. You are now ready to begin using the PowerPanel® Personal Edition software.

Mac Users: Configuring the “Energy Saver” UPS Function

When you first get a new CyberPower UPS, you’ll need to configure the Mac UPS function to control your UPS and begin using it.

  1. Plug the USB cord from your CyberPower UPS to an available USB port on your computer.
  2. Go to “System Preferences” and open the “Energy Saver” control panel.
  3. Select settings for “UPS”. You are now ready to configure the settings for the UPS.

No third-party bloatware to install or plastic disc to lose; just a single setting in System Preferences. I’m not as crazy about Apple’s software as I used to be, but I’m still more than happy to stay away from Windows.

GoDaddy Scumbaggery

Many moons ago, I registered a domain at GoDaddy. I knew how to navigate their shit-tastic UI from my client work, and they were at least marginally less reprehensible back then. Since I planned to use the domain for a while, I stored my credit card information and turned on auto-renewal.

Times change. I no longer need the domain, and GoDaddy sucks. A few months ago, they started sending me auto-renewal reminders and warning me that the credit card on file had expired. I knew that, and I didn’t care about the domain, so I just ignored the e-mails figuring that the charge would fail and the domain would be released. I’m nothing if not passive-aggressive.

Today, while unsubscribing from unwanted e-blasts collected in my spam folder, I saw a notice for a successful auto-renewal of that domain. I never updated my card’s expiration date at GoDaddy. I thought they might have been lying or joking, but the charge showed up on my account. Shocking, no?

I have to assume that they guessed the new expiration date of my card. It seems like my expiration dates advance three years at a time, so maybe that’s the first thing they try with expired cards. I don’t know if this is generally accepted practice, but it disgusted me.

My time is worth more than the $12 charge, so I’m not going to dispute it. I did release the domain and remove my payment information, and if I could figure out how, I’d cancel my account altogether.

Fuck those guys.

“To iterate is human…”

“…to recurse, divine.”

That’s one of maybe five of these fifty programming quotes I can recall at will. I actually used it today. Super proud about this one!

At work, our thing uses a tree structure with a Rails model like this:

class Node
  has_one :parent, :class_name => 'Node'
  has_many :children, :class_name => 'Node', :foreign_key => :parent_id
end

We needed the “path” from any given node back to the root of the tree. It was originally implemented as a named scope and an instance method that called that scope:

scope :path, lambda { |id| 
  {
    :select => 'parent.*',
    :joins => ', nodes AS parent',
    :conditions => ['nodes.left BETWEEN parent.left AND parent.right AND nodes.id = ?', id],
    :order => 'parent.left'
  }
}

def path
  self.class.path(id)
end

It’s was performing terribly on large data sets. Not only was the query performance bad, but we were only ever using the return value of the named scope as an array. There was no need for all the overhead of a named scope. We determined that it would actually be faster to hit the database n times for a node at depth n than to join all the ancestors in one query.

My first thought was to collect the node’s parents in a loop:

def path
  node = self
  parents = [node]
  while parent = node.parent
    parents.unshift parent
    node = parent
  end
  parents
end

It worked. Tests passed. Performance was astronomically better (down to 1ms from 270,000ms). I almost checked it in, but I hesitated. I heard the faint voice of L. Peter Deutsch, snickering at me in that way he almost certainly must.1 “Loops?” he said. “Gfaw.”

In about two minutes, I came up with this:

def path
  parent.path << self rescue [self]
end

Shit! I mean, the syntax might be a little wonky if you don’t read Ruby, but that’s magic. From a named scope with a lambda doing some kind of SQL query that stumped three professional programmers2 to one beautiful tail-recursive line. I’m patting myself on the back pretty hard!

Footnotes

  1. I still don’t know why we were filtering by left and right values and an ID.
  2. If your quip about programming made it into a well-known list, you probably snicker at bad programmers.

More articles

Featured Project

HSLider: A Color Widget for Mac OS X Dashboard

Targeted at developers using CSS, this Dashboard widget uses HSL sliders and plain ol’ text fields to tweak colors. Copy your new color as valid CSS (hex, RGB, or HSL) with a single click.

HSLider screenshot

More about HSLider