Uploading GBs to Django

As some may remember, one of our products involves the user posting very large (multiple GB) files. Previously we've been using an nginx extension that does a direct-to-disk streaming decode during an upload, but that extension isn't available in newer nginx (which we'd like to have available in order to take advantage of HTTP Live Streaming and Flash Video support).

We also wanted all the nice features you get with JQuery-File-Upload; resumability, progress tracking, etc. I looked through the various back-end scripts for JQuery-File-Upload, and was... nonplussed, lots of "basic functionality only" back-ends and a few overly complex ones that did way more than we wanted (e.g. producing thumbnails), often at the expense of adding PHP to the servers.

However, I have a huge advantage over JQ-F-U, I can just declare that users must use version X or above of a browser, so I can use all the nice HTML-5 goodness to make Multi-GB Uploads work cleanly with around 300 lines of JS and around 200 lines of Python (there's a few lines of HTML templates too, but whatever).  Works in Firefox 24+, Chrome 28+, and IE 11+ (it likely works all the way down to Firefox 7 and Chrome 13 and IE 10, but I haven't bothered to test those), doesn't rely on any plug-ins, and is all straightforward code that anyone can understand and debug.  It does serialized uploads in 10MB chunks (the chunk-size is parametrised, but 10MB seems to work for big video files), doesn't rely on any particular front-end server or configuration, and basically just works (in fact, the first time I pulled it up on IE11 and Firefox it just worked on both with no tweaking).

The key to making it work is to use the File API to iterate over the user-selected file in chunks, then post those chunks to a django handler that appends to the file-in-process (I actually use JQuery.ajax() for the uploads, for that matter). It encodes the current position and upload ID in the URL to which you post the data, so out-of-order posts don't mess you up. Each user gets their own upload-space (obviously you only want trusted users uploading multi-GB files). It would be nice to have checksums as well, but so far I haven't bothered.

Before you start uploading, you request an upload ID (eventually I'll likely pre-qualify the upload based on mime type and file size), and if the file "matches" an existing upload (same filename, same size, same user, same "purpose" (upload space)) you get an old ID, the response includes the upload ID, and a URL to which you can start pushing data.  On each push you get an updated json structure telling you the next upload URL and how much data has been written (eventually the checksum too).  There's cancel/delete functionality too, and multiple uploads are queued up such that there's only one upload from a given open page at a time).

What's somewhat annoying is that it's currently tied into our infrastructure a bit too much. There's no reason such a thing shouldn't be a generic component to plug into Django, but I wrote it to scratch an internal itch, rather than with an eye to releasing it.  It depends on lots of little things (Date.js, fussy, our particular settings standards) that could be stripped out, but that likely aren't worth it as I doubt there are all that many people wanting multi-GB uploads over HTTP anyway. Every upload plugin you look at seems to think that a 20MB image is a big deal and needs feedback and all sorts of special handling, and I suppose that really is the state of the world's network connections for the most part.


Comments are closed.


Pingbacks are closed.