On Christmas
There’s a shameful relief in admitting you feel a little sad around Christmas. You think about the people who aren’t around anymore, and the memories that get a little dimmer each year, and you watch as everything around you changes without regard to how it fits into your bulleted Holiday Plan. Man. It can be rough.
Revisiting a post that meant a lot to me last year from the great Merlin Mann. The embedded video has, alas, given up the ghost, but can be seen elsewhere on the Youtubes.
Source: merlin
Brian Wilson - Heroes and Villains
An honor to have gotten to see Brian perform with his band this week. He might not have the same range as he did in the old days, but he clearly still loves his music and has finally gotten over the worst of his demons. For someone who has endured all that he has, we’re fortunate to have him performing his greatest songs and still producing amazing new music. You can see the triumph in this performance, nearly forty years after the unfinished, yet remarkable, SMiLE sessions.
Source: youtube.com
“Kentucky Route Zero” is a magic realist adventure game about a secret highway in Kentucky and the mysterious folks who travel it.
This game is being developed by Cardboard Computer and will be released around the Fall of 2011 (on PC, Mac and possibly other platforms).
This sleepy trailer speaks to me. Maybe it’s just because the game is set in some form of Kentucky, maybe it’s the Monroesque soundtrack or maybe it’s the robotic Loretta Lynn lookalike hitching a ride on a giant bald eagle. I can’t pin any single reason, but I’m smitten.
Source: kck.st
jQuery Timeago and Django DateTimeFields
Here’s a (relatively) quick one, kids.
I love Django and its humble ORM. We’ve been casually dancing for the past five or so years, and it seems like things could be getting serious between us.
I also love timeago, a jQuery plugin that replaces ugly date/time strings with fuzzy approximations that humans (like myself) can understand.
I’ve always found date/time programming to be a real pain in the foot. There exists ISO 8901, which attempts to standardize the exchange of dates and times, but the myriad of variations that count as “standard” can result in systems adopting some subset of them to work with instead of fretting over each iteration.
Django’s built-in template filter for datetime formatting (appropriately called date) has a new parameter coming in Django 1.2 (example usage: somedatetime|date:"c") that gives you a full ISO 8601 representation without much work. Example output for that (directly lifted from Django’s documentation) is provided below:
2008-01-02 10:30:00.000123
So what’s the catch, man?
So all is well so far, but there’s a hitch. You see, using timeago on dates generated by somedatetime|date:"c" simply doesn’t work.
I spent some time digging through timeago’s source to find the issue and it seems that Django’s ISO 8601-compliant output is slightly too specific for timeago. It turns out that timeago hates microseconds, which date:"c" includes. When timeago encounters a string containing microsecond, it just quits.
Taming the Tiger
So how do we work around this? As soon as I get a few spare minutes, I’ll be submitting a patch to timeago that allows it to work with ISO 8601 strings containing microseconds. For now (and as a backup plan in case I never find those spare minutes), you can use the following format string to generate a timeago compatible representation in your Django templates.
{{ somedatetime|date:'Y-m-d\TH:i:s' }}
If you’re running Django 1.1.x or earlier, the above snippet will get the job done for you as well.
I’m likely violating best practices here by not specifying timezone information, but my current project is an internal tool, so I just needed something working quickly.
Hope that helps.
Special Note: Django provides a template filter for datetimes called timesince that outputs a relative string, but I much prefer the casual fuzziness of timeago.
Update 4/17/2010: I’ve forked timeago to add a patch that discards any microsecond data and allows it to work out of the box with Django. Timeago’s tests all seem to be passing, so I’ll be submitting this upstream after I get my head around its test suite and write one covering this use case.
Store Binary Data in Twitter with Tootfiles
Sometimes a man’s worst ideas lead to his finest moments: Ben Franklin decides to fly a kite in a storm and we get electricity (I’m paraphrasing that story), Alexander Graham Bell shacks up with his 15-year-old apprentice (who was deaf) and we get the telephone (paraphrasing again).
Now, I don’t fly kites and my wife is neither 15 years old nor deaf, but I feel you and I might be on the cusp of solving the world’s data storage needs as I, too, have a terrible idea: let’s put everything in Twitter. Everything. Music, photos, tax forms—you name it, we should be able to store the data in a Twitter stream.
Brilliance begets ‘tootfiles’
So yes, you probably have realized that storing all of your binary data in Twitter is one of the top five or six ideas of this decade, so you’re as eager as I am to begin implementation. Luckily, I’ve already gone to the trouble and written a Python script that does just this: I call it tootfiles.
Note: What follows is a detailed look at how this project was built and the various choices I made along the way. If the details of character encodings and data compression bore you and you’re only interested in seeing the final code or using the script yourself, you can check out the tootfiles project page over at my Github account. I promise I won’t judge you.
A Quick (and faulty) Look at the Numbers
Twitter allows us to post individual text messages to its service with up to 140 bytes per message. Assuming for a moment that each binary byte can be represented as one character byte in a toot, it is fairly simple to gauge how many toots it will take to represent a file given its size in bytes. For example, the standard sized Google search logo weighs in at 8,558 bytes. Divided up into 140 byte segments, you might determine that we could represent this file with 62 individual toots. That sounds like a lot of bit-sized messaging, but consider that internet celebrity Robert Scoble has over 20,000 toots to his name and still has over 90,000 followers fawning over whatever it is that he actually does.
A more amusing case study might be trying to store a music file in your Twitter stream. Doing a quick search for some legal music to test with, I’ve found a punchy little number from the year 1909 called “John, Go and Put Your Trousers On.” With this short piece weighing in at 3,586,134 bytes, it would only take 25,615 individual toots to represent it as a tootfile. I’ll admit that this is a disturbingly high number of messages to store a small music file, but life is about trudging through dead end projects and never letting it break your spirit, so we’re just going to pretend it’s not a problem. Did you hear that? It’s not a problem.
Reality Bytes
So the truth is we can’t just slice the files up and stick the pieces into a toot. Binary files are a different beast from the short, nearly plaintext messages of Twitter. To fit any possible binary byte into a string that we can store in Twitter, we’ll need to store these bytes using some type of standard encoding system.
Choosing an encoding system
The two candidate encoding systems I evaluated for this project were Base64 and Base85, with the trailing number indicating the size of the usable character set. To understand this a little better, remember that binary is base 2 (it can represent any number using only 1 or 0) and the standard numerical representation is base 10 (representing any number using 0-9). You could theoretically encode anything using any base, but the higher the base the tighter you can pack the data in.
Size Matters…
I said before that the higher the base, the more efficiently you can pack data into your string. To examine this a little closer, let us look at the real world difference between Base64 and Base85 encodings.
Base64 uses the character ranges A–Z, a–z, and 0–9 for its first 62 character slots. Depending on the implementation (and there are a few), the remaining two characters differ. In mapping binary to this encoding scheme, Base64 uses four characters to represent three bytes of data. This leads to a roughly 33% increase in size when representing a binary file in Base64.
Base85 (as defined in RFC1924) uses the character ranges 0–9, A–Z, a–z, and the 23 characters !#$%&()*+-;<=>?@^_`{|}~ to represent data. By using five characters to represent four bytes of data, Base85 typically can represent a binary file with a 25% increase in size.
To compare these encodings using our two example files from above, I wrote a script that encodes the files with Base64 (as it’s included in the standard lib) and used this pure Python implementation which does the same with Base85. You can compare the file sizes in the chart below.

There seem to be no surprises there, as the numbers back up our earlier claims with regard to encoding efficiency.
… But Character Sets Matter More
So yeah—Base85 is much more efficient; that’s the pick, right? I’ll now ask that you hold on for just one moment, as there is one more issue that we must consider: the characters used to represent our data in the encodings.
You see, Twitter has a little known ‘feature’ that does a bit of post-processing to your messages to prevent bad guys from inserting scripts into a Twitter stream. This sanitization of Twitter messages takes certain special characters and converts them into HTML entities, meaning that characters like <, >, and & get expanded into safer representations of themselves.
Sanitizing messages is great because it keeps you and I safe when we’re tooting away, but this has serious implications for our data storage scheme. You see, converting these single characters into HTML encodings means the characters now take up four bytes instead of one. Base85 includes a number of characters that Twitter translates into HTML entities, so our 140 byte messages might actually end up taking up much more than that and cause problems in the decode process.
For the above reason, its smarter for us to stick with good old Base64 encoding. Even without Base85’s character issues (get it?), there’s something to be said for having a standard Base64 implementation baked right in to most standard libraries as well.
Implementation Details
Instead of walking through every line of the code, I want to just highlight some of the other choices I made and what it means for the finished product. This is broken down into two sections—one for posting to and one for decoding files from Twitter—as the situations require different approaches. The code snippets are shown as Python functions for brevity, but you’ll notice that the final code handles things with an object-oriented design.
Posting Data to Twitter
Compression
Before encoding our data to be posted to Twitter, we should really employ some type of simple data compression. Using the standard zlib package that ships with Python, I ran our two example files from before through a quick script to see how much compression really helps us. The snippet below (shortened at the expense of PEP 8) should give you an indication of how to do this on your own.
import os, sys, zlib, base64
f = open('somefile', 'rb')
fcompressed = zlib.compress(f.read())
fencoded = base64.b64encode(fcompressed)
The results from the run are shown in the following chart.

This chart actually surprised me a bit, as I had only hoped to shrink the file down enough to save some of extra cruft we saw after encoding. While the compressed Google logo is only slightly smaller than the Base64 encoded file, our rousing hit song about trousers from 1909 has been reduced to a startling 82% of the original file’s size. Your mileage will vary on this, as some file formats compress better than others while many formats are compressed by default, but it’s clearly a huge win to include a bit of fast compression to our code.
Get Your toot in the Door
So now that we have a decent handle on our compressed and encoded data, it’s time to think about getting it tooted. Before posting the messages to Twitter, we’ll first need to split the large string of data into 140 byte chunks. The below function does just this, given a data string and defaulting the slice size to 140.
def segment(data, n=140): ''' Given the encoded string, slice it into twitter ready array elements ''' tootcount = int(math.ceil(len(data)/float(n))) slices = range(tootcount) slices.reverse() return [data[i*n:(i+1)*n] for i in slices]
You’ll notice that we reverse the range in the segment() function, as we want to build the list from tail to head. This reverse order makes reading the individual toots easy later on in the decode section, as we can simply iterate through the stream grabbing the toots in order.
Including some type of header helps signify the start of a file, along with a bit of metadata. Header information is appended to the list (to be posted last) using the following format:
tootlist = segment(data)
# Header Information
header = "|Tootfile:'%s' MD5:'%s' Count:'%s'|"
% (filename, md5hash, tootcount)
tootlist.append(header) # Insert the header
With this header, we’re helping future decode runs by including three important pieces of information: the name of the file, the MD5 hash of the file data (for integrity checking), and the number of data segments that will follow the header.
Publishing the Segments
Publishing the toots to Twitter is accomplished using the Twitter-python library. It’s a fairly simple library to use (despite the uppercase method names), so I won’t get into the details of this other than to provide a quick look at the publish() method:
def publish(data, username, password):
api = twitter.Api(username, password)
for toot in data:
for retries in range(5):
try:
status = api.PostUpdate(toot)
break
except:
if retries == 4:
raise Exception('Unable to post a segment. Quitting.')
else:
time.sleep(1)
print "Finished."
There’s some rudimentary fault tolerance built in, as I encountered some timeout issues when tooting a larger file. The run will fail if it cannot post a single toot with five attempts.
Retrieving Data from Twitter
This part of the project was a bit tricky, as the Twitter API does not seem to provide a good way to access a single user’s stream of toots for more than a single page. My preferred method was for a user to only need to know the ID of a tootfile header when retrieving files.
The first issue I ran into was not knowing the username to grab all of the toots from. I ended up using simplejson and using a call to http://twitter.com/statuses/show/<TOOTID>.json to grab enough info given a single Twitter message ID to do the rest of our business. The key piece of info I needed was the Twitter username that owns the toot. I could have settled and required a full URL path to the header (which includes username), but this was a choice I made—right or wrong—as to how I wanted this implemented.
After reading a blog post from Scott Carpenter, I decided to just use the fantastic Python HTML scraping library BeautifulSoup to grab the tweets out of the given users stream. While Scott’s goal was to archive all of his Twitter messages, I was able to simplify the script to grab just the info we needed. The modified section is below.
def walk(username, headerid, tootcount):
tootlist = []
grabbedtoots = 0
url = 'http://twitter.com/%s?page=%s'
re_status_id = re.compile(r'.*/status/([0-9]*).*')
# find the max number of pages, based on 20 per page
maxpages = int(math.ceil(int(tootcount)/20.0))
for page in range(1, maxpages+1):
f = urllib.urlopen(url % (username, page))
soup = BeautifulSoup(f.read())
f.close()
toots = soup.findAll('li', {'class': re.compile(r'.*\bstatus\b.*')})
if len(toots) == 0:
break
for toot in toots:
# Do we need more toots? If so, keep going
if grabbedtoots < int(tootcount):
m = re_status_id.search(toot.find('a', 'entry-date')['href'])
status_id = m.groups()[0]
# Look for the message directly after our header
if(int(status_id) < int(headerid)):
data = str(toot.find('span', 'entry-content').renderContents())
tootlist.append(data)
grabbedtoots += 1
else:
break
# one second delay between pages
time.sleep(1)
There’s nothing too fancy going on here—just some pulling down of the Twitter pages and scraping out the messages. This implementation of walk() is naive and assumes that you only have one tootfile in your stream and that it is located at the front of your stream. I consider this a bug, so it will be fixed eventually in the actual project.
Reassembling the data is fairly trivial once you have the data in a list. The following snippet works backwardly compared to the encryption, peeling back the layers of compression and encoding before leaving you with a string representing the file data.
def decode(tootlist):
"""Decodes the raw data given a list of toots"""
data = "".join(tootlist)
compressed_data = base64.b64decode(data)
rawdata = zlib.decompress(compressed_data)
md5hash = md5.new(self.rawdata).hexdigest()
With that, we’re done. The decode process can now check the md5 hash of the assembled data against what’s in the header and write out the rawdata to either a file or standard out.
Grab the source and fork it for your own needs
If you’re interested in the full (and slightly more proper) Python implementation of tootfiles, you can grab the source code and documentation over at its Github project. The script can either be used as a standalone command line encoder, or you can use it as a Python library.
I’m releasing the script with an MIT license, which basically means you can do whatever you want with it.
This is mostly a proof of concept release, as the whole idea of storing meaningful amounts of data in 140 byte segments is absolutely silly. Even as I write this, I’m aware of a few bugs that would occur if someone were to try and use it heavily. With that said, I’ll continue to iron things out and I will gladly accept any patches as well.
If you’re interested in seeing some actual postings from the script, check out tootfiles on Twitter. Have fun, and please don’t spam your followers with a tootfile—especially if you’re someone that I’m following. ;)