Hashing Flickr Photos

I used to host my photos with a simple set of CGI scripts that basically worked well enough for my simple requirements.  Such web applications are easy and fun to write, but in the end I decided that it wasn’t worth it because:

  • Hosting large amounts of data on a generic shell account is typically quite expensive.  Flickr‘s “pro” account subscription is a very good deal in comparison: as long as each photo is beneath 20 megabytes in size, you can upload as many as you like for $24.95 a year.
  • The community aspect of sites like Flickr is very encouraging – it’s lovely to have random people say nice things about your photographs, and occasionally have people use them in articles, etc.

(Some people are put off from using Flickr by the appearance of the site, but its API means that there are plenty of alternative front-ends for viewing or presenting your photos, such as flickriver.)

The slight problem with switching to hosting on Flickr was that previously I’d indexed all my photos by the MD5sum of the original image, so several of my pages had links or inline images that pointed to an MD5sum-based URL on the old site.  It occurred to me that it might be useful in general to have “machine tags” on each photo with a hash or checksum of the image, so that, for example:

  • You can simply check which photos have already been uploaded.
  • You can find URLs for all the different image sizes, etc. based on the content of the file.

Unfortunately, I hadn’t done this when uploading the files in the first place, so had to write a script (flickr-checksum-tags.py) which takes the slightly extraordinary step of downloading the original version of every photo that doesn’t have the checksum tags to a temporary file, hashing each file, adding the tags and deleting the temporary file.  This add tags for the MD5sum and the SHA1sum, using a namespace and keys suggested in this discussion, where someone suggests taking the same approach. These tags are of the form:

  checksum:md5=c629c63f8508cfd1a5e6ba6b4b3253a8
  checksum:sha1=df44fc771660fbe7a2d6b2e284ae61e9ed3e377c

The same script can return URLs for a given checksum:

  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --short
  > http://flic.kr/p/7oQxqK
  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 -p
  > [... the Flickr photo page URL, which WordPress insists on turning into an image ...]
  # ./flickr-checksum-tags.py -m c629c63f8508cfd1a5e6ba6b4b3253a8 --size=b
  > http://farm3.static.flickr.com/2552/4196574615_491c6387f8_b.jpg

The repository also has a script to pick out files that haven’t been uploaded, and a simple uploader script which will upload an image and add the checksum tags.  The scripts are based on the very useful Python flickrapi module and you’ll need to put your Flickr API key and secret in ~/.flickr-api

Anyway, these have been useful for me so maybe of some interest to someone out there…

2 thoughts on “Hashing Flickr Photos”

  1. I was just about to write something very similar. Thank you for saving me the trouble! I’ll let you know how I get on.

  2. Wow, thanks for shipping this product.

    My end-to-end scenario here is:
    (1) Dedupe my images locally, then upload all my photos to flickr with https://github.com/ept/uploadr.py
    (2) Organize all my pictures into “photosets” named after the folders they were originally stored in. (some glue written, but not published)
    (3) Tag all my images with a checksum with this script.
    (4) Search for all dupes and delete the dupe copies that aren’t in a “photoset” (some glue not yet written)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.