o< cuisine migration

clafoutisFor once I’m blogging in English here, because I’d expect most people who’d actually have any interest in this entry to also read English 🙂

I’ve been wanting to move o< cuisine to another place for a while now. The main issue is that I’m a very lazy sysadmin, and consequently maintaining a WordPress on a personal server was not necessarily the bestest of ideas, especially since it’s a blog I don’t write on that often, and so I don’t see the blinking update notifications as often as I should. So moving to hosted wordpress.com platform made sense (I had happily moved this one a few months ago, and I could definitely live with that.)

Now the issue is that my WordPress install is somewhat peculiar. In the past, I’ve been using a plugin to display photos from Gallery (hosted on the same server): thankfully, this one had already been taken care of during the dotclear-to-wordpress migration, so I had reasonably clean URLs to work with. More problematically, I was also using the shashin plugin to display pictures from Picasa/Google+/Google Photos/the current iteration of that thing. And there, the shashin plugin was actually keeping its own data and not replacing with proper links (which completely makes sense given the use case) – so instead of pictures, I had stuff like [simage=], that were transformed when rendering the corresponding blog entry, and not statically. So I knew from the start that this would not work with wordpress.com, where plugins are essentially a no-no. This is the main reason why I hadn’t migrated yet. The other reason was that I had a neat little thing to add recettes.de tags at the end of my entries, and that I’d have to say goodbye to it as well. Anyway, it was not simply a matter of exporting/re-importing, I knew I’d have to process the exported result quite a bit before being able to import it as I wanted on the new platform.

Important preliminary caveat

I’m going to link a few scripts here. I consider these scripts to be very much throw-away-execute-once-code, they are VERY ugly (I’m not a Python coder, and yes, I’m parsing XML with regexps, deal with it, and my dealing with text encoding is very much on the YOLO side – it works until it doesn’t). I hesitated a while before actually adding them to this article, especially with the whole « throw-away » concept; on the other hand, maybe, just maybe, it may be of some use as an example to someone, and if I don’t provide them now then I might get some requests for them at a less convenient time. I’m also fairly sure that if I provide them, no-one will have a look at them, and if I don’t, someone will ask. So there. Note: I’m not looking for code reviews here, there’s a LOT of things I’d improve on these things if I cared, but I really don’t, so… 🙂 Also, I’m not guaranteeing that they won’t have catastrophic results if you decide to execute them for the lulz. So don’t do that, probably.

Step 0: evaluate the damage

So I knew it wouldn’t be enough to export/re-import. First thing I did was to actually do that, though. I exported the blog from my own WordPress, I re-imported it into a dummy WordPress blog. As expected, I lost most of the images. The good surprise, though, was that the images that were actually stored on WordPress were exported and re-imported as well. Other than that, nothing much to say, but most importantly, no bad surprise.

First things first: the Gallery URLs

The easy part of the migration had to do with the fact that I had a bunch of image URLs that were pointing to the gallery with a relative URL. So that wouldn’t fly.

There, the approach was very much brute-force, and pretty much something like

sed "s+/gallery/+http://cuisine.palats.com/gallery/+g" export.xml > export-with-proper-gallery-links.xml

After checking that this did everything I wanted it to do and only what I wanted it to do, time for the next step.

Next: removing shashin tags

So as I was mentioning, the main issue that I had with the images was that my images were simply encoded with an ID, and that that ID was associated to the picture data themselves in the database. So time to brush up on the use of the mysql console. I started by a view tables; to have an idea of where that information would actually be stored; then I checked that my understanding of the storage was correct with a few queries like select * from wp_shashin_photo where id=2203;. Whereas it technically doesn’t prove anything, it was good enough for me 😉

Time for a little bit of Python. Fetch IDs and corresponding URLs, search what I want to replace, replace it, be happy. I put the script in question here; please re-read the above caveat if it’s not burnt into your brain yet.

After I ran that script, I still had a few tags that weren’t taken into account by my splendid processing engine; since there were 3 or 4 of them, I edited them by hand, re-ran the script, and voilà.

While I’m at it: getting rid of Gallery

At that point, I felt pretty much ready to flip the switch and migrate the whole thing. And then when discussing with my husband, it was mentioned that it also would be reaaaally nice if I could actually get rid of the Gallery dependency as well. This could clearly not be done by hand: there was a bit more than 2000 pictures on that gallery, so it was a non-trivial task.

I looked at my options to host pictures somewhere else: I essentially needed the new host to allow me to upload the pictures, keep a track of what URL they were available in, and replace the old Gallery links by the new ones. Easy-peasy. I first had a look at Google Photos – while it’s all fancy and nice, the complete lack of API made the thing not an option. I vaguely looked at options via Drive, and I ended up deciding against it – too complicated, too random. On the other hand, Flickr has a reasonable API, a terabyte of space, and everything’s nice and shiny. So I did a few tests with flickr_api, decided it would work well enough, and started the whole thing.

I went with a three-step approach:

  • grabbing all that looked like a gallery URL, fetch the corresponding full-res image from the gallery, store on disk along with a URL/filename matching file,
  • upload all to Flickr, making a note of where Flickr was actually storing my stuff
  • replace the gallery URL with the Flickr ones.

Again, easy-peasy.

The first step was the easiest one: I just needed to grep the Gallery URLs, generate the correct URL that would fetch the right picture, download, store in non-stupid place on disk, update log of operations. Of course, it wasn’t exactly as simple as that – I had a few URLs deviating from the usual pattern and I had to tweak the script a bit to take them into account. Said script is here; again, beware, here be dragons.

So at the end of this operation, I had a directory full of pictures and a file matching the gallery html code to said files. As a sanity check, I looked for files less than 10K, which would more probably be server error messages than my images, and found none. So far, so good.

Then came the upload to Flickr. No rocket surgery there either – take the pictures, upload to Flickr via Magic API, log where they end up, and done. Except that when I started the script, I realized that uploading a picture would actually take several seconds, possibly up to ten. The very annoying thing there was that 2000 pictures, 10 seconds apiece, is essentially 5 hours. The whole process was almost trivially parallelizable (instead of uploading one picture at a time, start 15 threads that do the same thing), but I knew nothing about Python parallelization primitives, and even then, my concurrent programming skills are somewhat rusty. (There was a hilarious moment before I resigned myself to the fact that yes, I needed to synchronize the access to stdout, and that it would NOT generally be okay.) All in all, I decided that I’d probably put these 5 hours to better use learning how to do that thing instead of waiting for it to be done, so I did that. And in the end, after a bit of fiddling, I ended up with a working solution – which is here. It didn’t upload everything on the first run, because sometimes Flickr’s API is, well, flickery, but since I was logging everything and not re-uploading stuff that had already been uploaded, well, I just ran it a few times and in the end everything was uploaded. Flickery API also meant that I had to actually treat exceptions that would happen, because otherwise my 15-thread script would actually end up quite quickly in a three, two, one, zero active thread script. The more you know.

Because of previous mishaps, I was a bit wary of the fact that maybe I’d logged stuff that wasn’t actually properly uploaded. So I wrote a small verification script, which I also happened to multi-thread, because now that I knew how it went, it was easy 🙂 – here it is.

And finally, last but not least, I needed to replace my Gallery URLs with Flickr URLs. I actually made more calls to the API in this script: when uploading the images, I was getting and logging the ID, but I needed another query to get the image in the correct size. That could probably have been done in a previous step, but oh well. Last script for today here.

And at the end of that operation, I finally had a full import XML for WordPress, with all my images where I wanted them, so yay. I imported it on a fresh WordPress.com url, and there, the new o< cuisine is alive!

Tying loose ends: DNS, Apache and all these internets

Now the thing is, o< cuisine on its previous address was, well, not exactly known, but you could find it on the internets. So I kind of wanted to keep the old links working. Except I couldn’t really, because the URL had a subfolder in it, which is not supported by wordpress.com as a valid redirection. So I set up a new subdomain on my domain, set it up as primary domain for my WordPress blog, added a Redirect 301 "/coinblog/" "http://coincuisine.pasithee.fr/" in my current Apache configuration on the old domain, and voilà.

Conclusion: what I (re-)learned

  • I actually enjoyed the whole thing a lot, to my own surprise. It was fairly nice to hack something into submission and in the end to have a working result. I should do that sort of things more often 😉
  • I learnt a bit about Flickr API – always a nice-to-know.
  • I learnt a bit about Python concurrency primitives, that’s actually very useful.
  • I have a new favorite grep option, « -o », as in grep -o '.\{0,10\}gallery.\{0,10\}' oltcuisine.wordpress.xml – it displays the 10 characters before and after the match. I actually used that thing 10 times today at work (yes, really.)

Oh, and the image illustrating this blog is obviously comes from o< cuisine: Clafoutis.