My efforts to get SIFF schedule data into a format useful to me continues.
Earlier, I had collected the movies that I had already manually tagged on SIFF's website that I might be interested in. There was some data that I didn't have in that original pass, like running time of the movies, so I proceeded to spider the full 2015 festival schedule:
Before I go on, let me encourage anybody that is inspired to do any sort of crawling of a website to be considerate, perhaps rate-limiting your requests - it's easy to imagine a script getting out of hand and inadvertently becoming a "denial of service" bot. And presumably, you don't want that.
So, you can see above that I've pulled the different showing data for each movie, and each showing is its own line in my new spreadsheet. That's handy.
You will also see that the movie "Décor" got mangled in the process. I fought with unicode and encodings and decodings and python 2 vs python 3 and I gave up. What's worse is that you don't see "Paco de Lucía: A Journey", because the URL for the movie is non-ascii, which I guess is fine, except that it broke my stuff. So I skipped that movie altogether. Maybe that's a tip for webmasters that want to discourage lazy hackers: throw in some accented characters, and hope that unicode is too much work to bother with.
I'm assured that Python 3 gets the unicode stuff right, and if I were to start all over again right now, I might use Python 3, but how many of the libraries I depend upon currently support Python 3? (Some, I imagine. Probably not all.)
Also, while I'm here, I'll mention that I appreciate that the midnight showings are listed as 11:55pm showings. That's unambiguous and easy to understand.
Next up: Hm, I don't know - I was thinking of jamming all of this information into a Google Calendar, which could be pretty useful. A whole new set of APIs to wrestle with, which isn't entirely a bad thing. I've got the start time, duration, and location - all of which would make for useful values in a schedule.
No comments:
Post a Comment