Scraping a jazz events calendar

June 12, 2006 at 12:48 am

As mentioned in my last post building a live music calendar, I’m disappointed that the websites that list jazz events in Boston don’t offer the data as an RSS or iCal feed. One example of this is the WGBH Jazz Calendar, which has probably the most comprehensive listing of jazz events in the Boston area.

In my talk about Plone4Artists at EuroPython 2005, I mentioned a tool called Scrape ‘n’ Feed, which will scrape a website and generate an RSS feed. Well, it’s been a year since I first discovered this tool, and now I’m revisiting it to see if I can make it work. Here is my first foray into this scraping business.

ScrapeNFeed depends on Beautiful Soup and PyRSS2Gen which are easily installable on Ubuntu Linux with:

Once I installed these two packages, I downloaded the ScrapeNFeed.py script and created the following file ‘getwgbhfeeds.py’:

Run the script with ./getwgbhfeeds.py and it will output a file wgbh.xml , which is in the RSS 2.0 format. You can then open this file using your RSS reader of choice, and view all the Boston jazz events.

Once thing that I noticed is that some of the items in the list have an extra <br /> which means the title doesn’t get read in correctly. I’ll have to find a way to ignore the <br /> which I sure will be fairly simple with BeautifulSoup.

What’s next

At the OPMLCamp a few weeks ago, I met Mike Kowalchik, the creator of grazr. After seeing this tool, I immediately thought about how useful it would be for generating a browseable directory of event listings. You simply supply grazr with an OPML file, and it will then display all the RSS feeds and their entries. After I get a couple more event listing sites scraped, I’ll generate the OPML file and try them out with grazr.

Mike also mentions on his blog about Tom Morris’ idea about using grazr to ‘kill myspace’ by creating a better way for independent bands and artists to self promote using OPML. Note to self: follow up with Tom to discuss this idea further. I love the integrated MP3 player in his grazr box. Update: left him an Odeo message.

Technorati Tags: , , , , , ,

Building a live music calendar

June 11, 2006 at 3:04 pm

While reading from Derek Siver’s O’Reilly blog, I came across Mark Hedlund’s talk Entrepreneuring for Geeks which described how the more technically minded can move into making companies of our own. He started out the talk with a set of proverbs.

The three proverbs that struck close to home for me were:

  • pay attention to the idea that won’t leave you alone.
  • build what you know
  • momentum builds on itself

Pay attention to the idea that won’t leave you alone

Several events have occurred in the past two weeks which have echoed these words in my mind.

During the BarCampBoston I spoke with other geek entrepreneurs about the problem of finding live music, and the guys from tourb.us told me about how they are scraping venue’s sites to get concert listings. They are providing a service that answers a particular need – when is my favorite band coming to town?

This triggered a memory of an exchange I had more than a year ago with trombonist Phil Wilson at the Jazz Journalists Association panel at Schullers Jazz Club. Jon Hammond organized a panel discussion on the topic of Boston as a Launching Pad for a Jazz Career. I asked the panel what kind of online tools or services could be provided to re-ignite the jazz scene in Boston. And Phil said that he would like to see a service that would notify him when a musician was going to be performing.

Then at the last Python meetup, Dan Milstein raved about the python scraping library BeautifulSoup and described how capable it was at scraping baseball scores off a website. I played around with BeautifulSoup awhile ago, but never actually built anything using it.

Scratch an itch

“Build what you know” affirms that the most basic advice of idea generation is to scratch an itch you have yourself. Now I have an itch to scratch. I love going out to hear live music, especially jazz – but there is no single site that aggregates the concert listings. There are several sites I must visit:

  • MyRootdown Improv Music Calendar is a great site built by graphic designer and improv enthusiast Shawn dos Santo. Shawn is doing a great job of posting events he hears about, but there’s no way for people to post their own gigs
  • The WGBH Jazz Calendar is good but again, it doesn’t have an RSS/iCal feed so I have to manually visit the site everytime I want to see who’s playing.
  • Each and every venue has their own concert listing page (Scullers, Regattabar, Wallys, Berklee, Reel Bar, etc.) and of course, none of them have RSS or iCal feeds.
  • I’m sure there are others that I don’t know about

The basic problem here is that there is a fragmentation of information. Since none of the sites publish their event listings in any sort of structured way (RSS, iCal, hCalendar), it’s tedious to monitor these listings and thus hard to stay on top of what’s going on in the Boston jazz scene.

The “Pull” method

Immediately after hearing Phil’s suggestion, my technical mind started churning as I thought about generating dynamic RSS feeds based on artist or band name, and then using something like Feedblitz to turn those RSS feeds into email notifcations. As much as us geeks would like to think it’s true, the average person still has no idea what an RSS feed is or how to use it. Email is still the lowest common denominator.

But the question still remains how to get the data into a system in the first place. It is not likely one can expect musicians to enter their gig listings themselves. And here is where Beautiful Soup comes in – if I scrape the event listing sites, I can put the data into a system, extract the metadata (band, location, date/time, cost, etc.) and syndicate these concert listings as RSS feeds and subsequently email notifications.

There is even a python script called Scrape ‘n’ Feed which will automatically turn a page scraped with BeautifulSoup into an RSS feed. This is why I love python – there is almost always a library that does exactly what you want. And there is also a python script to convert iCal into RSS.

The “Push” method

Now suppose for a moment that one could get musicians to enter their gigs into some sort of system, and what if you could offer a service, let’s call it GigBlast, which would push their gig information out to a bunch of event listing services: WGBH, eventful.com, upcoming.org, boston.craigslist.org, meetup.com, etc. using the API provided by those services or in the case of WGBH which has no API, use python libraries such as clientform to submit the form.

This would make it easier for musicians to get the word out about their gigs, give fans a tool to be informed when these musicians are performing, and ultimately get more people to go out to hear music which would create more demand for live music. Maybe I’m an idealist to think that it will have such far reaching effects, but even if no one else uses this service, at least I’ll be scratching my itch!

Momentum builds

Stay tuned for more thoughts on publishing events to the web using Apple’s iCal. This will simplify the data entry process even more as musicians can simply add their event info to iCal, and in the background it’s it’s transparently uploaded to their website and automatically pushed out to the event listing services.

I also want to explore the use of microformats, such as hCalendar, which I think have a better chance of being adopted among musicians, venues and bloggers since it is fairly easy to implement – just a few changes to the HTML template. Pages formatted with hCalendar are a breeze to scrape using Technorati’s events feed service and can be searched using Technorati’s experimental Event Search tool.

Well, after many days of sideways rain, the sun has finally come out in Boston, so I’m going for a jog in the Fens.

Technorati Tags: , , , , , ,