poniedziałek, 21 marca 2011

Kalendarz Pyrkon 2011



This code helped me to automate most of the work. Last year I used gdata directly, but this year I got so frustrated that I decided I need something simpler. The tradeoff is that the events are not described in details as they used to be. This year I'm not making a mistake of not keeping the code, so here it is. There are basically three steps:
  1. Scrape the website to gather information
  2. Build a shell command to call googlecl - google command line interface that provides some ability to use Google services from command line.
  3. Run the command in the terminal. Keep retrying if it failed for any reason.
There were some issues:
  • Doubled entries that migrated from calendar "Naukowa 2" to "Naukowa", but showed up in their proper place too.
  • Sometimes starting times where off by an hour or two. For example by 1 hour on Friday, 1 hour on Saturday, 2 hours on Sunday. WTF? I noticed it happened on calendars that have numbers in their names, but most of them do, so I might be just imagining it.
  • Some problems with encoding. I could never understand when I need to decode/encode a string.
# -*- coding:utf8 -*-

from BeautifulSoup import BeautifulSoup
import urllib2
import re
import subprocess

# Download the page
# You may want to save the page in the browser and use a local copy
# for example: 'file:///home/daniel/Pobrane/pyrkon.html'
page = urllib2.urlopen('http://www.pyrkon.pl/2011/index.php?go2=program')
soup = BeautifulSoup(page)

# Find div with the content
content = soup.find('div', id='content')
# Get all his children which are divs too
divs = content.findAll('div')

# Set starting index in case you wanted to start in the middle after some interruption
start_from = 0
i = 0
l = len(divs) - start_from

for div in divs[start_from:]:
 # Name and lecturer are easy
 tytul = div.contents[1].b.string
 prowadzacy = div.contents[1].i.string
 # I can never understand when I need to decode/encode from/to utf-8.
 # This was done by trial and error.
 # Madafaking new lines are contents too, so
 #  div.contents[2] == u'\n'
 # Place
 miejsce = re.search('^<b>miejsce: </b>(?P<miejsce>.+?)<br />', div.contents[3].renderContents(), re.M).group('miejsce')
 miejsce = miejsce.decode('utf-8')

 # Show some progress information
 i += 1
 print '[%d/%d] %s: %s' % (i, l, miejsce, tytul)

 # Event start time
 czas = re.search('^<b>termin: </b> (?P<dzien>pią|sob|nd)(\s*)(?P<godzina>\d{2}):(?P<minuta>\d{2})', div.contents[3].renderContents(), re.M)
 dzien, godzina, minuta = czas.group('dzien', 'godzina', 'minuta')
 godzina = int(godzina)
 minuta = int(minuta)
 # Conversion from name of the day to number of the day
 if dzien == 'pią':
  dzien = 25
 elif dzien == 'sob':
  dzien = 26
 elif dzien == 'nd':
  dzien = 27
  raise ValueError('Błędny dzień')
 # How long it lasts
 dlugosc = re.search('^<b>czas trwania: </b>(?P<godzin>\d+):(?P<minut>\d{2}) h<br />', div.contents[3].renderContents(), re.M)
 godzin, minut = dlugosc.group('godzin', 'minut')
 godzin = int(godzin)
 minut = int(minut)
 # Build shell command for googlecl - google command line interface (available at code.google.com)
 # uses "Quick Add" syntax
 polecenie = '''google calendar add --cal='%s' '%s - %s on %d/03/2011 %d:%02d for %d minutes in %s' ''' % (miejsce, tytul, prowadzacy, dzien, godzina, minuta, godzin * 60 + minut, miejsce)

 # Keep calling shell command until it succeeds
 # Sometimes it throws gdata.service.RequestError with status 302 and reason 'Redirect received, but redirects_remaining <= 0'
 return_code = 1
 while return_code != 0:
  print polecenie
  return_code = subprocess.call(polecenie, shell=True)

Brak komentarzy: