Categories: Programming
Tags: html, python
Reading time: approx. 8 minute(s)
Background
Should you ever be looking to find postal codes (or ZIP codes like Americans call them) for Finnish cities and/or street addresses, the Finnish Postal service (Posti) offers a manual search interface.
On quick inspection I was unable to find a Posti-provided English user interface (UI) for the search but it seems someone has implemented the same functionality elsewhere.
Using Posti’s search UI is very useful for making a small amount of searches now and then; but if you need to run a larger amount of searches, it gets pretty tedious pretty darn quick.
In search for an API
(In case you’re thinking “What the heck is an API?", it stands for Application Programming Interface. APIs can take varied forms but the main idea is that they are an alternative UI to using some system. Using an API takes more skill as it’s not point-n-click like the a browser interface for example. On the flipside, you gain possibilities not easily achievable in point-n-click UIs; for example easier automation.)
So roughly a month ago I needed to get postal codes for a low four-digit amount of Finnish cities/communities (including duplicates). I was fine with simply getting the postal code for the city center.
The obvious approach was to embark on a mission to find an API to help me get this done. It didn’t take more than one zip of coffee and the same amount of Google searches to find {API:Suomi}’s entry depicting an API for Finnish postal code searches created by a company called Flo Apps.
“The API seems to be able to provide data in both XML and JSON, cool, seems all good!”, I thought after a quick skim.
The thing that put me off from using the API though was the requirement to register by email; or actually not so much that but the statement that followed: “Tunnuksia aktivoidaan noin kerran kuussa." In case you don’t read Finnish, it translates to: “New accounts will be activated approximately once a month."
Unfortunately, I didn’t have a month to wait. It was a weekend and I wanted to be done with the whole thing and postal-codes-in-hand within a few hours if possible.
Other options?
Posti’s address data dump files
Now while writing this article I found another approach which I didn’t use a month ago though. It turns out Posti does offer Finnish address data in machine-readable format! It appears this data has been publicly available only since the beginning of 2015. Cool :) See also the Terms of Service and FAQ.
The approach I opted for
As I didn’t come across the aforementioned address data back then, I resorted to writing a Python script to simply query the Postinumerohaku browser UI and scraping data from the result pages' HTML output.
Let’s get coding
Examining HTML
Once choosing my approach I whipped out my code editor and also started looking at the HTML markup on a random address search’s result page1. The essential part of the HTML looks like this (after a bit of cleaning up):
|
|
The HTML doesn’t contain much of any useful class or id names. There is the hidden-xs
class on the table
tag as seen on line 2, which does help later in identifying the correct portion of HTML in our Python code though.
As you can see, the result (postal code) we want to be able to extract is located on line 15 above. (It’s obviously not line 15 of the actual output though as the above is only a segment thereof. Ctrl+F is your friend :)
Let’s give our pet Python some HTML to eat
I’ll just present the code first and then explain parts of it.
Please notice that there is one small deliberate error in the code (between lines 11-29) so you can’t abuse Posti’s servers without first correcting the error and hence knowing at least partially what you’re doing:
|
|
So I saved this file as postinumerot.py
which stands for postalcodes.py in Finnish.
A bit of explaining
On lines 4-7 we first import some libraries we’ll be needing:
import sys
from bs4 import BeautifulSoup
import requests as req
from urllib2 import quote
We’ll use sys.argv
for accessing command-line parameters, BeautifulSoup
is an HTML parser, Requests
is an HTTP library for human beings :D and from urllib2
we’ll be using the quote
function for some URL escaping.
Line 9 defines the URL that we’ll be sending our queries to. The %s
symbols are placeholders for character sequences (also called strings, hence the s) to be inserted into them upon query execution:
API_URL = 'http://www.verkkoposti.com/e3/postinumeroluettelo?po_commune_radio=zip&streetname=%s&po_commune=%s&zipcode='
Lines 11-29 define a function called get_zipcode
that takes street
and community
strings as parameters. The first parameter can be an empty string:
|
|
Someone might argue that a better order of parameters would’ve been community followed by street and they would be correct. I’m presenting the code “raw/unpolished” :)
Anyway, on line 5 of the above listing is an important part: that’s where we actually make a request to the Posti servers. On line 6 we use the BeautifulSoup
HTML parser to create an object based representation of the textual HTML data we obtained using the call on line 5. On line 8 we’re referring to HTTP status codes in which 200 indicates the preceding request succeeded.
Lines 12 and 15 traverse the object tree derived from raw HTML and find the data we want, the postal code that is.
Please notice that the essential-from-our-point-of-view part of the result page markup received from Posti’s servers differs significantly depending on whether you search on community/city name only or if you search based on both street address and city/community name. Try address and city and city only searches to see for yourself.
Done, mostly
That’s most of the work done right there. The lines at the end define what happens when the file is run from command line:
# MAIN
if __name__ == '__main__':
if len(sys.argv) == 3:
street = sys.argv[1]
community = sys.argv[2]
elif len(sys.argv) == 2:
street = ''
community = sys.argv[1]
else:
print 'Usage: python %s [<street>] <community/city>' % sys.argv[0]
exit(0)
try:
print get_zipcode(street, community)
except Exception, e:
raise e
So basically we can now get postal code data for an address as follows on the command line:
python postinumerot.py "Julkulanniementie 2" Kuopio
Which should simply print: 70260
You can also search only for the postal code of a city/community, for example:
python postinumerot.py Jyväskylä
which will print the postal code of the city center, in this case: 40100
Making the script executable
To make your script executable (on Linux/MacOSX) and hence to not have to type in python
at the beginning of the command every time, you can run this command:
chmod u+x postinumerot.py
Now you could run the command in a slightly shorter form, namely:
./postinumerot.py "Julkulanniementie 2" Kuopio
What’s next?
This code will only run a single query per call. The next part would be to have your whatever data source (for example CSV file) and then run queries for each of the records in the CSV/database. That’s going to be the topic of the next article though.
Disclaimer
Code in this article is provided the way it ended up when I was done with it and succeeded in getting the data I needed.
There is a deliberate error in the code presented to prevent the reader from running it without understanding what they’re doing. This is to prevent the reader from possibly getting into trouble with Posti. I won’t be putting the code on GitHub or other code hosting platform before making sure it’s ok with Posti.
You’re using code provided here on your own responsibility. This article is provided for educational purposes. Respect Posti’s terms of service.
Footnotes
- 1 This is the address I grew up/lived at between ages 6-20.