Survey MappingPublished 14 years, 8 months past
An anonymized copy of the data collected in the 2008 Survey has been turned over to some professional statisticians, as we did last year, and we’re waiting to hear back from them before moving into writing the full report. But there’s no reason we can’t have a little fun while we wait, right?
So, calling all mapping ninjas: here’s a 136KB zip archive containing two tab-separated text files listing the countries and postcodes supplied by takers of the survey. Before anyone has a privacy-related aneurysm, though, let me explain how they’re structured.
One of the two files is sorted alphabetically by country, with the postcodes as the second “column of data” (it’s country name, tab, postcode). The second is the reverse: it’s sorted alphabetically by postcode, with the country names following each postcode. This sorting should break any association they might have with the released data set, given that we won’t be including the postcodes in the released set. (More on that in a moment.)
A word of warning: though I cleaned out some of the more obvious cases of people heaping abuse on us for even daring to ask the question, I can’t guarantee that the data set is perfectly clean. There may be drops of bile here and there along with the usual collection of mistyped postcodes. I know there’s at least one bit of obvious humor that I chose to leave in, so enjoy that when you find it.
We have two reasons to release this data this way at this point. The first is to see what people do with it—heatmaps, perhaps, or one of those proportion-distortion maps, or a list of top-ten global postcodes or cities (or both). Hey, go crazy! I’d love to see a number of Google Maps/Yahoo! Maps/OpenMap/whatever mashups with this data. That would be awesome.
The second reason is to ask for help with an API challenge. Like I said, we’re not including the postcodes into the released data set. What I would like to do instead is translate the postcodes into administrative regions (states, provinces, etc.) and put those in the data set. That way, we can include things like “Ohio” and “British Columbia” and “Oaxaca”—thus providing a little bit better granularity in terms of geography, which was area of weakness in the 2007 survey.
Thanks to reading a couple of articles, I know how to do this for a single postcode. But how does one do it for 26,457 postcode-and-country combinations without having to submit every single postcode as a separate request? I’ve yet to see an explanation, and maybe there isn’t one, but I’d like to know either way. And please, if someone does come up with a way, please show the work instead of just spitting out the result! I’m hoping to learn a few things from the solution, but I obviously can’t do that without seeing the code.
One note: in cases where a postcode isn’t recognized or some kind of an error is returned, I’d like to have a little dash or “ERR” or something put in the result file. That way we can get a handle on what percentage of the responses were resolvable. Thanks.
Anyway, map and enjoy!
You might need to build a lookup table and/or some sort of heuristic/mapping ruleset. Many countries will allow you to break down the region by a partial piece of the code… EG.
* In Australia, the first digit will correspond to the state
* In the UK, the first 1 (or 2) alphabetic characters will give you regions, you can take those as-is or combine as per http://en.wikipedia.org/wiki/Mailsort#Residue_selection_codes
* Not sure how USPS works, but the first digit or two seems largely consistent across states from the small amount of info I’ve seen
* Some countries don’t have postcodes at all, or they do have them but they are not in common usage (eg. New Zealand) because they’ve only just been introduced, so data broken out on that basis may be unreliable.
I extracted some US zip code information from the Wikipedia list at http://en.wikipedia.org/wiki/ZIP_Code_prefixes and put it into a tab-delimited text file that gives the locations of zip codes based on the leading 3 digits. I may have time later to see about lists for other countries, and about automating the process of matching this data up with the ALA list.
I looked into this a while back for a project I was working on, and it turns out that postal code data is surprisingly costly, even u.s. zip codes. There’s a bunch of companies selling what should be public domain data for thousands of dollars. I did find one site selling it for about 5 bucks though, so that might be worth a look.
On an only vaguely survey-related note, I’m the guy from the Cleveland Web Standards group who was leaving CLE for Google. My wife and I are having a baby, and we’ve opened a vote on the web for the baby’s name. Drop on by and vote!
You might want to remove the last entry sorted by country for
Germany, Federal Republic of
I’m sure that’s not nice. And next year you should definitely make some sort of “choose not to say” option that is very clear next year, a fair number of people x-ing it out, putting in ‘?’s or something similar.
Oh, and look at the end of Netherlands, Kingdom of the
Netherlands, Kingdom of the not relevant
Netherlands, Kingdom of the private?
Netherlands, Kingdom of the secret
Here’s a file for UK postcodes. I used the link from Sam as well as a couple of other Wikipedia pages to put this list together. Somebody may want to check my accuracy, though, since I don’t know anything about dividing up regions England.
You could process some of the data with the 1999 U.S. Postal Service ZIP Codes and Federal Information Processing Standards (FIPS) Codes data from the US Census Bureau. You can import the dBase file into MS Access, and then export as a tab delimited text file. The result includes zip code, latitude, longitude, state code, county code, and a few other codes. The state and county codes match the FIPS codes, which will then give you the full state and county.
@Brian – I wonder whether the 1999 file is as up to date as the Wikipedia page. Could be the Wikipedia is based on the ’99 document…
If you’re just looking to get the zip codes linked with a state, I did that with a PHP function in TextMate. That accounts for over half of the records in the file:
Here’s the document:Zip code file (U.S. only) with states listed
and the code: TextMate custom command
Eric – I think that somewhere along the line a lot of these US zip codes got their leading zeros stripped out. I added them back in for this file.
It looks to me like it will take some attention to each specific country to get the geographic breakdowns for the different postal codes. I’ve gathered postal data on about half of the countries, but it seems to me that getting specific geographic data on a lot of countries will be a lot of work with very little return (i.e. I’m not sure it will make much difference to users whether we break France down into its 100+ geographical regions or not). It might be most useful to break down the largest geographic areas (Canada, Australia, Russia, etc.) and just list the others as countries.
Anyway, here’s the U.S. portion of the file. I’ll see if I have time to do other countries and post them here as well.
Here’s the UK portion of the list. I produced this one similar to the last, with a simple PHP function specific to the format of the postal code.
Postcode file (UK only) with region and city listed
This list includes the regions as well as the cities where that postal district is based.
Here’s a very simple mapping of the results, using Google Charts:
Sorry so many posts; this is a really interesting problem to me. As for getting longitude and latitude data for each zip, I haven’t been able to find a way to do it without separate http requests either. The interface for batchgeocode.com was convenient for this dataset, but judging from the slowness of processing, they’re doing separate requests as well. The thing I do like about that site is the map view, so that you can easily see whether there have been errors in the processing.
Given these constraints, I might suggest a hybrid approach to the mapping: One thing we can definitely get is the US data, with state only if necessary, but also with geospatial coordinates if we use the 1999 data that Sam referred to. That accounts for over half of the records. We can also get that type of data for other countries (UK, Canada, Australia) that account for most of the responses. If we use local data for these, and then supplement with requested data from Yahoo or Google for the other countries, we should be able to get a pretty complete set of coordinates. Also, for our http lookups, if we send requests for only the unique records that we have (eliminating all duplicates) that should further reduce the number of requests we have to make.
Nice work, Stephen. I didn’t see your post when I wrote my last one. It looks like you’ve got it solved very nicely.
Here is an example using Google Spreadsheet and Maps. Includes files.
While the UK data from Nate is really interesting (I have been reading through it for about an hour) there are a couple of errors I noticed.
1) Plymouth is in Devon and not Cornwall. Thus Devon is underrepresented and Cornwall over-represented in his data.
2) This one is more of nitpick but may be important to anyone using the data. The TS postcodes are incorrectly labelled as Cleveland, Teesside which no longer existed as of 1996. TS no refers to the Tees Valley region with postcodes North of the River Tees being in County Durham and those South of the river being in North Yorkshire.
I think what would be a really interesting mapping is correlate location against earnings and also location against the data on how respondents felt about the business and the economy (the questions on job security and whether business was growing or declining).
Weekly Links #20 | GrantPalin.com
[…] Survey Mapping Eric Meyer provides an initial glance at the results of the 2008 Web Developer Survey. […]
Nate, no problem with multiple posts!
You’re probably right about US zip codes– I’ll go back into the original CSV export to see what’s there. I did my sorting in Excel, which is annoyingly aggressive about converting text numerics to numbers, complete with leading-zero stripping, so it’s easily possible that’s what happened.
Interesting that you found a batch geocoder that seems to be single-submit under the hood. I’m beginning to think that single-submit is the only option, and I’ll just have to test with a reduced dataset until I work out a program that can do the whole conversion for me.
scragar, thanks for the pointers. I may pull some of those but will probably leave most or all of them since they’re good examples of invalid data, and any translator from postcode to region will need to be able to handle invalid data.
Jon, that sort of data would indeed be quite interesting, but we have to be very careful about how much data we associate with these postcodes. It’s probably no big deal if you’re one of a thousand people in NYC, but it could be a big deal if you’re the only respondent from Rhyl or Ottawa or Des Moines or wherever. I hope to gain enough experience that I can create those kinds of maps (or the data needed to create said maps) without being too privacy-insensitive.
If anyone is interested in using the Zip Code data as-is, the Perl script available from my previous post reformats most of the US Zip Codes using the sprintf function. With only a few exceptions, this fixes leading zeros and shortens the ZIP+4 Codes:
$line = sprintf("%05d", $line);
@Nate – Thanks for the extended UK postal code info, I’ve been searching around for that info for a little while now! Cheers :)
Just to let you know, the Australian Post Office provides a free list of all Aussie postcodes with their suburb and state at:
If you need geocoding information by postcode (as it appears you do):
The latter would be very easy to visualise through Google Maps, since you wouldn’t need any geocoding, just marker placement.
How unfortunate it is to see that less than 30.000 people have taken the time to answer the survey. It’s 10% less than last year. (I assume each file contains as many lines as there were answers, otherwise there would be no duplicate). Eric, what do you think is the reason behind this?
Just FYI, the Dutch post code system goes like this:
– Post codes are always made up of 4 digits, an optional space, and 2 letters.
– You can therefore strip out the space and the post code will be quite alright.
– There’s never a leading zero
– The four numbers give you the city and neighbourhood, sometimes the street (depends on how big the city and how long the street is
– The letters give you the final bit about the street, what range the house number is in, and whether it’s even or odd.
Getting the official list (in all kinds of formats) costs you thousands of euros if you go to TPG (the postal service in the Netherlands). There’s also a site that tries to build a free, open catalog (Dutch site). So far they’re close to being 72% complete.
Geonames.org has CC-BY postal codes for dozens of countries, US included:
A longer list of free open Geodata sources: Get.Theinfo Google Groups
When infochimps launches finding this stuff’ll be easy. /selfpimp