Skip to main content

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data).


The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py and geodump.pygeoload.py reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database (geodata.db).  geodump.py reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through the Python portion of the package.

Getting the Data: "Geoload.py"

We start by importing the libraries that allow us to retrieve information from the internet (urrlib, http), store it on databases (sqlite3), and interpret what we've stored (json).  We want to ignore certificates (ssl) and pause between loads (time) -- so that the user can Control-C out of the application and stay under h/er API limits.  Honestly not sure what (sys) is doing.

import urllib.request, urllib.parse, urllib.error
import http
import sqlite3
import json
import time
import ssl
import sys

Because Google limits API requests, Chuck put in the option of setting api_key to False -- in which case the app reads data from his own server and not Google's.  But I got a key, so in my run I was using the Google API.  I only needed it to look up one address near where I lived, which I added to the list of places to be searched (see below).

api_key = "MINDYOUROWNBUSINESS"

if api_key is False:
    serviceurl = "http://py4e-data.dr-chuck.net/geojson?"
else :
    serviceurl = "https://maps.googleapis.com/maps/api/place/textsearch/json?"

# Additional detail for urllib
# http.client.HTTPConnection.debuglevel = 1

Of course, we're going to put the data into a database, geodata.sqlite; we initialize a cursor to write to and read from the file.  We create the table "Locations," with two columns -- the address we're looking for and the data associated with it.  The address is just a normal street address; the data is a block of JSON Google gives us.  And we ignore SSL stuff.

conn = sqlite3.connect('geodata.sqlite')
cur = conn.cursor()

cur.execute('''
CREATE TABLE IF NOT EXISTS Locations (address TEXT, geodata TEXT)''')

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

We create a handle (fh) to read our list of addresses (where.data).  This is nothing fancy -- just a list of places or addresses.  So we have things like University of Chicago or 6035 South Lima Way, Englewood, CO.  It's just a list of addresses you could put into a Google Maps search -- but I assume you won't get anything good if you get an ambiguous address.  I made sure that the address I entered was only associated with one point on the map.

We're going to read through the address list, line by line, get the associated data, and store it in the database.  We're only going to retrieve 200 data for 200 lines on a single run, so we increment a count variable after each run, checking to make sure it isn't too high (or breaking with an error).

fh = open("where.data")
count = 0
for line in fh:
    if count > 200 :
        print('Retrieved 200 locations, restart to retrieve more')
        break

We strip the spaces from the line, store it as an address (which it is), print a blank line (for aesthetics), and look for the address in the database. (Not sure why we need memoryview() -- which as far as I can tell stores a string and converts it into an array. Also not sure what encoding SQL uses for strings -- here it would be searching for unicode, I think.)

    address = line.strip()
    print('')
    cur.execute("SELECT geodata FROM Locations WHERE address= ?",
        (memoryview(address.encode()), ))

We check to see whether we got anything from the database. If we did, we tell the user as much and go to the next address. If we didn't, we keep going -- now we're going to look up the address in Google and add the result to our database.

    try:
        data = cur.fetchone()[0]
        print("Found in database ",address)
        continue
    except:
        pass

We prepare a dictionary of the PARaMeterS we're going to pass to Google, in which we put the address we're looking for and the API key.  We then encode the information into our url, which is a combination of the basic API url and the dictionary.  We read data from the url and keep our user abreast of what we've done and what we've got.  Then we're donewith that address (count + 1 -- notice that we don't bother counting addresses that are already in the database, only addresses we had to request).  We print a little bit of the data we've received -- without breaks.

    parms = dict()
    parms["query"] = address
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters', data[:20].replace('\n', ' '))
    count = count + 1

We dump the data as a JSON -- not sure what good it's doing to print the data if there's a unicode problem...then we check js to make sure that it has the right status (if it doesn't, once again, we print out the data, then stop looping.)  The app is only going to stop running if something is seriously wrong -- a bad address will still give us back data (from my own experience).  Another interesting point is that it looks like we're just checking the JSON code to make sure it's good.  This application, unlike geodump, is restricted to what we read from the url without any particular formatting.

    try:
        js = json.loads(data)
    except:
        print(data)  # We print in case unicode causes an error
        continue

    if 'status' not in js or (js['status'] != 'OK' and js['status'] != 'ZERO_RESULTS') :
        print('==== Failure To Retrieve ====')
        print(data)
        break

Finally, we put the data associated with the address we looked up into our database, again using memoryview() and encoding into Unicode.

    cur.execute('''INSERT INTO Locations (address, geodata)
            VALUES ( ?, ? )''', (memoryview(address.encode()), memoryview(data.encode()) ) )
    conn.commit()
    if count % 10 == 0 :
        print('Pausing for a bit...')
        time.sleep(5)

We're done curating the database.  If the user wants to start interpreting all that JSON, s/he will have to run the next app.

print("Run geodump.py to read the data from the database so you can vizualize it on a map.")

Processing The Data: "Geodump.py"

The next program in the package is actually pretty simple.  It just loads JSON from the database of Google results and writes the bits we need into a javascript file, which has entries that look like this:

[40.8075355,-73.9625727, '116th St & Broadway, New York, NY 10027, United States']

That's a list with latitude, longitude (using I gather negative values to distinguish east and west), and the full address.  The web-page Chuck designed will then visualize the data.  

So the first step is the imports.  We need SQL, JSON, and codecs -- to handle encoding between Unicode and UTF-8 formats as we read and write to the javascript file.  We connect to the database and create our cursor.


import sqlite3
import json
import codecs

conn = sqlite3.connect('geodata.sqlite')
cur = conn.cursor()

We put everything in the "Locations" sheet (which is all our data) into the cursor. Then we open the javascript file, where.js, in write mode using UTF-8 encoding, and write the beginnings of a list, myData, into the file. Newline..

cur.execute('SELECT * FROM Locations')
fhand = codecs.open('where.js', 'w', "utf-8")
fhand.write("myData = [\n")

We go through the cursor row by row and put the second column of each row (where our geodata is) into data.  We load the data in JSON format if we can -- or go to the next line.  Assuming the status of the JSON is good, we store the latitude and longitude it contains, as well as the address.  We reformat the address a little bit (not sure exactly what is being reformatted) and then print everything out.  We keep a running count of the JSON we've processed and add a newline to where.js once we're past line 1.  We concatenate the latitude, longitude, and location correctly to create the javascript list entry.  But if there's a problem putting all of this into a string -- we just skip to the next line in the database.

count = 0
for row in cur :
    data = str(row[1].decode())
    try: js = json.loads(str(data))
    except: continue

    if not('status' in js and js['status'] == 'OK') : continue

    lat = js["results"][0]["geometry"]["location"]["lat"]
    lng = js["results"][0]["geometry"]["location"]["lng"]
    if lat == 0 or lng == 0 : continue
    where = js['results'][0]['formatted_address']
    where = where.replace("'", "")
    try :
        print(where, lat, lng)

        count = count + 1
        if count > 1 : fhand.write(",\n")
        output = "["+str(lat)+","+str(lng)+", '"+where+"']"
        fhand.write(output)
    except:
        continue

We're done.  All the locations have been entered into the javascript, so we close the list and add a line break. We close the cursor, we close the file, and we tell the user how many records were added to the javascript.  Now the user can see the results visualized in our .html file.

fhand.write("\n];\n")
cur.close()
fhand.close()
print(count, "records written to where.js")
print("Open where.html to view the data in a browser")

Comments

Popular posts from this blog

Compiling and Executing Java Files With -cp

I decided I was going to "man up" and figure out how to compile a java program with an external dependency from the command line instead of relying on an IDE-- the DOS command line, to be more specific. I ran into a few problems: 1.  The external dependency was given to me as a java file.  I experimented compiling it as a .jar, but I wasn't sure how to import a class from a .jar, so I ended up compiling it into a class. 2.  When I tried to run the file, I got an error saying that the class had been compiled with a different version of Java than my JRE.  The Internet told me to check my path variable for Java.  It sure looked like it was pointing to the latest JRE (and the same version of Java as my compiler).  I asked the Internet again and found the following command: for %I in (java.exe) do @echo %~$PATH:I I'm not exactly sure what the syntax of that magic command is (intuitively it's returning the path that executes when I run the "java" com...

Quick Find / Quick Union (Connected Nodes)

Setup This week I learned about the "Quick Find" or "Quick Union" algorithm. Imagine an NxN grid of nodes, some of which are connected by lines. A connection can be interpreted as accessibility: if two nodes are connected, you can get from one to the other. Every node is accessible to itself: to get where you already are, stay there. Also, If you can get from A to B, you can go back from B to A. And if you can get from A to B and from B to C, then you can get from A to C. As a consequence, the connection between nodes divides the grid into regions of mutually accessible nodes. You can travel from any node in a given region to any other node in that region -- but not to any nodes outside that region (exercise to reader -- proof by contradiction). The problem has two parts. First, find a way to represent this grid structure and the accessibility relation; second, use your schema to efficiently calculate whether two given nodes are accessible to each other. ...