python4oceanographers

Turning ripples into waves

NLTK, maps, and 007

The Natural Language Toolkit (NLTK) is on of those amazing piece of Software that surprises us just by the fact that it works. Just Google it and be amazed!

Recently I found the module geograpy which extract geographical information (countries, regions and cities) from texts using NLTK.

So why not make a post with something I always wanted to try (NLTK), with something I like (maps)? First let's take a look at how geograpy works.

In [3]:
import geograpy

text = "Paris is the city of love!"

places = geograpy.get_place_context(text=text)

places
Out[3]:
<geograpy.places.PlaceContext at 0x7f26bd1d53d0>

Cool! We get a PlaceContext object. Let's explore this object.

In [4]:
places.cities
Out[4]:
[u'Paris']
In [5]:
places.countries
Out[5]:
[u'France', u'United States', u'Canada']

Anyone reading the phrase above knows I meant the mud tidal flat city of Lutetia and not the other two cities named Paris in USA and CA. That is a limitation of automated text parsers. They cannot interpret the context like humans do. Still... That is pretty amazing!

This gets even more complicated when the city name is British*. You'll get returns all over the Empire!

* Or maybe worse for Spanish names if geograpy could do other languages than English. Portuguese would be OK though. They made a habit of naming things using local features. There are no New Lisbon anywhere in Brazil.

In [6]:
city = "Victoria"
text = "How many countries with a city named {} can you find?".format(city)

places = geograpy.get_place_context(text=text)
countries = places.countries

print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
Found 12 countries for the city Victoria:
United States, Canada, Seychelles, United Kingdom, Malta, Romania, Malaysia, Mexico, Chile, Argentina, Trinidad and Tobago, Panama

OK. I cheated by choosing Victoria. Let's try a more British-like name.

In [7]:
city = "Richmond"
text = "How many countries with a city named {} can you find?".format(city)

places = geograpy.get_place_context(text=text)
countries = places.countries

print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
Found 4 countries for the city Richmond:
United Kingdom, United States, Australia, Canada

I guess that this is enough to make the point that there will be plenty of false positives. With that in mind let's try something more challenging.

The new Bond movies is coming up and, as a fan, I am excited to see the it. While I wait for the movie let's parse all Ian Fleming's books and find out how many places in the world has 007 used his license to kill.

I will explain the code in the cells below using bad Bond puns.

Chunked, not stirred.

In [8]:
def utf8toascii(text):
    return text.decode("utf-8").encode("ascii", "ignore")


def read_in_chunks(file_object, chunk_size=2048):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

I've been expecting you, Mr Loop.

In [9]:
import os
import cPickle as pickle

if not os.path.exists('./data/books.pickle'):
    import geograpy
    from glob import glob

    books = dict()
    for book in glob('*.txt'):
        countries = []
        with open(book) as f:
            for chunk in read_in_chunks(f):
                try:
                    chunk = utf8toascii(chunk)
                    p = geograpy.get_place_context(text=chunk)
                except UnicodeDecodeError:
                    pass  # End of data.
                countries.extend(p.countries)
        book_name = book.split('.txt')[0]
        books.update({book_name: countries})

    with open('./data/books.pickle', 'wb') as f:
        pickle.dump(books, f)
else:
    with open('./data/books.pickle', 'rb') as f:
        books = pickle.load(f)

So we meet again.

In [10]:
%matplotlib inline
import pandas as pd
import numpy as np
from collections import Counter


dfs = []
for book, countries in books.items():
    book = book.split('-ian_fleming')[0]
    labels, values = zip(*Counter(countries).items())
    dfs.append(pd.DataFrame(np.array(values), index=labels, columns=[book]))

df = pd.concat(dfs, axis=1)

all_books = df.T.sum().sort_index()

World domination. The same old dream.

In [11]:
if not os.path.exists('./data/positions.pickle'):
    import time
    from geopy import GeoNames
    from geopy.geocoders.base import GeocoderTimedOut

    positions = dict()
    geolocator = GeoNames(username=username)
    for country in df.index:
        while True:
            try:
                position = geolocator.geocode(country)
            except:
                time.sleep(5)
                continue
            break
        if position:
            location = [position.latitude, position.longitude]
            positions.update({country: location})
            del position
        else:
            print("Could not get position for {}".format(country))
    with open('./data/positions.pickle', 'wb') as f:
        pickle.dump(positions, f)
else:
    with open('./data/positions.pickle', 'rb') as f:
        positions = pickle.load(f)
In [12]:
import folium

mapa = folium.Map(tiles="Cartodb dark_matter", location=[0, 0], zoom_start=2)


for country, location in positions.items():
    times = int(all_books[country])
    popup = "{} was mentioned {} times.".format(country, times)
    mapa.simple_marker(location=location, popup=popup,
                       marker_icon="ok",
                       marker_color="orange",
                       clustered_marker=True)
mapa
Out[12]:

I am pretty sure Uruguay was not mentioned 52 times in Ian Fleming's novels. On the other hand 139 mentions of Russia sounds about right ;-)

In [13]:
HTML(html)
Out[13]:

Comments