The Natural Language Toolkit (NLTK) is on of those amazing piece of Software that surprises us just by the fact that it works. Just Google it and be amazed!
Recently I found the module geograpy which extract geographical information (countries, regions and cities) from texts using NLTK.
So why not make a post with something I always wanted to try (NLTK), with something I like (maps)? First let's take a look at how geograpy
works.
import geograpy
text = "Paris is the city of love!"
places = geograpy.get_place_context(text=text)
places
Cool! We get a PlaceContext
object. Let's explore this object.
places.cities
places.countries
Anyone reading the phrase above knows I meant the mud tidal flat city of Lutetia and not the other two cities named Paris in USA and CA. That is a limitation of automated text parsers. They cannot interpret the context like humans do. Still... That is pretty amazing!
This gets even more complicated when the city name is British*. You'll get returns all over the Empire!
* Or maybe worse for Spanish names if geograpy could do other languages than English. Portuguese would be OK though. They made a habit of naming things using local features. There are no New Lisbon anywhere in Brazil.
city = "Victoria"
text = "How many countries with a city named {} can you find?".format(city)
places = geograpy.get_place_context(text=text)
countries = places.countries
print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
OK. I cheated by choosing Victoria. Let's try a more British-like name.
city = "Richmond"
text = "How many countries with a city named {} can you find?".format(city)
places = geograpy.get_place_context(text=text)
countries = places.countries
print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
I guess that this is enough to make the point that there will be plenty of false positives. With that in mind let's try something more challenging.
The new Bond movies is coming up and, as a fan, I am excited to see the it. While I wait for the movie let's parse all Ian Fleming's books and find out how many places in the world has 007 used his license to kill.
I will explain the code in the cells below using bad Bond puns.
Chunked, not stirred.
def utf8toascii(text):
return text.decode("utf-8").encode("ascii", "ignore")
def read_in_chunks(file_object, chunk_size=2048):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
I've been expecting you, Mr Loop.
import os
import cPickle as pickle
if not os.path.exists('./data/books.pickle'):
import geograpy
from glob import glob
books = dict()
for book in glob('*.txt'):
countries = []
with open(book) as f:
for chunk in read_in_chunks(f):
try:
chunk = utf8toascii(chunk)
p = geograpy.get_place_context(text=chunk)
except UnicodeDecodeError:
pass # End of data.
countries.extend(p.countries)
book_name = book.split('.txt')[0]
books.update({book_name: countries})
with open('./data/books.pickle', 'wb') as f:
pickle.dump(books, f)
else:
with open('./data/books.pickle', 'rb') as f:
books = pickle.load(f)
So we meet again.
%matplotlib inline
import pandas as pd
import numpy as np
from collections import Counter
dfs = []
for book, countries in books.items():
book = book.split('-ian_fleming')[0]
labels, values = zip(*Counter(countries).items())
dfs.append(pd.DataFrame(np.array(values), index=labels, columns=[book]))
df = pd.concat(dfs, axis=1)
all_books = df.T.sum().sort_index()
World domination. The same old dream.
if not os.path.exists('./data/positions.pickle'):
import time
from geopy import GeoNames
from geopy.geocoders.base import GeocoderTimedOut
positions = dict()
geolocator = GeoNames(username=username)
for country in df.index:
while True:
try:
position = geolocator.geocode(country)
except:
time.sleep(5)
continue
break
if position:
location = [position.latitude, position.longitude]
positions.update({country: location})
del position
else:
print("Could not get position for {}".format(country))
with open('./data/positions.pickle', 'wb') as f:
pickle.dump(positions, f)
else:
with open('./data/positions.pickle', 'rb') as f:
positions = pickle.load(f)
import folium
mapa = folium.Map(tiles="Cartodb dark_matter", location=[0, 0], zoom_start=2)
for country, location in positions.items():
times = int(all_books[country])
popup = "{} was mentioned {} times.".format(country, times)
mapa.simple_marker(location=location, popup=popup,
marker_icon="ok",
marker_color="orange",
clustered_marker=True)
mapa
I am pretty sure Uruguay was not mentioned 52 times in Ian Fleming's novels. On the other hand 139 mentions of Russia sounds about right ;-)
HTML(html)