8

I have a 500,000 list of latitudes and longitudes coordinates like below:

Latitude   Longitude  
42.022506  -88.168156  
41.877445  -87.723846  
29.986801  -90.166314  

I am looking to use python to get the city, state, and country for each coordinate in a new column like below:

Latitude   Longitude   City        State   Country
42.022506  -88.168156  Streamwood  IL      United States
41.877445  -87.723846  Chicago     IL      United States
29.986801  -90.166314  Metairie    LA      United States

With this large of a dataset, how can this be achieved in python? I have heard of Google's API, Nominatim's API, and Geopy package.

How do I get to run through all of the rows into this code? Right now I have to manually input the latitude and longitude into the last line.

import csv 
import pandas as pd
import numpy as np
import math
from geopy.geocoders import Nominatim

input_file = "Lat-Log.csv" # file contains ID, Latitude, Longitude
output_file = "output.csv"
df = pd.read_csv(input_file) 

geolocator = Nominatim(user_agent="geoapiExercises")
def city_state_country(coord):
    location = geolocator.reverse(coord, exactly_one=True)
    address = location.raw['address']
    city = address.get('city', '')
    state = address.get('state', '')
    country = address.get('country', '')
    return city, state, country
print(city_state_country("47.470706, -99.704723"))

The output gives me ('Bowdon', 'North Dakota', 'USA'). I am looking to replace the coordinates with my columns (latitude and longitude) to run through my list. How do I input my columns into the code to run through the whole document?

12
  • 3
    Is there something preventing you from doing the research on this yourself? Usually, StackOverflow questions which are just asking how to do something while having no code or attempts shown tend to get downvoted and closed. I think more research/effort on your part needs to be shown. Otherwise this is basically a "I want some code, gimme" post. See: How much research effort is expected of Stack Overflow users? Edit: Okay, looks like you did provide some code, but please explain what part you're stuck in regarding the columns. Commented Oct 1, 2021 at 16:35
  • It looks like you have code that does exactly what you are asking about. What do you want the output to be instead? Rather than talking about "replacing columns", or other code-related details, it is often more constructive to talk about the output. Commented Oct 1, 2021 at 16:43
  • If you don't want to pre-sort the coordinates to eliminate duplicates, at least cache the retrieved values so you don't call your api multiple times with the same coordinates.
    – RufusVS
    Commented Oct 1, 2021 at 16:46
  • The output gives me the city, state, and country for that exact location. I had to manually input the latitude and longitude into the print(city_state_country("47.470706, -99.704723")). I was looking to replace the coordinates with my latitude and longitude columns and give me the city, state, and country for every single coordinates and put them in a new column.
    – Nick
    Commented Oct 1, 2021 at 16:46
  • 1
    Did you write the sample code all by yourself ?
    – user2668284
    Commented Oct 1, 2021 at 17:22

1 Answer 1

12

You want to run a function on each row, which can be done using apply().

There are two complications, which is that you want to 1) provide multiple arguments to the function, and 2) get back multiple results.

These questions explain how to do those things:

Here's how to adapt your code to do this:

import pandas as pd
import io
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapiExercises")

s = """Latitude   Longitude  
42.022506  -88.168156  
41.877445  -87.723846  
29.986801  -90.166314"""

df = pd.read_csv(io.StringIO(s), delim_whitespace=True)

def city_state_country(row):
    coord = f"{row['Latitude']}, {row['Longitude']}"
    location = geolocator.reverse(coord, exactly_one=True)
    address = location.raw['address']
    city = address.get('city', '')
    state = address.get('state', '')
    country = address.get('country', '')
    row['city'] = city
    row['state'] = state
    row['country'] = country
    return row

df = df.apply(city_state_country, axis=1)
print(df)

(I replaced your read_csv() call with an inline definition of the dataframe. Ignore that. It's not important to the example. I did that to make the example self-contained.)

The city_state_country() function gets called with every row of the dataframe. (The axis=1 argument makes apply() run using rows rather than columns.) The function gets the lat and lon, and does a query. Then, it modifies the row to include the information from the query.

This gets the following result:

    Latitude  Longitude     city      state        country
0  42.022506 -88.168156            Illinois  United States
1  41.877445 -87.723846  Chicago   Illinois  United States
2  29.986801 -90.166314           Louisiana  United States

Not the same as your example, but Nominatim doesn't seem to return a city for two your your coordinates. (It calls them towns, not cities.)

5
  • 1
    You present a perfectly good pattern for this kind of problem. However, it needs to be borne in mind (for this particular case) that the geolocator services (unless paid for) are likely to stop working if you try to hit them with half a million queries in quick succession.
    – user2668284
    Commented Oct 1, 2021 at 17:31
  • @BrutusForcus True. There are options for doing hundreds of thousands of geocoding queries, which are compatible with Nominatim. If you want to use a different provider, the domain of the provider can be specified in the constructor, as documented here
    – Nick ODell
    Commented Oct 1, 2021 at 18:23
  • What's the best way for me to run through my whole data frame? Do I replace "geoapiExercises" with my email address?
    – Nick
    Commented Oct 1, 2021 at 18:51
  • 1
    @Nick It's not clear to me that you ought to run through the whole dataframe at once. The Nominatim API guidelines say to only do 1 request per second. At that rate, it would take more than five days to do half a million requests. If your internet goes down or your computer crashes, you will have to restart from scratch. You should come up with a solution which can do a handful of requests, save its work, and continue, as well as pick up after a crash.
    – Nick ODell
    Commented Oct 1, 2021 at 18:54
  • 2
    @Nick I would create a separate table of positions and locations, and only request UNIQUE positions so you don't repeat requests. In fact you should probably filter locations that are within some hundreds of feet of each other, assuming they are actually within the same city and therefore needn't be requested. Then fill in your full database from that lookup database.
    – RufusVS
    Commented Oct 1, 2021 at 19:13

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.