How to Convert a Shapefile to a DataFrame in Python

https://cdn-images-1.medium.com/max/2600/0*6kkluDdFOO8jdVb0

Original Source Here

Data Manipulation

How to Convert a Shapefile to a DataFrame in Python

An overview of the GeoPandas Python library, with a step-by-step example

Photo by GeoJango Maps on Unsplash

Data science application often require working with data in geographic space. Shapefiles are files that store geospatial data organized using a file-based database. Shapefiles are used by GIS professionals, local government agencies, and businesses for mapping and analysis.

In this blog post I will describe an elegant way of working with geospatial data in Python, through a practical example. I will be using GeoPandas, a Python library for working with geospatial data like plotting, analysing and mapping. GeoPandas extends the so popular Pandas library to deal with geographical data. I will also take a look at how to plot results using matplotlib.

GeoPandas can be installed through the following command:

pip3 install geopandas

The tutorial is organized as follows:

  • Load Dataset
  • Plot Data
  • Operations on the Geometry

Load Dataset

To load a geographical dataset, I can exploit the read_file() function, which automatically detects the format of the dataset. If the file is a shapefile, I should make sure that the folder containing the shapefile also includes the .prj, .dbf, and .shx files.

In this tutorial, I exploit a dataset containing Italian points of interest, provided by Map Cruzin. This shapefile is derived from OpenStreetMap.org and is licensed under the Open Data Commons Open Database License (ODbL).

import geopandas as gpddf = gpd.read_file('../../Datasets/italy-points-shape/points.shp')
df.head()
Image by Author

The Geometry field may contain POINTS, MULTILINES, POLYGONS and so on. The dataset may contain more than one geometry field, but only a geometry field can be set as active. This can be done through the set_geometry() function.

df = df.set_geometry('geometry')

The file is loaded as a GeoPandas dataframe. Since the GeoPandas Dataframe is a subclass of the Pandas Dataframe, I can use all the Pandas Dataframe methods with my GeoPandas Dataframe. For example, I can show the number of records through the shape attribute:

df.shape

The dataset contains 47,427 files.

Plot Data

I can plot the first map, through the plot() function provided by GeoPandas. If a file contains more than one geometry

df.plot()
Image by Author

The previous map is too small, thus it can be improved by using matplotlib. Firstly, I can increase the figure size. I define a subplot() with the desired size and then I pass the ax variable to the GeoDataFrame plot:

import matplotlib.pyplot as pltfig, ax = plt.subplots(1, 1, figsize=(15, 15))
df.plot(ax=ax)
Image by Author

I can also change the color of the dots according to the type column. This type of plot is called a Chorophlet map. I calculate the number of different types:

len(df['type'].value_counts())

There are 301 different types. To make the map more readable, I drop the types with less than 300 points.

target_types = df[‘type’].value_counts() > 300 tc = target_types[target_types == True].indexdef myfilter(x):
return x in tc
df['delete'] = df['type'].apply(lambda x: myfilter(x))
df = df[df['delete']]

Now I check the number of remaining types

len(df['type'].value_counts())

There are 26 types.

Now I plot the Chorophlet map, simply by passing the column attribute to the plot() function. I can show the legend by setting legend=True.

fig, ax = plt.subplots(1, 1, figsize=(15, 15))
df.plot(ax=ax, column='type', legend=True, cmap='viridis')
Image by Author

It is interesting to note that the majority of points of interest are located in North Italy.

Operations on the Geometry

GeoPandas permits to do many operations directly on the geometry field. For example, I can calculate the distance of each point from a given point, i.e. Rome, which is the Italian capital. I convert the coordinates to geometry through the points_from_xy() function:

rome_longitude = [12.496365]
rome_latitude = [41.902782]
rome_point = gpd.points_from_xy(rome_longitude,rome_latitude)

Then, I calculate the distance of each point in df from the rome_point. I use the distance() function, which is applied to the active geometry:

df['distance'] = df['geometry'].distance(rome_point[0])
Image by Author

I order the dataset by increasing the distance

df = df.sort_values(by='distance', ascending=True)

Finally, I select only points of interest near Rome, i.e. distance less than 0.2

df_rome = df[df['distance'] < 0.2]

Then, I plot the resulting dataframe:

fig, ax = plt.subplots(1, 1, figsize=(15, 15))
df_rome.plot(ax=ax, column='type', legend=True, cmap='viridis')
Image by Author

Summary

Congratulations! You have just learned how to represent geographical data in Python through GeoPandas!

You have learned how GeoPandas can be used to perform efficient operations on geodata. Although Pandas is excellent at many tasks, it is not ideal for working with geospatial data in location-aware applications. GeoPandas solves this problem by adding functionality well suited for geospatial data to Pandas.

You can download the code of this tutorial from my Github Repository.

If you have come this far to read, for me it is already a lot for today. Thanks! You can read more about me in this article.

Related Articles

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: