Original Source Here
Principal Coordinates Analysis
In this article, you will discover Principal Coordinate Analysis (PCoA), also known as Metric Multidimensional Scaling (metric MDS). You’ll learn what Principal Coordinates Analysis is, when to use it, and how to implement it on a real example using Python and/or R.
What is Principal Coordinates Analysis?
Principal Coordinates Analysis is a statistical method that converts data on distances between items into map-based visualization of those items.
The generated mappings can be used for better understanding which items are close to each other, and which are different. It can also allow you to identify groups or clusters.
Principal Coordinates Analysis vs Other Methods
Before we get into the details, let’s first discuss how Principal Coordinates Analysis relates to a few other closely related statistical methods.
Principal Coordinate Analysis vs Multidimensional Scaling
Multidimensional Scaling is a family of statistical methods that focus on creating mappings of items based on distance. Inside Multidimensional Scaling, there are methods for different types of data. Principal Coordinate Analysis is a subtype of Multidimensional Scaling that deals with numerical distances, in which there is no measurement error (you have exactly one distance measure for each pair of items)
Principal Coordinate Analysis vs Clustering
Clustering is also closely related to Principal Coordinate Analysis. In clustering, you try to create clusters of observations in your data set, based on similarities. The similarities are based on the variables that you have measured.
This is relatively similar to Principal Coordinate Analysis because it also makes representations of data based on (dis)similarity. Yet a big difference is that Principal Coordinate Analysis tries to extract two dimensions to make a 2D map, whereas clustering merely tries to group data points.
At the end of a clustering analysis, it is relatively common to plot your clusters and data points onto a 2D map as well. Yet you will need an additional analysis (for example a PCA) to compute those dimensions.
With a Principal Coordinate Analysis, the primary goal is creating the best possible mapping, and one thing you can look at in the graph are groups/clusters. With clustering, the principal goal is identifying clusters and one thing you can do with those clusters is to try to plot them on a map.
Principal Coordinate Analysis vs PCA
Principal Coordinate Analysis can also be easily confused with Principal Components Analysis (PCA). Firstly, they have the same initials which makes it easy to confuse them. Second of all, they both use dimensionality reduction.
The difference is that PCA focuses on shared variance: it tries to summarize multiple variables in the minimum number of components so that each component explains the most variance. PCoA on the other hand focuses on distances, and it tries to extract the dimensions that account for the maximum distances.
An example of Principal Coordinate Analysis
In the next part of the article, we’ll be working with two examples to make things clearer. The two examples are the two standard use cases for Principal Coordinate Analysis. Let’s introduce the examples before continuing.
The first example of Principal Coordinate Analysis that we’re going to see is a Perceptual Mapping use case. Perceptual Mapping means that you make a geographic map, but you use an unusual distance measure.
Of course, there are all types of distance measures that you can use for a map. Yet the idea of Perceptual Mapping is to create a visualization that gives you a great insight into other dimensions than distance.
For this example, I have created a small data set with travel times by train for cities in France. Ever since I’ve been living in France, I’ve been surprised how some close together cities take such a long time by train, whereas some long distances are very fast to travel. This can be due to conditions like the type of train that can run between two cities or geographical barriers like mountains and whatnot.
I calculated train itineraries between the 10 largest cities in France to obtain the data. The goal of the analysis is to redraw a map of those French cities that is not based on geographical proximity, but that is based on travel-time-based proximity.
The second example that we’ll look at is an example from product branding. We will use a simulated data set of distances between products. Imagine that you’re a company, and you want to introduce a new product. You could use this technique to map the product among existing products, to find out whether it is different enough to be introduced.
Product Mapping can be done using metric data, but it is often done using non-metric data as well. You can only use Principal Coordinate Analysis if you have data on a metric scale. For ordinal numeric data, you need to use a method called Non-Metric Multidimensional Scaling.
Torgerson method for Principal Coordinate Analysis
Before deep-diving into the code for the example, let’s zoom in on the mathematics behind the Principal Coordinate Analysis.
When starting with this method, you need data that comes down to having a so-called dissimilarity matrix. This means that you have a distance or dissimilarity measurement for each pair of items in your data set.
Be aware that this method can be used only for real measurements of distance. For example, if your value is twice as high, your distance must be twice as large. Therefore, you cannot use it for ordinal data like consumers filling in a measurement scale from one to five. You need to use a non-metric multidimensional scaling for that.
Once you have the metric distance matrix, you can compute your solution using the Torgerson method (when distances are euclidean) or else by the iterative method.
The Torgerson method
The Torgerson method has two steps. You start with the distance matrix (we call it D) to which you apply a double centering. You will obtain a double-centered matrix that we call B. This is done so that the center of your newly generated mapping will be in the middle of your graph.
The Torgerson formula for the double centering starts by computing the squares of the distances:
Then you compute the double centered matrix B as follows
The matrix C is a centering matrix computed by an identity matrix (I) and a matrix of all ones (J). n is the number of observations:
Then, you apply a Singular Value Decomposition on the matrix B. Once you do that, you take the first two dimensions of your SVD and you use them as axes to plot your mapping. The scores of your items on the first two dimensions will be used as coordinates for your map.
The iterative method
The iterative method is more general and can be applied when distances (dissimilarities) are not Euclidean. The iterative method consists of minimizing a cost function, which is defined as followed:
Principal Coordinates Analysis in R
Let’s now move to the implementation. We’ll start with the data on train travel times of French cities and then move to the branding study.
Principal Coordinates Analysis in R — Example 1 — Perceptual Mapping
Let’s start with a general overview of France’ 10 largest cities for those who are not familiar:
For this analysis, I have used an itinerary planner to obtain the travel time from each city to the other by train. I have put those travel times (in minutes) in a distance matrix. You can obtain the data directly from an S3 bucket using the following line of code:
The distance data will look as follows:
Note that this data set is already a distance matrix. Only, it is in a data frame format. We need to convert it into a distance matrix format as follows:
The distance matrix format will look like this:
Then, we can apply the Principal Coordinates Analysis function
cmdscale from the
stats package that will do all the mathematics of the Torgerson method for us, and we will obtain the coordinates for our mapping.
You will obtain the new coordinates for each of the cities and that looks as follows:
This coordinate matrix is the output of the model. Of course, a logical next step is to plot those coordinates to obtain the visual version of the mapping. We can then create the mapping as follows:
In the following graph, you see the mapping that you obtain. The cities are now relocated based on travel times by train rather than by kilometers:
Those who have looked at the original map of France will notice that there is something weird going on: cities in the east are all shown in the west and vice versa. Since the Principal Coordinates Analysis is based on distance, it does not preserve notions of the original directions. We can easily flip the map over the x-axis to get east and west back in place. You can do this with the following code:
You will now obtain the final mapping of the top 10 French cities, reorganized based on travel times by train:
Conclusions from the Perceptual Mapping study
This mapping allows us to see several interesting things. Firstly, we see that a lot of the cities are projected much closer to each other except for three outliers: Nice, Toulouse, and Bordeaux are being pulled away from each other to represent longer travel times in the southern cities of France.
The cities Marseille and Montpellier are also in the south of France, yet those are moved much closer to Paris and further away from the other southern cities. This can be explained by the quick train line from Paris to Marseille.
In the Northern part of France, we see the distance from east to west being made much smaller. The distance from Nantes to Strasbourg is shown as being very small, although they are on opposite sides of the country.
Now if you really want to move forward with this mapping, you could use mapping or GIS packages like the
cartography package to make this map look stunning. Be aware that the current mapping is based on a 0-centered map. Yet when you want to project onto a map of the country, you will need to make some additional decisions, including how to scale the map and where to place the center.
Principal Coordinates Analysis in R — Example 2 — Branding Study
Now let’s look at the Product Mapping example. Imagine you are a company, and you have an idea for a new product. You want to do a preliminary analysis to find out whether your product would be different enough from your existing products.
I have created a simulated data set with 10 candy products
The data contains 10 types of candy with measurements of sweetness, sourness, saltiness, and bitterness. The data looks as follows:
Unlike in the previous example, we do not yet have a distance matrix. Since the input for a Principal Component Analysis is a distance matrix, we need to compute that distance matrix first, based on the data. The
dist function in R computes the euclidean distances between observations, as follows:
The distance matrix looks like this:
Now we need to fit the Principal Coordinates Analysis using
cmdscale . The code is shown below:
You will obtain the coordinates of each of the 10 candies in a matrix:
You can generate the plot of the 10 candies on the two dimensions of the Principal Coordinates Analysis as follows:
You should obtain the following plot:
You can get some interesting insights from this graph. Most of the candies are grouped at the bottom of the graph. There is one very different candy: candy 5. Then for the other candies, we might distinguish two groups: one group on the bottom right (candies 2, 3, 4, and 1) and a group on the bottom left (candies 10, 9, 6, 7, and 8).
To find out a bit more about what the dimensions actually mean, it can be interesting to analyze correlations between the original variables and the two dimensions. This can be done as follows:
This gives us the following correlation matrix:
This tells us that the first dimension is strongly correlated with Sweetness and Sourness. The second dimension is mainly representing Saltiness.
Conclusions from the Candy study
We can conclude two things for the question of our candy company:
- Firstly, the company does not yet have candy in the top right of the graph. It may be interesting for them to study whether this would have any added value. This would be a candy that scores high on both dimensions. This candy could for example be a Sweet/Salty combination.
- Secondly, we can conclude that Bitterness is not really represented in the existing dimensions. This means that Bitterness is not something that highly varies inside the candies: if it did, it would probably have had a stronger presence in one of the dimensions. It may be interesting for the company to look into Bitter candies to expand their product range.
Principal Coordinates Analysis in Python
Now let’s see alternative implementation in Python for the two short examples that we have just covered with R.
Principal Coordinates Analysis in Python— Example Perceptual Mapping
The first step is to import the data. You can use the following code to obtain the city distances directly from my S3 bucket:
The data are a distance matrix with the travel times in minutes from each city too the other. It looks as follows:
The Python function that we’re going to use for the Principal Coordinates Analysis can only take a symmetrical distance matrix. This means that we have to fill in the NAs with the corresponding values. This is easy to do by replacing the NAs by 0 and doing a sum of the original matrix and the transposed matrix:
The outcome looks as follows:
Now we get to the modeling. You can use the
scikit-bio package for your Principal Coordinates Analysis. You can use the code below to install and import the package and to for the model. Finally, you print the coordinates in the first two dimensions using the
The new coordinates for the 10 cities look like this:
Finally, we want to create a plot of those 10 coordinates. You can use the following code to plot the cities with
The resulting map is shown below. It is the same output as the one obtained by R, except that it is mirrored. I will not repeat the conclusions, as they will be exactly the same conclusions as we’ve seen in the R analysis above.
Principal Coordinates Analysis in Python— Example Branding Study
Now, for completeness, let’s also do a Python implementation for the candy branding study. As previously, you need to start by importing the data. You can get them with the following code:
The data will look like this:
An additional step that is needed here is the computation of the distance matrix. In Python, you can compute pairwise distances (between each pair of rows) using
pdist . However, this function does not generate a symmetric distance matrix. You have to add the function
squareform to convert it into a symmetric matrix:
You will obtain the following NumPy array:
Now that you have this matrix, you can move on to fitting the model. We’ll use the
skbio package again and plot the results with
You will obtain the same graph as the one outputted by the equivalent R code. The Python graph is shown below:
The conclusions are the same as listed in the R analysis above.
Using Principal Coordinates Analysis, we have visualized the 10 largest cities of France and created an alternative map of France based on travel times by train.
We have also used Principal Coordinates Analysis to analyze product branding for a company that has 10 candy products on the market. We mapped their existing products and this allowed us to identify potential niches for new products to put onto the market.
In short, Principal Coordinates Analysis is a great method for exploring data and it allows you to make data visualizations that are closely linked to specific questions. Principal Coordinates Analysis can be a very useful tool if you know how and when to use it. I hope that this is the case after reading this article!
I hope that this article was useful for you. Don’t hesitate to stay tuned for more maths, stats, and data content!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot