Principal Coordinates Analysis



Original Source Here

Principal Coordinates Analysis

In this article, you will discover Principal Coordinate Analysis (PCoA), also known as Metric Multidimensional Scaling (metric MDS). You’ll learn what Principal Coordinates Analysis is, when to use it, and how to implement it on a real example using Python and/or R.

What is Principal Coordinates Analysis?

Principal Coordinates Analysis is a statistical method that converts data on distances between items into map-based visualization of those items.

The generated mappings can be used for better understanding which items are close to each other, and which are different. It can also allow you to identify groups or clusters.

Principal Coordinates Analysis vs Other Methods

Before we get into the details, let’s first discuss how Principal Coordinates Analysis relates to a few other closely related statistical methods.

Principal Coordinate Analysis vs Multidimensional Scaling

Multidimensional Scaling is a family of statistical methods that focus on creating mappings of items based on distance. Inside Multidimensional Scaling, there are methods for different types of data. Principal Coordinate Analysis is a subtype of Multidimensional Scaling that deals with numerical distances, in which there is no measurement error (you have exactly one distance measure for each pair of items)

Principal Coordinate Analysis vs Clustering

Clustering is also closely related to Principal Coordinate Analysis. In clustering, you try to create clusters of observations in your data set, based on similarities. The similarities are based on the variables that you have measured.

This is relatively similar to Principal Coordinate Analysis because it also makes representations of data based on (dis)similarity. Yet a big difference is that Principal Coordinate Analysis tries to extract two dimensions to make a 2D map, whereas clustering merely tries to group data points.

At the end of a clustering analysis, it is relatively common to plot your clusters and data points onto a 2D map as well. Yet you will need an additional analysis (for example a PCA) to compute those dimensions.

With a Principal Coordinate Analysis, the primary goal is creating the best possible mapping, and one thing you can look at in the graph are groups/clusters. With clustering, the principal goal is identifying clusters and one thing you can do with those clusters is to try to plot them on a map.

Principal Coordinate Analysis vs PCA

Principal Coordinate Analysis can also be easily confused with Principal Components Analysis (PCA). Firstly, they have the same initials which makes it easy to confuse them. Second of all, they both use dimensionality reduction.

The difference is that PCA focuses on shared variance: it tries to summarize multiple variables in the minimum number of components so that each component explains the most variance. PCoA on the other hand focuses on distances, and it tries to extract the dimensions that account for the maximum distances.

An example of Principal Coordinate Analysis

In the next part of the article, we’ll be working with two examples to make things clearer. The two examples are the two standard use cases for Principal Coordinate Analysis. Let’s introduce the examples before continuing.

Perceptual Mapping

The first example of Principal Coordinate Analysis that we’re going to see is a Perceptual Mapping use case. Perceptual Mapping means that you make a geographic map, but you use an unusual distance measure.

Of course, there are all types of distance measures that you can use for a map. Yet the idea of Perceptual Mapping is to create a visualization that gives you a great insight into other dimensions than distance.

For this example, I have created a small data set with travel times by train for cities in France. Ever since I’ve been living in France, I’ve been surprised how some close together cities take such a long time by train, whereas some long distances are very fast to travel. This can be due to conditions like the type of train that can run between two cities or geographical barriers like mountains and whatnot.

I calculated train itineraries between the 10 largest cities in France to obtain the data. The goal of the analysis is to redraw a map of those French cities that is not based on geographical proximity, but that is based on travel-time-based proximity.

Product Mapping

The second example that we’ll look at is an example from product branding. We will use a simulated data set of distances between products. Imagine that you’re a company, and you want to introduce a new product. You could use this technique to map the product among existing products, to find out whether it is different enough to be introduced.

Product Mapping can be done using metric data, but it is often done using non-metric data as well. You can only use Principal Coordinate Analysis if you have data on a metric scale. For ordinal numeric data, you need to use a method called Non-Metric Multidimensional Scaling.

Torgerson method for Principal Coordinate Analysis

Before deep-diving into the code for the example, let’s zoom in on the mathematics behind the Principal Coordinate Analysis.

When starting with this method, you need data that comes down to having a so-called dissimilarity matrix. This means that you have a distance or dissimilarity measurement for each pair of items in your data set.

Be aware that this method can be used only for real measurements of distance. For example, if your value is twice as high, your distance must be twice as large. Therefore, you cannot use it for ordinal data like consumers filling in a measurement scale from one to five. You need to use a non-metric multidimensional scaling for that.

Once you have the metric distance matrix, you can compute your solution using the Torgerson method (when distances are euclidean) or else by the iterative method.

The Torgerson method

The Torgerson method has two steps. You start with the distance matrix (we call it D) to which you apply a double centering. You will obtain a double-centered matrix that we call B. This is done so that the center of your newly generated mapping will be in the middle of your graph.

The Torgerson formula for the double centering starts by computing the squares of the distances:

Principal Coordinates Analysis — Torgerson method part 1

Then you compute the double centered matrix B as follows

Principal Coordinates Analysis — Torgerson method part 2

The matrix C is a centering matrix computed by an identity matrix (I) and a matrix of all ones (J). n is the number of observations:

Principal Coordinates Analysis — Torgerson method part 3

Then, you apply a Singular Value Decomposition on the matrix B. Once you do that, you take the first two dimensions of your SVD and you use them as axes to plot your mapping. The scores of your items on the first two dimensions will be used as coordinates for your map.

The iterative method

The iterative method is more general and can be applied when distances (dissimilarities) are not Euclidean. The iterative method consists of minimizing a cost function, which is defined as followed:

Principal Coordinates Analysis — the cost function for the iterative method

Principal Coordinates Analysis in R

Let’s now move to the implementation. We’ll start with the data on train travel times of French cities and then move to the branding study.

Principal Coordinates Analysis in R — Example 1 — Perceptual Mapping

Let’s start with a general overview of France’ 10 largest cities for those who are not familiar:

Principal Coordinates Analysis — Locations of the 10 largest cities in France

For this analysis, I have used an itinerary planner to obtain the travel time from each city to the other by train. I have put those travel times (in minutes) in a distance matrix. You can obtain the data directly from an S3 bucket using the following line of code:

Principal Coordinates Analysis — obtain the distance matrix

The distance data will look as follows:

Principal Coordinates Analysis — obtain the distance matrix

Note that this data set is already a distance matrix. Only, it is in a data frame format. We need to convert it into a distance matrix format as follows:

Principal Coordinates Analysis — convert the data into a real distance matrix object

The distance matrix format will look like this:

Principal Coordinates Analysis — the distance matrix object

Then, we can apply the Principal Coordinates Analysis function cmdscale from the stats package that will do all the mathematics of the Torgerson method for us, and we will obtain the coordinates for our mapping.

Principal Coordinates Analysis — fitting the model using cmdscale

You will obtain the new coordinates for each of the cities and that looks as follows:

Principal Coordinates Analysis — the output coordinates

This coordinate matrix is the output of the model. Of course, a logical next step is to plot those coordinates to obtain the visual version of the mapping. We can then create the mapping as follows:

Principal Coordinates Analysis — plotting the new coordinates

In the following graph, you see the mapping that you obtain. The cities are now relocated based on travel times by train rather than by kilometers:

Principal Coordinates Analysis — The top 10 cities of French reorganized based on train travel time

Those who have looked at the original map of France will notice that there is something weird going on: cities in the east are all shown in the west and vice versa. Since the Principal Coordinates Analysis is based on distance, it does not preserve notions of the original directions. We can easily flip the map over the x-axis to get east and west back in place. You can do this with the following code:

Flipping the x-axis of the plot

You will now obtain the final mapping of the top 10 French cities, reorganized based on travel times by train:

Principal Coordinates Analysis — The top 10 cities of French reorganized based on train travel time

Conclusions from the Perceptual Mapping study

This mapping allows us to see several interesting things. Firstly, we see that a lot of the cities are projected much closer to each other except for three outliers: Nice, Toulouse, and Bordeaux are being pulled away from each other to represent longer travel times in the southern cities of France.

The cities Marseille and Montpellier are also in the south of France, yet those are moved much closer to Paris and further away from the other southern cities. This can be explained by the quick train line from Paris to Marseille.

In the Northern part of France, we see the distance from east to west being made much smaller. The distance from Nantes to Strasbourg is shown as being very small, although they are on opposite sides of the country.

Now if you really want to move forward with this mapping, you could use mapping or GIS packages like the cartography package to make this map look stunning. Be aware that the current mapping is based on a 0-centered map. Yet when you want to project onto a map of the country, you will need to make some additional decisions, including how to scale the map and where to place the center.

Principal Coordinates Analysis in R — Example 2 — Branding Study

Now let’s look at the Product Mapping example. Imagine you are a company, and you have an idea for a new product. You want to do a preliminary analysis to find out whether your product would be different enough from your existing products.

I have created a simulated data set with 10 candy products

Principal Coordinates Analysis — Importing the data

The data contains 10 types of candy with measurements of sweetness, sourness, saltiness, and bitterness. The data looks as follows:

Principal Coordinates Analysis — the candy data

Unlike in the previous example, we do not yet have a distance matrix. Since the input for a Principal Component Analysis is a distance matrix, we need to compute that distance matrix first, based on the data. The dist function in R computes the euclidean distances between observations, as follows:

Principal Coordinates Analysis — computing a distance matrix

The distance matrix looks like this:

Principal Coordinates Analysis — the computed distance matrix

Now we need to fit the Principal Coordinates Analysis using cmdscale . The code is shown below:

Principal Coordinates Analysis — fitting the model

You will obtain the coordinates of each of the 10 candies in a matrix:

Principal Coordinates Analysis — the new coordinates for each candy

You can generate the plot of the 10 candies on the two dimensions of the Principal Coordinates Analysis as follows:

Principal Coordinates Analysis — plotting the candy analysis

You should obtain the following plot:

Principal Coordinates Analysis — plotting the candy mapping

You can get some interesting insights from this graph. Most of the candies are grouped at the bottom of the graph. There is one very different candy: candy 5. Then for the other candies, we might distinguish two groups: one group on the bottom right (candies 2, 3, 4, and 1) and a group on the bottom left (candies 10, 9, 6, 7, and 8).

To find out a bit more about what the dimensions actually mean, it can be interesting to analyze correlations between the original variables and the two dimensions. This can be done as follows:

Principal Coordinates Analysis — analyzing the dimensions

This gives us the following correlation matrix:

Principal Coordinates Analysis — correlation matrix

This tells us that the first dimension is strongly correlated with Sweetness and Sourness. The second dimension is mainly representing Saltiness.

Conclusions from the Candy study

We can conclude two things for the question of our candy company:

  • Firstly, the company does not yet have candy in the top right of the graph. It may be interesting for them to study whether this would have any added value. This would be a candy that scores high on both dimensions. This candy could for example be a Sweet/Salty combination.
  • Secondly, we can conclude that Bitterness is not really represented in the existing dimensions. This means that Bitterness is not something that highly varies inside the candies: if it did, it would probably have had a stronger presence in one of the dimensions. It may be interesting for the company to look into Bitter candies to expand their product range.

Principal Coordinates Analysis in Python

Now let’s see alternative implementation in Python for the two short examples that we have just covered with R.

Principal Coordinates Analysis in Python— Example Perceptual Mapping

The first step is to import the data. You can use the following code to obtain the city distances directly from my S3 bucket:

Principal Coordinates Analysis — importing the data

The data are a distance matrix with the travel times in minutes from each city too the other. It looks as follows:

Principal Coordinates Analysis — the distance matrix

The Python function that we’re going to use for the Principal Coordinates Analysis can only take a symmetrical distance matrix. This means that we have to fill in the NAs with the corresponding values. This is easy to do by replacing the NAs by 0 and doing a sum of the original matrix and the transposed matrix:

Principal Coordinates Analysis — converting the half distance matrix to the symmetric distance matrix

The outcome looks as follows:

Principal Coordinates Analysis — the symmetric distance matrix

Now we get to the modeling. You can use the scikit-bio package for your Principal Coordinates Analysis. You can use the code below to install and import the package and to for the model. Finally, you print the coordinates in the first two dimensions using the .samples attribute:

Principal Coordinates Analysis — fitting the model using scikit-bio

The new coordinates for the 10 cities look like this:

Principal Coordinates Analysis — showing the coordinates of the 10 cities

Finally, we want to create a plot of those 10 coordinates. You can use the following code to plot the cities with matplotlib :

Principal Coordinates Analysis — plotting the result

The resulting map is shown below. It is the same output as the one obtained by R, except that it is mirrored. I will not repeat the conclusions, as they will be exactly the same conclusions as we’ve seen in the R analysis above.

Principal Coordinates Analysis — mapping the cities based on travel times by train

Principal Coordinates Analysis in Python— Example Branding Study

Now, for completeness, let’s also do a Python implementation for the candy branding study. As previously, you need to start by importing the data. You can get them with the following code:

Principal Coordinates Analysis — importing the candy data

The data will look like this:

Principal Coordinates Analysis — the candy data in Python

An additional step that is needed here is the computation of the distance matrix. In Python, you can compute pairwise distances (between each pair of rows) using pdist . However, this function does not generate a symmetric distance matrix. You have to add the functionsquareform to convert it into a symmetric matrix:

Principal Coordinates Analysis — compute the symmetrical distance matrix in Python

You will obtain the following NumPy array:

Principal Coordinates Analysis — compute the symmetrical distance matrix

Now that you have this matrix, you can move on to fitting the model. We’ll use the skbio package again and plot the results with matplotlib :

Principal Coordinates Analysis — fitting the model with skbio

You will obtain the same graph as the one outputted by the equivalent R code. The Python graph is shown below:

Principal Coordinates Analysis — the resulting mapping of the 10 candies

The conclusions are the same as listed in the R analysis above.

Conclusion

Using Principal Coordinates Analysis, we have visualized the 10 largest cities of France and created an alternative map of France based on travel times by train.

We have also used Principal Coordinates Analysis to analyze product branding for a company that has 10 candy products on the market. We mapped their existing products and this allowed us to identify potential niches for new products to put onto the market.

In short, Principal Coordinates Analysis is a great method for exploring data and it allows you to make data visualizations that are closely linked to specific questions. Principal Coordinates Analysis can be a very useful tool if you know how and when to use it. I hope that this is the case after reading this article!

I hope that this article was useful for you. Don’t hesitate to stay tuned for more maths, stats, and data content!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: