Original Source Here
To obtain the same non-random split of train, validation, and test data as the one in R, you can use the code below. You can also go for a random sample using scikitlearn’s
train_test_split function if you don’t mind having slightly different outcomes.
Now, the complete model can be estimated as follows:
Just like we did in R, you need to tune the number of components. The plot of predictive error as a function of the number of components can be created as follows:
It will give you the following graphs that show an optimal number of components somewhere in the range of 15 to 20.
Now, it can be nice to visualize the differences in results for different numbers of components. As there are 100 coefficients in the model, the numerical data is hard to look at. The following plot shows the coefficients in a plot:
The plot looks as follows for the three dependent variables from left to right (water, fat, protein) and the number of components ( 1=blue, 2=orange, 3=green).
You can see that the model becomes more complex with more components involved, just like we saw with the R code.
Now, let’s move on to the Grid Search to find the best value for
n_comp , the number of components. We compute an R2 score on the validation data for each of the possible values of
n_comp as follows:
The best R2 that we can obtain is an R2 score of 0.9431952353094432 with a value of ncomp at 15.
As a final validation of our model, let’s verify if we obtain a comparable score on the test data set:
The obtained R2 score is 0.95628, which is even slightly higher than the validation error. We can be confident that the model does not overfit and that we have found the right number of components to make a performant model.
If this error is acceptable for meat testing purposes, we could confidently replace hand-made measurements of water, fat, and protein with the automated chemometrics measurements in combination with this PLS Regression. The PLS Regression would then serve to convert chemometrics measurements into estimations of water, fat, and protein contents.
Partial Least Squares Discriminant Analysis Example
For this second example, we’ll be doing an explanatory model: a model that focuses on understanding and interpreting the components rather than obtaining predictive performance.
The data is a data set on olive oil. The goal is to see if we can predict the country of origin based on chemical measurements and sensory measurements. This model will allow us to understand how to differentiate olive oils from different countries based on chemical and sensory measurements.
The dependent variable (country of origin) is categorical, which makes it a great case for Discriminant Analysis because this is a method in the family of classification models.
Partial Least Squares Discriminant Analysis R
In R, you can obtain the Olive Oil data set as soon as you import the pls library. You can do this as follows:
The data looks as follows. It contains two matrices: a chemical matric with 5 variables of chemical measurements and a sensory matrix with 6 variables of sensory measurements:
Of course, this data format is not ideal for our use case, so we need to make the data frame of two matrices into just one matrix. You can do this as follows:
The resulting matrix looks like this. As you can see it automatically contains the column names:
The country is the first letter of the rownames (G for Greece, I for Italy, and S for Spain). Here is an easy way to create the Y data as a factor. Factors are categorical data variables in R.
Now, we get to the model. We will use the caret library for fitting the model. Caret is a great library that contains lots of Machine Learning models and also a lot of tools for model development. If you’re interested in caret, you could check out this article that compares R’s caret against Python’s scikit-learn.
You can use the PLSDA function in caret to fit the model, as shown below:
The next step is to obtain the biplot so that we can interpret the meanings of the components and analyze the placement of the individuals at the same time.
To interpret the biplot, you need to look at directions. To understand what this means, try to draw an imaginary line from the middle of the graph to each label. The angle between the imaginary lines is what makes two items close. The distance from the middle makes that weights are strong or weak.
Tip: You could quickly scroll down to the Python biplot to understand the idea of the imaginary lines better!
Interpreting the Partial Least Squares Biplot
When using Partial Least Squares for explanatory use cases, a lot can be concluded from the biplot. To keep things simple, we sometimes look at more than two dimensions, but things can quickly get complicated from the third dimension onward. Let’s stick to two dimensions here.
- Interpreting the first dimension
The first question that we generally ask ourselves is about the meaning and interpretation of the first dimension. We can define this by looking at the variables (red labels on the plot) that are strongly related to the first dimension. To see those, we can take a variable far on the left (low score on comp 1) and a variable far on the right (high score on comp 1).
In our case, the first component goes from green on the left to yellow on the right. Apparently, the split between green and yellow is important in olive oil!
To confirm this, let’s now look at whether there is a trend in individuals (black labels on the plot). We see a lot of Greek oils on the left, whereas we see a lot of Spanish oils on the right. This means that the split between Yellow and Green olive oils allows us to distinguish between Greek and Spanish olive oils!
An interesting insight in terms of variables is that there apparently seems to be a very prominent color gradient in olive oils from yellow to green, whereas brown is not represented by the first dimension.
2. Interpreting the second dimension
Now, let’s see what we can learn from the second dimension. First, let’s find some representative variables. To do this, we need to find variables that score very high or very low on dimension 2.
We can see that dimension 2 goes from brown on the bottom to glossy and transparent on the top. Apparently, there is an important gradient in olive oils with brown oils on one end and glossy and transparent ones on the other side.
To obtain a learning in terms of countries, let’s see how the individuals are distributed along the axis of dimension 2. The split is less obvious than the one from dimension 1. Yet when looking carefully, we can clearly see that italian olive oils are generally browner than other oils. Non-Italian olive oils are also generally more glossy and transparent than the other oils.
An interesting insight in terms of variables is that there is no gradient from brown to another color, but rather from brown on one side to glossy and transparent on the other side. Of course, in reality, an olive oil expert would collaborate on such a study to help interpret the findings.
3. Interpretation of the types of variables in the main components
What we have seen here are the dimensions that are defined as most important by the model. We can note that the most important components mostly exist of sensory components. This means that sensory characteristics seem to work well for detecting the source country of an olive oil. This is also a very interesting learning!
Partial Least Squares Discriminant Analysis Python
The Olive Oil data set is built-in in the R PLS library. I have put a copy on my S3 bucket to make it also easy to import with Python. (Please see the notice for this data set higher up in case you want to distribute it somewhere else).
You can import the data into Python using the following code:
The data looks like this. I have added the countries as a variable, which is not the case in the original R dataset.
PLS Discriminant analysis in Python is actually done by doing a PLS Regression on a categorical variable that is transformed into a dummy. Dummies transform a categorical variable into a variable for each category with 1 and 0 values: 1 if the row belongs to this category and 0 otherwise.
Dummy encoding is done because 0 and 1’s are much easier to use in many Machine Learning models and a set of dummy variables contains the exact same information as the original variable.
In modern Machine Learning jargon, creating dummies is also called one-hot-encoding.
You can create dummies using Pandas as follows:
The data will now contain three variables for the countries: one for each country. The values are 1 if the row belongs to this country and 0 otherwise:
The next step is to split the data in an X data frame and a Y data frame, because this is required by the PLS model in Python:
Now we get to the model. We’ll use the PLSRegression from the scikitlearn package. We can fit it directly with 2 components to give the same interpretation as we did in the R example.
The difficulty with multivariate statistics in Python is often the plot creation. In the following code block, a biplot is created. It is coded step by step:
- First we obtain the scores. Scores represent how high each individual olive oil scores on each dimension. The scores will allow us to plot the individuals in the biplot.
- We need to standardize the scores to make them fit on the same plot as the loadings.
- Then we obtain the loadings. The loadings contain the weights of each variable on each component. They will allow us to plot the variables on the biplot.
- We then loop through each individual and each variable and we plot an arrow and a label for them. The dimension 1 score or loading will become the x-coordinate on the plot and the dimension 2 score or loading will become the y-coordinate on the plot.
The code is shown here:
The resulting biplot is shown below:
The R interpretation of the biplot will give you the same findings as the Python biplot. To quickly recap the findings:
- Dimension 1 (x-axis) has green on one side and yellow on the other. The split between yellow and green oil is therefore important.
- On the green side, there are a lot of Greek oils and on the yellow side, there are a lot of Spanish oils. The green/yellow split allows differentiating Greek and Spanish olive oils.
- Dimension 2 (y-axis_ has brown on one side and glossy and transparent on the other side. Apparently, there is a split between oils that are either brown or else they are glossy and transparent, but not both at the same time.
- Italian oils tend to be on the brown side, whereas Spanish and Greek oils tend to be on the glossy and transparent side.
In this article, you have first seen an overview of the (many) variants of Partial Least Squares that exist. Furthermore, you have seen in-depth explanations and implementations for two ways to use Partial least Squares:
- Partial Least Squares as a Machine Learning algorithm for predictive performance in the Meats example
- Partial Least Squares for interpretation in the olive oils example
You also have seen how to use Partial Least Squares with different types of dependent variables:
- Partial Least Squares Regression for the numeric dependent variables in the meats use case
- Partial Least Squares Discriminant Analysis for the categorical dependent variables in the olive oil use case
By using both R and Python implementations for both examples, you now have the needed resources to apply Partial Least Squares on your own use cases!
For now, thanks for reading and don’t hesitate to stay tuned for more math, stats and data science content!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot