Machine Learning*snZ2ziVj-cHItkC2

Original Source Here

Photo by Dmitriy Ermakov on Unsplash

How To Predict Lung Cancer with Weka and Python

Table of Contents:

I. Dataset Information
II. Downloading the Data
III. Python Data Prep
IV. Weka Implementation
V. Accuracy
VI. Confusion Matrix
VIII. References

I. Dataset Information

The following information is listed on the source website.


Francesca Grisoni, Claudia S. Neuhaus, Miyabi Hishinuma, Gisela Gabernet, Jan A. Hiss, Masaaki Kotera, Gisbert Schneider
contact: Francesca Grisoni, ETH Zurich, francesca.grisoni ‘@’

Data Set Information:

Membranolytic anticancer peptides (ACPs) are drawing increasing attention as potential future therapeutics against cancer, due to their ability to hinder the development of cellular resistance and their potential to overcome common hurdles of chemotherapy, e.g., side effects and cytotoxicity.
This dataset contains information on peptides (annotated for their one-letter amino acid code) and their anticancer activity on breast and lung cancer cell lines.

Two peptide datasets targeting breast and lung cancer cells were assembled and curated manually from CancerPPD. EC50, IC50, LD50 and LC50 annotations on breast and lung cancer cells were retained (breast cell lines: MCF7 = 57%, MDA-MB-361 = 11%, MT-1 = 9%; lung cell lines: H-1299 = 45%, A-549 = 17.7%); mg ml−1 values were converted to μM units. Linear and l-chiral peptides were retained, while cyclic, mixed or d-chiral peptides were discarded. In the presence of both amidated and non-amidated data for the same sequence, only the value referred to the amidated peptide was retained.

Peptides were split into three classes for model training: (1) very active (EC/IC/LD/LC50 ≤ 5 μM), (2) moderately active (EC/IC/LD/LC50 values up to 50 μM) and (3) inactive (EC/IC/LD/LC50 > 50 μM) peptides.

Duplicates with conflicting class annotations were compared manually to the original sources, and, if necessary, corrected. If multiple class annotations were present for the same sequence, the most frequently represented class was chosen; in case of ties, the less active class was chosen. Since the CancerPPD is biased towards the annotation of active peptides, we built a set of presumably inactive peptides by randomly extracting 750 alpha-helical sequences from crystal structures deposited in the Protein Data Bank (7–30 amino acids).

The final training sets contained 949 peptides for Breast cancer and 901 peptides for Lung cancer.
The datasets were used to develop neural networks model for anticancer peptide design and are provided as .csv file in a .zip folder.

Additional details can be found in: Grisoni, F., Neuhaus, C.S., Hishinuma, M., Gabernet, G., Hiss, J.A., Kotera, M. and Schneider, G., 2019. De novo design of anticancer peptides by ensemble artificial neural networks. Journal of Molecular Modeling, 25(5), 112.

II. Downloading the Data

Download the zip file from the below link or just click here:

Unzip the folder downloaded and then place the lung cancer file in a location of your choice to work with:

III. Data Prep

Open up Spyder or any python ide of your choice.

Go ahead and start by importing. In this case, you will need pandas:

import pandas as pd

Next, you can import the dataset:


You can now check it out in the variable explorer to see how the columns look:

If you are not in the Spyder ide, you can also use:


This will give you the first five rows to look at:

As you can see, we have the One-letter amino-acid sequence in the column labeled sequence.

The objective is to predict the class. What we are going to do is go ahead and split the sequence. In order to do this, let us quickly walk through another example in python.

Imagine, a program starts with the variable name and for my example I will use my name:

name = 'Ashutosh'

Name is a string variable and the contents is the word Ashutosh. What I want is a list of each character in the word Ashutosh seperated out. In order to do this, we can simply use the list function in python like so:


In this case, the output will be a list with each character like so:

Alright, so now in the case of our pandas dataframe, we have to apply this to every row. Luckily, pandas a function for this called apply. Let’s use it:

df2=df['sequence'].apply(lambda x: pd.Series(list(x)))

This will create a pandas dataframe, but only out of the peptide sequence column.

Here is how it will look if we type df2.head().

As you can see, we don’t have the class column. Let’s go ahead and insert the class column back in and then do a df2.head like so:


This time your output will look like this:

39 columns with the last column being class.

Alright, we have finished our dataprep. Now there is a final step. In order for Weka to ingest it, we have to export this prepped file. Make sure you include the index=False argument as the index cannot be properly imported into weka unless you rename it. In addition, the index does not play a role in predicting the class.


Now, you will have a cleaned csv file in that location for weka to pick up.

IV. Weka Implementation

If you don’t have weka, you will need to download and install it. Just do a quick search of Weka download on google and you will see sources.

Open up weka after you have installed it:

Click on the “Explorer” button above and then as shown in the picture below, click on the Open File button:

Once you have clicked on Open File, a browse window will open, browse over to the cleaned csv file we just made and open it:

Once the file is in, you will see all columns. You can also see the distribution statistics for each column. Here is an example for the class column:

This class distribution tells us something. Note that 750 out of 901 instances belong to the Class “Inactive — virtual”. The data here is unbalanced in terms of class distribution.

750/901 = 83.2%

This means that if the predictive capability of an applied algorithm is around 83%, it has done nothing other than simply guess. For an algorithm to be considered to have good predictive capability, it has to outperform by exceeding the class distribution given in the data.

Go ahead and click the classify button at the top:

Next, click on the “choose” button:

Under trees, pick the RandomForest algorithm.

In your test options, pick the percent split. 66% for training and 34% for test is a decent split. This just means that the algorithm will train itself on a random 66% and then see how its predictive capabilities works with the test data which is the rest.

Also, make sure that the drop down on top of the start button is set to the class variable. What you select in this dropdown is what the algorithm will predict i.e this is where you select the target variable to classify into. Since, in this case the variable we are predicting is named “class”, pick that one.

V. Accuracy

Now hit the “Start” button. It will run for a couple of seconds and then give you an output such as below:

Note our accuracy using the random forest is 94.1176% far higher than an algorithm classifying everything into the lopsided class and getting around 83% accuracy as we saw earlier in the class distribution. This high level of accuracy compared to the distribution means that the random forest performed very well compared to just simply guessing.

VI. Confusion Matrix

Here is the confusion matrix given by Weka in the same output window. This matrix tells us for each class what went right and what went wrong. It also tells us which wrong class was picked and how many instances were picked into that wrong class.

Accuracy by class: 
A --> 20 / (20+5+1+3)= 20/29 = 68.9%
B --> 7 / (3+7+2+2) = 7/14 = 50%
C --> 5 / (2+0+5+0) = 5/7 = 71.4%
D --> 256 / (0+0+0+256) = 256/256 = 100%

As you can see, the best accuracy was in predicting the inactive-virtual. This is primarily because the original class distribution is lopsided toward it. This is common in medical datasets. With that said, note that it was able to predict moderately active at a rate of 68.9% and very active at an event better rate of 71.4%.

VIII. References


2- Grisoni, F., Neuhaus, C.S., Hishinuma, M., Gabernet, G., Hiss, J.A., Kotera, M. and Schneider, G., 2019. De novo design of anticancer peptides by ensemble artificial neural networks. Journal of Molecular Modeling, 25(5), 112. [Web Link]

Thank you reading and hope you can follow me and check out my other work. Posting some links below:




Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: