Machine Learning for Stock Trading: Unsupervised Learning Techniques

Original Source Here

Machine Learning for Stock Trading: Unsupervised Learning Techniques


This example employs several unsupervised learning techniques in scikit-learn to extract the stock market structure from variations in historical close prices. The quantity that we use is the daily variation in close prices because prices that are linked tend to cofluctuate during a day.

Jupyter Notebooks are available on Google Colab and Github.

For this project, we use several Python-based scientific computing technologies listed below.

import requests
import numpy as np
import pandas as pd
import pymc3 as pm
import theano as th
import seaborn as sns
import as cm
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from sklearn.preprocessing import MinMaxScaler
from sklearn import cluster, covariance, manifold
%matplotlib notebookimport warnings

1. Define the Stock Universe

We start by specifying that we will constrain our search for pairs to a large and liquid single stock universe. To achieve this, we create a function that scrapes the tickers of the S&P 500 and then cleans the tickers by replacing those containing a . with a - so we can easily use them in AlphaWave Data’s APIs.

# Scrape the S&P 500 tickers from Wikipediadef get_tickers():
wiki_page = requests.get('').text
sp_data = pd.read_html(wiki_page)
ticker_df = sp_data[0]
ticker_options = ticker_df['Symbol']
return ticker_options
# Run the ticker scrape function
# Let's convert the get_tickers() output to a list and
# replace tickers that have '.' with '-' so we can use AlphaWave Data APIs
stock_tickers = get_tickers()
stock_tickers = stock_tickers.to_list()
for ticker in range(len(stock_tickers)):
stock_tickers[ticker] = stock_tickers[ticker].upper().replace(".", "-")

After scraping all the tickers in the S&P 500, let’s reduce the ticker list with the below code in order to present a simplified stock market visualization. This step is not required and is implemented only to provide an example.

symbol_dict = {
'TOT': 'Total',
'XOM': 'Exxon',
'CVX': 'Chevron',
'COP': 'ConocoPhillips',
'VLO': 'Valero Energy',
'MSFT': 'Microsoft',
'IBM': 'IBM',
'TWX': 'Time Warner',
'CMCSA': 'Comcast',
'CVC': 'Cablevision',
'YHOO': 'Yahoo',
'DELL': 'Dell',
'HPQ': 'HP',
'AMZN': 'Amazon',
'TM': 'Toyota',
'CAJ': 'Canon',
'MTU': 'Mitsubishi',
'SNE': 'Sony',
'F': 'Ford',
'HMC': 'Honda',
'NAV': 'Navistar',
'NOC': 'Northrop Grumman',
'BA': 'Boeing',
'KO': 'Coca Cola',
'MMM': '3M',
'MCD': 'Mc Donalds',
'PEP': 'Pepsi',
'MDLZ': 'Kraft Foods',
'K': 'Kellogg',
'UN': 'Unilever',
'MAR': 'Marriott',
'PG': 'Procter Gamble',
'CL': 'Colgate-Palmolive',
'GE': 'General Electrics',
'WFC': 'Wells Fargo',
'JPM': 'JPMorgan Chase',
'AIG': 'AIG',
'AXP': 'American express',
'BAC': 'Bank of America',
'GS': 'Goldman Sachs',
'AAPL': 'Apple',
'SAP': 'SAP',
'CSCO': 'Cisco',
'TXN': 'Texas instruments',
'XRX': 'Xerox',
'LMT': 'Lookheed Martin',
'WMT': 'Wal-Mart',
'WBA': 'Walgreen',
'HD': 'Home Depot',
'GSK': 'GlaxoSmithKline',
'PFE': 'Pfizer',
'SNY': 'Sanofi-Aventis',
'NVS': 'Novartis',
'KMB': 'Kimberly-Clark',
'R': 'Ryder',
'GD': 'General Dynamics',
'RTN': 'Raytheon',
'CVS': 'CVS',
'CAT': 'Caterpillar',
'DD': 'DuPont de Nemours'}
stock_dict_list = []
for key in symbol_dict.keys():

stock_tickers = list(set(stock_tickers) & set(stock_dict_list))

2. Retrieve Stock Price Data

We can use the 5 Year Historical Daily Prices endpoint from the AlphaWave Data Stock Prices API to pull in the five year historical prices. From this, we are going to calculate the daily returns for each stock selected. With these returns, we can produce a 2D graph of the stock market.

To call this API with Python, you can choose one of the supported Python code snippets provided in the API console. The following is an example of how to invoke the API with Python Requests. You will need to insert your own x-rapidapi-host and x-rapidapi-key information in the code block below.

#fetch 5 years of daily return dataurl = ""headers = {
stock_frames = []for ticker in stock_tickers:
querystring = {"ticker":ticker}
stock_daily_price_response = requests.request("GET", url, headers=headers, params=querystring)
# Create Stock Prices DataFrame
stock_daily_price_df = pd.DataFrame.from_dict(stock_daily_price_response.json())
stock_daily_price_df = stock_daily_price_df.transpose()
stock_daily_price_df = stock_daily_price_df.rename(columns={'Close':ticker})
stock_daily_price_df = stock_daily_price_df[{ticker}]
combined_stock_price_df = pd.concat(stock_frames, axis=1, sort=True)
combined_stock_price_df = combined_stock_price_df.dropna(how='all')
combined_stock_price_df = combined_stock_price_df.fillna("")
pct_change_combined_stock_df = combined_stock_price_df.pct_change()
pct_change_combined_stock_df = pct_change_combined_stock_df.dropna()
variation = pct_change_combined_stock_df

3. Unsupervised Learning Techniques

3.a Learning a Graph Structure

We use sparse inverse covariance estimation to find which close prices are correlated conditionally on the others. Specifically, sparse inverse covariance gives us a graph, that is a list of connections. For each symbol, the symbols that it is connected to are those useful to explain its fluctuations.

edge_model = covariance.GraphLassoCV(verbose=True)
X = variation.copy()
X /= X.std(axis=0)

3.b Clustering

We use clustering to group together close prices that behave similarly. Here, amongst the various clustering techniques available in the scikit-learn, we use Affinity Propagation as it does not enforce equal-size clusters, and it can choose automatically the number of clusters from the data.

Note that this gives us a different indication than the graph because the graph reflects conditional relations between variables while the clustering reflects marginal properties. Variables clustered together can be considered as having a similar impact at the level of the full stock market.

Let’s take a look to see which stocks are in the same clusters.

_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
for stock in pct_change_combined_stock_df.columns.tolist():
names = np.array(names)for i in range(n_labels + 1):
print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))

3.c Embedding in 2D space

For visualization purposes, we need to lay out the different symbols on a 2D canvas. For this we use Manifold learning techniques to retrieve 2D embedding.

node_position_model = manifold.LocallyLinearEmbedding(
n_components=2, eigen_solver='dense', n_neighbors=6)
embedding = node_position_model.fit_transform(X.T).T

4. Visualization

The output of the 3 models (sparse inverse covariance estimation represented by Lasso Cross-Validation, clustering using Affinity Propagation, 2D embedding with Manifold learning) are combined in a 2D graph where nodes represent the stocks and edges represent the connections between stocks:

  • the sparse covariance model is used to display the strength of the edges
  • cluster labels are used to define the color of the nodes
  • the 2D embedding is used to position the nodes in the graph

This example has a fair amount of visualization-related code, as visualization is crucial here to display the graph. One of the challenges is to position the labels minimizing overlap. For this we use a heuristic based on the direction of the nearest neighbor along each axis.

# Visualization
plt.figure(1, facecolor='w', figsize=(10, 8))
ax = plt.axes([0., 0., 1., 1.])
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)
# Plot the nodes using the coordinates of our embedding
plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,
# Plot the edges
start_idx, end_idx = np.where(non_zero)
# a sequence of (*line0*, *line1*, *line2*), where::
# linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding[:, start], embedding[:, stop]]
for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
lc = LineCollection(segments,
norm=plt.Normalize(0, .7 * values.max()))
lc.set_linewidths(15 * values)
# Add a label to each node. The challenge here is that we want to
# position the labels to avoid overlap with other labels
for index, (name, label, (x, y)) in enumerate(
zip(names, labels, embedding.T)):
dx = x - embedding[0]
dx[index] = 1
dy = y - embedding[1]
dy[index] = 1
this_dx = dx[np.argmin(np.abs(dy))]
this_dy = dy[np.argmin(np.abs(dx))]
if this_dx > 0:
horizontalalignment = 'left'
x = x + .002
horizontalalignment = 'right'
x = x - .002
if this_dy > 0:
verticalalignment = 'bottom'
y = y + .002
verticalalignment = 'top'
y = y - .002
plt.text(x, y, name, size=10,
bbox=dict(facecolor='w', / float(n_labels)),
plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),
embedding[0].max() + .10 * embedding[0].ptp(),)
plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),
embedding[1].max() + .03 * embedding[1].ptp())

As can be seen in the 2D graph, MSFT (Microsoft), AAPL (Apple), AMZN (Amazon), and TXN (Texas Instruments) all have the same color nodes, bold lines connecting them, and are positioned closely together on the graph. This shows that all 3 models identify these stocks as having a close relationship based on variations in historical close prices. Intuitively, this makes sense as this grouping of stocks have similar economic exposures and regulatory burdens.

5. Pairs Trading Analysis

5.a Historical Stock Prices

To examine how well our identified pairs trade algorithmically, we first reload the historical stock prices.

stock_data = combined_stock_price_df

Since MSFT (Microsoft) and AAPL (Apple) were identified in the 2D graph as being a good pairs trading candidate, we define them as symbol_one and symbol_two in our trading algorithm below:

symbol_one = 'MSFT'
symbol_two = 'AAPL'
stock_data = stock_data[[symbol_one,symbol_two]] = 'Date'

We focus on price data since January 1, 2020 in order to capture the coronavirus sell-off in March 2020 and subsequent stock market recovery.

stock1_name, stock2_name = symbol_one,symbol_two
orig_data = stock_data.loc['2020-01-01':,]
data = orig_data.diff().cumsum()
data1 = data[stock1_name].ffill().fillna(0).values
data2 = data[stock2_name].ffill().fillna(0).values

Let’s now plot the historical stock prices for MSFT and AAPL.

plt.figure(figsize = (18,8))
ax = plt.gca()
plt.title("Potentially Cointegrated Stocks")
plt.ylabel("Price (USD)")

These companies do indeed seem to have related price series.

5.b Bayesian Modeling

We take a Bayesian approach to pairs trading using probabilistic programming, which is a form of Bayesian machine learning. Unlike simpler frequentist cointegration tests, our Bayesian approach allows us to monitor the relationship between a pair of equities over time, which allows us to follow pairs whose cointegration parameters change steadily or abruptly. When combined with a simple mean-reversion trading algorithm, we demonstrate this to be a viable theoretical trading strategy, ready for further evaluation and risk management.

To learn more about this Bayesian approach to pairs trading, you can read AlphaWave Data’s article titled Bayesian Pairs Trading using Corporate Supply Chain Data.

We will use a Bayesian probabilistic programming package called PyMC3. Its simple syntax is excellent for prototyping as seen with the model description in the code below.

with pm.Model() as model:

# inject external stock data
stock1 = th.shared(data1)
stock2 = th.shared(data2)

# define our cointegration variables
beta_sigma = pm.Exponential('beta_sigma', 50.)
beta = pm.GaussianRandomWalk('beta', sd=beta_sigma,

# with our assumptions, cointegration can be reframed as a regression problem
stock2_regression = beta * stock1
# Assume prices are Normally distributed, the mean comes from the regression.
sd = pm.HalfNormal('sd', sd=.25)
likelihood = pm.Normal('y',
with model:
trace = pm.sample(2000,tune=1000,cores=4)

Let’s plot the 𝛽 distribution from the model over time.

rolling_beta = trace[beta].T.mean(axis=1)plt.figure(figsize = (18,8))
ax = plt.gca()
plt.title("Beta Distribution over Time")
for orbit in trace[beta][:500]:
plt.legend(['Beta Mean','Beta Orbit'])
#plt.savefig("beta distrib.png")

Notice that 𝛽 appears to shift between somewhat fixed regimes, and often does so abruptly.

5.c Trading Strategy

Knowing that two stocks may or may not be cointegrated does not explicitly define a trading strategy. For that we present the following simple mean-reversion style trading algorithm, which capitalizes on the assumed mean-reverting behavior of a cointegrated portfolio of stocks. We trade whenever our portfolio is moving back toward its mean value. When the algorithm is not trading, we dynamically update 𝛽 and its other parameters, to adapt to potentially changing cointegration conditions. Once a trade begins, we are forced to trade the two stocks at a fixed rate, and so our 𝛽 becomes locked for the duration of the trade. The algorithm’s exact implementation is as follows:

Define a “signal”, which should mean-revert to zero if 𝛽 remains relatively stationary.

Define a “smoothed signal”, a 15-day moving average of the “signal”.

If we are not trading…

  • Update 𝛽 so that it does not remain fixed while we aren’t trading.
  • If the smoothed signal is above zero and moving downward, short our portfolio.
  • If the smoothed signal is below zero and moving upward, go long on our portfolio.

If we are trading long…

  • If the smoothed signal goes below its start value, close the trade; we may be diverging from the mean.
  • If the smoothed signal rises through the zero line, we’ve reached the mean. Close the trade.

If we are trading short…

  • If the smoothed signal goes above its start value, close the trade; we may be diverging from the mean.
  • If the smoothed signal falls through the zero line, we’ve reached the mean. Close the trade.
def getStrategyPortfolioWeights(rolling_beta,stock_name1,stock_name2,data,smoothing_window=15):data1 = data[stock_name1].ffill().fillna(0).values
data2 = data[stock_name2].ffill().fillna(0).values
# initial signal rebalance
fixed_beta = rolling_beta[smoothing_window]
signal = fixed_beta*data1 - data2
smoothed_signal = pd.Series(signal).rolling(smoothing_window).mean()
d_smoothed_signal = smoothed_signal.diff()
trading = "not"
trading_start = 0
leverage = 0*data.copy()
for i in range(smoothing_window,data1.shape[0]):
leverage.iloc[i,:] = leverage.iloc[i-1,:]
if trading=="not":# dynamically rebalance the signal when not trading
fixed_beta = rolling_beta[i]
signal = fixed_beta*data1 - data2
smoothed_signal = pd.Series(signal).rolling(smoothing_window).mean()
d_smoothed_signal = smoothed_signal.diff()
if smoothed_signal[i]>0 and d_smoothed_signal[i]<0:leverage.iloc[i,0] = -fixed_beta / (abs(fixed_beta)+1)
leverage.iloc[i,1] = 1 / (abs(fixed_beta)+1)
trading = "short"
trading_start = smoothed_signal[i]
elif smoothed_signal[i]<0 and d_smoothed_signal[i]>0:fixed_beta = rolling_beta[i]
leverage.iloc[i,0] = fixed_beta / (abs(fixed_beta)+1)
leverage.iloc[i,1] = -1 / (abs(fixed_beta)+1)
trading = "long"
trading_start = smoothed_signal[i]
leverage.iloc[i,0] = 0
leverage.iloc[i,1] = 0
elif trading=="long":# a failed trade
if smoothed_signal[i] < trading_start:
leverage.iloc[i,0] = 0
leverage.iloc[i,1] = 0
trading = "not"
# a successful trade
if smoothed_signal[i]>0:
leverage.iloc[i,0] = 0
leverage.iloc[i,1] = 0
trading = "not"
elif trading=="short":# a failed trade
if smoothed_signal[i] > trading_start:
leverage.iloc[i,0] = 0
leverage.iloc[i,1] = 0
trading = "not"
# a successful trade
if smoothed_signal[i]<0:
leverage.iloc[i,0] = 0
leverage.iloc[i,1] = 0
trading = "not"

return leverage

5.d Backtesting & Performance in Market Drops

As a long-short algorithm, the expectation is that this algorithm would perform well during market drops. The backtest here includes the coronavirus sell-off in March 2020.

portfolioWeights = getStrategyPortfolioWeights(rolling_beta,stock1_name, stock2_name,data).fillna(0)def backtest(pricingDF,leverageDF,start_cash):
"""Backtests pricing based on some given set of leverage. Leverage works such that it happens "overnight",
so leverage for "today" is applied to yesterday's close price. This algo can handle NaNs in pricing data
before a stock exists, but ffill() should be used for NaNs that occur after the stock has existed, even
if that stock ceases to exist later."""

pricing = pricingDF.values
leverage = leverageDF.values

shares = np.zeros_like(pricing)
cash = np.zeros(pricing.shape[0])
cash[0] = start_cash
curr_price = np.zeros(pricing.shape[1])
curr_price_div = np.zeros(pricing.shape[1])

for t in range(1,pricing.shape[0]):

if np.any(leverage[t]!=leverage[t-1]):
# handle non-existent stock values
curr_price[:] = pricing[t-1] # you can multiply with this one
curr_price[np.isnan(curr_price)] = 0
trading_allowed = (curr_price!=0)
curr_price_div[:] = curr_price # you can divide with this one
curr_price_div[~trading_allowed] = 1

# determine new positions (warning: leverage to non-trading_allowed stocks is just lost)
portfolio_value = (shares[t-1]*curr_price).sum()+cash[t-1]
target_shares = trading_allowed * (portfolio_value*leverage[t]) // curr_price_div

# rebalance
shares[t] = target_shares
cash[t] = cash[t-1] - ((shares[t]-shares[t-1])*curr_price).sum()


# maintain positions
shares[t] = shares[t-1]
cash[t] = cash[t-1]

returns = (shares*np.nan_to_num(pricing)).sum(axis=1)+cash
pct_returns = (returns-start_cash)/start_cash
return (
pd.DataFrame( shares, index=pricingDF.index, columns=pricingDF.columns ),
pd.Series( cash, index=pricingDF.index ),
pd.Series( pct_returns, index=pricingDF.index)
shares, cash, returns = backtest( orig_data, portfolioWeights, 1e6 )plt.figure(figsize = (18,8))
ax = plt.gca()
plt.title("Return Profile of Algorithm")
plt.ylabel("Percent Returns")
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.0%}'.format(x) for x in vals])

As we might have hoped, performance through market drops is strong. Returns are somewhat outsized due to our portfolio only being two stocks. For a finalized version of this algorithm, we might trade a hundred pairs or more to reduce volatility.

6. Conclusions & Potential Future Directions

Using the output of the 3 models (sparse inverse covariance estimation represented by Lasso Cross-Validation, clustering using Affinity Propagation, 2D embedding with Manifold learning) to identify stock pairs, we demonstrated a robust prototype for what would be built into a more sophisticated pairs trading algorithm. There are many places where this algorithm and approach could be improved, including expanding the portfolio, creating criteria for when 𝛽 is suitable to trade over, backtesting over more periods, using a Bayesian model with fewer simplifying assumptions, and investigating potential nonlinear relationships between stocks.

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by AlphaWave Data, Inc. (“AlphaWave Data”). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, AlphaWave Data, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to AlphaWave Data, Inc. at the time of publication. AlphaWave Data makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: