Original Source Here

# Machine Learning for Stock Trading: Unsupervised Learning Techniques

# Introduction

This example employs several unsupervised learning techniques in `scikit-learn`

to extract the stock market structure from variations in historical close prices. The quantity that we use is the daily variation in close prices because prices that are linked tend to cofluctuate during a day.

Jupyter Notebooks are available on Google Colab and Github.

For this project, we use several Python-based scientific computing technologies listed below.

import requests

import numpy as np

import pandas as pdimport pymc3 as pm

import theano as th

import seaborn as snsimport matplotlib.cm as cm

import matplotlib.pyplot as plt

from matplotlib.collections import LineCollectionfrom sklearn.preprocessing import MinMaxScaler

from sklearn import cluster, covariance, manifold%matplotlib notebookimport warnings

warnings.filterwarnings('ignore')

# 1. Define the Stock Universe

We start by specifying that we will constrain our search for pairs to a large and liquid single stock universe. To achieve this, we create a function that scrapes the tickers of the S&P 500 and then cleans the tickers by replacing those containing a `.`

with a `-`

so we can easily use them in AlphaWave Data’s APIs.

# Scrape the S&P 500 tickers from Wikipediadef get_tickers():

wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies').text

sp_data = pd.read_html(wiki_page)

ticker_df = sp_data[0]

ticker_options = ticker_df['Symbol']

return ticker_options# Run the ticker scrape function

# Let's convert the get_tickers() output to a list and

# replace tickers that have '.' with '-' so we can use AlphaWave Data APIsstock_tickers = get_tickers()

stock_tickers = stock_tickers.to_list()

for ticker in range(len(stock_tickers)):

stock_tickers[ticker] = stock_tickers[ticker].upper().replace(".", "-")

After scraping all the tickers in the S&P 500, let’s reduce the ticker list with the below code in order to present a simplified stock market visualization. This step is not required and is implemented only to provide an example.

symbol_dict = {

'TOT': 'Total',

'XOM': 'Exxon',

'CVX': 'Chevron',

'COP': 'ConocoPhillips',

'VLO': 'Valero Energy',

'MSFT': 'Microsoft',

'IBM': 'IBM',

'TWX': 'Time Warner',

'CMCSA': 'Comcast',

'CVC': 'Cablevision',

'YHOO': 'Yahoo',

'DELL': 'Dell',

'HPQ': 'HP',

'AMZN': 'Amazon',

'TM': 'Toyota',

'CAJ': 'Canon',

'MTU': 'Mitsubishi',

'SNE': 'Sony',

'F': 'Ford',

'HMC': 'Honda',

'NAV': 'Navistar',

'NOC': 'Northrop Grumman',

'BA': 'Boeing',

'KO': 'Coca Cola',

'MMM': '3M',

'MCD': 'Mc Donalds',

'PEP': 'Pepsi',

'MDLZ': 'Kraft Foods',

'K': 'Kellogg',

'UN': 'Unilever',

'MAR': 'Marriott',

'PG': 'Procter Gamble',

'CL': 'Colgate-Palmolive',

'GE': 'General Electrics',

'WFC': 'Wells Fargo',

'JPM': 'JPMorgan Chase',

'AIG': 'AIG',

'AXP': 'American express',

'BAC': 'Bank of America',

'GS': 'Goldman Sachs',

'AAPL': 'Apple',

'SAP': 'SAP',

'CSCO': 'Cisco',

'TXN': 'Texas instruments',

'XRX': 'Xerox',

'LMT': 'Lookheed Martin',

'WMT': 'Wal-Mart',

'WBA': 'Walgreen',

'HD': 'Home Depot',

'GSK': 'GlaxoSmithKline',

'PFE': 'Pfizer',

'SNY': 'Sanofi-Aventis',

'NVS': 'Novartis',

'KMB': 'Kimberly-Clark',

'R': 'Ryder',

'GD': 'General Dynamics',

'RTN': 'Raytheon',

'CVS': 'CVS',

'CAT': 'Caterpillar',

'DD': 'DuPont de Nemours'}stock_dict_list = []

for key in symbol_dict.keys():

stock_dict_list.append(key)

stock_tickers = list(set(stock_tickers) & set(stock_dict_list))

stock_tickers

# 2. Retrieve Stock Price Data

We can use the 5 Year Historical Daily Prices endpoint from the AlphaWave Data Stock Prices API to pull in the five year historical prices. From this, we are going to calculate the daily returns for each stock selected. With these returns, we can produce a 2D graph of the stock market.

To call this API with Python, you can choose one of the supported Python code snippets provided in the API console. The following is an example of how to invoke the API with Python Requests. You will need to insert your own **x-rapidapi-host** and **x-rapidapi-key** information in the code block below.

#fetch 5 years of daily return dataurl = "https://stock-prices2.p.rapidapi.com/api/v1/resources/stock-prices/5y"headers = {

'x-rapidapi-host': "YOUR_X-RAPIDAPI-HOST_WILL_COPY_DIRECTLY_FROM_RAPIDAPI_PYTHON_CODE_SNIPPETS",

'x-rapidapi-key': "YOUR_X-RAPIDAPI-KEY_WILL_COPY_DIRECTLY_FROM_RAPIDAPI_PYTHON_CODE_SNIPPETS"

}stock_frames = []for ticker in stock_tickers:

querystring = {"ticker":ticker}

stock_daily_price_response = requests.request("GET", url, headers=headers, params=querystring)# Create Stock Prices DataFrame

stock_daily_price_df = pd.DataFrame.from_dict(stock_daily_price_response.json())

stock_daily_price_df = stock_daily_price_df.transpose()

stock_daily_price_df = stock_daily_price_df.rename(columns={'Close':ticker})

stock_daily_price_df = stock_daily_price_df[{ticker}]

stock_frames.append(stock_daily_price_df)combined_stock_price_df = pd.concat(stock_frames, axis=1, sort=True)

combined_stock_price_df = combined_stock_price_df.dropna(how='all')

combined_stock_price_df = combined_stock_price_df.fillna("")pct_change_combined_stock_df = combined_stock_price_df.pct_change()

pct_change_combined_stock_df = pct_change_combined_stock_df.dropna()

pct_change_combined_stock_df

`variation = pct_change_combined_stock_df`

variation.shape

# 3. Unsupervised Learning Techniques

# 3.a Learning a Graph Structure

We use sparse inverse covariance estimation to find which close prices are correlated conditionally on the others. Specifically, sparse inverse covariance gives us a graph, that is a list of connections. For each symbol, the symbols that it is connected to are those useful to explain its fluctuations.

edge_model = covariance.GraphLassoCV(verbose=True)

X = variation.copy()

X /= X.std(axis=0)edge_model.fit(X)

# 3.b Clustering

We use clustering to group together close prices that behave similarly. Here, amongst the various clustering techniques available in the `scikit-learn`

, we use Affinity Propagation as it does not enforce equal-size clusters, and it can choose automatically the number of clusters from the data.

Note that this gives us a different indication than the graph because the graph reflects conditional relations between variables while the clustering reflects marginal properties. Variables clustered together can be considered as having a similar impact at the level of the full stock market.

Let’s take a look to see which stocks are in the same clusters.

_, labels = cluster.affinity_propagation(edge_model.covariance_)

n_labels = labels.max()

names=[]for stock in pct_change_combined_stock_df.columns.tolist():

names.append(stock)names = np.array(names)for i in range(n_labels + 1):

print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))

# 3.c Embedding in 2D space

For visualization purposes, we need to lay out the different symbols on a 2D canvas. For this we use Manifold learning techniques to retrieve 2D embedding.

node_position_model = manifold.LocallyLinearEmbedding(

n_components=2, eigen_solver='dense', n_neighbors=6)embedding = node_position_model.fit_transform(X.T).T

# 4. Visualization

The output of the 3 models (sparse inverse covariance estimation represented by Lasso Cross-Validation, clustering using Affinity Propagation, 2D embedding with Manifold learning) are combined in a 2D graph where nodes represent the stocks and edges represent the connections between stocks:

- the sparse covariance model is used to display the strength of the edges
- cluster labels are used to define the color of the nodes
- the 2D embedding is used to position the nodes in the graph

This example has a fair amount of visualization-related code, as visualization is crucial here to display the graph. One of the challenges is to position the labels minimizing overlap. For this we use a heuristic based on the direction of the nearest neighbor along each axis.

# Visualization

plt.figure(1, facecolor='w', figsize=(10, 8))

plt.clf()

ax = plt.axes([0., 0., 1., 1.])

plt.axis('off')# Display a graph of the partial correlations

partial_correlations = edge_model.precision_.copy()

d = 1 / np.sqrt(np.diag(partial_correlations))

partial_correlations *= d

partial_correlations *= d[:, np.newaxis]

non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)# Plot the nodes using the coordinates of our embedding

plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,

cmap=plt.cm.nipy_spectral)# Plot the edges

start_idx, end_idx = np.where(non_zero)

# a sequence of (*line0*, *line1*, *line2*), where::

# linen = (x0, y0), (x1, y1), ... (xm, ym)

segments = [[embedding[:, start], embedding[:, stop]]

for start, stop in zip(start_idx, end_idx)]

values = np.abs(partial_correlations[non_zero])

lc = LineCollection(segments,

zorder=0, cmap=plt.cm.hot_r,

norm=plt.Normalize(0, .7 * values.max()))

lc.set_array(values)

lc.set_linewidths(15 * values)

ax.add_collection(lc)# Add a label to each node. The challenge here is that we want to

# position the labels to avoid overlap with other labels

for index, (name, label, (x, y)) in enumerate(

zip(names, labels, embedding.T)):dx = x - embedding[0]

dx[index] = 1

dy = y - embedding[1]

dy[index] = 1

this_dx = dx[np.argmin(np.abs(dy))]

this_dy = dy[np.argmin(np.abs(dx))]

if this_dx > 0:

horizontalalignment = 'left'

x = x + .002

else:

horizontalalignment = 'right'

x = x - .002

if this_dy > 0:

verticalalignment = 'bottom'

y = y + .002

else:

verticalalignment = 'top'

y = y - .002

plt.text(x, y, name, size=10,

horizontalalignment=horizontalalignment,

verticalalignment=verticalalignment,

bbox=dict(facecolor='w',

edgecolor=plt.cm.nipy_spectral(label / float(n_labels)),

alpha=.6))plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),

embedding[0].max() + .10 * embedding[0].ptp(),)

plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),

embedding[1].max() + .03 * embedding[1].ptp())plt.show()

As can be seen in the 2D graph, `MSFT`

(Microsoft), `AAPL`

(Apple), `AMZN`

(Amazon), and `TXN`

(Texas Instruments) all have the same color nodes, bold lines connecting them, and are positioned closely together on the graph. This shows that all 3 models identify these stocks as having a close relationship based on variations in historical close prices. Intuitively, this makes sense as this grouping of stocks have similar economic exposures and regulatory burdens.

# 5. Pairs Trading Analysis

# 5.a Historical Stock Prices

To examine how well our identified pairs trade algorithmically, we first reload the historical stock prices.

`stock_data = combined_stock_price_df`

stock_data

Since `MSFT`

(Microsoft) and `AAPL`

(Apple) were identified in the 2D graph as being a good pairs trading candidate, we define them as `symbol_one`

and `symbol_two`

in our trading algorithm below:

symbol_one = 'MSFT'

symbol_two = 'AAPL'stock_data = stock_data[[symbol_one,symbol_two]]

stock_data.index.name = 'Date'

stock_data

We focus on price data since January 1, 2020 in order to capture the coronavirus sell-off in March 2020 and subsequent stock market recovery.

`stock1_name, stock2_name = symbol_one,symbol_two`

orig_data = stock_data.loc['2020-01-01':,]

data = orig_data.diff().cumsum()

data1 = data[stock1_name].ffill().fillna(0).values

data2 = data[stock2_name].ffill().fillna(0).values

Let’s now plot the historical stock prices for `MSFT`

and `AAPL`

.

`plt.figure(figsize = (18,8))`

ax = plt.gca()

plt.title("Potentially Cointegrated Stocks")

orig_data[stock1_name].plot(ax=ax,color=sns.color_palette()[1],linewidth=2)

orig_data[stock2_name].plot(ax=ax,color=sns.color_palette()[2],linewidth=2)

plt.ylabel("Price (USD)")

plt.legend()

plt.show()

These companies do indeed seem to have related price series.

# 5.b Bayesian Modeling

We take a Bayesian approach to pairs trading using probabilistic programming, which is a form of Bayesian machine learning. Unlike simpler frequentist cointegration tests, our Bayesian approach allows us to monitor the relationship between a pair of equities over time, which allows us to follow pairs whose cointegration parameters change steadily or abruptly. When combined with a simple mean-reversion trading algorithm, we demonstrate this to be a viable theoretical trading strategy, ready for further evaluation and risk management.

To learn more about this Bayesian approach to pairs trading, you can read AlphaWave Data’s article titled Bayesian Pairs Trading using Corporate Supply Chain Data.

We will use a Bayesian probabilistic programming package called PyMC3. Its simple syntax is excellent for prototyping as seen with the model description in the code below.

with pm.Model() as model:

# inject external stock data

stock1 = th.shared(data1)

stock2 = th.shared(data2)

# define our cointegration variables

beta_sigma = pm.Exponential('beta_sigma', 50.)

beta = pm.GaussianRandomWalk('beta', sd=beta_sigma,

shape=data1.shape[0])

# with our assumptions, cointegration can be reframed as a regression problem

stock2_regression = beta * stock1# Assume prices are Normally distributed, the mean comes from the regression.

sd = pm.HalfNormal('sd', sd=.25)

likelihood = pm.Normal('y',

mu=stock2_regression,

sd=sd,

observed=stock2)with model:

stock1.set_value(data1)

stock2.set_value(data2)

trace = pm.sample(2000,tune=1000,cores=4)

Let’s plot the 𝛽 distribution from the model over time.

rolling_beta = trace[beta].T.mean(axis=1)plt.figure(figsize = (18,8))

ax = plt.gca()

plt.title("Beta Distribution over Time")

pd.Series(rolling_beta,index=orig_data.index).plot(ax=ax,color='r',zorder=1e6,linewidth=2)

for orbit in trace[beta][:500]:

pd.Series(orbit,index=orig_data.index).plot(ax=ax,color=sns.color_palette()[0],alpha=0.05)

plt.legend(['Beta Mean','Beta Orbit'])

#plt.savefig("beta distrib.png")plt.show()

Notice that 𝛽 appears to shift between somewhat fixed regimes, and often does so abruptly.

# 5.c Trading Strategy

Knowing that two stocks may or may not be cointegrated does not explicitly define a trading strategy. For that we present the following simple mean-reversion style trading algorithm, which capitalizes on the assumed mean-reverting behavior of a cointegrated portfolio of stocks. We trade whenever our portfolio is moving back toward its mean value. When the algorithm is not trading, we dynamically update 𝛽 and its other parameters, to adapt to potentially changing cointegration conditions. Once a trade begins, we are forced to trade the two stocks at a fixed rate, and so our 𝛽 becomes locked for the duration of the trade. The algorithm’s exact implementation is as follows:

Define a “signal”, which should mean-revert to zero if 𝛽 remains relatively stationary.

Define a “smoothed signal”, a 15-day moving average of the “signal”.

If we are not trading…

- Update 𝛽 so that it does not remain fixed while we aren’t trading.
- If the smoothed signal is above zero
*and*moving downward, short our portfolio. - If the smoothed signal is below zero
*and*moving upward, go long on our portfolio.

If we are trading long…

- If the smoothed signal goes below its start value, close the trade; we may be diverging from the mean.
- If the smoothed signal rises through the zero line, we’ve reached the mean. Close the trade.

If we are trading short…

- If the smoothed signal goes above its start value, close the trade; we may be diverging from the mean.
- If the smoothed signal falls through the zero line, we’ve reached the mean. Close the trade.

def getStrategyPortfolioWeights(rolling_beta,stock_name1,stock_name2,data,smoothing_window=15):data1 = data[stock_name1].ffill().fillna(0).values

data2 = data[stock_name2].ffill().fillna(0).values# initial signal rebalance

fixed_beta = rolling_beta[smoothing_window]

signal = fixed_beta*data1 - data2

smoothed_signal = pd.Series(signal).rolling(smoothing_window).mean()

d_smoothed_signal = smoothed_signal.diff()

trading = "not"

trading_start = 0leverage = 0*data.copy()

for i in range(smoothing_window,data1.shape[0]):

leverage.iloc[i,:] = leverage.iloc[i-1,:]if trading=="not":# dynamically rebalance the signal when not trading

fixed_beta = rolling_beta[i]

signal = fixed_beta*data1 - data2

smoothed_signal = pd.Series(signal).rolling(smoothing_window).mean()

d_smoothed_signal = smoothed_signal.diff()if smoothed_signal[i]>0 and d_smoothed_signal[i]<0:leverage.iloc[i,0] = -fixed_beta / (abs(fixed_beta)+1)

leverage.iloc[i,1] = 1 / (abs(fixed_beta)+1)trading = "short"

trading_start = smoothed_signal[i]elif smoothed_signal[i]<0 and d_smoothed_signal[i]>0:fixed_beta = rolling_beta[i]

leverage.iloc[i,0] = fixed_beta / (abs(fixed_beta)+1)

leverage.iloc[i,1] = -1 / (abs(fixed_beta)+1)trading = "long"

trading_start = smoothed_signal[i]else:

leverage.iloc[i,0] = 0

leverage.iloc[i,1] = 0elif trading=="long":# a failed trade

if smoothed_signal[i] < trading_start:

leverage.iloc[i,0] = 0

leverage.iloc[i,1] = 0

trading = "not"# a successful trade

if smoothed_signal[i]>0:

leverage.iloc[i,0] = 0

leverage.iloc[i,1] = 0

trading = "not"elif trading=="short":# a failed trade

if smoothed_signal[i] > trading_start:

leverage.iloc[i,0] = 0

leverage.iloc[i,1] = 0

trading = "not"# a successful trade

if smoothed_signal[i]<0:

leverage.iloc[i,0] = 0

leverage.iloc[i,1] = 0

trading = "not"

return leverage

# 5.d Backtesting & Performance in Market Drops

As a long-short algorithm, the expectation is that this algorithm would perform well during market drops. The backtest here includes the coronavirus sell-off in March 2020.

portfolioWeights = getStrategyPortfolioWeights(rolling_beta,stock1_name, stock2_name,data).fillna(0)def backtest(pricingDF,leverageDF,start_cash):

"""Backtests pricing based on some given set of leverage. Leverage works such that it happens "overnight",

so leverage for "today" is applied to yesterday's close price. This algo can handle NaNs in pricing data

before a stock exists, but ffill() should be used for NaNs that occur after the stock has existed, even

if that stock ceases to exist later."""

pricing = pricingDF.values

leverage = leverageDF.values

shares = np.zeros_like(pricing)

cash = np.zeros(pricing.shape[0])

cash[0] = start_cash

curr_price = np.zeros(pricing.shape[1])

curr_price_div = np.zeros(pricing.shape[1])

for t in range(1,pricing.shape[0]):

if np.any(leverage[t]!=leverage[t-1]):# handle non-existent stock values

curr_price[:] = pricing[t-1] # you can multiply with this one

curr_price[np.isnan(curr_price)] = 0

trading_allowed = (curr_price!=0)

curr_price_div[:] = curr_price # you can divide with this one

curr_price_div[~trading_allowed] = 1

# determine new positions (warning: leverage to non-trading_allowed stocks is just lost)

portfolio_value = (shares[t-1]*curr_price).sum()+cash[t-1]

target_shares = trading_allowed * (portfolio_value*leverage[t]) // curr_price_div

# rebalance

shares[t] = target_shares

cash[t] = cash[t-1] - ((shares[t]-shares[t-1])*curr_price).sum()

else:

# maintain positions

shares[t] = shares[t-1]

cash[t] = cash[t-1]

returns = (shares*np.nan_to_num(pricing)).sum(axis=1)+cash

pct_returns = (returns-start_cash)/start_cash

return (

pd.DataFrame( shares, index=pricingDF.index, columns=pricingDF.columns ),

pd.Series( cash, index=pricingDF.index ),

pd.Series( pct_returns, index=pricingDF.index)

)shares, cash, returns = backtest( orig_data, portfolioWeights, 1e6 )plt.figure(figsize = (18,8))

ax = plt.gca()

plt.title("Return Profile of Algorithm")

plt.ylabel("Percent Returns")

returns.plot(ax=ax,linewidth=3)

vals = ax.get_yticks()

ax.set_yticklabels(['{:,.0%}'.format(x) for x in vals])

plt.show()

As we might have hoped, performance through market drops is strong. Returns are somewhat outsized due to our portfolio only being two stocks. For a finalized version of this algorithm, we might trade a hundred pairs or more to reduce volatility.

# 6. Conclusions & Potential Future Directions

Using the output of the 3 models (sparse inverse covariance estimation represented by Lasso Cross-Validation, clustering using Affinity Propagation, 2D embedding with Manifold learning) to identify stock pairs, we demonstrated a robust prototype for what would be built into a more sophisticated pairs trading algorithm. There are many places where this algorithm and approach could be improved, including expanding the portfolio, creating criteria for when 𝛽 is suitable to trade over, backtesting over more periods, using a Bayesian model with fewer simplifying assumptions, and investigating potential nonlinear relationships between stocks.

*This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by AlphaWave Data, Inc. (“AlphaWave Data”). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, AlphaWave Data, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to AlphaWave Data, Inc. at the time of publication. AlphaWave Data makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.*

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot