Automated Stock Trading Using Machine Learning Algorithms

CS229_Final_Report

CS229 Project ReportAutomated Stock Trading Using Machine Learning Algorithms

Introduction
The use of algorithms to make trading decisions hasbecome a prevalent practice in major stock exchangesof the world. Algorithmic trading, sometimes calledhigh-frequency trading, is the use of automated systems toidentify true signals among massive amounts of data thatcapture the underlying stock market dynamics. MachineLearning has therefore been central to the process ofalgorithmic trading because it provides powerful tools toextract patterns from the seemingly chaotic market trends.This project, in particular, learns models from Bloombergstock data to predict stock price changes and aims to makeprofit over time.
In this project, we examine two separate algorithms andmethodologies utilized to investigate Stock Market trendsand then iteratively improve the model to achieve higherprofitability as well as accuracy via the predictions.
Methods2.1. Stock Selection
Stock ticker data, relating to prices, volumes, quotes areavailable to academic institutions through the Bloombergterminal and Stanford has a easily accessible one in itsengineering library.
When collecting stock data for this project we attemptedto have a conservative universe selection to ensure that wemined a good universe a priori and avoided stocks that werelikely to be outliers to our algorithm to confuse the results.The criteria we shortlisted by were the following:
price between 10-30 dollars

membership in the last 300 of SP500

average daily volume (ADV) in the middle 33 per-centile

variety of stock sectors

Arpan Shah Hongxia Zhong
ashah29@stanford.edu hongxia.zhong@stanford.edu

According to the listed criteria, we obtained a universeof 23 stocks for this project1.
The data we focussed on was the price and volume move-ments for each stock throughout the day on a tick-by-tickbasis. This data was then further preprocessed to enable in-terfacing with Matlab and integrate into the machine learn-ing algorithms.
2.2. Preprocessing
Before using the data in the learning algorithms, the fol-lowing preprocessing steps were taken.
2.2.1 Discretization
Since the tick-by-tick entries retrieved from Bloomberghappen in non-deterministic timestamps, we attempted tostandardize the stock data by discretizing the continuoustime domain, from 9:00 am to 5:00 pm when the marketcloses. Specifically, the time domain was separated into1-minute buckets and we discarded all granularities withineach bucket and treated the buckets as the basic units in ourlearning algorithms.
2.2.2 Bucket Description
For each 1-minute bucket, we attempted to extract 8 identi-fiers to describe the price and volume change of that minuteheuristically. We discussed the identifier selection with ex-perienced veteran in algorithmic trading industry (footnote:Keith). Based on his suggestions, we chose the following 4identifiers to describe the price change:

open price: price at the beginning of each 1-minutebucket
close price: price at the end of each 1-minute bucket3. high price: highest price within each 1-minute bucket4. low price: lowest price within each 1-minute bucket
1See Appendix

Similarly, we chose open volume, close volume, highvolume and low volume to describe the volume change.
With this set of identifiers, we can formulate the algo-rithms to predict the change in the closing price of each1-minute bucket given information of the remaining sevenidentifiers (volume and price) prior to that minute2. Theidentifiers help capture the trend of the data of a givenminute.
2.3. Metrics
To evaluate the learning algorithms, we simulate a
real-time trading process, on one single day, using the
models obtained from each algorithm. Again, we discretize
the continuous time domain into 1-minute buckets. For
each bucket at time t, each model attempts to invest 1
share in each stock if it predicts an uptrend in price, i.e.
(t) (t)Pclose > Popen. If a model invested in a stock at time t, it

based on the discussion above.
The first model we tried was Logistic Regression3Initially, we attempted to fit logistic regression with thefollowing six features: 1) percentage change in open price,2) percentage change in high price, 3) percentage changein low price, 4) percentage change in open volume, 5)percentage change in high volume, and 6) percentagechange in low volume.
Note that although change in ”open” variables are be-tween the current and previous 1-minute bucket, since highand low variables for the current 1-minute bucket are unob-served so far, we can only consider the change between theprevious two buckets as an indicator of the trend. Formally,these features can be expressed using the formula below4:
⇣ (t) (t1)⌘ (t1)Popen Popen /Popen (1)

always sells that stock at the end of that minute(t). To esti-(t) (t)

⇣⌘
P(t1) P(t2)high high
⇣ (t1) (t2) ⌘Plow Plow

/P(t2) (2)high
(t2)/Plow (3)

mate profit, we calculate the price difference Pclose Popento update the rolling profit. If, on the other hand, it predictsa downtrend it does nothing. This rolling profit, denotedconcisely as just ”profit” in this report, is one of our metricsin evaluating the algorithm’s performance.

⇣ (t) (t1)⌘ (t1)Vopen Vopen /Vopen (4)
⇣V(t1) V(t2)⌘/V(t2 (5)high high high
⇣V (t1) V (t2) ⌘ /V (t2) (6)low low low
The results, however, showed that a logistic regressionmodel could not be applied well to this set of high-dimensional features. Intuitively this behavior can beexplained if we consider the significant noise introduced bythe high-dimensional features, which makes it difficult tofit weights for our model. More specifically, this behaviorcould be due to certain features obscuring patterns obtainedby other features.
In an attempt to reduce the dimensionality of our featurespace, we use cross-validation to eliminate less effectivefeatures. We realized that logistic regression model onstock-data can fit at most two-dimensional feature spacewith reliability. The results of the cross validation sug-gested that feature(1) and feature(4) provide optimal results.
In addition to optimizing the feature set, we also usecross-validation to obtain an optimal training set, which isdefined as the training duration in our application. Figure1 plots the variation of the metrics over training durationsfrom 30-minute period to 120-minute period (the heuris-tic assumption is training begins at 9:30 AM, and testing

In addition to profit, we also utilize the standard evalu-ation metrics: accuracy, precision and recall, to judge theperformance of our models. Specifically,

accuracy =precision =recall =

correct predictions

total predictions# accurate uptick predictions

uptick predictions# accurate uptick predictions# actual upticks

To conclude, each time we evaluate a specific model oralgorithm, we take the average precision, average recall andaverage accuracy and average profit over all 23 stocks in ouruniverse. These are the metrics used for performance in thisreport.

Models & Results
3.1. Logistic Regression
3.1.1 Feature Optimization and Dimensionality Con-straint
To predict the stock-price trends, our goal was to predict

1{P(t) >P(t) }close open
open price/volume, high price/volume, low price/volume, end volume

3Our implementation utilizes the MNRFIT library in Matlab.4We will denote features using the numbering of equations for the rest

⇣ (t) (t1)⌘ (t1)of this report, e.g. feature (1) is Popen Popen /Popen )

lasts for 30 minutes right after training finishes). We ob-serve that logistic regression model achieves maximal per-formance when training duration is set to 60 minutes.
Figure 1: Performance over different training durations
Hence, we train the logistic regression model with fea-ture (1) and feature (4), starting from 9:30 AM to 10:30AM, and the obtained model obtains precision 55.07%, re-call 30.05%, accuracy 38.39%, and profit 0.0123 when test-ing for the rest of the day.
3.1.2 Improvements based on Time Locality
While logistic regression was able to achieve a reason-able performance with the two-dimensional feature setincluding (1) and (4) and made a profit of 0.0123 , weattempted to further improve our results. Based on earlierdiscussion, our logistic regression model is constrainedto a low-dimensional feature space. As a result, we musteither select more descriptive features in low-dimensionalfeature space or use a different model that would learnfrom a higher-dimensional feature space for our application.
We started by constructing more descriptive features.We hypothesized that the stock-market exhibits significanttime-locality of price-trends based on the fact that it is ofteninfluenced by group decision making and other time-boundevents that occur in the marketplace. The signals of theseevents are usually visible over a time-frame longer thana minute since in the very-short term, these trends aremasked by the inherent volatility of the stock prices inthe market. For example, if the market enters a mode ofgeneral rise with high-fluctuation at a certain time, thenlarge 1-minute percentage changes in price or volumebecome less significant in comparison to the general trend.
We attempted to address these concerns by formulating

new features based on the -minute high-low model[1]5.Professionals in the algorithmic trading field recommendedthe heuristic choice of = 5.6 The -minute high-lowmodel tracks the high price, low price, high volume, lowvolume across all the ticks in any -minute span. For themost recent -minute span w.r.t. any 1-minute bucket of
time t, we define P H(t), P L(t), V H(t), V L(t) as follows:

PH(t) =
PL(t) =
VH(t) =
VL(t) =

max P(i) (7)tit1 high
min P(i) (8)tit1 low
max P(i) (9)tit1 high
min P(i) (10)tit1 low

Under the -minute high-low model, we choose our fea-tures to be the following:

⇣ (t) (t1)⌘Popen Popen
PH(t) PL(t)
⇣⌘(t) (t1)

(11)

Vopen Vopen (12)VH(t) VL(t)

Specifically, they are the ratio of open price and openvolume change to the most recent “-minute high-lowspread”, respectively.
Considering that our stock universe may be different, weuse cross-validation to determine the optimal value of .Figure 2 suggests that = 5 leads to maximal precisionwhile = 10 guarantees maximal profit and recall. Forthe purpose of this project, we chose = 5 because higherprecision leads to a more conservative strategy.
Figure 2: Performance over different
5Inspired by CS 246 (2011-2012 Winter) HW4, Problem 1.6Keith Siilats, a former CS 246 TA

Also, we set training duration to 60 minutes based an-other cross-validation analysis with = 5. Our -minutehigh-low logistic regression model finally achieves preci-sion 59.39%, recall 27.43%, accuracy 41.58% and profit0.0186.
Table 1: Comparison between two logistic regression mod-els

cross-validation. Similarly, we choose optimal = 10 andC = 0.1 using cross-validation. We also compared linearkernel with Gaussian kernel, and linear kernel tends to givebetter results.
The SVM model trained with the chosen training du-ration, and C finally achieves precision 47.13%, recall53.96%, accuracy 42.30% and profit 0.3066. By compar-ing -minute high-low regression model with SVM model,we see that SVM model significantly improves recall, byalmost 100%, by only sacrificing a small percentage of pre-cision, around 20%.
3.2.2 Time-Locality Revisited
Recall that the -min high-low model is based on ourhypothesis that there exists a minute rolling correlationin between trades within a certain period of time, and bycross-validation, we choose = 10 for the SVM model.To further substantiate this hypothesis, we conducted anexperiment in which we train an SVM using the optimalparameters from the previous section, and then we evaluatethe accuracy of the model by testing it on different periodsof time.
Specifically, the performance statistics of an SVMmodel, trained from 9:30 AM to 10:30 AM, are listed inTable 3. A close inspection shows that there exists a down-trend in performance as delay between testing period andtraining period becomes larger. In fact, it wouldn’t be sur-prising to see even better performance of this model within10 minutes after training completes as we chose = 107!

Model
Baseline-HL

Profit

Precision

Recall

Accuracy
38.39%41.58%

0.01230.0186

55.07%59.39%

30.05%27.43%

By compare the performance of the two logistic regres-sion models in Table 1, we clearly see that -minute high-low model provides a superior model than baseline model.This result validates our hypothesis on the time-localitycharacteristic of stock data and suggests that time-localitylasts around 5 minutes.
3.2. Support Vector Machine
As we discussed earlier, further improvement of resultsmay still be possible by exploring a new machine learningmodel. The previous model we explored contained us to alow-dimensional feature space, and to overcome this con-straint, we attempted to experiment with SVM using `-1regularization with C = 1.
3.2.1 Feature & Parameter Selection
We tried different combinations of the 8 features defined byequation (1) to (6), equation (11), and equation(12). Sincethere are a large number of feature combinations to con-sider, we used forward-search to continuously add featuresto our existing feature set and choose the best set based onour 4 metrics.
Table 2: Performance over different feature sets

Table 3: Performance over periods of time

Period
10:30-11:00AM10:45-11:15AM11:00-11:30AM11:15-11:45AM11:30-12:00AM

Profit

Precision

Recall

Accuracy
43.92%42.15%43.07%38.68%40.44%

0.0926

56.45%

38.10%

0.0684

42.49%

38.32%

0.0775

54.29%

41.09%

0.0726

48.68%

36.68%

0.0632

32.74%

29.77%

Features(1), (4)
(11), (12)(1), (4), (11),(12)(1), (4), (11),(12), (2), (5)(1), (4), (11),(12), (2), (5),(3), (6)

Profit Precision

Recall Accuracy
42.85%40.34%39.42%
42.60%42.91%

0.3066

44.72%

52.11%

0.3706

42.81%

57.64%

0.3029

42.48%

47.54%

0.3627

45.22%

56.25%

0.3484

46.43%

55.66%

Conclusion and Furtherwork

We chose the last feature set since it leads to the highestprecision and also very high profit, recall, and accuracy.In addition, we set training duration to 60 minutes using

Predicting stock market trends using machine learningalgorithms is a challenging task due to the trends being
7The result is precision: 68.84%, recall: 36.88%, accuracy: 44.84%,which tops all other results in Table 3.

masked by various factors such as noise and volatility. Inaddition, the market operates in various local-modes thatchange from time to time making it necessary to capturethose changes in order to be profitable while trading.
Although our algorithms and models were simplified, wewere able to meet our expectation of reaching modest prof-itability. As per our sequential analysis it became clear thatfactoring in time-locality and capturing the features aftersmoothing, to reduce volatility improves profitability andprecision substantially.
Factoring in features of high-dimensionality after carefulselection can also be significant to improving the results andour analysis of the SVM compared to logistic regressionwas able to capture this. We expect that this is the casebecause of higher-dimensionality increasing the likelihoodof linear separation of the dataset.
Finally, iterative improvements achieved through se-quential optimizations in the form of discretization, real-ization of time-locality, smoothing improved results signifi-cantly. Cross-validation and forward search were also pow-erful tools in making the algorithm perform better.
In conclusion, our experience in this project suggeststhat machine learning has great potential in this field andwe hope to continue working on this project further to ex-plore more nuances in improving performance via better al-gorithms as well as optimizations.
A few interesting questions that we think would be worthinvestigating would be exploring other international stockmarkets to find locations where algorithmic trading is ableto perform better. In addition, it would be interesting toinvestigate other algorithms such as reinforcement-learningto compare with the models discussed in this report. Featureselection has been key and more work in discovering moredescriptive features would prove to be promising in termsof making the results even better.

Acknowledgements
We would like to thank Professor Andrew Ng and theTA’s of the class for their feedback and input on the project.We would also like to thank Keith Sillats for generous helpin the form of advice as well as valuable personal experi-ence in the field that helped inform our decisions.
References
[1] JureLeskovec,TA:KeithSillatsHW4

A. Appendix

Stock Ticker

APOL

CMA

GCI

NFX

Origin

US Equity

CBG US Equity

US Equity

CMS

US Equity

CVS

US Equity

GME

JBL

US Equity

KIM

US Equity

LNC

US Equity

NWL

US Equity

NYX

PWR

US Equity

QEP

US Equity

SEE

TER

US Equity

THC

US Equity

TIE

TXT

US Equity

ZION

US Equity

最后编辑于：2017.11.27 03:32:29

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 157,298评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,701评论 1赞 290
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 107,078评论 0赞 237
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,687评论 0赞 202
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,018评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,410评论 1赞 211
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,729评论 2赞 310
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,412评论 0赞 194
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,124评论 1赞 239
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,379评论 2赞 242
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,903评论 1赞 257
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,268评论 2赞 251
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,894评论 3赞 233
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,014评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,770评论 0赞 192
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,435评论 2赞 269
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,312评论 2赞 260

Automated Stock Trading Using Machine Learning Algorithms

correct predictions

total predictions# accurate uptick predictions

uptick predictions# accurate uptick predictions# actual upticks

推荐阅读更多精彩内容