Bachelor's Thesis

My Bachelor’s Thesis is titled “Inventory Forecasting in the Crude Oil Market with Machine Learning”, the project takes weekly U.S. Crude Oil Inventories forecasts future inventory values using deep neural networks, incorporating inventory driver features and novel natural language features derived from breaking financial news headlines to improve forecasting.

Crude oil is the most traded commodity in the world, with $1.45T being traded in 2022, thus has an enormous amount of liquidity, providing trading opportunities. The motivation for this thesis was to find the correlation between inventory movement and market movement by forecasting the inventory ahead of time using news headlines to achieve returns from the non-linear relationship.

Our research used the Energy Information Administration’s (EIA) weekly reports on crude oil inventories, which offer insights into U.S. oil supply levels on a weekly basis. We also leveraged the weekly preliminary report from the American Petroleum Institute’s (API). These weekly reports provide an early indication of inventory changes that can influence market expectations and assist investors in gauging the balance between supply and demand in the crude oil market.

EIA Release Numbers & Market Movement on October 18th 2023.

EIA Release Numbers & Market Movement on October 18th 2023.

The dataset is as follows: 159 weekly economic events were sampled with 25,530 headlines from FinancialJuice. A pre-trained RoBERTa model was used to generate aggregate sentiment values and word embeddings, further processing was done to reduce dimensions so our models would not fall short to the curse of dimensionality. Extensive hyperparameter tuning was done using Bayesian optimisation techniques and an expanding window technique was used to ensure consistency in results, and showing how well the best models perform over time.

Transformer model results on test data, using max pooled embeddings passed through a linear layer.

Transformer model results on test data, using max pooled embeddings passed through a linear layer.

Results from the models are largely inconclusive, likely due to three things. Firstly, the nature of financial news headlines being short in text and lacking context. Secondly, the Pre-Trained model having context of the Financial Phrasebank dataset, rather than a text corpus that better represents the supply and demand dynamics of the crude oil market. Thirdly, the small sample size of 159 release events.

However, this proposes a lot of promise, as a fine-tuned time series deep neural network model with an augmented dataset would be expected to handle the task more effectively, covering up the weaknesses of a pre-trained model, utilising the strength of models built for time series data and using the natural language expertise of transformer neural network models.

Trading Copier

Trading Copier Thumbnail

Trading Copier Thumbnail

The Telegram Trading copier is a project built using Python, where trading signals are extracted using the Telegram API, and translated into real trades with a take profit and stop loss, sending the packet of information to a broker with an API call, making a market or limit order.

It then keeps track of every trade using an SQL database and decides the next correct move using streams of tick data, processing information every tick for all instruments that currently have a pending order or open position using REST and Streaming APIs. Includes over 10000 lines of code and upwards of 10 active Telegram channels.I’ve also built a framework for this project to backtest signals using 5s historical candle data.

Time is Money

Everyday there are breaking financial news headlines reported by news platforms such as Financial Juice. This project aimed to take all tagged historical headlines, and capture the relationship between the time of headlines being released and the market movement afterwards. The methodology of the project is to gather over 100,000 breaking news headlines, and analyse the semantic structure in relation to three years of market time-series data by generating volatility metrics and visualising with topic modelling tools.

Time is Money: A Visualisation

Time is Money: A Visualisation

To view the interactive map, head over to the build

The interactive map can be split into a two core components, the time series data and the unstructured headline data. For the time series data, we use the open, high, low and close values surrounding the headline release to generate volatility values, which were produced from the Garman-Klass estimator as this considers OHLC data and is more fit for intraday price extremums. A sliding window method was utilised to take the average garman klass volatility at specific time windows (2m, 5m, 15m, 30m and 60m). The headlines were then embedded using Open-AI’s embedding tool to workout the simialrity of headlines. The embeddings create 768-dimensional vectors that can be visualised in a 2D plane, where the closer two headlines are in this space, the more similar Open-AI deems them to be.

In addition, we had over 300,000 headlines. Since the majority of these headlines was untagged, we trained a multi label classifier model with the tagged headlines to identify what the tags of the untagged headlines would be. This tagging tool identified the relevant tags for the testing dataset of the tagged headlines with 90% precision.

IC Hack 2023

Front-end User Interface live demo snapshot

Front-end User Interface live demo snapshot

During Imperial’s Hackathon, I teamed up with 5 others to take on Terra API’s wearable challenge, where the goal was the most innovative hack using health data. After some brainstorming of ideas, we came up with the objective of monitoring recovery after physical surgery. My contribution to the project was simulating physiological responses with Monte Carlo methods in Python. Since only the heartrate data could be streamed with the wearable device we were given, I decided to simulate some data to to mimic real-world events. The vitals states change in response to probabilistic rules based on its current state, a stochastic process similar to Markov Chains. Essentialy using logic zones with core health metrics, we were able to generate realistic occurences of what a recovery patients vitals may look like, and warning medical professionals such as doctors and nurses if the patient’s vitals reached critical logic zones. The maths for the transition equation from state to state can be defined as below.

Monte Carlo Simulation akin to Markov Chain

Monte Carlo Simulation akin to Markov Chain

This was then integrated into the system, and a score of the patients health was calculated, sending a notification to the medical professional to prompt an appropriate response. A key takeaway from the hackathon was the importance of scaling scope through effective communication; great achievements are possible only when everyone is aligned. Not all members of our team had a technical background, so building the bridge of communication between the technical and non-technical side was crucial to getting our project over the line. The full system overview can be seen as below.

High-Level System Design Diagram

High-Level System Design Diagram