Building a complex reinforcement learning crypto-trading environment in python

Published in

Level Up Coding

7 min readJan 19, 2021

I’ve only recently started learning about reinforcement learning in detail and I was fascinated after understanding how feats in the field were achieved by projects like DeepMind’s AlphaGo and AlphaZero. This made me eager to understand deep RL with all its complexities and so I decided to dissect its components and using them for an interesting application.

Building a consistently profitable trading bot is no easy task, probably impossible to some degree. A perfect model would basically know the state of the market with all its variables along with some variables we don’t know even exist. With a project like AlphaZero, I was amused by the fact that the algorithm was able to develop strategies in the game of Go that were still undiscovered by high level Go players. Based on this notion, I’m interested in exploring the strategies an RL agent could develop in a complex trading environment, and by complex, I just mean across multiple markets.

Most of the articles I’ve read on this topic stick to one market, for example a “Bitcoin trading bot”. My reasoning for adding this ‘complexity’ factor is that first, we are giving our agent more data from different markets reducing its probability to generalize, and second is there could be meaningful patterns in observing multiple markets under the same space which will be the digital currency space for this project.

Building our environment

I’ll be using Tensorflow’s agents to interact with the environment later on, so I will be building a custom environment class that extends the py_environment class provided by the TF environments library.

The most important two vectors to note are the actions and observations. With each step our agent will choose a coin to buy, sell, or hold for a % amount of what it currently owns, hence we have the numpy array with the shape (3) of type ints where the first value is the number of pairs we’ll be working with, 2nd is the buy, sell, or hold, and last is the amount where 1 is 10%, 2 is 20% and so on.

As for the observation space, I want my agent to observe the market as a whole and not just the coin that it just traded in. So for now it’s (4, 5, 40), where 4 is the pairs, 5 is the volume, open, high, low, and close values for each step, and 40 is the look back window which is basically how far back should our agent look at the data. This might not be the optimal observation space we want but my focus for this article is to just to have a fully fledged environment that we can start experimenting with and engineering every component later on.

Also we defined our wallet as a list where the first value is the initial USD balance. Note that I’m using locally stored data in the form of an excel spreadsheet, I used Binance’s API to basically download market data for each pair seperated by sheets:

I won’t be going over the script I used to extract data into this format but you find it here.

Next, we define our reset() method to reset our environment with each episode:

Similar to the constructor in the reset() method we reinitialize everything, except constants such as our price data for example.

Now, moving to our step function which the agent will be passing an ‘action’ to and receiving a ‘timestep’ object containing the reward, state, and discount with each step.

First I initialized a list ‘data’ which will contain the dataframes necessary for the state at that step (i.e taking the rows in the range of the current step and 40 steps behind it which is our look back window). Then I added the condition for when our episode ends which is when our agent loses all it’s money.

We will then either buy, sell, or hold, based on which the wallet will be updated and lastly we increment our step and return the timestep object as a transition to the next state.

Again, I’m leaving all the action reinforcement learning engineering for a seperate article and so our reward currently does not function at all. The end goal of this part is to just have an environment that an agent can interact with and that can be visually rendered.

Rendering our environment

For this I defined a new class object to graph our price data and the agent’s moves into candlesticks on multiple subplots that can be iteratively updated with each step.

Here simply enough I initialized the subplots so that we have 4 axis using plt.subplots(2,2) so that we have 2 rows and 2 columns:

A common library used for candlestick graphs is ‘mlp_finance’ but for some reason I was having difficulty getting it to work so I decide to write my own candlestick plotting function within our ‘TradingGraph’ class:

The function takes in an axis, the ohlc values, and the index values, goes over the ohlc rows to plot two lines, the first plot is a thin line that represents our high and low values and the second is for the open and close. This iteration results in a candlestick graph.

Next we need to define the function that will take in all the data and plot it on the 4 subplots:

The ‘axs’ variable in our construct holds our subplots in a 2-d numpy array so that they can be accessed through index, for example [0,0] is the first subplot, [1,0] is the second and so on. So I used the flatten() function so that it’s a 1-d array of size 4 and now we can iterate over the dataframes where each dataframe here contains the price data for a coin and edit it’s plot.

Next I defined a function to render the trades taken by our agent:

Similar to the previous function, we iterate over each plot to add a horizontal line for the price at which it bought with a dot to indicate when that action was taken and used the color red to indicated that its a buy and green to indicate that it’s a sell.

Lastly we need a method to put everything together that will be used by our environment’s render function:

Here we simply call render_prices() and render_trades() followed by the matplotlib’s draw(). Normally the way matplotlib works is if the plot() function is called at any point it will show the figure but we used plt.ion() in our constructor to tell it that we’ll be handling when to update the figure. So with each step we’ll do all the neccessary plotting then call draw to basically redraw the figure with the changes. Lastly, plt.pause(0.1) is used to indicate the time in seconds that we’ll wait between each frame and is necessary for our live plot updating mechanism to work.

We can go back to our environment class, import this visualization class as ‘tg’ and define our render function to use what we just built.

Testing the environment

To test our environment I made a new python file to import the environment and start interacting with it:

This runs the environment for a 100 steps using a random buy, sell, hold action with the probability distribution 0.1, 0.1, 0.8, respectively, then chooses a random coin, and I’ve kept the amount to 2 because that doesn’t affect our current visual model so there’s no need to randomize it for now.

Our final result:

There is definitely room for adjustment, adding my annotations, labels, a grid maybe, etc. But for now, I think this is a solid prototype model that we can work with and start applying reinforcement learning techniques to train an agent on.

The next step would be to add a pyenvironment tensorflow wrapper to our environment to make sure TF agents can properly interact with no problems but we’ll keep that for the next article.

Summary and Final Thoughts

In this article, I discussed the process I went through to design a fully fledged TF agents reinforcement learning environment for crypto-trading in python. You can find the full code to this project here. There’s still some work to be done before we can start actually applying reinforcement learning. Any updates I do will be updated on this article and in the repository so there’s no need to actually go over these minor changes. So hopefully in the next article we can start with the more interesting aspects and begin playing around with the state-of-the-art reinforcement learning algorithms provided my Tensorflow’s agents library.

This is one of the most fun yet frustrating projects I’ve worked on, and it’s only going to get more interesting from here so hope you found value in this and stay tuned :)