Q-Learning implementation for Tic-Tac-Toe

Developed a python module for an agent and a bot that uses Q-Learning technique to train the agent on how to play the Tic-Tac-Toe game.
There are two scripts in this module. The first 'Q_Learning_Tic_Tac_Toe.py' uses a programmed bot to train the agent, and output is a model file that can be used by the 'Tic-Tac-Toe-Player.py' to play against a human player.

One interesting thing we noticed was in the line 142 from Q_Learning_Tic_Tac_Toe.py'

Line 142:      q_table[x, a] = instant_reward(x, a, computer_move) + learning_gamma * discounted_reward

Here, by design, the value returned by instant_reward is 0 for most of the moves (all the non-terminating moves, actually). We hypothesized that changing this function to q_lookup could make the learning faster. q_lookup returns the previously learned reward if present and the instant_reward otherwise. i.e.

Line 143:      q_table[x, a] = q_lookup(x, a) + learning_gamma * discounted_reward

However, since our bot uses random values (unless there is a winning or blocking move), we need to be able to reduce the learned value also. With this change the values in Q-table started to accumulate, making it difficult to reduce the learned value.
To fix this, we used the following approach to prevent the value from accumulating:

Line 143:      q_table[x, a] = (1-learning_gamma) * q_lookup(x, a) + learning_gamma * discounted_reward

Through this change, we were able to get better results.

Download: Deliverable 1.zip