Chris Pollett > Students > Garg

    (Print View)

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [Q-Learning-Presentation - PDF]

    [Deliverable 1]

    [FABRIK-Presentation - PDF]

    [Deep-Q-Networks-Presentation - PDF]

    [Deliverable 2]

    [TRPO-Presentation - PDF]

    [Deliverable 3]

    [DDPG-Presentation - PDF]

    [Deliverable 4]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Report - PDF]

























Q-Learning implementation for Tic-Tac-Toe

Developed a python module for an agent and a bot that uses Q-Learning technique to train the agent on how to play the Tic-Tac-Toe game.
There are two scripts in this module. The first 'Q_Learning_Tic_Tac_Toe.py' uses a programmed bot to train the agent, and output is a model file that can be used by the 'Tic-Tac-Toe-Player.py' to play against a human player.

One interesting thing we noticed was in the line 142 from Q_Learning_Tic_Tac_Toe.py'

Line 142:      q_table[x, a] = instant_reward(x, a, computer_move) + learning_gamma * discounted_reward
Here, by design, the value returned by instant_reward is 0 for most of the moves (all the non-terminating moves, actually). We hypothesized that changing this function to q_lookup could make the learning faster. q_lookup returns the previously learned reward if present and the instant_reward otherwise. i.e.
Line 143:      q_table[x, a] = q_lookup(x, a) + learning_gamma * discounted_reward 
However, since our bot uses random values (unless there is a winning or blocking move), we need to be able to reduce the learned value also. With this change the values in Q-table started to accumulate, making it difficult to reduce the learned value.
To fix this, we used the following approach to prevent the value from accumulating:
Line 143:      q_table[x, a] = (1-learning_gamma) * q_lookup(x, a) + learning_gamma * discounted_reward  
Through this change, we were able to get better results.

    Download: Deliverable 1.zip