Football Table RL: Difference between revisions
Line 31: | Line 31: | ||
Initialization is done using either the constructor or vfa::init. | Initialization is done using either the constructor or vfa::init. | ||
<code> | <code> | ||
int numberOfActions = 2; <br/> | #int numberOfActions = 2; <br/> | ||
int numberOfStates = 2<br/> | #int numberOfStates = 2<br/> | ||
double numberOfNodesPerState[2] = {3, 3};<br/> | #double numberOfNodesPerState[2] = {3, 3};<br/> | ||
double saturationPerState[2] = {1, 1};<br/> | #double saturationPerState[2] = {1, 1};<br/> | ||
double nodeWidthPerState[2] = {1, 1};<br/> | #double nodeWidthPerState[2] = {1, 1};<br/> | ||
bool normalize = false;<br/> | #bool normalize = false;<br/> | ||
bool varRate = false; //this does not work well<br/> | #bool varRate = false; //this does not work well<br/> | ||
vfa EXAMPLE(numberOfActions, numberOfStates, numberOfNodesPerState, saturationPerState, nodeWidthPerState, normalize, varRate);<br/></code> | #vfa EXAMPLE(numberOfActions, numberOfStates, numberOfNodesPerState, saturationPerState, nodeWidthPerState, normalize, varRate);<br/></code> | ||
Or you can do this: | Or you can do this: |
Revision as of 15:50, 12 September 2013
Reinforcement Learning Basics
The football table employs on-line value iteration, namely Greedy-GQ[math]\displaystyle{ (\lambda) }[/math] and Approximate-Q[math]\displaystyle{ (\lambda) }[/math]. This page does not explain Reinforcement learning theory, it just touches on the usage and implementation of the provided library (libvfa
located on the SVN). Too get a basic understanding of Reinforcement Learning i suggest reading the book by Sutton & Barto [1]. For a slightly more in-depth, hands-on, book on using RL with function approximation, i suggest the book by Lucian Busoniu (TU Delft) et al. [2], which is freely available as e-book from within the TU/e network. Unfortunately both books very different notations, here we use the notation from the former.
From here on out this section assumes familiarity with reinforcement learning and discusses how the provided libvfa
library works and is used. Note, this can be used on any project provided the user provides transition samples correctly.
Value function approximation
For this project it was chosen to use a Gaussian Radial Basis Function Network(RBFs) to approximate the Q-function, a short explanation on the method can be found here in [3]. Currently, this is the only method of function approximation available in the library, however other forms of linear parametric function approximation should be easy to incorporate (this requires to only change the membership function).
RBFs are a linear parametric function approximation, where the Q-function is estimated using:
[math]\displaystyle{ Q = \theta\cdot\phi(s,a)^{T} }[/math]
here [math]\displaystyle{ \phi(s,a)~(1 \times n) }[/math] denotes the so called feature vector, [math]\displaystyle{ \theta~(1 \times n) }[/math] denotes the parameter (or weight) vector and [math]\displaystyle{ n }[/math] denotes the number of nodes. The value of Q hence becomes a scalar, depended on the action and the state. During Reinforcement learning we are trying to estimate the vector [math]\displaystyle{ \theta }[/math].
Formally, feature vector [math]\displaystyle{ \phi(s,a) }[/math], is depended on both the state and the action, creating one long vector that represents Q is its entirety (all combinations of actions and states). However, to make things more simple and comprehensible libvfa
is implemented slighty different, yet still correct of course. Instead of creating one long vector, we store [math]\displaystyle{ \theta }[/math] matrix of [math]\displaystyle{ (n_a \times n) }[/math] ([math]\displaystyle{ n_a }[/math] denotes the number of actions). This still holds the same information, however they are now stored in seperate rows, making indexation easier. On top of that, we use the same feature vector for each action (same node distribution and width across the states-space) yielding [math]\displaystyle{ \phi(s) }[/math] instead of [math]\displaystyle{ \phi(s,a) }[/math]. This is obviously not necessary, but keeps things simple and compact (we no longer need to store [math]\displaystyle{ \phi(s) }[/math] at all, just [math]\displaystyle{ \theta }[/math]). It could be better however to tailor the feature-vector to the subset of the state-space in which the action can occur (which could also allow higher precision with a lower amount of nodes).
Sparse Calculations
Library functions
In this section we explain how to use some of the important functions of libvfa
. The best way too familiarize yourself is to read primitives_select.cpp
. But here i present some basic explanation.
Initialize
Initialization is done using either the constructor or vfa::init.
- int numberOfActions = 2;
- int numberOfStates = 2
- double numberOfNodesPerState[2] = {3, 3};
- double saturationPerState[2] = {1, 1};
- double nodeWidthPerState[2] = {1, 1};
- bool normalize = false;
- bool varRate = false; //this does not work well
vfa EXAMPLE(numberOfActions, numberOfStates, numberOfNodesPerState, saturationPerState, nodeWidthPerState, normalize, varRate);
Or you can do this:
vfa EXAMPLE;
EXAMPLE.init(numberOfActions, numberOfStates, numberOfNodesPerState, saturationPerState, nodeWidthPerState, normalize, varRate);
Getting Values
If we want a single Q-value, we can use this:
int action = 0;
double state[2] = {0,0};
int width = 2; //Here we determine the number of neighbours we use
double value = EXAMPLE.getValueFast(action, state, width)
If we want the values of a couple of particalur actions
int nActions = 3;
int actions[3] = {0,2,3};
double state[2] = {0,0};
int width = 2; //Here we determine the number of neighbours we use
fvec values = EXAMPLE.getValuesFast(state, actions, nActions, width)
If we want all, we can use getAllValuesFast
.
Performing updates
To perform an in Q-learning update we need the [math]\displaystyle{ s, a, s_{t+1}, r_{t+1} }[/math]. Also currently we provide all the learning parameters as well ([math]\displaystyle{ \alpha, \beta, \gamma, \lambda, \epsilon }[/math]). You can choose either approximate-Q or Greedy-GQ to perform the update. For Greedy-GQ we use:
VFA.updateGQReallyFast(double * currentState, double * nextState, int action, double reward, alpha, beta, gamma, lambda , width);
For approximate Q we use:
VFA.updateGQReallyFast(double * currentState, double * nextState, int action, double reward, double alpha, double gamma , int width);
Doing eligibility traces
Eligibility traces are done separately after a normal update, to perform such an update use:
VFA.updateGQLambda(double * traceState, int traceAction, double alpha,double beta,double gamma,double lambda , int width, int depth);
Here the int depth
states how many steps back we are from the current state ([math]\displaystyle{ s_{t-depth} }[/math]). Similarly for approximate-Q we can use:
VFA.updateGQLambda(double * traceState, int traceAction, double alpha,double gamma,double lambda , int width, int depth);
So to perform an update on all previous states within the episode you need to loop back using a buffer. Here we have some example code, where we use phiBuf to break off the trace once an sub-optimal action was taken (if useWatkins is true).
for(int i = 1; i<traceDepth_; i++){
traceIndex = bufIndex - i;
/*Same episode && either (no Watkins || Watkins and we took an 'optimal' action)*/
if(eBuf(traceIndex) == eBuf(bufIndex) && (!useWatkins || useWatkins && phiBuf(traceIndex+1)!=0)){
}else{ break;}
for(int j = 0; j<nStates; j++){stateTrace[j] = csBuf(traceIndex, i);}
if(useGQ_){
VFA.updateGQLambda(stateTrace, aBuf(traceIndex), alpha_, beta_, gamma_, lambda_, epsilon_, 2, i);
}
else if(useApproxQ_){
VFA.updateTDLambda(stateTrace, aBuf(traceIndex), alpha_,gamma_, lambda_, epsilon_, 2, i);
}
}
Value Function Approximation
- ↑ Reinforcement Learning: an introduction, Sutton & Barto, http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html
- ↑ Reinforcement Learning and Dynamic Programming using Function Approximation, Lucian Busoniu , Robert Babuska , Bart De Schutter & Damien Ernst, 2010, http://www.crcnetbase.com/isbn/9781439821091
- ↑ Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning, R.M. Kretchmar && C. W. Anderson, http://www.cs.colostate.edu/pubserv/pubs/Kretchmar-anderson-res-rl-matt-icnn97.pdf