Description of Perseus

   The Perseus point-based POMDP solver.

   Perseus algorithm.
   A point-based solver for POMDP.
   This corresponds to the algorithm in Table 2 (Section 6.1) of the paper.

   Inputs
      P: A POMDP object. This can be any of the specializations of the
         POMDP (discrete/continuous states/actions/observations).
      B: A set of beliefs.
      stopCriteria: A function with 3 parameters (planning horizon,
                    execution time, maximum change in the value for all
                    beliefs in B in the last iteration).
                    The function returns true if the main loop of Perseus
                    must be stopped.
   Outputs
      V: A cell array of policies. V{n} is the policy for planning 
         horizon 'n'.
      Value: A cell array of vectors. Value{n}(i) is the value for belief
             'i' in planning horizon 'n'.
      Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to
             belief 'i' in planning horizon 'n'.
      t: A vector. t(n) is the time taken to compute the policy for
         planning horizon 'n'.

0001 function [V Value Alpha t]=Perseus(P,B,stopCriteria)
0002 %   The Perseus point-based POMDP solver.
0003 %
0004 %   Perseus algorithm.
0005 %   A point-based solver for POMDP.
0006 %   This corresponds to the algorithm in Table 2 (Section 6.1) of the paper.
0007 %
0008 %   Inputs
0009 %      P: A POMDP object. This can be any of the specializations of the
0010 %         POMDP (discrete/continuous states/actions/observations).
0011 %      B: A set of beliefs.
0012 %      stopCriteria: A function with 3 parameters (planning horizon,
0013 %                    execution time, maximum change in the value for all
0014 %                    beliefs in B in the last iteration).
0015 %                    The function returns true if the main loop of Perseus
0016 %                    must be stopped.
0017 %   Outputs
0018 %      V: A cell array of policies. V{n} is the policy for planning
0019 %         horizon 'n'.
0020 %      Value: A cell array of vectors. Value{n}(i) is the value for belief
0021 %             'i' in planning horizon 'n'.
0022 %      Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to
0023 %             belief 'i' in planning horizon 'n'.
0024 %      t: A vector. t(n) is the time taken to compute the policy for
0025 %         planning horizon 'n'.
0026 %
0027   
0028   % Get the number of beliefs in the set B
0029   nb=size(B,2);
0030 
0031   % Set up the initial policy (policy with planning horizon=1)
0032   n=1;                   
0033   V{n}=Policy;
0034   alpha0=Alpha0(P);
0035   V{n}=AddAlpha(V{n},alpha0,RandomAction(P));
0036   
0037   [Alpha{n} Value{n}]=cellfun(@(b)(MaxAlpha(V{n},b)),B);
0038     
0039   % And now start the itarations to improve the policy
0040   t_init=cputime;
0041   stop=false;
0042   
0043   while ~stop
0044     
0045     n1=n+1; % Horizon for the new policy to compute
0046     fprintf('    Iteration %u: ',n1);
0047     
0048     % Pre-compute the set of \Alphas_j_a_o
0049     % This is only feasible in some types of POMDPs
0050     % In the rest of cases the output is empty
0051     fprintf('[P')
0052     Alphas_j_a_o=ComputeAlphas_j_a_o(P,V{n});
0053     fprintf('] ');
0054     
0055     % The new policy is initially empty
0056     V{n1}=Policy;
0057     
0058     % Initially all beliefs in B are to be improved
0059     pending=true(1,nb);
0060     nPending=nb;
0061     it=1;
0062     while nPending>0
0063       fprintf('%u(%u)',it,nPending);
0064       it=it+1;
0065       
0066       % Select a belief pending to be improved at random
0067       ndx=find(pending); 
0068       nPending=size(ndx,2);
0069       cb=ndx(ceil(rand*nPending));
0070      
0071       % POMDP with continouos action sets are converted to their discrete
0072       % counterparts by sampling actions.
0073       % We sample a set of actions (taking into accound the action previously
0074       % assigned to the belief and the planning horizon).
0075       [alpha action]=V{n}{Alpha{n}(cb)};
0076       P1=DiscretizeActionModel(P,action,n);
0077       
0078       % Execute the backup (this is a different function given different
0079       % types of POMDP)
0080       fprintf('b[');
0081       [alpha optimalAction v]=Backup(P1,B{cb},V{n},Alphas_j_a_o);
0082       fprintf(']');
0083       
0084       % If the new value is not better that the one in the previous
0085       % planning horizon, use the previous alpha.
0086       if v<Value{n}(cb)
0087         [alpha optimalAction]=V{n}{Alpha{n}(cb)}; 
0088         fprintf('p ');
0089       else
0090         fprintf('n ');
0091       end
0092        
0093       % Add the new alpha to the new policy
0094       [V{n1} l]=AddAlpha(V{n1},alpha,optimalAction);
0095       
0096       for i=1:nb
0097         if pending(i)
0098           nv=Expectation(B{i},alpha);
0099           if nv>=Value{n}(i)
0100             pending(i)=false;
0101             nPending=nPending-1;
0102             Value{n1}(i)=nv;
0103             Alpha{n1}(i)=l;
0104           end
0105         end
0106       end
0107       
0108     end %  while nPending>0
0109     
0110     %[Alpha{n1} Value{n1}]=cellfun(@(b)(MaxAlpha(V{n1},b)),B);
0111   
0112     t(n1)=cputime-t_init; 
0113     vd=Value{n1}-Value{n};
0114     mvd=max(vd);
0115     avd=sum(vd)/nb;
0116 
0117     fprintf('\n                     mvd:%g  avd: %g  t: %f\n',mvd,avd,t(n1));
0118     
0119     n=n1; 
0120     
0121     stop=stopCriteria(n,t(n),mvd);
0122   end
0123   
0124

Perseus

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE