PerseusPURPOSEThe Perseus point-based POMDP solver.
SYNOPSISfunction [V Value Alpha t]=Perseus(P,B,stopCriteria)
DESCRIPTIONThe Perseus point-based POMDP solver. Perseus algorithm. A point-based solver for POMDP. This corresponds to the algorithm in Table 2 (Section 6.1) of the paper. Inputs P: A POMDP object. This can be any of the specializations of the POMDP (discrete/continuous states/actions/observations). B: A set of beliefs. stopCriteria: A function with 3 parameters (planning horizon, execution time, maximum change in the value for all beliefs in B in the last iteration). The function returns true if the main loop of Perseus must be stopped. Outputs V: A cell array of policies. V{n} is the policy for planning horizon 'n'. Value: A cell array of vectors. Value{n}(i) is the value for belief 'i' in planning horizon 'n'. Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to belief 'i' in planning horizon 'n'. t: A vector. t(n) is the time taken to compute the policy for planning horizon 'n'. CROSS-REFERENCE INFORMATIONThis function calls:
SOURCE CODE0001 function [V Value Alpha t]=Perseus(P,B,stopCriteria) 0002 % The Perseus point-based POMDP solver. 0003 % 0004 % Perseus algorithm. 0005 % A point-based solver for POMDP. 0006 % This corresponds to the algorithm in Table 2 (Section 6.1) of the paper. 0007 % 0008 % Inputs 0009 % P: A POMDP object. This can be any of the specializations of the 0010 % POMDP (discrete/continuous states/actions/observations). 0011 % B: A set of beliefs. 0012 % stopCriteria: A function with 3 parameters (planning horizon, 0013 % execution time, maximum change in the value for all 0014 % beliefs in B in the last iteration). 0015 % The function returns true if the main loop of Perseus 0016 % must be stopped. 0017 % Outputs 0018 % V: A cell array of policies. V{n} is the policy for planning 0019 % horizon 'n'. 0020 % Value: A cell array of vectors. Value{n}(i) is the value for belief 0021 % 'i' in planning horizon 'n'. 0022 % Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to 0023 % belief 'i' in planning horizon 'n'. 0024 % t: A vector. t(n) is the time taken to compute the policy for 0025 % planning horizon 'n'. 0026 % 0027 0028 % Get the number of beliefs in the set B 0029 nb=size(B,2); 0030 0031 % Set up the initial policy (policy with planning horizon=1) 0032 n=1; 0033 V{n}=Policy; 0034 alpha0=Alpha0(P); 0035 V{n}=AddAlpha(V{n},alpha0,RandomAction(P)); 0036 0037 [Alpha{n} Value{n}]=cellfun(@(b)(MaxAlpha(V{n},b)),B); 0038 0039 % And now start the itarations to improve the policy 0040 t_init=cputime; 0041 stop=false; 0042 0043 while ~stop 0044 0045 n1=n+1; % Horizon for the new policy to compute 0046 fprintf(' Iteration %u: ',n1); 0047 0048 % Pre-compute the set of \Alphas_j_a_o 0049 % This is only feasible in some types of POMDPs 0050 % In the rest of cases the output is empty 0051 fprintf('[P') 0052 Alphas_j_a_o=ComputeAlphas_j_a_o(P,V{n}); 0053 fprintf('] '); 0054 0055 % The new policy is initially empty 0056 V{n1}=Policy; 0057 0058 % Initially all beliefs in B are to be improved 0059 pending=true(1,nb); 0060 nPending=nb; 0061 it=1; 0062 while nPending>0 0063 fprintf('%u(%u)',it,nPending); 0064 it=it+1; 0065 0066 % Select a belief pending to be improved at random 0067 ndx=find(pending); 0068 nPending=size(ndx,2); 0069 cb=ndx(ceil(rand*nPending)); 0070 0071 % POMDP with continouos action sets are converted to their discrete 0072 % counterparts by sampling actions. 0073 % We sample a set of actions (taking into accound the action previously 0074 % assigned to the belief and the planning horizon). 0075 [alpha action]=V{n}{Alpha{n}(cb)}; 0076 P1=DiscretizeActionModel(P,action,n); 0077 0078 % Execute the backup (this is a different function given different 0079 % types of POMDP) 0080 fprintf('b['); 0081 [alpha optimalAction v]=Backup(P1,B{cb},V{n},Alphas_j_a_o); 0082 fprintf(']'); 0083 0084 % If the new value is not better that the one in the previous 0085 % planning horizon, use the previous alpha. 0086 if v<Value{n}(cb) 0087 [alpha optimalAction]=V{n}{Alpha{n}(cb)}; 0088 fprintf('p '); 0089 else 0090 fprintf('n '); 0091 end 0092 0093 % Add the new alpha to the new policy 0094 [V{n1} l]=AddAlpha(V{n1},alpha,optimalAction); 0095 0096 for i=1:nb 0097 if pending(i) 0098 nv=Expectation(B{i},alpha); 0099 if nv>=Value{n}(i) 0100 pending(i)=false; 0101 nPending=nPending-1; 0102 Value{n1}(i)=nv; 0103 Alpha{n1}(i)=l; 0104 end 0105 end 0106 end 0107 0108 end % while nPending>0 0109 0110 %[Alpha{n1} Value{n1}]=cellfun(@(b)(MaxAlpha(V{n1},b)),B); 0111 0112 t(n1)=cputime-t_init; 0113 vd=Value{n1}-Value{n}; 0114 mvd=max(vd); 0115 avd=sum(vd)/nb; 0116 0117 fprintf('\n mvd:%g avd: %g t: %f\n',mvd,avd,t(n1)); 0118 0119 n=n1; 0120 0121 stop=stopCriteria(n,t(n),mvd); 0122 end 0123 0124 |