iPerseusPURPOSEImproved version of Perseus.
SYNOPSISfunction [V Value Alpha t]=iPerseus(P,B,stopCriteria)
DESCRIPTIONImproved version of Perseus. Improved version of Perseus algorithm. A point-based solver for POMDP. This corresponds to a slightly optimized version of the algorithm in Table 2 (Section 6.1) of the paper. To understant the code if better to first check the plain version of Perseus that follows more closely the algorithm in Table 2. Inputs P: A POMDP object. This can be any of the specializations of the POMDP (discrete/continuous states/actions/observations). B: A set of beliefs. stopCriteria: A function with 3 parameters (planning horizon, execution time, maximum change in the value for all beliefs in B in the last iteration). The function returns true if the main loop of Perseus must be stopped. Outputs V: A cell array of policies. V{n} is the policy for planning horizon 'n'. Value: A cell array of vectors. Value{n}(i) is the value for belief 'i' in planning horizon 'n'. Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to belief 'i' in planning horizon 'n'. t: A vector. t(n) is the time taken to compute the policy for planning horizon 'n'. See also Perseus. CROSS-REFERENCE INFORMATIONThis function calls:
SOURCE CODE0001 function [V Value Alpha t]=iPerseus(P,B,stopCriteria) 0002 % Improved version of Perseus. 0003 % 0004 % Improved version of Perseus algorithm. 0005 % A point-based solver for POMDP. 0006 % This corresponds to a slightly optimized version of the algorithm 0007 % in Table 2 (Section 6.1) of the paper. 0008 % To understant the code if better to first check the plain version of 0009 % Perseus that follows more closely the algorithm in Table 2. 0010 % 0011 % Inputs 0012 % P: A POMDP object. This can be any of the specializations of the 0013 % POMDP (discrete/continuous states/actions/observations). 0014 % B: A set of beliefs. 0015 % stopCriteria: A function with 3 parameters (planning horizon, 0016 % execution time, maximum change in the value for all 0017 % beliefs in B in the last iteration). 0018 % The function returns true if the main loop of Perseus 0019 % must be stopped. 0020 % Outputs 0021 % V: A cell array of policies. V{n} is the policy for planning 0022 % horizon 'n'. 0023 % Value: A cell array of vectors. Value{n}(i) is the value for belief 0024 % 'i' in planning horizon 'n'. 0025 % Alpha: A cell array of vectors. Alpha{n}(i) is the alpha assigned to 0026 % belief 'i' in planning horizon 'n'. 0027 % t: A vector. t(n) is the time taken to compute the policy for 0028 % planning horizon 'n'. 0029 % 0030 % See also Perseus. 0031 0032 % Set up the initial policy (policy with planning horizon=1) 0033 n=1; 0034 V{n}=Policy; 0035 alpha0=Alpha0(P); 0036 V{n}=AddAlpha(V{n},alpha0,RandomAction(P)); 0037 0038 [Alpha{n} Value{n}]=cellfun(@(b)(MaxAlpha(V{n},b)),B); 0039 0040 % Get the number of beliefs in the set B 0041 nb=size(B,2); 0042 0043 % And now start the itarations to improve the policy 0044 t_init=cputime; 0045 stop=false; 0046 0047 while ~stop 0048 0049 n1=n+1; % Horizon for the new policy to compute 0050 fprintf(' Iteration %u: ',n1); 0051 0052 0053 % Pre-compute the set of \Alphas_j_a_o 0054 % This is only feasible in some types of POMDPs 0055 % In the rest of cases the output is empty 0056 fprintf('[P') 0057 Alphas_j_a_o=ComputeAlphas_j_a_o(P,V{n}); 0058 fprintf('] '); 0059 0060 % The new policy is initially empty 0061 V{n1}=Policy; 0062 0063 % Initially all beliefs in B are to be improved 0064 Alpha{n1}=zeros(1,nb); 0065 Value{n1}=zeros(1,nb); 0066 pending=true(1,nb); 0067 nPending=nb; 0068 it=1; 0069 while nPending>0 0070 fprintf('%u(%u)',it,nPending); 0071 it=it+1; 0072 0073 % Select a belief pending to be improved at random 0074 ndx=find(pending); 0075 cb=ndx(ceil(rand*nPending)); 0076 0077 % POMDP with continouos action sets are converted to their discrete 0078 % counterparts by sampling actions. 0079 % We sample a set of actions (taking into accound the action previously 0080 % assigned to the belief and the planning horizon). 0081 [alpha action]=V{n}{Alpha{n}(cb)}; 0082 P1=DiscretizeActionModel(P,action,n); 0083 0084 % Execute the backup (this is a different function given different 0085 % types of POMDP) 0086 fprintf('b['); 0087 [alpha optimalAction v]=Backup(P1,B{cb},V{n},Alphas_j_a_o); 0088 fprintf(']'); 0089 0090 % If the new value is not better that the one in the previous 0091 % planning horizon, use the previous alpha. 0092 if v<Value{n}(cb) 0093 0094 fprintf('p '); 0095 0096 [alpha optimalAction]=V{n}{Alpha{n}(cb)}; 0097 0098 % Add the old alpha to the new policy 0099 [V{n1} l]=AddAlpha(V{n1},alpha,optimalAction); 0100 0101 pending(cb)=false; 0102 Alpha{n1}(cb)=l; 0103 Value{n1}(cb)=Value{n}(cb); 0104 nPending=nPending-1; 0105 0106 % Remove all (still) pending beliefs have the old alpha as 0107 % a maximizing one 0108 for i=1:nb 0109 if pending(i) && Alpha{n}(i)==Alpha{n}(cb) 0110 pending(i)=false; 0111 Alpha{n1}(i)=l; 0112 Value{n1}(i)=Value{n}(i); 0113 nPending=nPending-1; 0114 end 0115 end 0116 0117 else 0118 0119 fprintf('n '); 0120 0121 % Add the new alpha to the new policy 0122 [V{n1} l]=AddAlpha(V{n1},alpha,optimalAction); 0123 0124 pending(cb)=false; 0125 Alpha{n1}(cb)=l; 0126 Value{n1}(cb)=v; 0127 nPending=nPending-1; 0128 0129 for i=1:nb 0130 if pending(i) 0131 nv=Expectation(B{i},alpha); 0132 if nv>=Value{n}(i) 0133 pending(i)=false; 0134 Alpha{n1}(i)=l; 0135 Value{n1}(i)=nv; 0136 nPending=nPending-1; 0137 end 0138 end 0139 end 0140 0141 end 0142 0143 end 0144 0145 t(n1)=cputime-t_init; 0146 vd=Value{n1}-Value{n}; 0147 mvd=max(vd); 0148 avd=sum(vd)/nb; 0149 0150 fprintf('\n mvd:%g avd: %g t: %f\n',mvd,avd,t(n1)); 0151 0152 n=n1; 0153 0154 stop=stopCriteria(n,t(n),mvd); 0155 end 0156 0157 |