Upgrade to Pro — share decks privately, control downloads, hide ads and more …

在庫管理のための機械学習と最適化の融合

 在庫管理のための機械学習と最適化の融合

岡山大学 Ziang Liu先生の講演スライド

Avatar for MIKIO KUBO

MIKIO KUBO

May 25, 2025
Tweet

More Decks by MIKIO KUBO

Other Decks in Research

Transcript

  1. ML and Opt for Inventory Integrating Machine Learning and Optimization

    for Inventory Management Ziang Liu Faculty of Environmental, Life, Natural Science and Technology Okayama University MOAI, January 26, 2024 1 / 78
  2. ML and Opt for Inventory Table of Contents 1 Reinforcement

    Learning for Inventory Optimization Newsvendor Problem and Multi-armed Bandit Problem Multi-Period Inventory Problem and Markov Decision Process Multi-Period Inventory Problem and TD Learning Inventory Management and Deep Reinforcement Learning 2 Surrogate Models in Optimization Surrogate-based Optimization Trust Region Methods Bayesian Optimization Surrogate-assisted Evolutionary Optimization 3 Appendix 2 / 78
  3. ML and Opt for Inventory References References Books Sutton, Richard

    S., and Andrew G. Barto. 2018. Reinforcement Learning, Second Edition: An Introduction. MIT Press. Snyder, Lawrence V., and Zuo-Jun Max Shen. 2019. Fundamentals of Supply Chain Theory. John Wiley & Sons. Puterman, Martin L. 1990. “Chapter 8 Markov Decision Processes.” In Handbooks in Operations Research and Management Science, 2:331-434. Elsevier. Kochenderfer, Mykel J., and Tim A. Wheeler. 2019. Algorithms for Optimization. MIT Press. Jin, Yaochu, Handing Wang, and Chaoli Sun. 2021. Data-Driven Evolutionary Optimization: Integrating Evolutionary Computation, Machine Learning and Data Science. Springer Nature. Webpages “UCL Course on RL.” 2019. David Silver. December 19, 2019. https://www.davidsilver.uk/teaching/. Snyder, Larry. n.d. RL for Inventory Optimization. Github. Accessed January 4, 2024. https://github.com/LarrySnyder/RLforInventory. Poupart, Pascal. n.d. “CS885 Spring 2020 - Reinforcement Learning.” Accessed January 4, 2024. https://cs.uwaterloo.ca/ ppoupart/teaching/cs885-spring20/schedule.html. 3 / 78
  4. ML and Opt for Inventory Source Code Source Code Source

    code is available on GitHub. https://github.com/zi-ang-liu/ML-and-Opt-for-Inventory 0 1 2 3 4 5 6 7 8 9 10 Bandit 4 2 0 2 4 Reward Violin plot for 10-armed bandits 54 55 56 57 58 59 Order Quantity 2.00 2.02 2.04 2.06 2.08 2.10 Expected Cost 0 2 4 6 8 10 6 4 2 0 2 4 6 8 True function: f(x)=xsin(x) GP mean 95% confidence interval Training data Lower Confidence Bound (k=5) 15 10 5 0 5 10 15 State 0 2 4 6 8 10 12 14 16 Action Policy 15 10 5 0 5 10 15 State 0 2 4 6 8 10 12 14 16 Action Policy 0 2 4 6 8 10 6 4 2 0 2 4 6 8 True function: f(x)=xsin(x) GP mean 95% confidence interval Training data 4 / 78
  5. ML and Opt for Inventory Mindmap Mindmap Integrating Machine Learning

    and Optimization for Inventory Management Reinforcement Learning Newsvendor problem Surrogate Model Multi-armed bandit algorithm Multi-period inventory problem Bellman optimality equation Linear programming Dynamic programming Q-learning Deep Q-learning Multi-period perishable inventory problem Surrogate-based optimization Surrogate-assisted optimization Trust region method Surrogated-assisted Evolutionary algorithm Beyasian optimization 5 / 78
  6. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Table of Contents 1 Reinforcement Learning for Inventory Optimization Newsvendor Problem and Multi-armed Bandit Problem Multi-Period Inventory Problem and Markov Decision Process Multi-Period Inventory Problem and TD Learning Inventory Management and Deep Reinforcement Learning 2 Surrogate Models in Optimization Surrogate-based Optimization Trust Region Methods Bayesian Optimization Surrogate-assisted Evolutionary Optimization 3 Appendix 6 / 78
  7. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The newsvendor problem: Description A newspaper vendor must decide how many copies of a newspaper to order each morning The demand is uncertain Overage cost occurs when the demand is less than the order quantity Underage cost occurs when the demand is greater than the order quantity What is the optimal order quantity? 7 / 78
  8. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The newsvendor problem: Formulation Parameters: D: Demand (random variable) h: Holding cost (overage cost) p: Stockout cost (shortage cost) Decision variables: Q: Order quantity Objective: g(Q) = E [h(Q − D)+ + p(D − Q)+] (x)+ = max{x, 0} 8 / 78
  9. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The newsvendor problem: Optimization Objective function: E [h max {Q − D, 0} + p max {D − Q, 0}] = h Q 0 (Q − d)f(d)dd + p ∞ Q (d − Q)f(d)dd First order derivative: hF(Q) − p(1 − F(Q)) Second order derivative: hf(Q) + pf(Q) Optimal order quantity: Q∗ = F−1 h h+p 9 / 78
  10. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The newsvendor problem: Example D ∼ N(50, 82) h = 0.18, p = 0.7 20 30 40 50 60 70 80 90 100 Order Quantity 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Expected Cost Figure: Continuous newsvendor problem 10 / 78
  11. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The discrete newsvendor problem: Example 54 55 56 57 58 59 Order Quantity 2.00 2.02 2.04 2.06 2.08 2.10 Expected Cost Figure: Discrete newsvendor problem g(Q) is still convex. 11 / 78
  12. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem Multi-armed bandit problem Problem statement: k different actions we can take After each action, we receive a reward from a stationary probability distribution The reward distribution is unknown Objective: maximize the total reward over some time period 12 / 78
  13. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The k-armed bandit problem: Example 0 1 2 3 4 5 6 7 8 9 10 Bandit 4 2 0 2 4 Reward Violin plot for 10-armed bandits Figure: The 10-armed bandit problem 13 / 78
  14. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The k-armed bandit problem: Notations At : Action at time t Rt : Reward at time t q∗(a): Expected reward for taking action a, q∗(a) = E [Rt |At = a] Qt (a): Estimated value of action a at time t, Qt (a) = t−1 i=1 Ri I{Ai =a} t−1 i=1 I{Ai =a} 0 I {Ai = a} is an indicator function, I {Ai = a} = 1 if Ai = a, otherwise I {Ai = a} = 0 14 / 78
  15. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem Incremental implementation of sample averages Qn+1 = R1 + R2 + · · · + Rn n = 1 n n i=1 Ri = 1 n Rn + n−1 i=1 Ri = 1 n Rn + (n − 1) R1 + R2 + · · · + Rn−1 n − 1 = 1 n (Rn + (n − 1)Qn) = Qn + 1 n (Rn − Qn) 15 / 78
  16. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem The k-armed bandit problem: ϵ-greedy policy with probability 1 − ϵ, choose the action with the highest estimated value, At = arg maxa Qt (a) with probability ϵ, choose an action randomly 16 / 78
  17. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem A simple bandit algorithm Algorithm 1: A simple bandit algorithm 1 Initialize, for a = 1, . . . , k: 2 Q(a) ← 0 3 N(a) ← 0 4 for t = 1, 2, . . . do 5 Choose At using ϵ-greedy policy based on Qt 6 Rt ← bandit(At ) 7 N(At ) ← N(At ) + 1 8 Q(At ) ← Q(At ) + 1 N(At ) (Rt − Q(At )) 9 end 17 / 78
  18. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem Solving newsvendor problem as a k-armed bandit problem discrete newsvendor problem h = 0.18, p = 0.7, D ∼ N(5, 12), D is discrete Optimal order quantity: 6, Expected Cost: 0.24446 simple bandit algorithm (ϵ = 0.01, k = 10, T = 2000) Optimal Action Q Value 6 -0.24113 6 -0.24389 6 -0.24330 6 -0.24549 6 -0.23680 Table: Optimal actions and corresponding Q values 18 / 78
  19. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Newsvendor Problem and Multi-armed Bandit Problem k-armed bandit problem and newsvendor problem Summary k-armed bandit problem is nonassociative Only need to find the best action The newsvendor problem: Perishable, Lost sales The newsvendor problem can be solved as k-armed bandit problem Next: Associative problem need to learn a policy (a mapping from states to actions) Multi-period inventory problem 19 / 78
  20. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Inventory Policy Constant order quantity policy: Order Q units at each period. Base-stock policy: Order up to S units when the inventory level is less than S, otherwise do not order. (s, S) policy: Order up to S units when the inventory level is less than s, otherwise do not order. Inventory policy is a mapping from inventory level to order quantity. 20 / 78
  21. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process The multi-period inventory problem: Description Inventory is non-perishable Decide how much to order at each morning The demand is uncertain Unsatisfied demand is backordered What is the optimal order policy? 21 / 78
  22. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Notations Dt : Demand at period t h: Holding cost (overage cost) p: Stockout cost (shortage cost) Qt : Order quantity at period t ILt : Inventory level at the end of period t ILt > 0: Inventory on hand ILt < 0: Backorder ILt = ILt−1 + Qt − Dt 22 / 78
  23. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process The multi-period inventory problem Inventory Level Time Period Q1 d1 Q2 IL1 IL2 d2 Q3 d3 IL3 23 / 78
  24. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Objective Cost at period t gt (Qt ) = h(ILt−1 + Qt − dt )+ + p(dt − ILt−1 − Qt )+ Objective Total cost: T t=1 gt (Qt ) Total cost, infinite horizon: ∞ t=1 gt (Qt ) Cumulative discounted cost, infinite horizon: ∞ t=1 γt−1gt (Qt ) 24 / 78
  25. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Problem statement Objective: minimize the discounted cumulative cost, ∞ t=1 γt−1gt (Qt ) Optimal inventory decision at period t depends on the ending inventory level in the previous period, ILt−1 We want to learn a policy π that maps ILt−1 to Qt at each period t. The multi-period inventory problem is an associative problem 25 / 78
  26. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Markov Decision Process (MDP) Definition A Markov Decision Process (MDP) is a tuple (S, A, P, R, γ) S: a set of states A: a set of actions P: transition probability, p(s′, r|s, a) = P [St = s′, Rt = r|St−1 = s, At−1 = a], p : S × R × S × A → [0, 1] R: a reward function, r(s, a) = E [Rt |St−1 = s, At−1 = a], r : S × A → R γ: a discount factor, γ ∈ [0, 1] 26 / 78
  27. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Multi-period inventory problem as an MDP Multi-period inventory problem as an MDP S = {ILt ∈ Z : ILt is a ending inventory levels} A = {Qt ∈ N : Qt is a order quantity} p(s′, r|s, a) = P [Dt = s + a − s′] R: a reward function, r(s, a) = −E [h(s + a − Dt )+ + p(Dt − s − a)+] γ: a discount factor, γ ∈ [0, 1] 27 / 78
  28. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Multi-period inventory problem as an MDP Returns discount return: Gt = ∞ k=0 γk Rt+k+1 Policies and value functions policy: π(a|s) = P [At = a|St = s] state value function: vπ(s) = Eπ [Gt |St = s] action value function: qπ(s, a) = Eπ [Gt |St = s, At = a] 28 / 78
  29. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Optimal policies and value functions Policies π ≥ π′ if vπ(s) ≥ vπ′ (s), ∀s ∈ S At least one optimal policy exists, denoted as π∗ Value functions optimal state value function: v∗(s) = maxπ vπ(s), ∀s ∈ S optimal action value function: q∗(s, a) = maxπ qπ(s, a), ∀s ∈ S, a ∈ A 29 / 78
  30. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Computing optimal policies and value functions Bellman optimality equation Linear programming Dynamic programming Policy iteration Value iteration 30 / 78
  31. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Bellman optimality equation (Optional) v∗(s) = max a∈A(s) qπ∗ (s, a) = max a Eπ∗ [Gt |St = s, At = a] = max a Eπ∗ [Rt+1 + γGt+1|St = s, At = a] = max a Eπ∗ [Rt+1 + γv∗(St+1)|St = s, At = a] = max a s′,r p(s′, r|s, a) r + γv∗(s′) 31 / 78
  32. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Solving MDP: Bellman optimality equation Bellman optimality equation v∗(s) = max a s′,r p(s′, r|s, a) r + γv∗(s′) 32 / 78
  33. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Solving MDP: Linear programming A less frequently used method for solving MDP Idea: If v(s) ≥ r(s, a) + γ s′ p(s′|s, a)v(s′) for all s ∈ S and a ∈ A, then v(s) is an upper bound on v∗ (s) v∗ (s) must be the smallest such solution 33 / 78
  34. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Solving MDP: Linear programming Linear programming formulation minimize s∈S αsv(s) s.t. v(s) ≥ r(s, a) + γ s′ p(s′|s, a)v(s′), ∀s ∈ S, ∀a ∈ A The constants αs are arbitrary positive numbers. Notes Linear programming methods can also be used to solve MDPs Linear programming methods become impractical at a much smaller number of states than do DP methods (by a factor of about 100). 34 / 78
  35. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Policy iteration: Policy evaluation Given a policy π, compute vπ (s) for all s ∈ S Linear system of |S| equations in |S| unknowns vπ (s) = a π(a|s) s′,r p(s′, r|s, a) [r + γvπ (s′)] Iterative policy evaluation (converge to vπ as k → ∞) vk+1 (s) = a π(a|s) s′,r p(s′, r|s, a) [r + γvk (s′)] 35 / 78
  36. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Policy iteration: Policy Improvement Given vπ(s), compute qπ(s, a) qπ(s, a) = s′,r p(s′, r|s, a) r + γvπ(s′) We can improve π by acting greedily qπ(s, π′(s)) ≥ vπ(s) π′ is as good as or better than π 36 / 78
  37. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Policy Improvement Theorem Policy Improvement Theorem If ∀s ∈ S, qπ(s, π′(s)) ≥ vπ(s), then π′ is as good as or better than π, i.e., vπ′ (s) ≥ vπ(s). Greedy policy: π′(s) = arg max a qπ(s, a) 37 / 78
  38. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Policy iteration: pseudocode Algorithm 2: Policy Iteration for estimating π ≈ π∗ 1 Initialize V(s) and π(s) arbitrarily, for all s ∈ S; 2 while True do // Policy Evaluation 3 while ∆ ≥ θ do 4 ∆ ← 0; 5 for each s ∈ S do 6 v ← V(s); 7 V(s) ← a π(a|s) s′,r p(s′, r|s, a)[r + γV(s′)]; 8 ∆ ← max(∆, |v − V(s)|); // Policy Improvement 9 policy-stable ← True; 10 foreach s ∈ S do 11 a ← π(s); 12 π(s) ← arg maxa s′,r p(s′, r|s, a)[r + γV(s′)]; 13 if a ̸= π(s) then 14 policy-stable ← False; 15 if policy-stable then 16 break; 38 / 78
  39. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Solving Multi-period inventory problem We can solve the multi-period inventory problem using policy iteration, value iteration (or Bellman optimality equation, linear programming) Example: State space: S = {s ∈ Z : −16 ≤ s ≤ 16} Action space: A = {a ∈ N : 0 ≤ a ≤ 16} Overage cost: h = 1, Underage cost: p = 10 Demand distribution: D ∼ P(5) Discount factor: γ = 0.9 39 / 78
  40. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Solving Multi-period inventory problem: Example 15 10 5 0 5 10 15 State 0 2 4 6 8 10 12 14 16 Action Policy π∗(s) =      0, 8 ≤ s ≤ 16 8 − s, −8 ≤ s < 8 16, −16 ≤ s < −8 We obtained a Base-stock policy for −8 ≤ s ≤ 16, which is exactly the optimal policy for this problem (h = 1, p = 10, D ∼ P(5)) 40 / 78
  41. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and Markov Decision Process Multi-period inventory problem and MDP Summary The multi-period inventory problem is an associative problem We need to find the best action for each state The multi-period inventory problem can be formulated as an MDP Methods for solving MDPs are introduced Next What if we do not know the dynamics of the environment? What if the state space/action space is too large to solve? 41 / 78
  42. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning TD Learning Temporal Difference (TD) Learning do not require complete knowledge of the environment. Although a model is required, the model need only generate sample transitions. In many cases, it is difficult to obtain the distribution in explicit form. TD learning: Prediction problem (estimating vπ or qπ ) Control problem (estimating π∗ ) 42 / 78
  43. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning TD Prediction Recall that the value function is the expected return: vπ(s) = Eπ [Gt |St = s] = Eπ [Rt+1 + γvπ(St+1)|St = s] The Simplest TD method: TD(0) V(St ) ← V(St ) + α [Rt+1 + γV(St+1) − V(St )] Update V(St ) towards estimated return Rt+1 + γV(St+1) Rt+1 + γV(St+1) is called TD target δt = Rt+1 + γV(St+1) − V(St ) is called TD error 43 / 78
  44. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning Q-learning Q-learning Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max a Q(St+1, a) − Q(St , At ) Use ϵ-greedy policy to select action 44 / 78
  45. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning TD Control (Q-Learning): Pseudocode Algorithm 3: Q-learning for estimating π ≈ π∗ Input: a small ϵ > 0, a small α ∈ (0, 1] Output: output a deterministic policy π ≈ π∗ 1 Initialize Q(s, a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal, ·) = 0; 2 while True do 3 t ← 0; 4 Initialize St ; 5 while St is not terminal do 6 take action At using policy derived from Q (e.g., ϵ-greedy); 7 observe Rt+1 and St+1; 8 Q(St , At ) ← Q(St , At ) + α[Rt+1 + γ max Q(St+1, a) − Q(St , At )]; 9 t ← t + 1; 45 / 78
  46. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning TD Control (Q-Learning): Example State space: S = {s ∈ Z : −16 ≤ s ≤ 16} Action space: A = {a ∈ N : 0 ≤ a ≤ 16} Discount factor: γ = 0.9 Environment: h = 1, p = 10, D ∼ P(5) a0 = 0 a1 = 1 ... a|A| = 16 s0 = −16 Q(s0, a0) Q(s0, a1) ... Q(s0, a|A| ) s1 = −15 Q(s1, a0) Q(s1, a1) ... Q(s1, a|A| ) ... ... ... ... ... s|S| = 16 Q(s|S| , a0) Q(s|S| , a1) ... Q(s|S| , a|A| ) Table: Q-values for different state-action pairs 46 / 78
  47. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning TD Control (Q-Learning): Results 15 10 5 0 5 10 15 State 0 2 4 6 8 10 12 14 16 Action Policy Figure: Value iteration 15 10 5 0 5 10 15 State 0 2 4 6 8 10 12 14 16 Action Policy Figure: Q-learning 47 / 78
  48. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Multi-Period Inventory Problem and TD Learning Multi-period inventory problem and TD learning Summary Q-learning can be used to solve the multi-period inventory problem TD learning do not require complete knowledge of the environment A model is required to generate sample transitions A Q-table is required to store and update Q-values Next The state can be very large or even continuous A function approximation method can be used 48 / 78
  49. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning Deep Q-Network (DQN) Recall TD error for Q-learning δt = Rt+1 + γ max a′ Q(St+1, a′) − Q(St , At ) Basic idea of DQN Represent Q(s, a) by a neural network with weights θ Q(s, a; θ) Update θ to minimize the TD error δt = Rt+1 + γ max a′ Q(St+1, a′; θ) − Q(St , At ; θ) 49 / 78
  50. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning Two issues of DQN Correlations between samples Training NN requires independent and identically distributed (i.i.d.) samples But, samples taken from an episode are not i.i.d. Non-stationary targets The target yj = Rj+1 + γ maxa′ Q(Sj+1, a′; θ) changes as θ changes δt = Rt+1 + γ max a′ Q(St+1, a′; θ) − Q(St , At ; θ) 50 / 78
  51. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning Experience replay and target network Experience replay Store transitions (St , At , Rt , St+1) in a replay buffer D Sample random minibatch of transitions (Sj , Aj , Rj , Sj+1 ) from D Target network Use a separate network with weights θ− to compute the target yj = rj for terminal Sj+1 rj + γ maxa′ ˆ Q(Sj+1, a′; θ−) for non-terminal Sj+1 Update θ− every C steps 51 / 78
  52. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning Deep Q-Network (DQN): Pseudocode Algorithm 4: Deep Q-Network Input: replay buffer capacity N, the number of steps C to perform a target update, a small ϵ > 0, a small α ∈ (0, 1] Output: output a deterministic policy π ≈ π∗ 1 Initialize empty replay memory D to capacity N; 2 Initialize action-value function Q with random weights θ; 3 Initialize target action-value function ˆ Q with weights θ− ← θ; 4 for episode = 1, 2, . . . M do 5 t ← 0; 6 Initialize St ; 7 while St is not terminal do 8 with probability ϵ select a random action At ; 9 otherwise take action At using policy derived from Q; 10 Execute action At and observe reward Rt and St+1; 11 Store transition (St , At , Rt , St+1) in D; 12 t ← t + 1; 13 Sample random minibatch of transitions (Sj , Aj , Rj , Sj+1) from D; 14 Set yj = rj for terminal Sj+1 rj + γ maxa′ ˆ Q(Sj+1, a′; θ−) for non-terminal Sj+1 ; 15 Perform a gradient descent step on (yj − Q(Sj , aj ; θ))2 with respect to the network parameters θ; 16 Every C steps reset ˆ Q ← Q; 52 / 78
  53. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning DRL for inventory management: Example 1 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 Inventory in Transit On-hand Inventory 1 2 3 4 5 6 7 8 9 10 Shelf Life Inventory Level Figure: Perishable Inventory Problem 0 De Moor, Bram J., Joren Gijsbrechts, and Robert N. Boute. 2022. “Reward Shaping to Improve the Performance of Deep Reinforcement Learning in Perishable Inventory Management.” European Journal of Operational Research 301 (2): 535-45. 53 / 78
  54. ML and Opt for Inventory Reinforcement Learning for Inventory Optimization

    Inventory Management and Deep Reinforcement Learning DRL for inventory management: Summary Summary Machine learning models can be used to approximate the action-value function Integrating with machine learning enables solving more complex problems Future Directions Offline Methods Explainability Privacy-preserving 54 / 78
  55. ML and Opt for Inventory Surrogate Models in Optimization Table

    of Contents 1 Reinforcement Learning for Inventory Optimization Newsvendor Problem and Multi-armed Bandit Problem Multi-Period Inventory Problem and Markov Decision Process Multi-Period Inventory Problem and TD Learning Inventory Management and Deep Reinforcement Learning 2 Surrogate Models in Optimization Surrogate-based Optimization Trust Region Methods Bayesian Optimization Surrogate-assisted Evolutionary Optimization 3 Appendix 55 / 78
  56. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-based

    Optimization Surrogate-based Optimization Motivation: The objective function f is highly nonlinear The objective function f only can be obtained by simulation Basic Idea: Approximate the objective function f with a surrogate model ˆ f Solve x = arg minx ˆ f(x) Note: This method only builds surrogate model once The accuracy of ˆ f at the obtained solution is unknown 0Gurobi Machine Learning: https://gurobi-machinelearning.readthedocs.io/ 56 / 78
  57. ML and Opt for Inventory Surrogate Models in Optimization Trust

    Region Methods Trust Region Methods Basic Idea: Approximate the objective function f with ˆ f (Normally, a second-order taylor expansion approximation). Solve x = arg minx ˆ f(x) within a trust region Expand or shrink the trust region based on the improvement Accept the new solution if the improvement is large enough 57 / 78
  58. ML and Opt for Inventory Surrogate Models in Optimization Trust

    Region Methods Trust Region Methods: Pseudocode Algorithm 5: Trust Region Methods Input: initial guess x0 , trust region radius δ, threshold η1 , η2 , scale factor γ1 , γ2 Output: solution x 1 k ← 0; 2 solve f(xk ); 3 while not converged do 4 solve minx ˆ f(xk + p) s.t. ∥p∥ ≤ δk ; 5 compute ρk = f(xk )−f(xk +pk ) f(xk )−ˆ f(xk +pk ) ; 6 if ρk < η1 then 7 δk+1 ← γ1δk ; 8 else 9 xk+1 ← xk + pk ; 10 if ρk > η2 then 11 δk+1 ← γ2δk ; 12 k ← k + 1; 58 / 78
  59. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Bayesian Optimization Motivation: Evaluate f is expensive or time-consuming. Only a few evaluations of f are allowed. No first- or second-order derivatives. Basic Idea: Approximate f with a surrogate model ˆ f (Gaussian process) Find the next evaluation point x by maximizing the acquisition function α(x) Update ˆ f with the new evaluation point x and f(x) Repeat until the budget is exhausted 59 / 78
  60. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Gaussian Process A set of points X = {x1, x2, . . . , xn} is given. The corresponding y = {y1, y2, . . . , yn} are observed. Predict yn+1 at a new point xn+1. yn+1|y ∼ N(µ(xn+1), σ2(xn+1)) µ(xn+1) is the estimated mean of yn+1. σ2(xn+1) is the estimated variance of yn+1, which is a measure of uncertainty. 60 / 78
  61. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Gaussian Process: Example 0 2 4 6 8 10 6 4 2 0 2 4 6 8 True function: f(x)=xsin(x) GP mean 95% confidence interval Training data Figure: Gaussian Process Example 61 / 78
  62. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Acquisition Function Find the next evaluation point xt+1 by maximizing α(x). xt+1 = arg max x α(x) Prediction-based Exploration: α(x) = −µ(x) Error-based Exploration: α(x) = σ(x) Lower Confidence Bound: α(x) = −(µ(x) − kσ(x)) . . . 0 2 4 6 8 10 6 4 2 0 2 4 6 8 True function: f(x)=xsin(x) GP mean 95% confidence interval Training data Lower Confidence Bound (k=5) 62 / 78
  63. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Bayesian Optimization: Pseudocode Algorithm 6: Bayesian Optimization Input: X = {x1, x2, . . . , xn}, y = {y1, y2, . . . , yn}, acquisition function α(x), budget T Output: solution x∗ 1 while t < T do 2 Update the Gaussian process model with X and y; 3 xt+1 ← arg maxx α(x); 4 Evaluate yt+1 = f(xt+1); 5 Update X = X ∪ {xt+1}, y = y ∪ {yt+1}; 6 x∗ = arg minx y; 63 / 78
  64. ML and Opt for Inventory Surrogate Models in Optimization Bayesian

    Optimization Properties of Bayesian Optimization x ∈ Rd , d is not too large. Typically, d ≤ 20. Feasible region X is a simple set (e.g., a hyper-rectangle). 0Frazier, Peter I. ”A tutorial on Bayesian optimization.” arXiv preprint arXiv:1807.02811 (2018). 64 / 78
  65. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Surrogate-assisted Evolutionary Optimization Motivation: The objective function f is expensive . . . Evolutionary computation needs many evaluations Basic Idea: Approximate the objective function or constraint functions Regression models Predict the feasibility, superiority. Classification models Use the surrogate models to evaluate candidate solutions Only evaluate the ”promising” solutions Update the surrogate models with certain frequency 65 / 78
  66. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Pseudocode Algorithm 7: Surrogate-assisted Evolutionary Optimization Input: X = x1, x2, . . . , xn, y = y1, y2, . . . , yn, surrogate model M, real function f, budget T 1 while not terminate do 2 Update M using X and y; 3 Generate new candidate solutions X′; 4 Evaluate X′ with M; 5 Select promising solutions from X′; 6 Evaluate the selected solutions with the real function f; 7 Add the evaluated solutions to X and y; 66 / 78
  67. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Surrogate-assisted Optimization for Inventory Motivation: Difficult to explicitly formulate supply chain simulation models. Simulation model under uncertainty requires a large number of samples, which is time-consuming. Supply chain digital twin can be expensive to evaluate. Figure: Supply Chain Digital Twin Developed by anyLogistix 0 Liu, Ziang, and Tatsushi Nishi. 2023. “Data-Driven Evolutionary Computation for Service Constrained Inventory Optimization in Multi-Echelon Supply Chains.” Complex & Intelligent Systems, August. https://doi.org/10.1007/s40747-023-01179-0. 67 / 78
  68. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Surrogate-assisted Optimization for Inventory Multi-period multi-echelon perishable inventory problem Predict total cost/service level of an inventory policy with a surrogate model (s, S) policy, (r, Q) policy, (s, Q) policy, . . . e.g., x = (s1, S1, s2, S2, . . . , sN, SN ) Figure: Perishable inventory problem in a distribution system 68 / 78
  69. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Surrogate-assisted Optimization for Inventory Figure: Convergence graph (Left), correct selection rate (Right) 0 Liu, Ziang, and Tatsushi Nishi. 2023. “Surrogate-Assisted Evolutionary Optimization for Perishable Inventory Management in Multi-Echelon Distribution Systems.” Expert Systems with Applications, October, 122179. 69 / 78
  70. ML and Opt for Inventory Surrogate Models in Optimization Surrogate-assisted

    Evolutionary Optimization Summary Summary: Surrogate-assisted optimization is useful when the objective function is expensive. Machine learning models can be used to approximate the objective function or constraint functions. It is able to obtain a good solution with a small number of evaluations. 70 / 78
  71. ML and Opt for Inventory Appendix Table of Contents 1

    Reinforcement Learning for Inventory Optimization Newsvendor Problem and Multi-armed Bandit Problem Multi-Period Inventory Problem and Markov Decision Process Multi-Period Inventory Problem and TD Learning Inventory Management and Deep Reinforcement Learning 2 Surrogate Models in Optimization Surrogate-based Optimization Trust Region Methods Bayesian Optimization Surrogate-assisted Evolutionary Optimization 3 Appendix 71 / 78
  72. ML and Opt for Inventory Appendix Policy Improvement Theorem (Optional)

    vπ(s) ≤ qπ(s, π′(s)) = E Rt+1 + γvπ(St+1)|St = s, At = π′(s) ≤ Eπ′ Rt+1 + γqπ(St+1, π′(St+1))|St = s ≤ Eπ′ Rt+1 + γRt+2 + γ2qπ(St+2, π′(St+2))|St = s . . . = vπ′ (s) 72 / 78
  73. ML and Opt for Inventory Appendix Policy iteration: Policy Improvement

    Suppose the new policy π′ is as good as, but not better than, the old policy π Then, vπ′ (s) = vπ(s), for all s ∈ S vπ′ (s) = max a s′,r p(s′, r|s, a) r + γvπ′ (s′) This is the Bellman optimality equation π′ must be an optimal policy 73 / 78
  74. ML and Opt for Inventory Appendix Value Function ∞-Norm (Optional)

    We use ∞-norm to measure the difference between two value functions u and v ∞-norm is the maximum absolute difference between state values ∥u − v∥∞ = max s∈S |u(s) − v(s)| 74 / 78
  75. ML and Opt for Inventory Appendix Bellman Expectation Backup is

    a Contraction (Optional) Define the Bellman operator Tπ Tπ(v) = Rπ + γPπv This operator is γ-Contraction, i.e., it makes values functions closer by at least γ ∥Tπ(u) − Tπ(v)∥∞ = ∥Rπ + γPπu − Rπ − γPπv∥∞ = γ∥Pπu − Pπv∥∞ ≤ γ∥u − v∥∞ 75 / 78
  76. ML and Opt for Inventory Appendix Value iteration Must we

    wait for policy evaluation to converge before improving the policy? Policy improvement after one update of each state Vk+1(s) = max a s′,r p(s′, r|s, a)[r + γVk (s′)] For arbitrary v0, vk converges to v∗ The output policy can be obtained π(s) = arg max a s′,r p(s′, r|s, a)[r + γV∗(s′)] 76 / 78
  77. ML and Opt for Inventory Appendix Value iteration: pseudocode Algorithm

    8: Value Iteration for estimating π ≈ π∗ Input: input a small threshold θ > 0 determining accuracy of estimation Output: output a deterministic policy π ≈ π∗, such that π(s) = arg maxa s′,r p(s′, r|s, a)[r + γV(s′)] 1 Initialize V(s) arbitrarily, for all s ∈ S+; 2 Initialize V(terminal) = 0; 3 while ∆ ≥ θ do 4 ∆ ← 0; 5 foreach s ∈ S do 6 v ← V(s); 7 V(s) ← maxa s′,r p(s′, r|s, a)[r + γV(s′)]; 8 ∆ ← max(∆, |v − V(s)|); 77 / 78
  78. ML and Opt for Inventory Appendix Gaussian Process (Optional) µ(x)

    = K(x, X)K(X, X)−1(y − m(X)) + m(x) σ2(xn+1) = K(xn+1, xn+1) − K(xn+1, X)K(X, X)−1K(X, xn+1) Mean function and Kernel: m(x): Mean function K(x, x′): Kernel function Commonly used functions: Zero mean function: m(x) = 0 Gaussian kernel: α exp(−∥x − x′∥2) . . . 78 / 78