Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Online Nonstationary and Nonlinear Bandits with...

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

IEEE COMPSAC 2024
https://ieeecompsac.computer.org/2024/
Osaka, Japan, on July 2 – 4, 2024

Avatar for monochromegane

monochromegane

July 04, 2024
Tweet

More Decks by monochromegane

Other Decks in Research

Transcript

  1. Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

    Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)
  2.  4 Background • Selecting an optimal behavior from many

    ones is crucial for practical applications. •  The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. •  However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.
  3. • The player must select one arm from multiple arms

    to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration.  5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN
  4. •  -Greedy policy selects an arm uniformly at random

    with a probability of  (exploration) and selects the arm with the highest average reward at that time with a probability of  (exploitation). ϵ ϵ 1 − ϵ  6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L
  5.  7 Motivation of our Approach • Nonlinearity • In

    practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance
  6. • Use of Kernel Function  • Efficiently computes the

    inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties.  9 Gaussian Process Regression  K−1  K ∈ ℝN×N  K−1 →  k(xi , xj )
  7.  10 • GP regression model-based nonstationary and nonlinear policy

    Weighted GP-UCB [Y. Deng 2022]  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N
  8.  11 • Nonstationary kernel Recursive Least Squares (RLS) method

    • Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020]  (Z⊤ΓZ + γMΛ)−1 γ γ z xN
  9. • Online nonstationary and nonlinear contextual MAB policy  13

    Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in  -dimensional space, where  . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N  z  P N−2,M−2  Q N−2,M−2  x N−1  y N−1  (M = 0)  γ  P N−1,M−1  Q N−1,M−1  z  x N  y N  γ  P N,M  Q N,M  z  x *  p(y N−1 ∣ x N−1 , X, y)  p(y N ∣ x N , X, y)  p(y * ∣ x * , X, y)  (M = 0)
  10. • A method to approximate kernel function  using 

    samples from a probability distribution  . k(xi , xj ) R′  = R/2 p(ω)  14 Random Fourier Features  K  K ∈ ℝN×N  k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with  -dimensional features, the original problem can be solved as a linear problem in  -dimensional space. R R  K  ZZ⊤  ≃  K ∈ ℝN×N  Z ∈ ℝN×R  Z⊤Z  Z⊤Z ∈ ℝR×R  ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )
  11.  15 Comparative Learning Methods  K′   K′

      K′  … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning  K′  −1 Inv  K′  −1 Inv  K′  −1 Inv  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  K′  ∈ ℝN×N  Z⊤Z ∈ ℝR×R
  12. • Incorporating exponential forgetting for past inputs and outputs enhances

    estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors  , thus lower accuracy. ϵn  16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence Transition of Observation Error  Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from  fA Recent training data from  fB Prediction  fits the recent training data. = Quick adaptation to changing environments. μ
  13.  17 Comparative Learning Methods Nonstationary GP with RFF learning

     Z⊤ΓZ + Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  Z⊤ΓZ ∈ ℝR×R GP with RFF learning  Z⊤Z + Λ  Z⊤Z + Λ  Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  (Z⊤Z + Λ)−1 Inv  Z⊤Z ∈ ℝR×R
  14. • Recursive Least Squares (RLS) updates regression model parameters using

    previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time  and observed values at time  . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1  18 Recursive Learning Mechanism
  15.  19 Comparative Learning Methods Batch learning  Z⊤ΓZ +

    Λ  Z⊤ΓZ + Λ  Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γΛ)−1  (Z⊤ΓZ + γ2Λ)−1  (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning
  16. • Estimation error arises in recursive learning due to the

    forgetting effect is recursively applied to the regularization term.  20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ confidence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ confidence ≠  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each  M
  17. • A novel recursive error correction method balances estimation accuracy

    and online performance.  21 Error Correction Method =  Z⊤ΓZ + Λ z ⋮ xN x1 xN−1  (Z⊤ΓZ + Λ)−1 Inv  (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of  matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when  . M = 0
  18. • Nonstationary and nonlinear contextual MAB simulation • The arm's

    reward follows a normal distribution  , with the mean  corresponding to the context  as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2  23 Simulation Setup The arm  is always  a(1) t μ = μ1  μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are  . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm  otherwise. μ2 < μ1 ≪ μ3 a(1)
  19.  24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note

    ✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale  is used for all policies to clarify the effect of each policy's regression model. β = 1
  20. • The proposed RW-GPB policy achieves higher cumulative rewards and

    shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB (  ) reduces the execution time by 71% and equal rewards. • RW-GPB (  ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40  25 Simulation Results: Trade-off Analysis
  21. • Accumulated errors reduce the cumulative reward and that the

    frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for  than for the most accurate  . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1  26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)
  22. • The proposed RW-GPB policy keeps the execution time constant.

    • The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially.  27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)
  23. • RW-GPB Policy • We introduced RW-GPB, a new online

    policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations.  29 Conclusion
  24. • Meta-Recommender System • We aim to implement and evaluate

    a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios.  30 Future Work