Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of wearable device data using function...

Julia Wrobel
October 08, 2023

Analysis of wearable device data using functional data models

Talk for Georgia statistics day 2023

Julia Wrobel

October 08, 2023
Tweet

More Decks by Julia Wrobel

Other Decks in Research

Transcript

  1. Analysis of “big N” wearable device data using functional data

    models Julia Wrobel, PhD Department of Biostatistics and Bioinformatics
  2. 2 BIOSTATISTICS, EPIDEMIOLOGY, & RESEARCH DESIGN FORUM Advances and Challenges

    in Wearables Research Friday, November 3 Advances and Challenges in Wearables Research Julia Wrobel, PhD Keynote Speaker Friday, November 3 10:00 AM — 3:00 PM REGISTER: bit.ly/BERD2023 In-Person: Morehouse School of Medicine, Building A, 4th Floor Sr. Biostatistician Virtual: Zoom
  3. Accelerometers • Physical activity is key to many health-related questions

    • Active individuals tend to live longer and healthier lives • Traditionally, this has been done using retrospective questionnaires • Accelerometers have become hugely popular • Objective • Collection “in the wild” • High resolution 7
  4. • PA measures: Total steps / counts, MVPA minutes •

    Sedentary measures: Sedentary time, number of sedentary bouts Accelerometer data processing pipeline
  5. Reproducibility and rigor • Much of this is still up

    for debate • Consider moderate-to-vigorous physical activity (MVPA) • How are “activity counts” generated? • How are cut points formed (no PA / light PA/ MVPA)? • Are these consistent across devices? Age groups? Placements? • Some general recommendations • Keep data in rawest form possible • Process using non-proprietary software 11
  6. Functional data analysis (FDA) • Wearables devices record signal over

    24-hour periods- the exact focus of FDA! • In FDA, outcome is curve or function 𝑌! 𝑡 • For accelerometer data 𝑌! 𝑡 is a 24-hour activity profiles 12 𝑡 (hour) 𝑌! (𝑡)
  7. Uses for FDA in wearables • Less pre-processing of the

    raw data • Less information is discarded • Better ways of imputing data • Missing data is a big problem in wearables • Time-dependent interpretations • Timing and consistency • Does it matter when and how regularly someone moves? 13
  8. FDA tools for massive accelerometer studies • Function-on-scalar regression (FoSR)

    • Functional outcome, scalar predictors (e.g. age) • UK Biobank Accelerometry Study • 80,000+ participants • Generalized functional principal components analysis (gFPCA) • National Health and Nutrition Examination Survey (NHANES) • 4,000+ participants (2011-2014 wave) • Registration • How does timing of wake/sleep, PA differ across people? • Baltimore Longitudinal Study on Aging (BLSA) • 500+ participants 14
  9. Function-on-scalar regression 𝑌! 𝑡 = 𝛽" 𝑡 + & #$%

    & 𝛽# 𝑡 𝑋!# + 𝑏! 𝑡 + 𝜖! 𝑡 • 𝑌! 𝑡 : Magnitude of physical activity at time 𝑡 • 𝑋!# : Scalar covariate (e.g. age) for subject 𝑖 • 𝛽# 𝑡 : Coefficient function for covariate 𝑝 • 𝑏! 𝑡 ∼ 𝐺𝑃 0, Σ' ; 𝜖! 𝑡 ~!!( 𝑁 0, 𝜎) * 16
  10. FDA of 88,693 subjects from UK Biobank study • Average

    daily activity patterns across ages from functional regression • Left are males, right panel are females 17 J. Wrobel, J. Muschelli, and A. Leroux (2021). Sensors.
  11. Exponential family functional data • Functional data methods assume 𝑌!

    𝑡 is Gaussian • Wearable device data is often non-Gaussian • Poisson 𝑌! 𝑡 ∈ 0, 1, 2, … (activity counts) • Binary 𝑌! 𝑡 ∈ {0, 1} (sedentary/active minutes) • Instead assume 𝑌! 𝑡 follows exponential family distribution • Assumes smooth latent subject-specific mean 𝜇! 𝑡 = 𝐸 𝑌! 𝑡 • Leads to GLM-like framework 𝑔 𝐸 𝑌! 𝑡 = 𝜂! 𝑡
  12. Example binary “curve” or “binary activity profile” • Subject shown

    below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 20
  13. Example binary “curve” or “binary activity profile” • Subject shown

    below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 21
  14. Binary activity profiles for studying sedentary behavior • Raw counts

    at each minute dichotomized at low value to detect activity vs. inactivity 22
  15. Generalized functional principal components analysis • Generalized FPCA and generalized

    regression model exponential family functional data using a (GLM)-like framework 𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 • 𝑌! ∼ 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝐹𝑎𝑚𝑖𝑙𝑦; 𝑔(⋅) is a link function • 𝛽& 𝑠 is a population mean function • 𝜙' 𝑠 are population level eigenfunctions • 𝜉!' are subject-specific scores 23
  16. The NHANES 2011-2014 accelerometer study • National Health and Nutrition

    Examination Survey • Accelerometer data from 2011-2014 wave released in 2021 • Accelerometer data over multiple days from > 4000 subjects • 1440 minutes per day of PA measurement • Goal is to understand population patterns in sedentary behavior • Existing FDA methods cannot handle data of this size • We proposed a fast, general-purpose algorithm for generalized FPCA 24
  17. 𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠

    + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 1. Bin the data along the functional domain 𝑠 into 𝐿 bins 2. Estimate separate local GLMMs in each bin to obtain 𝜂! 𝑠(! at each bin midpoint 3. Estimate FPCA on local latent estimates 𝜂! 𝑠(! to obtain eigenfunctions 𝝓 𝑠 4. Estimate global model conditioning on eigenfunctions 𝝓 𝑠 by re- estimating subject-specific scores 𝜉!' Four-step fast GFPCA algorithm A. Leroux, C. Crainiceanu, and J. Wrobel (2023+). Fast generalized functional principal components analysis. Under review.
  18. fastGFPCA simulation results • Compared with two existing methods •

    Variational Bayes binary FPCA (Wrobel, 2019), bfpca • Can’t estimate Poisson or other distributions • Two-step conditional model (Gertheiss, 2017), tsGFPCA • Breaks for N > 100 • fastGFPCA is • More accurate than tsGFPCA for binary and Poisson data • Order of magnitude faster • As or more accurate than bfpca for binary data • Comparable computation time 26
  19. GFPCA results for NHANES data • 4286 participants with 1440

    observations each • 3-4 hours of computation time (step 4 is the slow step) • Subsampled version of step 4 led to ~22 minutes of computation time
  20. Misalignment in accelerometer data • Time variation: subjects start and

    end the day at different times • Activity level variation: people have higher or lower levels of activity 29
  21. Registration methods align functional data by warping the domain •

    Most methods are computationally inefficient and handle only continuous data 𝜇! 𝑡! ∗ ℎ! #$ 𝑡! ∗ = 𝑡 𝜇! ℎ! #$ 𝑡! ∗ = 𝜇! 𝑡
  22. Two-step exponential family registration algorithm • Computationally efficient and geared

    towards binary data 33 Step 1: estimate template Step 2: estimate warping 𝑌! 𝑡! ∗ 𝑌! 𝑡
  23. Algorithm and software optimized for computational efficiency • Step 1:

    Estimates template to which curves are registered • uses fast, novel variational EM algorithm for binary functional data • Step 2: Estimates warping function for each subject • uses constrained maximum likelihood estimation • Implemented in R package registr • Implemented in C++ 34 • Wrobel, Goldsmith (2019). Registration for exponential family functional data. Biometrics. • Wrobel (2018). registr: Registration for exponential family functional data. Journal of Open Source Software. 3.
  24. Future methods work in these areas • Fast GFPCA •

    Multilevel data (Monday-Sunday) • Xinkai Zhou • Sparse and irregular data • Fast Generalized function-on-scalar regression • Dustin Rogers • Registration • Multilevel registration
  25. Acknowledgements Colorado SPH Biostatistics • Andrew Leroux • Dustin Rogers

    Columbia Biostatistics Functional Data Analysis Working Group • Jeff Goldsmith Johns Hopkins School of Public Health WIT: Wearable and Implantable Technology • Vadim Zipunnikov • Jennifer Schrack • John Muschelli • Ciprian Crainiceanu • Xinkai Zhou
  26. Step 1: bin the data Choose 𝐿 bins where 𝑚+

    is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins
  27. Step 1: bin the data Choose 𝐿 bins where 𝑚+

    is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues
  28. Step 1: bin the data Choose 𝐿 bins where 𝑚+

    is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function
  29. Step 1: bin the data Choose 𝐿 bins where 𝑚+

    is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function
  30. Step 2: fit Generalized Linear Mixed Model in each bin

    Fit separate GLMM in each bin to get latent estimates • 𝑔 𝐸 𝑌! 𝑠"! = 𝛽$ 𝑠"! + 𝑏! 𝑠"! = 𝜂! 𝑠"! • 𝑠"! : time 𝑠 at the midpoint of bin 𝑙 • 𝛽$ 𝑠"! : fixed effect mean • 𝑏! 𝑠"! : subject-specific random effect • 𝜂! 𝑠"! : linear predictor, local latent estimates • Estimates are not on the original domain • On domain defined by bin midpoints • Model assumes constant effect for 𝛽% , 𝑏! across each bin • Used for estimating covariance matrix and eigenfunctions
  31. Step 3: estimate eigenfunctions using fPCA Estimate FPCA using linear

    predictor from Step 2 • + 𝜂! 𝑠"! = , 𝛽$ 𝑠"! + ∑%&' ( , 𝜉!% / 𝜙% 𝑠"! • Estimated using refund::fpca.face() • Eigenfunctions F 𝝓 characterize covariance • 𝐾 : chosen by percent variance explained • Evaluated at bin midpoint rather than original domain • Project eigenfunctions onto original domain
  32. Step 4: estimate GFPCA Estimate GFPCA conditional on eigenfunctions from

    Step 3 • 𝑔 𝐸 𝑌! 𝑠 | = 𝛽$ 𝑠 + ∑%&' ( 𝜉!% / 𝜙% 𝑠 • Eigenfunctions are orthogonal basis functions • Reduces number of covariance parameters that need to be estimated for random effects • Simple implemention • mgcv::bam()