Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Your Smartphone Knows What You're Doing

Your Smartphone Knows What You're Doing

A data science project that took a Kaggle dataset and created a logistic regression model that predicted with 98% accuracy one of six activities performed by a group of test participants.

I'm no data scientist nor do I play one on YouTube, but this was a great project to extend my knowledge in how data from gyroscopes and accelerometers can be used in the real world.

C. Todd Lombardo

May 06, 2020
Tweet

More Decks by C. Todd Lombardo

Other Decks in Technology

Transcript

  1. OUTLINE • Problem statement • Performance summary • Process ◦

    Exploratory data analysis ◦ Model selection ◦ Feature selection ◦ Model tuning ◦ Error analysis ◦ Tried and failed • Next steps
  2. Problem statement Hypothesis By examining accelerometer and gyroscope sensor data

    from a smartphone, a model can classify which activity was performed by a person Goals Accurately predict a human movement from accelerometer and gyroscope smartphone data, both provided in the data repo and captured by a mobile app. Risks and limitations Feature engineering may need to go beyond the scope of the course content. One is the possible exploration of a principal component analysis and other feature reduction techniques.
  3. Results: Classify by logistic regression Logistic Regression Accuracy = 0.98640

    on the test data set With PCA, the 561 features could be reduced to 120 principal components Sitting and Standing were the most difficult to discern: 20 false positives among them
  4. Target variable: Activity The model will need to accurately predict

    one of these movements: Walking Walking_upstairs Walking_downstairs Sitting Standing Laying dynamic static
  5. About the dataset Human Activity Recognition w/Smartphone (Sources Kaggle, UC

    Irvine) 1. Inertial sensor data a. Raw triaxial signals from the accelerometer & gyroscope of all the trials with participants b. The labels of all the performed activities 2. Records of activity windows. Each one composed of: a. A 561-feature vector with time and frequency domain variables. b. Its associated activity label c. An identifier of the subject who carried out the experiment. The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING-UPSTAIRS, WALKING-DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
  6. About the dataset Features will be selected and/or engineered from:

    Time-series signals ▸ tBodyAcc-XYZ ▸ tGravityAcc-XYZ ▸ tBodyAccJerk-XYZ ▸ tBodyGyro-XYZ ▸ tBodyGyroJerk-XYZ ▸ tBodyAccMag ▸ tGravityAccMag ▸ tBodyAccJerkMag ▸ tBodyGyroMag ▸ tBodyGyroJerkMag Fourier Transformed Signals ▸ fBodyAcc-XYZ ▸ fBodyAccJerk-XYZ ▸ fBodyGyro-XYZ ▸ fBodyAccMag ▸ fBodyAccJerkMag ▸ fBodyGyroMag ▸ fBodyGyroJerkMag Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. Some feature derivations are included in the dataset. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.
  7. Which correlations matter? Feature_1 Feature_2 Correlation Abs_Corr tBodyAccJerk-energy()-X fBodyAccJerk-energy()-X 0.999999

    0.999999 fBodyAccJerk-energy()-X tBodyAccJerk-energy()-X 0.999999 0.999999 fBodyAcc-bandsEnergy()-1,24 fBodyAcc-energy()-X 0.999878 0.999878 fBodyAcc-energy()-X fBodyAcc-bandsEnergy()-1,24 0.999878 0.999878 fBodyGyro-energy()-X fBodyGyro-bandsEnergy()-1,24 0.999767 0.999767 fBodyGyro-bandsEnergy()-1,24 fBodyGyro-energy()-X 0.999767 0.999767 fBodyAcc-bandsEnergy()-1,24.1 fBodyAcc-energy()-Y 0.999661 0.999661 fBodyAcc-energy()-Y fBodyAcc-bandsEnergy()-1,24.1 0.999661 0.999661 tBodyAccJerkMag-mean() tBodyAccJerk-sma() 0.999656 0.999656 tBodyAccJerk-sma() tBodyAccJerkMag-mean() 0.999656 0.999656 corr_values = hua[feature_cols].corr() corr_values[(corr_values.Correlation < 1.0) & (corr_values.Correlation > 0.9)]
  8. Can we separate static and dynamic activities? Yes ——> If

    tBodyAccMag < -0.5 then it’s a static activity If tBodyAccMag > -0.5 then it’s a dynamic activity dynamic static
  9. Model Selection: “Naive” (No feature selection, no tuning) Algorithm Train

    Cross-Validation Score Test Accuracy Score DecisionTreeClassifier 0.840870 0.852392 RandomForestClassifier 0.917985 0.922973 KNNeighborsClassifier 0.897175 0.900238 LogisticRegression 0.933495 0.957923 Naive running is fitting models with no feature selection, tuning, or optimization. Yes, I ran all 561 features. And yes, it took a while. Must be classification algorithm, ran four
  10. PCA

  11. Map back to actual features? Signal Value angle(tBodyGyroJerkMean,gravityMean) 0.479798 tBodyAccJerk-mean()-X

    0.201355 tGravityAcc-energy()-Y 0.115365 fBodyAcc-kurtosis()-Z 0.098975 fBodyAcc-skewness()-Z 0.098076 tGravityAcc-correlation()-X,Y 0.085849 angle(tBodyGyroMean,gravityMean) 0.075399 angle(Z,gravityMean) 0.074471 tGravityAcc-min()-Y 0.059661 tGravityAcc-mean()-Y 0.055262 pd.Series(pc1, index=hua_pca.columns).sort_values(ascending=False) Which are the most predictive signals?
  12. How many PCA features needed to increase accuracy? Plotting all

    the accuracy scores by increasing the number of PCA variables in the model
  13. Which are the most important PCA Features? coef variable abscoef

    6.015424 PC3 6.015424 3.503071 PC22 3.503071 2.330216 PC48 2.330216 1.976726 PC32 1.976726 -1.844982 PC26 1.844982 1.678715 PC23 1.678715 1.291059 PC45 1.291059 1.269009 PC13 1.269009 -1.153241 PC6 1.153241 1.136382 PC24 1.136382 coefs_vars.sort_values('abscoef', ascending=False, inplace=True) Which are the most important PCA Features?
  14. These didn’t work so well 1. Scaling the data with

    StandardScaler: Accuracy dropped to 0.911 (“features are normalized”) 2. Further feature reduction of PCA or Signal features a. a challenge to find specific signals, it seems the combination of signals is far more useful b. even when using top 20 principal components 3. Ridge — struggled to get this to work properly
  15. With more time 1. Horn's parallel analysis 2. Optimize Lasso

    and Ridge, would a penalty improve the model? 3. Dig further into t-SNE for feature selection & reduction 4. Run experiment with my own phone a. Perform the six activities with a sensor recorder app b. Process the raw data, run the model and score it 5. Try this with data from a smartwatch!