Tokyo.Rも100回目なので、そろそろPythonと仲良くしたい

Tokyo.Rも100回目なので、そろそろPythonと仲良くしたい Tokyo.R #100 (2022/07/23) @bob3bob3

Tokyo.R100回開催！おめでとうございます！

第1回から参加してます！

「パーマーステーションのペンギンたち」 • 脱初心者のセッションをシリーズで展開中。 • 次回は「データクリーニング編」の予定。

ここから本題

RとPythonと言えば永遠のライバル！

Tokyo.Rも100回目なので、そろそろPythonと仲良くしたい。

RMarkdownとreticulateパッケージでPythonと仲良くできます！

例えばこんな時 • LightGBMを使いたい。 • 機械学習はやっぱりPythonが得意。Optuna LightGBM Tuner というハイパーパラメータの自動最適化までいい感じに実行してくれるモジュールがあり、正直うらやましい。
• でも、データの前処理はＲの方が圧倒的に便利。 • 可視化にはやっぱりggplot2を使いたい。 • 「前処理はR」|>「機械学習はPython」|>「可視化はR」ってできないの？

RMarkdownを使うと簡単にできます • RMarkdown上でRとPythonを同居させることができます。 • 今回は palmerpenguins の penguins_raw データセットを題材に、ペンギンの種類を判別するモデル（多クラス分類）をOptunaを使って実行してみます。
• Pythonの基本や環境設定については応用セッションのやわらかクジラさんの発表を参考にしてください。Pythonと必要なモジュールはインストール済みの前提でお話しします。

1. Rで前処理 Rで実行したいのは以下の処理です。 • クリーニング ◦ 変数名をスネークケースに統一 ◦ 必要な変数に絞る •
LightGBM用に整形 ◦ 名義尺度であるペンギンの種類、性別、居住する島をゼロから始まる整数型に変換。 • 学習用データと検証用データに分割 • 説明変数群と目的変数に分割

Rのターン

1. Rで前処理 ```{r library, include=FALSE} library(tidyverse) # データ操作全般 library(palmerpenguins) #
デモデータ library(janitor) # データクリーニング library(rsample) # 学習データと検証データの分割 library(withr) # 乱数種の制御 library(reticulate) # RとPythonのやり取り library(MLmetrics) # モデル評価指標 ```

1. Rで前処理 ```{r cleaning} dat <- penguins_raw |> clean_names() #
変数名をスネークケースに統一 dat_cleaned <- dat |> # 必要な変数に絞る select( species, island, culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g, sex, delta_15_n_o_oo, delta_13_c_o_oo ) |> mutate(species = species |> str_split(" ") |> map_chr(1)) |> mutate( # 名義尺度であるペンギンの種類、性別、居住する島をゼロから始まる整数型に変換。 species = case_when( species == "Adelie" ~ 0, species == "Chinstrap" ~ 1, species == "Gentoo" ~ 2), island = case_when( island == "Biscoe" ~ 0, island == "Dream" ~ 1, island == "Torgersen" ~ 2), sex = case_when( sex == "FEMALE" ~ 0, sex == "MALE" ~ 1, is.na(sex) ~ 2) ) ``` 処理内容は今回の主題じゃないので解説は省きます。

1. Rで前処理 ```{r split} df_train <- dat_cleaned |> # 学習用データと検証用データに分割
initial_split(prop = 0.8, strata = "species") |> with_seed(1234, code = _) |> training() df_test <- dat_cleaned |> initial_split(prop = 0.8, strata = "species") |> with_seed(1234, code = _) |> testing() x_train <- df_train |> select(!species) # 説明変数群と目的変数に分割 y_train <- df_train |> select(species) x_test <- df_test |> select(!species) y_test <- df_test |> select(species) ``` 処理内容は今回の主題じゃないので解説は省きます。

Pythonのターン

2. PythonでLightGBM ```{python r_object1} r.x_train.head() ``` ```{python r_object2} r.x_train.info() ```
Python上の r オブジェクトからRのデータが取り出せる！ Rのデータフレームは pandasのデータフレームになる。

2. PythonでLightGBM ```{python lgb, cache=TRUE, include=FALSE} import pandas as pd
import numpy as np import optuna.integration.lightgbm as lgb train = lgb.Dataset(r.x_train, r.y_train) # Rで前処理したデータをLightGBM用のデータセットに変換 test = lgb.Dataset(r.x_test, r.y_test) params = { # 基本パラメータの設定 'task': 'train', 'boosting': 'gbdt', 'objective': 'multiclass', 'metric': 'multi_logloss', 'num_class': 3, 'verbosity': 0, 'seed': 0 } lgb_trained = lgb.train( # モデルの構築とハイパーパラメータ最適化 params, train, categorical_feature = ["island","sex"], valid_sets=test ) ``` 貧弱なノートPCで十数分程度かかりました。

2. PythonでLightGBM ```{python lgb2, cache=TRUE} best_params = lgb_trained.params # 最適化パラメータを取り出す
# 最適化モデルで検証用データに対する各クラスの推定確率 y_test_pred_prop = lgb_trained.predict(r.x_test, num_iteration=lgb_trained.best_iteration) # 最も推定確率の一番大きいクラスに振り分け y_test_pred = np.argmax(y_test_pred_prop, axis=1) # 説明変数の重要度を取り出す cols = list(r.x_test.columns) f_importance = 　np.array(lgb_trained.feature_importance(importance_type='gain')) df_importance = pd.DataFrame({'feature':cols, 'importance':f_importance}) ``` 結果の取り出しと、検証データへの適用。

Rのターン

3. Rで可視化と評価 ```{r output, cache=TRUE} py$df_importance |> ggplot(aes(x =reorder(feature, importance,
mean), y = importance, fill = importance)) + scale_fill_gradient( low = "blue", high = "red" ) + coord_flip() + geom_col(width = 0.5) + labs( x = "feature", title="LightGBM 説明変数の重要度 ", subtitle = "RとPythonの併用", x = "説明変数", y = "重要度", fill = "重要度" ) ``` R上のpyオブジェクトからPython のデータが取り出せる！

3. Rで可視化と評価 ```{r output5, cache=TRUE} tibble( score = c("Accuracy", "Precision",
"Recall", "F1_Score"), value = c( Accuracy(y_test$species, py$y_test_pred), # Accuracy(正解率) Precision(y_test$species, py$y_test_pred), # Recall (再現率) Recall(y_test$species, py$y_test_pred), # Precision (適合率) F1_Score(y_test$species, py$y_test_pred) # F1スコア ) |> round(4) ) ``` R上のpyオブジェクトからPython のデータが取り出せる！

Enjoy!

Tokyo.Rも100回目なので、そろそろPythonと仲良くしたい

Tokyo.Rも100回目なので、そろそろPythonと仲良くしたい

bob3bob3

More Decks by bob3bob3

Other Decks in Science

Featured

Transcript