one-hot encoding • label encoding • target encoding • categorical feature support of LightGBM • other encoding technique 2. Experiments and the results
◦ The basic idea is to sort the categories according to the training objective at each split. ◦ More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram. • There is the overfitting risk as mentioned before, and the official document encourages to use parameters ‘min_data_per_group’ or ‘cat_smooth’ for very high cardinality features to avoid overfitting. this idea is the same as target encoding
only categorical features and a real-valued target variable. ( rows and categorical features) • Each categorical features has the common number of levels (), and it was randomly chosen which value is assigned for each row. • Each level of categorical features has a corresponding real value, and the target variable is calculated as the sum of those values and a random noise.
C … B 6.52 A A … A -1.28 B B … A -5.46 C A … A -3.83 : : : : : -0.77 0.05 … 0.49 sum up these corresponding values and add a random noise (follows (0, 32)) level v1 v2 … vM A -0.58 -0.38 … -0.24 B -0.77 -0.70 … 0.49 C -0.71 0.05 … 0.39 Example of dataset ( = 3) corresponding values each value was sampled from (−1, 1)
have higher cardinality, RMSE become worse. (intuitively obvious) But the performance drop is limited with fewer number of leaves. (Because of the simple data structure, using complex tree structures is likely to overfit.) theoretical limit (because of the random noise (0, 32)) = (about 5000 data points per level)
label encoding is more likely to overfit compared with one-hot encoding. At = 4 or 5, however, label encoding had a slightly better performance with less than 16 num_leaves. = (about 5000 data points per level)
gaps of performance by the difference of the number of leaves. When the cardinality of categorical features is high and we use complex structures of tree, it seems that using target encoding is a better strategy. = (about 5000 data points per level)
very similar to that of target encoding. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 5000 data points per level)
made a large difference to the performance. It seems that larger num_leaves causes severer overfitting especially with high cardinality categorical variables. = (about 500 data points per level)
much different from that of label encoding. In contrast with label encoding, larger num_leaves did not deteriorate the performance when the cardinality of categorical features is high. = (about 500 data points per level)
was similar to that of target encoding, but there was a slightly stronger dependency on num_leaves. With simple tree structures, LGBM encoding was better, and with complex tree structures, target encoding was better. = (about 500 data points per level)
the target is additive • No feature interaction • Levels of each categorical feature are evenly distributed Insight from the experiment results: • If cardinality of categorical features is high, it is difficult to capture all the effect by one- hot encoding or label encoding, so target encoding or LGBM encoding is preferable. • If cardinality is much higher, LGBM encoding also causes overfitting. • Even if there are many useless features, target encoding is not affected by them.