understanding AUC ROC, let's look at the relationship between AUC and Cohen's effect size. Cohen's effect size, d , expresses the difference between two groups as the number of standard deviations between the means. As d increases, we expect it to be easier to distinguish between groups, so we expect AUC to increase. I'll start in one dimension and then generalize to multiple dimensions.a
means, we can compute the fraction of Group 0 that would be above the threshold. I'll call that the false positive rate. In [4]: thresh = (mu1 + mu2) / 2 np.mean(sample1 > thresh) Out[4]: 0.301
operating curve (ROC) represents this tradeoff. To plot the ROC, we have to compute the false positive rate (which we saw in the gure above), and the true positive rate (not shown in the gure). The following function computes these metrics. In [11]: def fpr_tpr(sample1, sample2, thresh): """Compute false positive and true positive rates. sample1: sequence sample2: sequence thresh: number returns: tuple of (fpr, tpf) """ fpr = np.mean(sample1>thresh) tpr = np.mean(sample2>thresh) return fpr, tpr
low, but so is the true positive rate. In [12]: fpr_tpr(sample1, sample2, 1) As we decrease the threshold, the true positive rate increases, but so does the false positive rate. In [13]: fpr_tpr(sample1, sample2, 0) Out[12]: (0.15, 0.501) Out[13]: (0.483, 0.846)
I sweep thresholds from high to low so the ROC goes from left to right. In [14]: from scipy.integrate import trapz def plot_roc(sample1, sample2, label): """Plot the ROC curve and return the AUC. sample1: sequence sample2: sequence label: string returns: AUC """ threshes = np.linspace(5, -3) roc = [fpr_tpr(sample1, sample2, thresh) for thresh in threshes] fpr, tpr = np.transpose(roc) plt.plot(fpr, tpr, label=label) plt.xlabel('False positive rate') plt.ylabel('True positive rate') auc = trapz(tpr, fpr) return auc
we have more than one variable, with a difference in means along more than one dimension. First, I'll generate a 2-D sample with d=1 along both dimensions. In [19]: from scipy.stats import multivariate_normal d = 1 mu1 = [0, 0] mu2 = [d, d] rho = 0 sigma = [[1, rho], [rho, 1]] In [20]: sample1 = multivariate_normal(mu1, sigma).rvs(n) sample2 = multivariate_normal(mu2, sigma).rvs(n); Out[19]: [[1, 0], [0, 1]]
features. In [21]: np.mean(sample1, axis=0) And the mean of sample2 should be near 1. In [22]: np.mean(sample2, axis=0) Out[21]: array([ 0.01204411, -0.05193738]) Out[22]: array([0.97947675, 1.02358947])
In [23]: x, y = sample1.transpose() plt.plot(x, y, '.', alpha=0.3) x, y = sample2.transpose() plt.plot(x, y, '.', alpha=0.3) plt.xlabel('X') plt.ylabel('Y') plt.title('Scatter plot for samples with d=1 in both dimensions');
2-D density function and make a contour plot. In [25]: X, Y, Z = kde_scipy(sample1) plt.contour(X, Y, Z, cmap=plt.cm.Blues, alpha=0.7) X, Y, Z = kde_scipy(sample2) plt.contour(X, Y, Z, cmap=plt.cm.Oranges, alpha=0.7) plt.xlabel('X') plt.ylabel('Y') plt.title('KDE for samples with d=1 in both dimensions');
To see how distinguishable the samples are, I'll use logistic regression. To get the data into the right shape, I'll make two DataFrames, label them, concatenate them, and then extract the labels and the features.
of labels. In [31]: X = df[[0, 1]] y = df.label; Now we can t the model. In [32]: from sklearn.linear_model import LogisticRegression model = LogisticRegression(solver='lbfgs').fit(X, y); And compute the AUC. In [33]: from sklearn.metrics import roc_auc_score y_pred_prob = model.predict_proba(X)[:,1] auc = roc_auc_score(y, y_pred_prob) With two features, we can do better than with just one. Out[33]: 0.853391
np.linspace(-0.9, 0.9)] rhos, aucs = np.transpose(res) plt.plot(rhos, aucs) plt.xlabel('Correlation (rho)') plt.ylabel('Area under ROC') plt.title('AUC as a function of correlation');
[39]: def compute_auc_vs_d(num_dims): """Sweep a range of effect sizes and compute AUC. num_dims: number of dimensions returns: list of """ effect_sizes = np.linspace(0, 4) return [(d, multivariate_normal_auc(d, num_dims)) for d in effect_sizes] In [40]: res1 = compute_auc_vs_d(1) res2 = compute_auc_vs_d(2) res3 = compute_auc_vs_d(3) res4 = compute_auc_vs_d(4);
plot_auc_vs_d(res2, 'num_dim=2') plot_auc_vs_d(res1, 'num_dim=1') plt.title('AUC vs d for different numbers of features') plt.legend(); With more features, the AUC gets better, assuming the features are independent.