Machine learning won't save us: Dependencies bias cross-validation estimates of model performance

Momin Malik


Dependencies are the bane of attempts to statistically analyze network data. Failing to account for dependencies between dyads, from processes like reciprocity or transitivity, will lead to omitted variable bias in estimates and can lead us to being wrong about everything. Even without misspecification, dependencies will deflate standard errors, leading to incorrect inferences. Can machine learning, and giving up on trying to explain network processes and instead just trying to do prediction, save us? After all, machine learning is neither concerned with unbiased estimation, nor does it do statistical inference to need correct standard errors. So maybe dependencies don't matter? Even more promisingly, machine learning also specializes in low-assumption models applied to high-dimensional data, and networks can be formulated as one of the most high-dimensional forms of data. In this poster, I show that the answer is no. Machine learning relies enormously on a technique called cross-validation to estimate the generalizability of its results; data is split into a training set and a test set, with a model fit on the training set and tested on the test set, with test set performance being what is reported. But dependencies between observations can mean that the training set and the test set are not independent! In such cases, just as the performance on the training set gives an "overly optimistic" picture of how well a model will perform, so too will dependencies make the test set error an unreliable signal for assessing model generalizability. Unfortunately for hopes for machine learning, this means that properly doing machine learning on networks requires dealing with the same issues that have always plagued network statistics. But fortunately, the hard-won lessons of the past 40 years of statistics can help structure data splits to respect dependency structure, and partially alleviate the problem of doing cross-validation with dependent data.

← Schedule