Minipost: Considerations for "Small" Data Sets

Why might a data set be small?

We’d like to be up to our ears in all the data we want, all the time, but sometimes that just isn’t the case. This could be because you’re having trouble sourcing data, in which case you might want to work directly on that problem, but it could be for other reasons that are harder to work around. Maybe the thing you’re looking at is just naturally limited in size, like the number of different species of fish on the planet. Maybe your goals require deeper analysis of a very specific subset or slice of your data. You might be stuck with the data you have.

Common challenge areas

Overfitting

A powerful model can often quickly and easily overfit to every single point in your relatively small data set. Be on the lookout even more than usual for common signs of overfitting, be sure to properly cross-validate, and definitely consider using regularization terms.

More fundamentally, model selection can be key here- conceptually speaking, what is your hypothetical model suggesting about the underlying relationships in your data, and how well does that match your understanding of the context of the data and the processes that generated it?

Model cross-validation

You might not have enough data points for a train-test-holdout split to be a fruitful endeavor: either you’ll be lacking training data and end up with a poor model, or you’ll be lacking testing data and be unable to draw useful conclusions about model performance. K-fold cross-validation or Leave-one-out on your test data could be a better option.

Outliers

You don’t have the leeway to be careless with outlier removal. One one side, inappropriately including outliers in a small data set could skew your model in a major way. On the other, being too conservative could result in too many data points being excluded, which will have a large impact on the statistical power of your model. Thinking critically about the conceptual underpinnings of the outliers in your data set can be helpful.