Data Mining the Water Table
This is a competition from DrivenData that I've wanted to enter for some time - it's a geospatial problem, and it's related to water, and it's related to development - this is a win-win-win from my perspective.
Of course, the first attempt is never actually going to be about winning, especially in an established competition like this one. Here I'll just try to figure out what data is available to us and what API the competition requires (what is the input to our pipeline? what does the output look like?) Sometimes I'll even use randomly generated mock predictions to validate the API - but in this case, I'll actually put together a simple baseline model.
First things first, what's the objective here?
Your goal is to predict the operating condition of a waterpoint for each record in the dataset.
So we have some working wells, and some broken wells, and it's our job to figure out which is which.
Next, some imports to make and some files to open:
!pip install geopandas > /dev/null
!pip install folium matplotlib mapclassify > /dev/null
!pip install contextily > /dev/null
import pandas as pd
import geopandas as gpd
train_labels = pd.read_csv("/content/drive/MyDrive/data_mining_water_table/0bf8bc6e-30d0-4c50-956a-603fc693d966.csv")
train_features = pd.read_csv("/content/drive/MyDrive/data_mining_water_table/4910797b-ee55-40a7-8668-10efd5c1b960.csv")
test_features = pd.read_csv("/content/drive/MyDrive/data_mining_water_table/702ddfc5-68cd-4d1d-a0de-f5f566f76d91.csv")
train_labels.status_group.value_counts()
So this is a multi-class classification problem! I'll leave that be for the time being, but I think there may be a way around that particular hitch.
The class balance looks alright between functional and non functional, but functional needs repair
is super underrepresented.
Only one more column to check here - it may say it's a primary key but I've been burned too many times:
train_labels.id.is_unique
train_features.info()
We also get a preview of missingness. In a welcome departure from meticulously curated datasets that often show up in ML competitions, this dataset has missing data. This is a frequent problem in real ML problems and handling it gracefully is important in application.
What sticks out at me:
- We have latitude and longitude here, so we are working with geospatial data as expected (yay!)
- They are already floats, so we won't need to deal with strings or WKB or anything
- None of them are missing!
What will probably require more feature engineering work:
-
date_recorded
- if we have datetimes, we will experience at least one datetime related error (sorry, that's just the rule!) And the year and/or month will probably be more useful than some high-cardinality year-month-day combination
Speaking of cardinality - we will need to check on these object columns! We might need to process some of them to prune
And finally - some of these look like spatial subdivisions. Those require special care. I'll address that later. Fortunately, few of those seem to be missing.
Maps
So we've confirmed that the coordinates exist - what's our next step?
What is absolutely positively the FIRST thing you should do with geospatial data? MAP IT!
The floating point latitude and longitude will take a little massaging to get into the right datatype for , but it's not anything geopandas
can't handle:
train_feat_gdf = gpd.GeoDataFrame(train_features,
geometry=gpd.points_from_xy(
train_features.longitude,
train_features.latitude)
).set_crs(epsg=4326)
This dataframe is big, so we should probably subsample first and restrict the values we display.
import contextily as cx
ax = train_feat_gdf.merge(train_labels)[["geometry", "status_group"]].sample(frac=0.1).to_crs(epsg=3857).plot(figsize=(20, 12),column="status_group", legend=True)
# ax = df_wm.plot(figsize=(10, 10), alpha=0.5, edgecolor='k')
cx.add_basemap(ax)
The wells are pretty clustered, presumably around human settlements. The sparse areas are reserves and national parks, so we wouldn't expect a large or permanent human presence in that area. That can all be established from just the label locations, not their color.
From the class labels it is pretty clear that the non-functional wells are clustered in proximity to one another. This shouldn't be terribly surprising: we can imagine a number of different variables that vary spatially and affect the functioning of the wells (like the depth to water table, or geology, or topography, or a shared maintenance crew etc.)
This spatial correlation shows up in many different applications and there are many methods to help us make good spatial predictions. This will probably be important in this particular competition.
Given that we can see the spatial correlation visually, it makes sense to exploit it immediately. This can serve as a good baseline method.
Within the universe of spatial models there are still many to choose from. The simplest is probably a k nearest neighbors approach - where the class is determined by a vote of the nearest k labeled examples in feature space (in this case, $(x,y)$ space).
kNN is most appropriate for "interpolation" type problems, where the unlabeled data is interspersed with the labeled data and there are plenty of close neighbors to choose from. This is a common feature of spatial algorithms that use weighted combinations of surrounding values. I verified this earlier but didn't want to include the plot here to avoid bloating the notebook with maps.
So we can move right on to preprocessing, parameter selection, and training:
xy_ = train_features.merge(train_labels)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, cross_validate
X = xy_[["longitude", "latitude"]]
y = xy_[["status_group"]]
pipe = Pipeline(
[
("s_scaler", StandardScaler()),
("knn", KNeighborsClassifier()),
]
)
Explicitly setting the value of k to 5 neighbors:
pipe.set_params(knn__n_neighbors = 5)
scores = cross_validate(pipe, X, y.values.ravel(), cv=5, scoring=["roc_auc_ovo", "balanced_accuracy"])
scores
pipe.set_params(knn__n_neighbors = 7)
scores = cross_validate(pipe, X, y.values.ravel(), cv=5, scoring=["roc_auc_ovo", "balanced_accuracy"])
scores
And now to 15:
pipe.set_params(knn__n_neighbors = 15)
scores = cross_validate(pipe, X, y.values.ravel(), cv=5, scoring=["roc_auc_ovo", "balanced_accuracy"])
scores
The ROC score improved! However the balanced accuracy did not.
kNN methods are known to struggle with imbalanced classes, and we are likely seeing the minority class get erased from the predictions as the neighborhood size increases and the local majority vote looks more and more like the global majority (which doesn't include many instances of the minority class, by definition)
The competition uses a global accuracy measurement that doesn't specifically penalize errors on the minority class, so we can simply optimize for AUC and improve our overall classification accuracy.
pipe.set_params(knn__n_neighbors = 20)
scores = cross_validate(pipe, X, y.values.ravel(), cv=5, scoring=["roc_auc_ovo", "balanced_accuracy"])
scores
pipe.set_params(knn__n_neighbors = 30)
scores = cross_validate(pipe, X, y.values.ravel(), cv=5, scoring=["roc_auc_ovo", "balanced_accuracy"])
scores
So based on these experiments (really, just a manual hyperparameter optimization routine) the optimal value for k is around 15. So we can train and predict using k=15:
pipe.set_params(knn__n_neighbors = 15)
pipe = pipe.fit(X, y.values.ravel())
Xtest = test_features[["longitude", "latitude"]]
test_features["status_group"] = pipe.predict(Xtest)
test_features[["id", "status_group"]].to_csv(
"/content/drive/MyDrive/data_mining_water_table/nshea3_submission_110422_2.csv",
index=False)
At the time, that puts me at number 4933 out of 5400 submissions with an accuracy of 0.6737 - which is calculated as $\frac{1}{N} \sum_{1}^{N} I(y_i = \hat{y}_i)$
This result isn't super impressive but it shows that there is indeed spatial correlation in this problem that can be exploited.
Next steps
As I mentioned before: I don't like working with multiclass classification problems unless it is completely necessary. Multiclass problems and multiclass models and multiclass model diagnostics+interpretations are more difficult to understand and more difficult to explain. In this case, non functional
and functional needs repair
could be subsets of needs repair
- so we can bundle them into that variable and perform a first stage of binary classification on needs repair
versus not needing repair
. Or we could bundle first the other way around - we'll try both and use whatever works best.
We'll address that in combination with the following:
- Experiment with AutoML
- Address the missing data problem
- Rigorously model spatial dependence