Predicting building age with EUBUCCO footprints

Complete notebook available at: https://github.com/ai4up/ufo-prediction/blob/main/demo/demo.ipynb

Motivation¶

Building attributes such as building height, type, and construction year are not available for all buildings in EUBUCCO. However, for many prospective use cases of the dataset, such as energy modeling, the building attributes are of high importance. This notebook shows how the available building footprints can be used to estimate missing building attributes with supervised machine learning. For more details on the conceptualization and feature engineering see:

Milojevic-Dupont, Nikola, et al. "Learning from urban form to predict building heights." Plos one 15.12 (2020): e0242010.

Data¶

Demo sample of ~20k buildings for Spain, ~50k for France, and 170k for the Netherlands. All 117 urban form features, lat lon, as well as some auxiliary attributes like city name, neighborhood, building type, etc. are included.

The demo samples are stored using Git Large File Storage (LFS). To download them explicitly use:

!git clone git@github.com:ai4up/ufo-prediction.git
!git lfs pull

DATA_DIR = '.'

path_data_NLD = os.path.join(DATA_DIR, 'df-NLD-exp.pkl')
path_data_FRA = os.path.join(DATA_DIR, 'df-FRA-exp.pkl')
path_data_ESP = os.path.join(DATA_DIR, 'df-ESP-exp.pkl')

df = pd.read_pickle(path_data_NLD)

Prediction¶

xgb_model_params = {'tree_method': 'hist'}
xgb_hyperparams = {
    'max_depth': 5,
    'learning_rate': 0.1,
    'n_estimators': 500,
    'colsample_bytree': 0.5,
    'subsample': 1.0,
}

Regression¶

predictor = AgePredictor(
    model=XGBRegressor(**xgb_model_params),
    df=df,
    test_training_split=pp.split_80_20,
    # cross_validation_split=pp.cross_validation,
    early_stopping=True,
    hyperparameters=xgb_hyperparams,
    preprocessing_stages=[pp.remove_outliers]
)

predictor.evaluate()

MAE: 10.73 y
RMSE: 16.85 y
R2: 0.5483
R2: nan
MAPE: nan

Classification¶

tabula_nl_bins = [1900, 1965, 1975, 1992, 2006, 2015, 2022]
equally_sized_bins = (1900, 2020, 10)

classifier = AgeClassifier(
    model=XGBClassifier(**xgb_model_params),
    df=df,
    test_training_split=pp.split_80_20,
    # cross_validation_split=pp.cross_validation,
    preprocessing_stages=[pp.remove_outliers],
    hyperparameters=xgb_hyperparams,
    mitigate_class_imbalance=True,
    # bin_config=equally_sized_bins,
    bins=tabula_nl_bins,
)
classifier.evaluate()

Classification report:
               precision    recall  f1-score  support
1900-1964      0.751537  0.842825  0.794567     8850
1965-1974      0.875129  0.834151  0.854149     7133
1975-1991      0.904658  0.799159  0.848642     8798
1992-2005      0.852081  0.774682  0.811540     6209
2006-2014      0.595462  0.695315  0.641526     3095
2015-2021      0.496798  0.711664  0.585129      763
accuracy       0.801911  0.801911  0.801911        0
macro avg      0.745944  0.776299  0.755926    34848
weighted avg   0.813968  0.801911  0.805261    34848
Cohen’s kappa: 0.7501
Matthews correlation coefficient (MCC): 0.7513

Country and generalization comparison¶

The AgePredictorComparison faciliates comparisons between differently configured training runs, for example to compare the prediction performance between countries, cross-validation strategies, oversampling strategies or any other preprocessing steps.

comparison_config = {
    'Spain': {'df': path_data_ESP},
    'France': {'df': path_data_FRA},
    'Netherlands': {'df': path_data_NLD},
}

grid_comparison_config = {
    'random-cv': {'cross_validation_split': pp.cross_validation},
    'neighborhood-cv': {'cross_validation_split': pp.neighborhood_cross_validation},
    'city-cv': {'cross_validation_split': pp.city_cross_validation},
}

comparison = AgePredictorComparison(
    exp_name='demo',
    model=XGBRegressor(**xgb_model_params),
    df=None,
    frac=0.5,
    cross_validation_split=None,
    preprocessing_stages=[pp.remove_outliers],
    hyperparameters=xgb_hyperparams,
    compare_feature_importance=False,
    compare_classification_error=False,
    include_baseline=False,
    save_results=False,
    garbage_collect_after_training=True,
    comparison_config=comparison_config,
    grid_comparison_config=grid_comparison_config,
)

results = comparison.evaluate()

results

	name	R2	MAE	RMSE	within_5_years	within_10_years	within_20_years	R2_seed_0
8	Netherlands_city-cv	0.135401	18.030643	23.598668	0.221385	0.392977	0.638903	0.135401
7	France_city-cv	0.187767	18.645831	23.772030	0.176875	0.345911	0.615315	0.187767
6	Spain_city-cv	0.197072	23.840955	29.563272	0.126411	0.247178	0.494357	0.197072
3	Spain_neighborhood-cv	0.198503	23.779078	29.536916	0.129797	0.247178	0.506772	0.198503
5	Netherlands_neighborhood-cv	0.304538	15.884060	21.164937	0.241489	0.444702	0.699700	0.304538
4	France_neighborhood-cv	0.330228	16.306574	21.586864	0.211348	0.408337	0.705209	0.330228
0	Spain_random-cv	0.363164	20.108252	26.328608	0.180587	0.355530	0.592551	0.363164
1	France_random-cv	0.511105	12.372172	18.443089	0.369564	0.593466	0.806340	0.511105
2	Netherlands_random-cv	0.575725	10.203823	16.531180	0.525335	0.695626	0.827052	0.575725