Complete notebook available at: https://github.com/ai4up/ufo-prediction/blob/main/demo/demo.ipynb

Motivation

Building attributes such as building height, type, and construction year are not available for all buildings in EUBUCCO. However, for many prospective use cases of the dataset, such as energy modeling, the building attributes are of high importance. This notebook shows how the available building footprints can be used to estimate missing building attributes with supervised machine learning. For more details on the conceptualization and feature engineering see:

Milojevic-Dupont, Nikola, et al. "Learning from urban form to predict building heights." Plos one 15.12 (2020): e0242010.

Data

Demo sample of ~20k buildings for Spain, ~50k for France, and 170k for the Netherlands. All 117 urban form features, lat lon, as well as some auxiliary attributes like city name, neighborhood, building type, etc. are included.

The demo samples are stored using Git Large File Storage (LFS). To download them explicitly use:

!git clone git@github.com:ai4up/ufo-prediction.git
!git lfs pull
DATA_DIR = '.'

path_data_NLD = os.path.join(DATA_DIR, 'df-NLD-exp.pkl')
path_data_FRA = os.path.join(DATA_DIR, 'df-FRA-exp.pkl')
path_data_ESP = os.path.join(DATA_DIR, 'df-ESP-exp.pkl')

df = pd.read_pickle(path_data_NLD)

Prediction

xgb_model_params = {'tree_method': 'hist'}
xgb_hyperparams = {
    'max_depth': 5,
    'learning_rate': 0.1,
    'n_estimators': 500,
    'colsample_bytree': 0.5,
    'subsample': 1.0,
}

Regression

predictor = AgePredictor(
    model=XGBRegressor(**xgb_model_params),
    df=df,
    test_training_split=pp.split_80_20,
    # cross_validation_split=pp.cross_validation,
    early_stopping=True,
    hyperparameters=xgb_hyperparams,
    preprocessing_stages=[pp.remove_outliers]
)

predictor.evaluate()
MAE: 10.73 y
RMSE: 16.85 y
R2: 0.5483
R2: nan
MAPE: nan

Classification

tabula_nl_bins = [1900, 1965, 1975, 1992, 2006, 2015, 2022]
equally_sized_bins = (1900, 2020, 10)

classifier = AgeClassifier(
    model=XGBClassifier(**xgb_model_params),
    df=df,
    test_training_split=pp.split_80_20,
    # cross_validation_split=pp.cross_validation,
    preprocessing_stages=[pp.remove_outliers],
    hyperparameters=xgb_hyperparams,
    mitigate_class_imbalance=True,
    # bin_config=equally_sized_bins,
    bins=tabula_nl_bins,
)
classifier.evaluate()
Classification report:
               precision    recall  f1-score  support
1900-1964      0.751537  0.842825  0.794567     8850
1965-1974      0.875129  0.834151  0.854149     7133
1975-1991      0.904658  0.799159  0.848642     8798
1992-2005      0.852081  0.774682  0.811540     6209
2006-2014      0.595462  0.695315  0.641526     3095
2015-2021      0.496798  0.711664  0.585129      763
accuracy       0.801911  0.801911  0.801911        0
macro avg      0.745944  0.776299  0.755926    34848
weighted avg   0.813968  0.801911  0.805261    34848
Cohen’s kappa: 0.7501
Matthews correlation coefficient (MCC): 0.7513

Country and generalization comparison

The AgePredictorComparison faciliates comparisons between differently configured training runs, for example to compare the prediction performance between countries, cross-validation strategies, oversampling strategies or any other preprocessing steps.

comparison_config = {
    'Spain': {'df': path_data_ESP},
    'France': {'df': path_data_FRA},
    'Netherlands': {'df': path_data_NLD},
}

grid_comparison_config = {
    'random-cv': {'cross_validation_split': pp.cross_validation},
    'neighborhood-cv': {'cross_validation_split': pp.neighborhood_cross_validation},
    'city-cv': {'cross_validation_split': pp.city_cross_validation},
}

comparison = AgePredictorComparison(
    exp_name='demo',
    model=XGBRegressor(**xgb_model_params),
    df=None,
    frac=0.5,
    cross_validation_split=None,
    preprocessing_stages=[pp.remove_outliers],
    hyperparameters=xgb_hyperparams,
    compare_feature_importance=False,
    compare_classification_error=False,
    include_baseline=False,
    save_results=False,
    garbage_collect_after_training=True,
    comparison_config=comparison_config,
    grid_comparison_config=grid_comparison_config,
)

results = comparison.evaluate()
results
name R2 R2_std MAE MAE_std RMSE RMSE_std within_5_years within_10_years within_20_years R2_seed_0
8 Netherlands_city-cv 0.135401 0.0 18.030643 0.0 23.598668 0.0 0.221385 0.392977 0.638903 0.135401
7 France_city-cv 0.187767 0.0 18.645831 0.0 23.772030 0.0 0.176875 0.345911 0.615315 0.187767
6 Spain_city-cv 0.197072 0.0 23.840955 0.0 29.563272 0.0 0.126411 0.247178 0.494357 0.197072
3 Spain_neighborhood-cv 0.198503 0.0 23.779078 0.0 29.536916 0.0 0.129797 0.247178 0.506772 0.198503
5 Netherlands_neighborhood-cv 0.304538 0.0 15.884060 0.0 21.164937 0.0 0.241489 0.444702 0.699700 0.304538
4 France_neighborhood-cv 0.330228 0.0 16.306574 0.0 21.586864 0.0 0.211348 0.408337 0.705209 0.330228
0 Spain_random-cv 0.363164 0.0 20.108252 0.0 26.328608 0.0 0.180587 0.355530 0.592551 0.363164
1 France_random-cv 0.511105 0.0 12.372172 0.0 18.443089 0.0 0.369564 0.593466 0.806340 0.511105
2 Netherlands_random-cv 0.575725 0.0 10.203823 0.0 16.531180 0.0 0.525335 0.695626 0.827052 0.575725