PrecioPiso (Master’s): housing price prediction with POI and Streamlit

Introduction

PrecioPiso is a master’s project aimed at estimating housing prices in Spain using listings data and contextual variables from the surroundings. It combines a reproducible scraping pipeline, rigorous EDA, supervised models and a Streamlit web app for interactive querying.

Data and pipeline

Sources: sale/rental listings enriched with Points of Interest (POI) from OpenStreetMap via Overpass.
Extraction: Python scripts in Docker containers; randomized pauses, simulated headers and retries.
Structure: staging in CSV → integration in PostgreSQL → unified dataset with property variables and distances to POI (1km/3km/5km).
Cleaning: removal of obvious outliers, city→region mapping, elimination of irrelevant columns and explicit missing‑value handling.

EDA in brief

Skewed distributions; concentration of areas below 100 m² and presence of outliers in the tails.
Controlled missings after targeted imputations; variables with high absence (e.g., floor) are dropped.
Correlations: latitude/longitude with weak linear relation to area; relevance of territorial categories and context.

Models and evaluation

Two approaches: Linear Regression (interpretable baseline) and Random Forest (non‑linear, robust). Training with train/test split, cross‑validation and hyperparameter tuning. Metrics: R², RMSE and MAE. The forest consistently outperforms the baseline. Dominant variables: m2, bathrooms, rooms and a relevant block of POI.

From model to product

Architecture: Docker containers per stage (ingestion, processing, training, app).
CI/CD: GitHub Actions to build images; automated deployment and frequent rollouts.
Light orchestration: cron jobs to update data/model daily.
Web app: Streamlit front with forms and visualizations (distributions, heatmaps and comparisons).

UX and visualization

Guided inputs by key fields (location, area, rooms, bathrooms, context).
Results with point estimate, error bands and decomposition by variable contributions.
Heatmaps: simulations keeping property attributes and varying geolocation to compare areas.

Limitations

Geographic scope limited to several regions in the current version.
Dependence on listings and POI coverage/quality; macro factors are out of scope.
Computational cost for large datasets; need for optimization and inference time monitoring.

Data risks and compliance

Scraping commercial portals has restrictions. For commercial use or redistribution, negotiate licenses or use data providers with explicit permissions. Define update policies and respect robots.txt.

Next steps

Nationwide expansion and finer territorial normalization.
Alternative models: Gradient Boosting/XGBoost; probabilistic calibration; explainability (SHAP).
Temporal analysis: price nowcasting, local elasticities, sensitivity to POI openings/closures.

Conclusion

PrecioPiso shows that a contained architecture, a context‑enriched dataset and a well‑governed non‑linear model can deliver useful, actionable estimates. The value is not only prediction, but enabling comparisons and communicating uncertainty clearly to non‑technical users.

Example repository: github.com/Alonsomar/TFM_inmobiliario_av.