top of page

Data Science Project

Project Summary

Goal: Predict Yelp star ratings for unseen user-business pairs using engineered features and model-based learning.

​

This project builds a system to predict Yelp star ratings for unseen user-business pairs by learning from historical data. Using Spark RDDs for scalable feature engineering and XGBoost regression for nonlinear modeling, the pipeline transforms raw interaction data into a predictive engine. The final output is a CSV containing predicted ratings, evaluated using Root Mean Squared Error (RMSE) to assess how closely they match real user reviews.

​

The model draws on multiple structured sources from the Yelp dataset. User profiles include review counts, average ratings, and reputation signals like compliments and elite status. Businesses are described by review volume, categories, and attributes. Reviews provide ground-truth ratings for training; tips and photos enrich the context with user sentiment and visual indicators. Together, these inputs support a robust set of features for understanding user preferences and business appeal.

Explore the Code on GitHub

Full pipeline in PySpark RDDs and XGBoost, including feature extraction, model tuning, and output generation.

Hybrid Recommender System for Yelp Reviews Using Spark and XGBoost

TECHNICAL SUMMARY

This project presents a hybrid recommender system built to predict Yelp user ratings using Spark RDDs and XGBoost. It leverages metadata from users, businesses, reviews, photos, and tips to engineer features, then applies a model-based regression approach to generate accurate predictions. The system achieves a validation RMSE of 0.979 and was designed for scalability, running fully in-memory without intermediate storage.

Model Performance

Root Mean Squared Error (RMSE) measures how far predicted ratings deviate from actual Yelp ratings, penalizing larger errors more heavily. On a 1–5 star scale, the worst possible RMSE is 4.0, and a perfect model scores 0.0.

​

This system achieved a validation RMSE of 0.9790, meaning predictions were typically within 1 star of the true rating. In this domain—where user preferences are noisy and data is sparse—RMSE under 0.98 is considered a strong result.

bottom of page