Data Engineering Project
Project Summary
Goal: Extract multi-year U.S. vehicle sales from poorly structured PDF and image sources to create a unified, analysis-ready dataset.
​
This project extracts model-level automotive sales data from over 10 years of Toyota, Nissan, and Honda reports. While Toyota and Nissan provided structured PDFs, Honda published image-based tables, requiring a more advanced pipeline. The goal was to convert all sources into a single, analysis-ready dataset for Power BI dashboards.
​
Text-based PDFs were parsed using pdfplumber and layout heuristics. Honda screenshots went through an OCR pipeline using doctr, OpenCV, and rapidfuzz to extract and standardize sales data. Final outputs were merged into a unified CSV format.
​
Honda’s inconsistent image layouts made extraction particularly difficult, requiring preprocessing, row detection, and fuzzy matching to align messy OCR results with known model names. This highlighted the importance of resilient pipelines for working with unstructured data.
Extracting Auto Sales Data from PDFs and Images with OCR
TECHNICAL SUMMARY
This project builds a multi-source data extraction pipeline for Toyota, Nissan, and Honda U.S. sales reports. Toyota and Nissan data, provided in structured PDFs, were parsed using pdfplumber. In contrast, Honda data—available only as images—required OCR (doctr, OpenCV) and fuzzy string matching to convert screenshots into structured tabular data. The unified outputs were used as a backend for interactive Power BI dashboards, enabling model-level analysis across more than a decade of U.S. auto sales.
Data Output & Usability
This pipeline successfully extracted and standardized over a decade of U.S. vehicle sales data across three major manufacturers—despite inconsistent formats and varying data quality. The Toyota and Nissan pipelines processed 25+ PDF files with high reliability, while the Honda OCR pipeline accurately mapped noisy image text to structured labels using layered preprocessing and fuzzy matching.
​
The final dataset combines model-level sales counts, brand metadata, and reporting year into a clean, analysis-ready CSV—enabling visual exploration and long-term trend analysis via Power BI dashboards. This output now serves as the backbone for the Auto Industry Visualization Project.

