Football Statistics Tracker Logo

Football Statistics Tracker 📊⚽

Python 3.12 GCP Powered IaC Terraform CI/CD passing Svelte BigQuery Firestore Repo Size

An end-to-end data engineering pipeline that collects, processes, and analyzes football match results, standings data, weather data, Reddit data and summarizes matchdays using Gemini from the top 5 European leagues. Used data sources include football-data.org API, Open-Meteo API, and PRAW (Reddit API), Maps...

Introduction

This project demonstrates a complete data pipeline for football (soccer) results, from data extraction to visualization. It implements some data engineering practices including data lakes, transformation layers, and Infrastructure as Code (IaC) with Terraform.

Features

  • Automated Data Collection: Scheduled data fetching from multiple APIs using Google Cloud Functions
  • Multi-layer Data Architecture: Raw data stored in GCS, processed data in BigQuery, and user-facing data in Firestore
  • Weather Integration: Match statistics with weather data at match time
  • Social Media (Reddit) Data: Reddit comments for fan sentiment
  • Infrastructure as Code: Cloud Functions and Pub/Sub subscriptions and topics defined and deployed with Terraform

Architecture

Architecture Diagram

The pipeline follows the following architecture:

  1. Data Ingestion: Cloud Functions trigger on schedule to fetch data
  2. Storage Layers: Raw data(json) → External BQ tables (Parquet) → Processed Data in BQ → Firestore
  3. Validation: Very simple validation and Data quality with Dataplex
  4. Summarization: Creation of short summaries in Markdown with Gemini 2.0 Flash
  5. Visualization: This web app for insights

Data Sources

  • Football-data.org: Match data, team data, and standings
  • Open-Meteo API: Historical weather data
  • Reddit (via PRAW): Fan comments and sentiment
  • Maps SDK: Location of stadiums

Technology Stack

CategoryTechnologies
Cloud PlatformGoogle Cloud Platform (GCP)
Infrastructure as CodeTerraform
Programming LanguagesPython, TypeScript (Svelte)
Data StorageCloud Storage, BigQuery, Firestore
Data QualityDataplex
Data TransformationDataform
Serverless ComputingCloud Functions
Event-Driven ArchitecturePub/Sub
API ConsumptionFootball-data.org, Open-Meteo, Reddit API, Google Maps
CI/CDGitHub Actions
Package Managementuv, pyproject.toml
Code QualityRuff, Bandit, Mypy
Testingpytest
Web FrameworkSvelte, ShadCN UI Components
HostingFirebase App Hosting
LLMGoogle Gemini 2.0 Flash

I got the idea to make this project from this repo by digitalghost-dev