When AI Audited My Thesis: The Retrospective Clicked in a Missing Puzzle Piece
That original work was the high-stakes catalyst that pushed me toward Big Data Analytics. But while the code itself was a massive undertaking, the academic writing now feels incredibly superficial. It highlighted the final insights, completely burying the fragmented, manually stitched data pipelines and architectural heavy lifting that made those insights possible. Here is what the thesis failed to tell you, viewed through the lens of a modern data professional.

The physical copy of my 2020 Master’s thesis is currently collecting dust in a corner of my university's Kresge library. I had largely made peace with that—until I decided to index the original GitHub repository using DeepWiki AI just to see how it held up during this #reMomentum and attending Data Engineering Zoomcamp 2026 Cohort. The result was a brutal reality check.
Addressing the Obvious "Whys"
Why audit a 6-year-old academic project now? As I progressed through the DE Zoomcamp 2026 cohort, I had a sudden, nagging realization: the core engineering skills I used back in 2020 are still entirely valid today. My thesis wasn't just research; it required a massive data ingestion phase, heavy data wrangling, and a complex ML inference pipeline running data through an external Senti4SD "AI brain."
To establish a completely objective baseline of my past work, I decided to run the repository through DeepWiki AI. It generated the system design documentation and architecture diagrams straight from my raw Java code.
Seeing it mapped out visually confirmed the reality: I had built a massive, tightly-coupled batch processing monolith defined by strict sequential execution.
Why didn't this work immediately translate into a Data Engineering role in 2020? DeepWiki's Final Verdict:"The engineering grit is solid, but the architecture was a fragile, localized monolith." The puzzling problem wasn't that my code was irrelevant. The problem was twofold: I never had the industry vocabulary to explain what I actually built to anyone outside of a university—and while the pipeline flow was mathematically sound, the underlying architecture was optimized for academic completion, not the enterprise resilience I recognize today.
Why dissect this dusted-off technical debt in public? Because making peace with my past architecture is the only way to gain a profound understanding of the data roles I am targeting today. Dissecting my own technical debt equips me with a deeper, more practical understanding of enterprise data pipelines than any theoretical coursework ever could. It builds the exact self-awareness required to make resilient, scalable architectural decisions in the future. By putting my own legacy monolith under the microscope, I want to prove that my #reMomentum journey isn't just about collecting new cloud certifications. It is about taking raw, proven engineering grit and upgrading it with the architectural maturity required to build governed, scalable enterprise platforms.
Today, as I dive deep into Kubernetes orchestration and NVIDIA GPU infrastructure during my #reMomentum sabbatical, I am looking at this legacy pipeline with a completely fresh perspective. Armed with modern cloud-native tooling, I can see exactly how to re-architect this exact same workflow with the scalability, efficiency, and engineering maturity of a true enterprise platform.
To prove it, I broke the teardown into five distinct architectural vantages. Here is exactly how I built it then, and how I build it now.
🎩 The Data Engineer Vantage (Batch ETL & Ingestion)
The 2020 Academic Reality (What was done)
As DeepWiki accurately summarized, I built a raw batch ETL and analytics pipeline. I ingested massive 1 TB XML 7z data dumps of StackOverflow posts, performing heavy data wrangling using JSoup and regex to strip HTML and extract clean text. From there, I enriched the data with sentiment and topic labels, managed complex nested HashMaps to aggregate counts and merge duplicate topics (e.g., merging topics 10+24), and wrote custom data quality validation logic to compare input/output record counts to guarantee zero data loss.
The 2020 Enterprise Gap (What was missing)
DeepWiki exposed the exact enterprise bottlenecks: the pipeline was modular but entirely manually orchestrated. File paths were hardcoded—completely killing portability—and downstream popularity metrics were calculated manually in SPSS rather than programmatically. It was a functional ETL pipeline, but it lacked data observability and the automated scheduling required of a reliable, enterprise-grade data product.
The 2026 Standard (What I can confidently build now)
Cloud-native batch processing and the ELT (Extract, Load, Transform) paradigm. Today, hardcoded paths and manual SPSS steps are entirely eliminated. Raw data is programmatically ingested into a Google Cloud Storage (GCS) Data Lake. Instead of writing brittle Java HashMaps to aggregate topics, transformations are handled using declarative, partitioned SQL models inside BigQuery, treating the entire workflow as a Directed Acyclic Graph (DAG) automatically scheduled and monitored via Bruin. Most importantly, I now approach this architecture knowing that Data Platform Engineers don't build pipelines in a vacuum—we build governed, automated platforms designed specifically to unblock internal customers (Data Scientists, BI Analysts) and accelerate the company's time-to-insight.
🎩 The Platform Engineer Vantage (Environment provisioning, infrastructure, and Orchestration)
The 2020 Academic Reality (What was done)
I built a functional, research-grade batch pipeline that tightly coupled compute and execution chained together multiple external tools (JSoup, Senti4SD, Mallet LDA). I pushed a single 32GB RAM Windows machine to its absolute physical limits to ensure the data wasn't being silently dropped mid-flight while local hardware didn't crash from memory exhaustion during massive workloads.
The 2020 Enterprise Gap (What was missing)
DeepWiki’s audit exposed the exact infrastructure risks: the system was a localized monolith with zero environment parity. Hardcoded absolute file paths made the pipeline non-portable, and external dependencies weren't containerized, creating massive reproducibility risks—if my laptop died, the project died. The orchestration was entirely manual—there was no scheduler, no automated retries, and no structured logging. The execution lacked observability (no structured logging or telemetry). Furthermore, the pipeline was not idempotent; rerunning a failed stage meant risking duplicate outputs, and having separate manual pipelines for different datasets (Big Data vs. Concurrency) duplicated platform technical debt.
The 2026 Standard (What I can confidently build now)
Immutable infrastructure and Infrastructure as Code (IaC) with Containerization and declarative orchestration. Today, the execution environment is never a single laptop—it is containerized via Docker and compute resources are dynamically provisioned on Kubernetes (GKE) using OpenTofu or Terraform or CloudFormation. Hardcoded paths are replaced by environment variables. Drawing from my background in hardware-level testing, I now design for fault tolerance. Manual execution is replaced by a modern workflow engine (Bruin) that enforces idempotency, handles retries, and treats data pipelines as directed acyclic graphs (DAGs) state is managed safely, and the system can gracefully recover from node failures without manual intervention.
🎩 The Analytics Engineer Vantage (Building the Semantic Layer)
The 2020 Academic Reality (What was done)
I materialized a clean, topic-aware sentiment model from raw XML. By establishing a clear data grain, I built a reliable fact table joining post IDs to sentiment labels generated by Senti4SD, and integrated topic assignments from an external LDA model. From there, I built a functional metric layer: handling duplicate topic merges, applying basic dimensional modeling, and aggregating sentiment counts to produce a final dataset ready for Kendall Tau rank correlations in SPSS.
The 2020 Enterprise Gap (What was missing)
DeepWiki’s audit accurately pointed out the lack of analytics governance and data lineage. While intermediate CSVs were used for debugging, the pipeline completely lacked dbt-like versioning, isolated staging environments, and automated data testing. Crucially, the final enrichment step—joining the downstream popularity and difficulty metrics—was done manually inside SPSS rather than being programmatically integrated as core business logic within the pipeline.
The 2026 Standard (What I understand now)
Decoupled transformations and automated metric joins. What I used to call "data cleaning" is now governed as a true Semantic Layer, unblocks Data Analysts from doing repetitive SQL debugging. Instead of relying on manual SPSS interventions, I write modular, version-controlled, and heavily tested SQL models directly in a cloud data warehouse like BigQuery. Complex joins and metric aggregations are handled declaratively via Bruin, ensuring that downstream Business Intelligence (BI) tools and dashboards consume a fully automated, governed source of truth.
🎩 The MLOps & NLP Engineer Vantage (Batch Inference & Model Orchestration)
The 2020 Academic Reality (What was done)
I built a heavy batch NLP batch inference pipeline to process roughly 400,000 posts. It handled deep text normalization using JSoup and orchestrated two external models: Senti4SD for sentiment classification and Mallet LDA for topic modeling. Because of hardware limits, I had to manually split the files, run the batch inference, and use complex HashMap lookups with position-based keys (e.g., matching row t0 to t0) to stitch the sentiment predictions back to the original post IDs.
The 2020 Enterprise Gap (What was missing)
DeepWiki exposed the exact fragility of this research-grade approach. First, position-based matching is incredibly risky in production—if a single row silently drops during inference, the entire downstream alignment breaks. Second, the system completely lacked model lifecycle management. There was no Model Registry, no artifact tracking for the Senti4SD or LDA models, and zero NLP-specific observability to monitor model degradation or data drift (like tracking shifting sentiment confidence scores).
The 2026 Standard (What I understand now)
ID-preserving, containerized ML pipelines. Today, I would never rely on position-based matching; feature engineering and transformations must be ID-preserving by design. Machine learning models are no longer treated as loose external scripts; they are containerized, pinned to specific versions, and managed as tracked model artifacts allow Data Scientists to deploy models faster without worrying about infrastructure. Instead of dumping data into fragile CSVs, intermediate outputs are saved in structured, columnar formats like Parquet, with the entire inference lifecycle orchestrated dynamically to prevent manual file chunking.
🎩 Vantage of a FinOps Engineer (Compute & Storage Optimization)
The 2020 Academic Reality (What was done)
To accommodate the memory limits of the Senti4SD tool, I had to hardcode file splits at 10,000 records each. Processing over 150,000 posts meant running multiple Java stages, executing redundant I/O passes over the same text, and generating a massive proliferation of intermediate CSV files for questions, answers, and sentiment mappings.
The 2020 Enterprise Gap (Not an enterprise standard yet but):
In a university setting, the only cost is the electric bill, but DeepWiki’s audit highlighted the massive infrastructure and opportunity cost of this architecture. Manual orchestration meant idle resources waiting for human intervention, killing compute utilization. Hardcoded splits led to inefficient small-file operations, and redundant I/O passes amplified storage bloat and data egress overhead. In an enterprise cloud environment, this lack of storage lifecycle management would cause cloud bills to hemorrhage money.
The 2026 Standard (What I understand now)
Rightsizing compute and resource efficiency. True FinOps isn't just about minimizing cloud bills; it is about reducing time-to-insight so the business can make faster, data-driven decisions without scaling infrastructure costs linearly. Today, I replace fixed file splits with dynamic batching matched to instance memory. Intermediate CSVs are replaced by optimized columnar formats or managed BigQuery storage to slash I/O costs. Furthermore, by automating scheduling via Bruin, I eliminate manual wait times and can leverage preemptible spot instances via OpenTofu to process heavy batch workloads for pennies on the dollar, proving out a highly efficient chargeback model.
The Architecture Audit (Then vs. Now)
If I were tasked to architect this entire data platform today, here is how the approach changes from a localized script to an enterprise-grade cloud architecture:
Architecture Layer | The 2020 Academic Baseline | The 2026 Enterprise Standard |
Data Ingestion & ETL | Local extraction of 1 TB XML dumps using brittle Java scripts and nested HashMaps. | Programmatic ingestion into a Google Cloud Storage (GCS) Data Lake with declarative SQL transformations. |
Infrastructure & Orchestration | A localized, manual monolith running on laptop hardware with uncontainerized dependencies. | Ephemeral Google Kubernetes Engine (GKE) clusters provisioned via OpenTofu, orchestrated as DAGs using Bruin. |
Data Modeling & Analytics | Unversioned intermediate CSVs with manual metric enrichment handled downstream in SPSS. | A governed Semantic Layer inside BigQuery featuring version-controlled, automated metric joins. |
MLOps & Inference | Untracked Senti4SD/LDA models relying on highly fragile position-based HashMap matching. | Containerized, version-pinned ML artifacts utilizing ID-preserving pipelines and Parquet/BigQuery storage. |
FinOps & Resource Management | Hardcoded 10,000-record file splits resulting in redundant I/O passes and idle manual wait times. | Dynamic batching and automated scheduling leveraging preemptible spot instances to slash compute costs. |
The Final Verdict: Putting Legacy Work Under the Microscope.
The ultimate retrospective from this DeepWiki audit is that it didn't just highlight my 2020 gaps; it built the absolute foundation for my next phase. Dissecting this monolith has unlocked a profound new confidence in my #reMomentum journey. I know what happens when data systems are pushed to their physical limits without governance, containerization, or orchestration—and more importantly, I now know exactly how to architect the modern solutions.
I am ready to step into Data Platform Engineer role, equipped not just with modern Data 3.0 syntax, but with the foundational scars required to know why we build things the right way.I am looking for a team that needs an engineer who can bridge the gap between raw data and scalable AI—someone who can take chaotic, localized pipelines and rebuild them into governed, profitable data products.
Rather than forcing an outdated 2020 dataset into a new tech stack, I am applying this renewed confidence to a modern, high-volume data source. Stay tuned to my GitHub repository, where I am currently architecting my next enterprise data platform from scratch using OpenTofu, Bruin, Kubernetes, and BigQuery.
The #reMomentum continues! learn more about my past thesis work chat here: DeepWiki AI


