In the previous labs, we defined our strategy and identified the necessary tables. Now, we move to the "engine room" of business analytics: Extract, Transform, and Load (ETL). Raw data is rarely ready for a dashboard because it is optimized for recording transactions, not for high-level analysis.
To make this data useful, we must filter out superfluous information—technical clutter like last_update timestamps or internal address_id keys—and focus on the "signal" that drives business decisions.
Modern analytics often uses a "Medallion" framework to refine data through three distinct stages:
rental, payment) exactly as they exist in the operational database. This data is usually collected in a form that is built for operational speed and simplicity.When moving from Bronze to Silver, the type of "Join" you choose determines which rows are included in your final analysis. Think of these as different ways to handle missing or unmatched data.
customer and rental, you will only see customers who have actually rented at least one movie.customer to rental, you will see every customer in your database—even those who have never rented a movie (their rental columns will just appear as "Null").customer to rental, you would see every rental record, even if—hypothetically—a rental was somehow processed without a valid customer ID.A key part of ETL is understanding how the "grain" (what one row represents) changes as it moves toward the Gold layer. I like to always emphasize the importance of being absolutely clear about what each row of your represents from each table you are reading it from.
<aside> 💡
As you build your ETL, you must track your row counts at every single step. This practice, which I call the Doctrine of Volumetric Integrity, is your primary defense against "garbage in, garbage out."
Imagine you have 100 films, 500 customers, and 4,000 rentals.
customers and films via the rentals table, your final joined table should ideally have 4,000 rows (one for every rental event).If you enter a meeting and your numbers differ from a colleague’s, your ability to demonstrate Volumetric Integrity—showing exactly how and why your row counts changed at each stage—is what proves your analysis is the "source of truth."
</aside>
payment table.