Dataset Leakage Detection
This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.
WITH
overlapping_rows AS (
SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
(overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
There is a very good in depth explanation of leakage in public datasets in this article.
Example Query
You can check out this link for the leakage query for the IMDB dataset.