Dataset Leakage Detection

This snippet demonstrates how to use DuckDB to detect potential data leakage between train and test datasets by calculating the overlap percentage.

WITH
    overlapping_rows AS (
        SELECT COUNT(*) AS overlap_count
        FROM train
        INTERSECT
        SELECT COUNT(*) AS overlap_count
        FROM test
    ),
    total_unique_rows AS (
        SELECT COUNT(*) AS total_count
        FROM (
            SELECT * FROM train
            UNION
            SELECT * FROM test
        ) combined
    )
SELECT
    overlap_count,
    total_count,
    (overlap_count * 100.0 / total_count) AS overlap_percentage
FROM overlapping_rows, total_unique_rows;

There is a very good in depth explanation of leakage in public datasets in this article.

Example Query

You can check out this link for the leakage query for the IMDB dataset.

leakage-query-imdb