DuckDB Histogram

This snippet demonstrates how to use the Histogram function in DuckDB to calculate aggregate statistics for a dataset. The histogram function in DuckDB is used to compute histograms over columns of a dataset. It works for columns of any type and allows for various binning strategies and a custom number of bins.

from histogram(
    table_name,
    column_name,
    bin_count := 10
)

Parameters

  • table_name: The name of the table or a subquery result.
  • column_name: The name of the column for which to create the histogram, you can use different expressions to summarize the data such as length of a string.
  • bin_count: The number of bins to use in the histogram. (Optional)
  • technique: The binning technique to use. (Optional)

Binning Techniques

Technique Description
auto Automatically selects the best binning technique based on the data type. If the data type is not numeric or timestamp, it defaults to sample. For numeric or timestamp data, it defaults to equi-width-nice.
sample Uses distinct values in the column as bins. This technique is useful when the column has a small number of distinct values.
equi-height Creates bins such that each bin has approximately the same number of data points. This technique is useful for ensuring that each bin has a similar number of entries. This can be helpful for skewed distributions.
equi-width Creates bins of equal width. This technique is useful for numeric data. You want each bin to cover the same range of values.
equi-width-nice Creates bins of equal width with “nice” boundaries. This technique is similar to equi-width. It adjusts the bin boundaries to be more human-readable (e.g., rounding to the nearest whole number).

You can find more information in the PR that added this feature.

Histogram of the length of the input persona from the PersonaHub dataset

from histogram(
  instruction,
  len("input persona"),
  bin_count := 5
)