DuckDB Histogram
This snippet demonstrates how to use the Histogram function in DuckDB to calculate aggregate statistics for a dataset. The histogram function in DuckDB is used to compute histograms over columns of a dataset. It works for columns of any type and allows for various binning strategies and a custom number of bins.
from histogram(
table_name,
column_name,
bin_count := 10
) Parameters
table_name: The name of the table or a subquery result.column_name: The name of the column for which to create the histogram, you can use different expressions to summarize the data such as length of a string.bin_count: The number of bins to use in the histogram. (Optional)technique: The binning technique to use. (Optional)
Binning Techniques
| Technique | Description |
|---|---|
auto | Automatically selects the best binning technique based on the data type. If the data type is not numeric or timestamp, it defaults to sample. For numeric or timestamp data, it defaults to equi-width-nice. |
sample | Uses distinct values in the column as bins. This technique is useful when the column has a small number of distinct values. |
equi-height | Creates bins such that each bin has approximately the same number of data points. This technique is useful for ensuring that each bin has a similar number of entries. This can be helpful for skewed distributions. |
equi-width | Creates bins of equal width. This technique is useful for numeric data. You want each bin to cover the same range of values. |
equi-width-nice | Creates bins of equal width with “nice” boundaries. This technique is similar to equi-width. It adjusts the bin boundaries to be more human-readable (e.g., rounding to the nearest whole number). |
You can find more information in the PR that added this feature.
Histogram of the length of the input persona from the PersonaHub dataset
from histogram(
instruction,
len("input persona"),
bin_count := 5
)