DuckDB Histogram
This snippet demonstrates how to use the Histogram
function in DuckDB to calculate aggregate statistics for a dataset. The histogram
function in DuckDB is used to compute histograms over columns of a dataset. It works for columns of any type and allows for various binning strategies and a custom number of bins.
from histogram(
table_name,
column_name,
bin_count := 10
)
Parameters
table_name
: The name of the table or a subquery result.column_name
: The name of the column for which to create the histogram, you can use different expressions to summarize the data such as length of a string.bin_count
: The number of bins to use in the histogram. (Optional)technique
: The binning technique to use. (Optional)
Binning Techniques
Technique | Description |
---|---|
auto | Automatically selects the best binning technique based on the data type. If the data type is not numeric or timestamp, it defaults to sample . For numeric or timestamp data, it defaults to equi-width-nice . |
sample | Uses distinct values in the column as bins. This technique is useful when the column has a small number of distinct values. |
equi-height | Creates bins such that each bin has approximately the same number of data points. This technique is useful for ensuring that each bin has a similar number of entries. This can be helpful for skewed distributions. |
equi-width | Creates bins of equal width. This technique is useful for numeric data. You want each bin to cover the same range of values. |
equi-width-nice | Creates bins of equal width with “nice” boundaries. This technique is similar to equi-width . It adjusts the bin boundaries to be more human-readable (e.g., rounding to the nearest whole number). |
You can find more information in the PR that added this feature.
Histogram of the length of the input persona from the PersonaHub
dataset
from histogram(
instruction,
len("input persona"),
bin_count := 5
)