pyspark.sql.functions.theta_union_agg#

pyspark.sql.functions.theta_union_agg(col, lgNomEntries=None)[source]#

Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch that is the union of the Theta sketches in the input column.

New in version 4.1.0.

Parameters
colColumn or column name
lgNomEntriesColumn or int, optional

The log-base-2 of nominal entries for the union operation (must be between 4 and 26, defaults to 12)

Returns
Column

The binary representation of the merged ThetaSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df1 = spark.createDataFrame([1,2,2,3], "INT")
>>> df1 = df1.agg(sf.theta_sketch_agg("value").alias("sketch"))
>>> df2 = spark.createDataFrame([4,5,5,6], "INT")
>>> df2 = df2.agg(sf.theta_sketch_agg("value").alias("sketch"))
>>> df3 = df1.union(df2)
>>> df3.agg(sf.theta_sketch_estimate(sf.theta_union_agg("sketch"))).show()
+--------------------------------------------------+
|theta_sketch_estimate(theta_union_agg(sketch, 12))|
+--------------------------------------------------+
|                                                 6|
+--------------------------------------------------+