array_distinct is marked as Incompatible in Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference.
According to Spark's array_distinct behavior:
SQLOpenHashSet for O(1) duplicate detectionExample:
SELECT array_distinct(array(3, 1, 2, 1, 3));
-- Spark returns: [3, 1, 2] (preserves insertion order)
The test file CometArrayExpressionSuite.scala includes comments explaining the incompatibility:
// The result needs to be in ascending order for checkSparkAnswerAndOperator to pass
// because datafusion array_distinct sorts the elements and then removes the duplicates
And:
// NULL needs to be the first element for checkSparkAnswerAndOperator to pass because
// datafusion array_distinct sorts the elements and then removes the duplicates
DataFusion's array_distinct:
Example:
SELECT array_distinct(array(3, 1, 2, 1, 3));
-- DataFusion returns: [1, 2, 3] (sorted order, not insertion order)
Users who depend on the order of elements in array_distinct output will see different results.
allow_incompatible=trueNote: This issue was generated with AI assistance.