Issue Description

Summary

array_distinct is marked as Incompatible in Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference.

Spark Specification

According to Spark's array_distinct behavior:

Removes duplicate elements from an array while preserving the order of first occurrence
Handles null values by keeping only the first null encountered
Uses SQLOpenHashSet for O(1) duplicate detection

Example:

SELECT array_distinct(array(3, 1, 2, 1, 3));
-- Spark returns: [3, 1, 2] (preserves insertion order)

Current Comet Behavior

The test file CometArrayExpressionSuite.scala includes comments explaining the incompatibility:

// The result needs to be in ascending order for checkSparkAnswerAndOperator to pass
// because datafusion array_distinct sorts the elements and then removes the duplicates

And:

// NULL needs to be the first element for checkSparkAnswerAndOperator to pass because
// datafusion array_distinct sorts the elements and then removes the duplicates

DataFusion's array_distinct:

Sorts the elements before removing duplicates
This changes the output order compared to Spark

Example:

SELECT array_distinct(array(3, 1, 2, 1, 3));
-- DataFusion returns: [1, 2, 3] (sorted order, not insertion order)

Impact

Users who depend on the order of elements in array_distinct output will see different results.

Possible Solutions

Custom Rust implementation that uses a HashSet but preserves insertion order (like Spark)
Document clearly in compatibility matrix that element order differs
Keep as Incompatible and require allow_incompatible=true

Note: This issue was generated with AI assistance.

Summary

array_distinct is marked as Incompatible in Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference.

Spark Specification

According to Spark's array_distinct behavior:

Removes duplicate elements from an array while preserving the order of first occurrence
Handles null values by keeping only the first null encountered
Uses SQLOpenHashSet for O(1) duplicate detection

Example:

SELECT array_distinct(array(3, 1, 2, 1, 3));
-- Spark returns: [3, 1, 2] (preserves insertion order)

Current Comet Behavior

The test file CometArrayExpressionSuite.scala includes comments explaining the incompatibility:

// The result needs to be in ascending order for checkSparkAnswerAndOperator to pass
// because datafusion array_distinct sorts the elements and then removes the duplicates

And:

// NULL needs to be the first element for checkSparkAnswerAndOperator to pass because
// datafusion array_distinct sorts the elements and then removes the duplicates

DataFusion's array_distinct:

Sorts the elements before removing duplicates
This changes the output order compared to Spark

Example:

SELECT array_distinct(array(3, 1, 2, 1, 3));
-- DataFusion returns: [1, 2, 3] (sorted order, not insertion order)

Impact

Users who depend on the order of elements in array_distinct output will see different results.

Possible Solutions

Custom Rust implementation that uses a HashSet but preserves insertion order (like Spark)
Document clearly in compatibility matrix that element order differs
Keep as Incompatible and require allow_incompatible=true

Note: This issue was generated with AI assistance.

[Incompatibility] Document array_distinct behavior differences: element ordering

Issue Description

Summary

Spark Specification

Current Comet Behavior

Impact

Possible Solutions

[Incompatibility] Document array_distinct behavior differences: element ordering

Issue Description

Summary

Spark Specification

Current Comet Behavior

Impact

Possible Solutions