-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison fails on dataframes with a single column #253
Comments
Thanks for the report. Would you be able to provide a minimal example so we can reproduce the issue? That would be really helpful to debug here. |
Thanks for your reply, here is a sample of some code that throws an exception for us: df_1 = spark.createDataFrame([{"a": 1}])
df_2 = spark.createDataFrame([{"a": 1}])
compare = datacompy.SparkCompare(
spark,
df_1,
df_2,
join_columns=["a"],
cache_intermediates=True,
)
compare.rows_both_mismatch.count() |
The error message is:
|
Thanks for the details. So it seems like I can reproduce it. I also tried using Pandas and Fugue: Pandasdf_1 = pd.DataFrame([{"a": 1}])
df_2 = pd.DataFrame([{"a": 1}])
compare = datacompy.Compare(df_1, df_2, join_columns=["a"])
print(compare.report()) Results in: DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 1 1
1 df2 1 1
Column Summary
--------------
Number of columns in common: 1
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: a
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 1
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 1
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 0 Fuguedf_1 = spark.createDataFrame([{"a": 1}])
df_2 = spark.createDataFrame([{"a": 1}])
print(datacompy.report(df_1, df_2, join_columns=["a"])) Results in: DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 1 1
1 df2 1 1
Column Summary
--------------
Number of columns in common: 1
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: a
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 1
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 1
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 0 We should align the Spark with Pandas and Fugue. @rupertbarton Would you be open to using Fugue for your Spark Compare for now? you should be able to run it successfully. I'll need to debug the native Spark compare. I have been debating if we should just remove it in favor of using Fugue moving forward. |
I'll create a ticket in our backlog to investigate switching over, thanks! |
@rupertbarton more for my understanding but could you articulate what sort of use case you have where you are just joining on a single column with nothing else to compare? |
@jdawang @rupertbarton I have a WIP fix here Getting the following back: In [2]: print(compare.report())
****** Column Summary ******
Number of columns in common with matching schemas: 1
Number of columns in common with schema differences: 0
Number of columns in base but not compare: 0
Number of columns in compare but not base: 0
****** Row Summary ******
Number of rows in common: 2
Number of rows in base but not compare: 0
Number of rows in compare but not base: 0
Number of duplicate rows found in base: 0
Number of duplicate rows found in compare: 0
****** Row Comparison ******
Number of rows with some columns unequal: 0
Number of rows with all columns equal: 2
****** Column Comparison ******
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 0
None Seems like the Column Comparison is different than the Pandas version. I think this is mostly due to the difference in the underlying logic. In Pandas it would say: Number of columns compared with all values equal: 1. I can kind of see this both ways. This corner case is just a bit odd cause you aren't really comparing anything, just joining on the key ( @jdawang Another reason why I'm thinking maybe we just drop the native Spark implementation. The differences are annoying. |
Hi! Our use case is that we have a large number of tables we are running assertions on, and all of them work fine apart from 1 particular table. This table has multiple columns, but all of the columns apart from 1 are being encrypted so we're excluding them from the comparison as it's awkward trying to work out what the encrypted values will be, hence why the DF only has a single column. We still would want to compare that all the values in the expected DF and all the values in the actual DF match up, and we're using the same code for every table. |
Hi everyone I guess this will work for you |
When running a comparison on dataframes with a single column, the following exception is thrown:
I believe it's due to this line always including a
WHERE
statement even when thewhere_cond
is empty (empty because there are no columns apart from the join column).The text was updated successfully, but these errors were encountered: