[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517
[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts#54517yaooqinn wants to merge 1 commit intoapache:masterfrom
Conversation
30258f5 to
8aeab4f
Compare
|
cc @dongjoon-hyun @cloud-fan @gengliangwang, Is the fix direction correct? is this a genuine bug or design choice. I haven't found any public discussions on this area. |
|
Hi, @yaooqinn
For me, this PR seems to introduce a new feature instead of bugs. cc @aokolnychyi , @peter-toth , too. |
@dongjoon-hyun |
8aeab4f to
28f9608
Compare
|
IIUC, |
|
Anyway, let's wait for @gengliangwang and @cloud-fan 's comment. Maybe, I lost the track of code change. |
@dongjoon-hyun FYI, Null constraints for table output relies on it too |
|
Got it. So, from Apache Spark 3.5.0 via the following? Then, cc @aokolnychyi , too. |
28f9608 to
4b7a2fd
Compare
4b7a2fd to
68ad307
Compare
|
This is probably a too breaking change, and many users may already treat this bug as a feature. If users need strict not null enforcement, they should migrate to DS v2. |
|
+1 with @cloud-fan . At least, the default behavior should not be changed in 4.2. |
|
Yeah, I agree with you that this is too breaking. However, many of our clients and developers here have raised doubts about this strange behavior of Spark. There are no clear docs for this.
Do you mean by using built-in sources with v2 code path or suggesting users migrate to third-party formats? |
|
I think using plain file source table is not recommended now, people should switch to lakehouse table formats. |
|
Since DSv2 migration is an independent topic from this DSv1 bug, I made a PR for DSv2 specifically, @cloud-fan , @gengliangwang , @yaooqinn . |
|
Hi @dongjoon-hyun builtin file sources with DSv2 code path have the same issue based on my unit tests |
|
@cloud-fan Instead of making NOT NULL as a default behavior, would you mind if we give it a shot that provide an alternative option for users if they want built-in parquet and delta lake to behave the same on NOT NULL constraints? For steaming source, we have such an option for toggling asNullable. |
68ad307 to
d9631e2
Compare
…ble inserts V1 file-based DataSource writes (parquet/orc/json) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement by: 1. CreateDataSourceTableCommand: Preserve user-specified nullability by recursively merging nullability flags from the user schema into the resolved dataSource.schema. 2. PreprocessTableInsertion: Restore nullability flags from the catalog schema before null checks. This ensures AssertNotNull is injected when needed, gated by spark.sql.fileSource.insert.enforceNotNull. 3. Config: spark.sql.fileSource.insert.enforceNotNull (default false) - when true, enforces NOT NULL constraints for file-based tables. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d9631e2 to
dd394fc
Compare
What changes were proposed in this pull request?
V1 file-based DataSource writes (Parquet, ORC, JSON, etc.) silently accept null values into NOT NULL columns. This PR adds opt-in NOT NULL constraint enforcement controlled by
spark.sql.fileSource.insert.enforceNotNull.Changes:
CreateDataSourceTableCommand: Preserves user-specified nullability by recursively merging nullability flags from the user schema into the resolveddataSource.schema. Previously it storeddataSource.schemadirectly, which is all-nullable due toDataSource.resolveRelation()callingdataSchema.asNullable(SPARK-13738).PreprocessTableInsertion: Restores nullability flags from the catalog schema before null checks, ensuringAssertNotNullis injected when needed. Gated behindspark.sql.fileSource.insert.enforceNotNull.New config:
spark.sql.fileSource.insert.enforceNotNull(defaultfalse) — when set totrue, enables NOT NULL constraint enforcement for V1 file-based tables, consistent with the behavior for other data sources and V2 catalog tables.SparkGetColumnsOperation: FixedIS_NULLABLEto respectcolumn.nullableinstead of always returning"YES".Why are the changes needed?
DataSource.resolveRelation()callsdataSchema.asNullable(added in SPARK-13738 for read safety), which strips all NOT NULL constraints recursively.CreateDataSourceTableCommandthen stores this all-nullable schema in the catalog, permanently losing NOT NULL information. As a result,PreprocessTableInsertionnever injectsAssertNotNullfor V1 file source tables.Note:
InsertableRelation(e.g.,SimpleInsertSource) does NOT have this problem because it preserves the original schema (SPARK-24583).Does this PR introduce any user-facing change?
No change in default behavior. Users can opt in to NOT NULL enforcement for V1 file source tables by setting
spark.sql.fileSource.insert.enforceNotNulltotrue.How was this patch tested?
InsertSuitecovering top-level, nested struct, array, and map null constraint enforcement.SparkMetadataOperationSuite.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with GitHub Copilot.