I don’t want to filter out these rows as the rest of the metadata is useful to me, but I also don’t want to have to remap all 900 fields from the schema just to tackle 4 problematic fields - is there an easy way to tell the Type Conversion code to simply make an assumption based on a setting / something I can force?
It’s been a while since you posted this question, have you found the answer to your issue ?
I assumed that the schema is inferred by Spark and not explicitly specified i.e. map 900 fields.
Based on the exception message you posted, the value seems to be coming from geo field or sort.
Spark would samples fields to infer the schema, and in this case it is likely that it has sampled all geo fields of NULL values. Which resulting in inferring type for geo field as NullType. When it encountered a document with non-null value of type Document it sees it as conflict.
A work around without defining your own map of 900 fields, is to let it infer the schema and then modify selected types only.
>>> df.printSchema() root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- a: null (nullable = true) |-- b: string (nullable = true) # Example of changing the NullType to StringType >>> modified_df = df.withColumn("geo", df["geo"].cast("string")) >>> modified_df.printSchema() root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- a: string (nullable = true) |-- b: string (nullable = true)
See also pyspark.sql.types