git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Complex Types Support in DDL


Hi,

We want add support for non-primitive types (ROW, ARRAY, MAP) to Apache
Beam SQL DDL (based on Calcite DDL extensions). What would be the best way
to approach this?

*Our Use Case:*
  We want to be able to use DDL to define data sources and sinks for Beam
pipelines, so that users don't have to wrap SQL into custom code which
configures sources/sinks.

*What we have already:*
  We have a customized CREATE TABLE statement which allows users to specify
the type of the data source, its schema, and data location. The
implmentation is based on Calcite DDL extensions.

*What we're missing:*
  We need to be able to define schemas with non-primitive types, e.g.
arrays or rows, so that we can correctly describe data sources and sinks
which supports such types. For example if we want to manipulate data in a
stream of JSON objects, we want to be able to describe the JSON contents
somehow, including arrays or nested objects. Or we would need similar types
to interact with BigQuery which supports arrays and nested struct types.

*Problem:*
  I tried to check if it is possible to extend the parser using the
config.fmpp approach, so that we can hook into the Parser.TypeName()
<https://github.com/apache/calcite/blob/a5d520df76602d25ed66627f08f5e0db4d048a77/core/src/main/codegen/templates/Parser.jj#L4439>
method and parse the complex types ourselves. But Parser.DataType()
<https://github.com/apache/calcite/blob/a5d520df76602d25ed66627f08f5e0db4d048a77/core/src/main/codegen/templates/Parser.jj#L4377>
creates
SqlDataTypeSpec only in two specific ways, without ability to extend it, so
even if we parse the typename ourselves, we would not be able to construct
the SqlDataTypeSpec in a way that supports arrays/rows. But even if we
could, looking at SqlDataTypeSpec
<https://github.com/apache/calcite/blob/09be7e74a6a4d1b1c4f640c8e69b5ebdd467d811/core/src/main/java/org/apache/calcite/sql/SqlDataTypeSpec.java#L327>
it seems that it does not support creating arrays or rows as well: it calls
typeFactory.createSqlType(typename)
<https://github.com/apache/calcite/blob/09be7e74a6a4d1b1c4f640c8e69b5ebdd467d811/core/src/main/java/org/apache/calcite/sql/SqlDataTypeSpec.java#L350>
which
only
<https://github.com/apache/calcite/blob/f47465236b7650f2280092b708fa39062fe79ffd/core/src/main/java/org/apache/calcite/sql/type/SqlTypeFactoryImpl.java#L49>
creates basic types in this call.

*Path forward:*
  It the above is correct, then it appears that we would need to patch
Calcite in couple of places to support arrays, rows, and maps in DDL:
    - update Parser.jj to support parsing the type definitions for the
required types and constructing SqlDataTypeSpec correctly for those cases;
    - update SqlDataTypeSpec.java to handle complex types and invoke
correct typeFactory interfaces;

*Questions:*
 - does the above sound sane/correct?
 - is there a similar work already tracked in Calcite somewhere? I saw
something mentioned in CALCITE-2045
<https://issues.apache.org/jira/browse/CALCITE-2045?focusedCommentId=16351203&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16351203>,
but didn't see any tracking Jiras specifically for this work yet;
 - is there a known/recommended/working syntax for such DDL? If there is
none, then would it make sense to adopt something similar to BigQuery
STRUCT/ARRAY
definition <https://cloud.google.com/bigquery/docs/data-definition-language>
?

Thank you,
Anton