git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

thoughts on fixing trailing separator suppression


I haven't completed this note yet, but as I am having to shift gears for a few days, I am providing my current state of thinking around what's actually required here for posterity/amusement, and to share what I anticipate will actually be a great improvement to Daffodil, but non-trivial to implement.


Comments are welcome. This needs to turn into a plan/design for the fix.


-----------------------


Design/Redesign Issues for Sequences with Separators

We recently found issues associated with nested separators, and with dfdl:separatorSuppressionPolicy 'trailing' and 'trailingStrict'.

It's not possible to get the IBM4690TLog DFDL schema to work on Daffodil until these are fixed, and the bug with nested separators also has impact on the DFDL schema for USMTF.

Fixing these is going to require some refactoring and reimplementation starting from the daffodil-core grammar package.

The Problems

The grammar has some mistakes in it that make it impossible to properly implement.

* The first place we look at dfdl:separatorSuppressionPolicy is in the grammar production for arrayContentsWithSeparators. This is incorrect because separator suppression can apply to terms in a sequence that are not related by being part of an array.

That said, it is not terribly important (from a bug standpoint) to support separator suppression except for when there are repeating/optional elements involved.

The bigger problem is:

* The separatedForArrayPosition(body) method composes together the separators with the term following them that is separated. This prevents independent orchestration of the parse of the separator with that of the following element. It is depending on the parse of the separator failing, and that causing backtracking to try subsequent alternatives, as a way of identifying which delimiter was actually found. This could be made to work, but it requires changes to the way separators and all delimiters really, are parsed.

The current design, where separators are composed together into the child terms, makes implementation of trailing separator suppression basically impossible.

Fact is, all of this was designed to recursively compose by just composing parsers from other parsers, but that turns out to be naive because it doesn't provide enough structure to issue really good clear diagnostics, and it is too hard to tell what is going on.

-----------------------------------------------

A sequence combinator should be laid down depending on whether the sequence has separators or not, and whether it has trailing separator suppression or not.

That is, the decision we see being made in arrayContentsWithSeparators, needs to be made in SequenceGrammarMixin in the orderedSequenceContent production. (When implemented unorderedSequenceContent may be similar.)

The kind of sequence combinator being laid down should depend on separators or not right at that point, and the sequence combinator for the hasSeparators case will have a rich set of parameters.

The asTermInSequence, and asTermInArray both need to go away. Instead of combining all these characteristics into one composed term that we parse/unparse, with separators embedded into it, whether it is even represented embedded in it, etc. Instead we need an array of tuples, each tuple is a (termContentBody, termRuntimeData) where termRuntimeData lets us know if the term isNotRepresented, is scalar, or is array, and what the max/min occurs are, etc.

So the SeparatedSequenceCombinator then must orchestrate everything about prefix/posfix separators, and the loops needed for repeating/array/optional terms.

I expect we will have at least

SeparatedSequenceCombinator
UnseparatedSequenceCombinator

I believe the entire LocalElementGrammarMixin will have to be refactored/rewritten. I think every method will likely have to be removed.

All these methods have to go away:
separatedEmpty
separatedRecurring
StopValue - which isn't implemented anyway, so it's just clutter.
recurrance
separatedContentWithMinUnboundedWithoutTrailingEmpties
separatedContentWithMinAndMaxWithoutTrailingEmpties
separatedContentWithMinUnbounded
separatedContentWithMinAndMax
separatedContentZeroToUnbounded
separatedContentAtMostNWithoutTrailingEmpties
stopValueSize
separatedContentExactlyN
separatedContentExactlyNComputed
arrayContents
arrayContentsNoSeparators
arrayContentsWithSeparators
contentUnbounded

Instead, LocalElementGrammarMixin - which is all about recurrance/optionality - will be in service of several kinds of sequence combinators. Its methods will provide the things needed in order for proper compilation of the parameters given to a sequence combinator. It may not be a grammar mixin anymore. It may just become methods on LocalElementMixin.

The chosen sequence combinator must interoperate with the repeat combinators when dealing with separators. Or alternatively, it must subsume their functionality.

---------------------------
The below is probably NOT true:

Interplay of field isolation that scans for a value, ending at a delimiter, vs., after that, we parse the delimiter again, is where we get into trouble.

We're depending on backtracking - when it tries to "reparse" the separator, but doesn't find the separator, then backtracking causes the parser to then again parse the field, but then try the next outward possible terminating markup. Until something parses successfully.

------------------

TODO and FIXME comments in the DelimiterText code suggest this is not being done properly - elements that could NOT be terminated by the sequence's terminator are nevertheless being scanned with the sequence's terminator as one of the possible delimiters.  (Though the semantics of nested delimiters are a bit unclear here. For example, if a simple type element has a terminator, then one would *think* that one would not need to escape the enclosing group separator appearing within it, because the only thing that matters is the terminator; however, this is not the case. One must still escape the enclosing separator. (This is the agreed upon view of the DFDL Workgroup. I am not sure it is right... but we may have to introduce a different lengthKind='delimited2' variation to provide the alternate behavior where a terminator on a simple element eliminates the need to escape anything except that terminator.

-----------------------

If we create new Sequence combinators and eliminate the composition of separators into child term parsers, we can better orchtestrate how the delimited term parsers identify a simple-type element, scanning until some terminating markup is found, with the testing for whether what was found is the expected separator or not.

That is, we can eliminate the redundant parse of the delimiter that happens currently - first to isolate the element value, second to consume the delimiter, verifying that it is an expected separator, etc.