[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Updated] (DAFFODIL-931) Variable-width charset with 'replace' can result in wrong length calculations

     [ ]

Michael Beckerle updated DAFFODIL-931:
    Fix Version/s:     (was: deferred)

> Variable-width charset with 'replace' can result in wrong length calculations
> -----------------------------------------------------------------------------
>                 Key: DAFFODIL-931
>                 URL:
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Back End, General
>    Affects Versions: s12
>            Reporter: Michael Beckerle
>            Assignee: Steve Lawrence
>            Priority: Major
>             Fix For: 2.2.0
> Given a utf-8 string with a single-byte non-decodable byte in the middle.
> When we parse this the non-decodable byte will contribute a unicode replacement character to the string. 0xFFFD is the character code.
> If you then take this string and call getBytes("utf-8") on it, you will not get the right length. You will get 3 instead of 1 for the error because 0xFFFD takes 3 bytes in utf-8.
> The way we are measuring how far to move ahead in bytes right now, when we have a variable-width encoding like UTF-8, is to do exactly the above, call getBytes to find how long the string was.
> This will cause us to move too far ahead into the data.
> Test case to illustrate is TBD, but isn't too hard to put together. Just put a string per above with length coming from an expression. Put the string between two binary int fields. The binary int field after will not be parsed properly. because we will advance too far on the string.

This message was sent by Atlassian JIRA