git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [CSV] Inconsistent record separator behavior


On 23 August 2018 at 17:31, Benedikt Ritter <britter@xxxxxxxxxx> wrote:
> Hi,
>
> Am Do., 23. Aug. 2018 um 12:11 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:
>
>> On 23 August 2018 at 07:10, Benedikt Ritter <britter@xxxxxxxxxx> wrote:
>> > Hey sebb,
>> >
>> > Am Do., 23. Aug. 2018 um 01:23 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:
>> >
>> >> On 23 August 2018 at 00:01, Bruno P. Kinoshita
>> >> <brunodepaulak@xxxxxxxxxxxx.invalid> wrote:
>> >> >
>> >> >>Maybe I'm just not getting it, but it feels pretty messed up :-)
>> >> >
>> >> >
>> >> > Mutual feeling, and +1 for consistency. From what I understood, users
>> >> should be able to parse these crazy CVS's, but if they tried to
>> re-create
>> >> them, with comments, then they wouldn't be able to avoid the
>> >> println/newline (so it wouldn't be parseable later with the same
>> reader).
>> >> >
>> >> >
>> >> > We probably need a ticket for it to aggregate the discussion and
>> maybe a
>> >> possible solution.
>> >>
>> >> I'm wondering whether we need to be as flexible when *creating* the CSV
>> >> files.
>> >>
>> >> "Be liberal in what you accept, and conservative in what you send" (Jon
>> >> Postel)
>> >>
>> >> In this case send == create, as it might be sent to other less liberal
>> >> readers.
>> >>
>> >> I don't have a problem with the output being less flexible, so long as
>> >> it is sufficiently flexible (which I think it likely is already).
>> >>
>> >> I don't think consistency is necessary - or even desirable - here.
>> >>
>> >
>> > okay, but wouldn't you expect that you can use a CSVFormat instance to
>> read
>> > a file that you created with it? This is currently not the case.
>>
>> Sorry, I misread the problem.
>>
>> Yes, it should be able to read what it writes.
>>
>> So the issue remains: should the reader be able to parse the unusual
>> format, or should the writer not be able to create it?
>>
>> I don't have a particular view on that, except that allowing LF and
>> CRLF only seems too restricting.
>> We should allow at least CR alone. I don't know whether there are any
>> other reasonable separators.
>>
>
> As Bruno pointed out, there seem to be formats that have record separator
> that are not new lines. So maybe CSVPrinter.printComment(String) should not
> scan for CR and LF but for the record separator.
>

Makes sense.

>>
>> Perhaps we could just document the method to warn that using anything
>> other than CR, LF or CRLF will produce an output file that is not
>> parseable?
>>
>
> That sounds like a good approach. But how would you implement that? You
> probably don't want to introduce a dependency on a logging framework just
> for that, do you?

I meant: add a warning to the documentation.

> Regards,
> Benedikt
>
>
>>
>> > Regards,
>> > Benedikt
>> >
>> >
>> >>
>> >> > Cheers
>> >> >
>> >> > ________________________________
>> >> > From: Benedikt Ritter <britter@xxxxxxxxxx>
>> >> > To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>;
>> >> brunodepaulak@xxxxxxxxxxxx
>> >> > Sent: Thursday, 23 August 2018 7:10 AM
>> >> > Subject: Re: [CSV] Inconsistent record separator behavior
>> >> >
>> >> >
>> >> >
>> >> > Hi Bruno,
>> >> >
>> >> > Am Mi., 22. Aug. 2018 um 15:10 Uhr schrieb Bruno P. Kinoshita
>> >> > <brunodepaulak@xxxxxxxxxxxx.invalid>:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >>
>> >> >> Will try to look at the code and give a better answer during the
>> >> weekend.
>> >> >> But risking a silly question, would it mean that users are not able
>> to
>> >> >> parse a CSV unless each CSV row is separated by LF or CRLF?
>> >> >
>> >> >
>> >> > Yes.
>> >> >
>> >> >
>> >> >> I remember getting a CSV in a government website some time ago that
>> was
>> >> >> formatted in a very strange way, and if I remember well it was a
>> small
>> >> >> file, but without LF or CRLF. I think it was using | to separate the
>> >> rows,
>> >> >> and , for columns.
>> >> >>
>> >> >
>> >> > I didn't know that there are formats that don't use a new line as line
>> >> > separator.
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Quick search returned at least another person with similar issue
>> >> >>
>> >>
>> https://stackoverflow.com/questions/29903202/how-to-read-csv-on-python-with-newline-separator
>> >> >>
>> >> >>
>> >> >> Not sure if I understood the problem well, but in case it makes
>> sense...
>> >> >> my suggestion would be to perhaps confirm if we could change
>> >> >> CSVPrinter.printComment to accept other characters for line ending?
>> >> >>
>> >> >
>> >> > The inconsistency I'm seeing is, that we an the one hand accept any
>> >> > character sequence as a record separator. Comments in a way a like
>> >> special
>> >> > records to me. But our implementation seems to put them on a new
>> "line"
>> >> > using the println() method. The println() method in turn uses the
>> record
>> >> > seperator to start a new record. So it's not necessarily a new line.
>> >> > Nevertheless while processing a comment, we look out for CR and LF and
>> >> then
>> >> > we call println() again. Maybe I'm just not getting it, but it feels
>> >> pretty
>> >> > messed up :-)
>> >> >
>> >> > Regards,
>> >> > Benedikt
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Bruno
>> >> >>
>> >> >>
>> >> >> ________________________________
>> >> >> From: Benedikt Ritter <britter@xxxxxxxxxx>
>> >> >> To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>
>> >> >> Sent: Tuesday, 21 August 2018 7:13 PM
>> >> >> Subject: [CSV] Inconsistent record separator behavior
>> >> >>
>> >> >>
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >>
>> >> >> we have this strange handling of record separator / line endings in
>> CSV:
>> >> >>
>> >> >>
>> >> >> Users can use what ever character sequence they like as a record
>> >> separator.
>> >> >>
>> >> >> I could for example use the ! character to mark the end of a record.
>> >> >>
>> >> >> Then we have CSVPrinter.printComment(String). This inserts comments
>> >> into a
>> >> >>
>> >> >> CSV output. It detects CRLF and call println() on the CSVFormat,
>> which
>> >> in
>> >> >>
>> >> >> turn uses the record separator to indicate a new record...
>> >> >>
>> >> >>
>> >> >> So now I'm thinking: Does it make sense to use anything else but LF
>> or
>> >> CRLF
>> >> >>
>> >> >> as record separator? Maybe we should deprecate
>> >> >>
>> >> >> CSVFormat.recordSeparator(String) and introduce a LineEnding enum
>> where
>> >> >>
>> >> >> users can choose between LF and CRLF. This way we can make the
>> behavior
>> >> >>
>> >> >> between parsing and printing consistent.
>> >> >>
>> >> >>
>> >> >> Thoughts?
>> >> >>
>> >> >> Benedikt
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
>> >> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>> >> >
>> >> >>
>> >> >>
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
>> >> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
>> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx