git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [CSV] Inconsistent record separator behavior


Am Do., 23. Aug. 2018 um 20:17 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:

> On 23 August 2018 at 17:31, Benedikt Ritter <britter@xxxxxxxxxx> wrote:
> > Hi,
> >
> > Am Do., 23. Aug. 2018 um 12:11 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:
> >
> >> On 23 August 2018 at 07:10, Benedikt Ritter <britter@xxxxxxxxxx> wrote:
> >> > Hey sebb,
> >> >
> >> > Am Do., 23. Aug. 2018 um 01:23 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:
> >> >
> >> >> On 23 August 2018 at 00:01, Bruno P. Kinoshita
> >> >> <brunodepaulak@xxxxxxxxxxxx.invalid> wrote:
> >> >> >
> >> >> >>Maybe I'm just not getting it, but it feels pretty messed up :-)
> >> >> >
> >> >> >
> >> >> > Mutual feeling, and +1 for consistency. From what I understood,
> users
> >> >> should be able to parse these crazy CVS's, but if they tried to
> >> re-create
> >> >> them, with comments, then they wouldn't be able to avoid the
> >> >> println/newline (so it wouldn't be parseable later with the same
> >> reader).
> >> >> >
> >> >> >
> >> >> > We probably need a ticket for it to aggregate the discussion and
> >> maybe a
> >> >> possible solution.
> >> >>
> >> >> I'm wondering whether we need to be as flexible when *creating* the
> CSV
> >> >> files.
> >> >>
> >> >> "Be liberal in what you accept, and conservative in what you send"
> (Jon
> >> >> Postel)
> >> >>
> >> >> In this case send == create, as it might be sent to other less
> liberal
> >> >> readers.
> >> >>
> >> >> I don't have a problem with the output being less flexible, so long
> as
> >> >> it is sufficiently flexible (which I think it likely is already).
> >> >>
> >> >> I don't think consistency is necessary - or even desirable - here.
> >> >>
> >> >
> >> > okay, but wouldn't you expect that you can use a CSVFormat instance to
> >> read
> >> > a file that you created with it? This is currently not the case.
> >>
> >> Sorry, I misread the problem.
> >>
> >> Yes, it should be able to read what it writes.
> >>
> >> So the issue remains: should the reader be able to parse the unusual
> >> format, or should the writer not be able to create it?
> >>
> >> I don't have a particular view on that, except that allowing LF and
> >> CRLF only seems too restricting.
> >> We should allow at least CR alone. I don't know whether there are any
> >> other reasonable separators.
> >>
> >
> > As Bruno pointed out, there seem to be formats that have record separator
> > that are not new lines. So maybe CSVPrinter.printComment(String) should
> not
> > scan for CR and LF but for the record separator.
> >
>
> Makes sense.
>
> >>
> >> Perhaps we could just document the method to warn that using anything
> >> other than CR, LF or CRLF will produce an output file that is not
> >> parseable?
> >>
> >
> > That sounds like a good approach. But how would you implement that? You
> > probably don't want to introduce a dependency on a logging framework just
> > for that, do you?
>
> I meant: add a warning to the documentation.
>

+1 for that! CSVPrinter has almost no class level documentation, so I
wanted to improve that anyway.

Benedikt


>
> > Regards,
> > Benedikt
> >
> >
> >>
> >> > Regards,
> >> > Benedikt
> >> >
> >> >
> >> >>
> >> >> > Cheers
> >> >> >
> >> >> > ________________________________
> >> >> > From: Benedikt Ritter <britter@xxxxxxxxxx>
> >> >> > To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>;
> >> >> brunodepaulak@xxxxxxxxxxxx
> >> >> > Sent: Thursday, 23 August 2018 7:10 AM
> >> >> > Subject: Re: [CSV] Inconsistent record separator behavior
> >> >> >
> >> >> >
> >> >> >
> >> >> > Hi Bruno,
> >> >> >
> >> >> > Am Mi., 22. Aug. 2018 um 15:10 Uhr schrieb Bruno P. Kinoshita
> >> >> > <brunodepaulak@xxxxxxxxxxxx.invalid>:
> >> >> >
> >> >> >> Hi,
> >> >> >>
> >> >> >>
> >> >> >> Will try to look at the code and give a better answer during the
> >> >> weekend.
> >> >> >> But risking a silly question, would it mean that users are not
> able
> >> to
> >> >> >> parse a CSV unless each CSV row is separated by LF or CRLF?
> >> >> >
> >> >> >
> >> >> > Yes.
> >> >> >
> >> >> >
> >> >> >> I remember getting a CSV in a government website some time ago
> that
> >> was
> >> >> >> formatted in a very strange way, and if I remember well it was a
> >> small
> >> >> >> file, but without LF or CRLF. I think it was using | to separate
> the
> >> >> rows,
> >> >> >> and , for columns.
> >> >> >>
> >> >> >
> >> >> > I didn't know that there are formats that don't use a new line as
> line
> >> >> > separator.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> Quick search returned at least another person with similar issue
> >> >> >>
> >> >>
> >>
> https://stackoverflow.com/questions/29903202/how-to-read-csv-on-python-with-newline-separator
> >> >> >>
> >> >> >>
> >> >> >> Not sure if I understood the problem well, but in case it makes
> >> sense...
> >> >> >> my suggestion would be to perhaps confirm if we could change
> >> >> >> CSVPrinter.printComment to accept other characters for line
> ending?
> >> >> >>
> >> >> >
> >> >> > The inconsistency I'm seeing is, that we an the one hand accept any
> >> >> > character sequence as a record separator. Comments in a way a like
> >> >> special
> >> >> > records to me. But our implementation seems to put them on a new
> >> "line"
> >> >> > using the println() method. The println() method in turn uses the
> >> record
> >> >> > seperator to start a new record. So it's not necessarily a new
> line.
> >> >> > Nevertheless while processing a comment, we look out for CR and LF
> and
> >> >> then
> >> >> > we call println() again. Maybe I'm just not getting it, but it
> feels
> >> >> pretty
> >> >> > messed up :-)
> >> >> >
> >> >> > Regards,
> >> >> > Benedikt
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Bruno
> >> >> >>
> >> >> >>
> >> >> >> ________________________________
> >> >> >> From: Benedikt Ritter <britter@xxxxxxxxxx>
> >> >> >> To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>
> >> >> >> Sent: Tuesday, 21 August 2018 7:13 PM
> >> >> >> Subject: [CSV] Inconsistent record separator behavior
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >>
> >> >> >> we have this strange handling of record separator / line endings
> in
> >> CSV:
> >> >> >>
> >> >> >>
> >> >> >> Users can use what ever character sequence they like as a record
> >> >> separator.
> >> >> >>
> >> >> >> I could for example use the ! character to mark the end of a
> record.
> >> >> >>
> >> >> >> Then we have CSVPrinter.printComment(String). This inserts
> comments
> >> >> into a
> >> >> >>
> >> >> >> CSV output. It detects CRLF and call println() on the CSVFormat,
> >> which
> >> >> in
> >> >> >>
> >> >> >> turn uses the record separator to indicate a new record...
> >> >> >>
> >> >> >>
> >> >> >> So now I'm thinking: Does it make sense to use anything else but
> LF
> >> or
> >> >> CRLF
> >> >> >>
> >> >> >> as record separator? Maybe we should deprecate
> >> >> >>
> >> >> >> CSVFormat.recordSeparator(String) and introduce a LineEnding enum
> >> where
> >> >> >>
> >> >> >> users can choose between LF and CRLF. This way we can make the
> >> behavior
> >> >> >>
> >> >> >> between parsing and printing consistent.
> >> >> >>
> >> >> >>
> >> >> >> Thoughts?
> >> >> >>
> >> >> >> Benedikt
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> >> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> >> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>
>