git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [CSV] Inconsistent record separator behavior


Hi,

Am Do., 23. Aug. 2018 um 12:11 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:

> On 23 August 2018 at 07:10, Benedikt Ritter <britter@xxxxxxxxxx> wrote:
> > Hey sebb,
> >
> > Am Do., 23. Aug. 2018 um 01:23 Uhr schrieb sebb <sebbaz@xxxxxxxxx>:
> >
> >> On 23 August 2018 at 00:01, Bruno P. Kinoshita
> >> <brunodepaulak@xxxxxxxxxxxx.invalid> wrote:
> >> >
> >> >>Maybe I'm just not getting it, but it feels pretty messed up :-)
> >> >
> >> >
> >> > Mutual feeling, and +1 for consistency. From what I understood, users
> >> should be able to parse these crazy CVS's, but if they tried to
> re-create
> >> them, with comments, then they wouldn't be able to avoid the
> >> println/newline (so it wouldn't be parseable later with the same
> reader).
> >> >
> >> >
> >> > We probably need a ticket for it to aggregate the discussion and
> maybe a
> >> possible solution.
> >>
> >> I'm wondering whether we need to be as flexible when *creating* the CSV
> >> files.
> >>
> >> "Be liberal in what you accept, and conservative in what you send" (Jon
> >> Postel)
> >>
> >> In this case send == create, as it might be sent to other less liberal
> >> readers.
> >>
> >> I don't have a problem with the output being less flexible, so long as
> >> it is sufficiently flexible (which I think it likely is already).
> >>
> >> I don't think consistency is necessary - or even desirable - here.
> >>
> >
> > okay, but wouldn't you expect that you can use a CSVFormat instance to
> read
> > a file that you created with it? This is currently not the case.
>
> Sorry, I misread the problem.
>
> Yes, it should be able to read what it writes.
>
> So the issue remains: should the reader be able to parse the unusual
> format, or should the writer not be able to create it?
>
> I don't have a particular view on that, except that allowing LF and
> CRLF only seems too restricting.
> We should allow at least CR alone. I don't know whether there are any
> other reasonable separators.
>

As Bruno pointed out, there seem to be formats that have record separator
that are not new lines. So maybe CSVPrinter.printComment(String) should not
scan for CR and LF but for the record separator.


>
> Perhaps we could just document the method to warn that using anything
> other than CR, LF or CRLF will produce an output file that is not
> parseable?
>

That sounds like a good approach. But how would you implement that? You
probably don't want to introduce a dependency on a logging framework just
for that, do you?

Regards,
Benedikt


>
> > Regards,
> > Benedikt
> >
> >
> >>
> >> > Cheers
> >> >
> >> > ________________________________
> >> > From: Benedikt Ritter <britter@xxxxxxxxxx>
> >> > To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>;
> >> brunodepaulak@xxxxxxxxxxxx
> >> > Sent: Thursday, 23 August 2018 7:10 AM
> >> > Subject: Re: [CSV] Inconsistent record separator behavior
> >> >
> >> >
> >> >
> >> > Hi Bruno,
> >> >
> >> > Am Mi., 22. Aug. 2018 um 15:10 Uhr schrieb Bruno P. Kinoshita
> >> > <brunodepaulak@xxxxxxxxxxxx.invalid>:
> >> >
> >> >> Hi,
> >> >>
> >> >>
> >> >> Will try to look at the code and give a better answer during the
> >> weekend.
> >> >> But risking a silly question, would it mean that users are not able
> to
> >> >> parse a CSV unless each CSV row is separated by LF or CRLF?
> >> >
> >> >
> >> > Yes.
> >> >
> >> >
> >> >> I remember getting a CSV in a government website some time ago that
> was
> >> >> formatted in a very strange way, and if I remember well it was a
> small
> >> >> file, but without LF or CRLF. I think it was using | to separate the
> >> rows,
> >> >> and , for columns.
> >> >>
> >> >
> >> > I didn't know that there are formats that don't use a new line as line
> >> > separator.
> >> >
> >> >
> >> >>
> >> >>
> >> >> Quick search returned at least another person with similar issue
> >> >>
> >>
> https://stackoverflow.com/questions/29903202/how-to-read-csv-on-python-with-newline-separator
> >> >>
> >> >>
> >> >> Not sure if I understood the problem well, but in case it makes
> sense...
> >> >> my suggestion would be to perhaps confirm if we could change
> >> >> CSVPrinter.printComment to accept other characters for line ending?
> >> >>
> >> >
> >> > The inconsistency I'm seeing is, that we an the one hand accept any
> >> > character sequence as a record separator. Comments in a way a like
> >> special
> >> > records to me. But our implementation seems to put them on a new
> "line"
> >> > using the println() method. The println() method in turn uses the
> record
> >> > seperator to start a new record. So it's not necessarily a new line.
> >> > Nevertheless while processing a comment, we look out for CR and LF and
> >> then
> >> > we call println() again. Maybe I'm just not getting it, but it feels
> >> pretty
> >> > messed up :-)
> >> >
> >> > Regards,
> >> > Benedikt
> >> >
> >> >
> >> >
> >> >>
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Bruno
> >> >>
> >> >>
> >> >> ________________________________
> >> >> From: Benedikt Ritter <britter@xxxxxxxxxx>
> >> >> To: Commons Developers List <dev@xxxxxxxxxxxxxxxxxx>
> >> >> Sent: Tuesday, 21 August 2018 7:13 PM
> >> >> Subject: [CSV] Inconsistent record separator behavior
> >> >>
> >> >>
> >> >>
> >> >> Hi,
> >> >>
> >> >>
> >> >> we have this strange handling of record separator / line endings in
> CSV:
> >> >>
> >> >>
> >> >> Users can use what ever character sequence they like as a record
> >> separator.
> >> >>
> >> >> I could for example use the ! character to mark the end of a record.
> >> >>
> >> >> Then we have CSVPrinter.printComment(String). This inserts comments
> >> into a
> >> >>
> >> >> CSV output. It detects CRLF and call println() on the CSVFormat,
> which
> >> in
> >> >>
> >> >> turn uses the record separator to indicate a new record...
> >> >>
> >> >>
> >> >> So now I'm thinking: Does it make sense to use anything else but LF
> or
> >> CRLF
> >> >>
> >> >> as record separator? Maybe we should deprecate
> >> >>
> >> >> CSVFormat.recordSeparator(String) and introduce a LineEnding enum
> where
> >> >>
> >> >> users can choose between LF and CRLF. This way we can make the
> behavior
> >> >>
> >> >> between parsing and printing consistent.
> >> >>
> >> >>
> >> >> Thoughts?
> >> >>
> >> >> Benedikt
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >> >
> >> >>
> >> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> > For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> >> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxx
>
>