git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Confusing textwrap parameters, and request for RE help


22. 3. 2020 v 20:02 Chris Angelico <rosuav at gmail.com>:
>
> When using textwrap.fill() or friends, setting break_long_words=False
> without also setting break_on_hyphens=False has the very strange
> behaviour that a long hyphenated word will still be wrapped. I
> discovered this as a very surprising result when trying to wrap a
> paragraph that contained a URL, and wanting the URL to be kept
> unchanged:
> [...]
>
> Second point, and related to the above. The regex that defines break
> points, as found in the source code, is:
>
> wordsep_re = re.compile(r'''
>         ( # any whitespace
>           %(ws)s+
>         | # em-dash between words
>           (?<=%(wp)s) -{2,} (?=\w)
>         | # word, possibly hyphenated
>           %(nws)s+? (?:
>             # hyphenated word
>               -(?: (?<=%(lt)s{2}-) | (?<=%(lt)s-%(lt)s-))
>               (?= %(lt)s -? %(lt)s)
>             | # end of word
>               (?=%(ws)s|\Z)
>             | # em-dash
>               (?<=%(wp)s) (?=-{2,}\w)
>             )
>         )''' % {'wp': word_punct, 'lt': letter,
>                 'ws': whitespace, 'nws': nowhitespace},
>
> It's built primarily out of small matches with long assertions, eg
> "match a hyphen, as long as it's preceded by two letters or a letter
> and a hyphen". What I want to do is create a *negative* assertion:
> specifically, to disallow any breaking between "\b[a-z]+://" and "\b",
> which will mean that a URL will never be broken ("https://..........";
> until the next whitespace boundary). Regex assertions of this form
> have to be fixed lengths, though, so as described, this isn't
> possible. Regexperts, any ideas? How can I modify this to never break
> inside a URL?
> [...]
>
> ChrisA
>
Hi,
I might be missing something obvious, but it seems to me, that the
regex library might help with regard to your originally presented
approach:
https://pypi.org/project/regex/
https://bitbucket.org/mrabarnett/mrab-regex/

It supports variable-length lookaround assertions (beyond many other
extra features);
You could make textwrap or other code use it with a tweaked regex pattern.
However, I can't say whether it is sufficient in order to achieve the
needed functionality.

Regards,
         vbr