git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Confusing textwrap parameters, and request for RE help


On 25/03/20 10:30 AM, Chris Angelico wrote:
> On Wed, Mar 25, 2020 at 8:04 AM DL Neil via Python-list
> <python-list at python.org> wrote:
>>
>> On 23/03/20 8:00 AM, Chris Angelico wrote:
>>> When using textwrap.fill() or friends, setting break_long_words=False
>>> without also setting break_on_hyphens=False has the very strange
>>> behaviour that a long hyphenated word will still be wrapped. I
>>> discovered this as a very surprising result when trying to wrap a
>>> paragraph that contained a URL, and wanting the URL to be kept
>>> unchanged:

>> Your idea of sub-classing (as I'm sure YOU know, textwrap is but a
>> convenience-function) struck me as clever-thinking! Could textwrap's
>> 'final format' be caught just before 'return', enabling a post-process
>> to undo anything textwrap has done, and (re-)format the URLs to spec, or
>> to treat textwrap's output as a template and 'inject' the URL
>> appropriately? If not a sub-class, a decorator?
> 
> Hmmmmmm. Very VERY interesting idea, and one I hadn't thought of. Thank you.

A pleasure to be able to offer even such a small 'something', by way of 
return!


>> My idea (being more simple-minded than you!), would be to partition the
>> text (yes, am alluding to the Python str.method):
>> - textwrap the 'early text',
>> - treat the URL as a string using the required convention,
>> - textwrap the 'later text', and
>> - str.join() the three components/partitions afterwards.
>>
>> Both likely 'force' the URL to occupy a line of its own, and thus create
>> some odd-looking results!
> 
> The use-case here is a Twitter client I'm building. It works in the
> terminal. I would very much like NOT to build any sort of GUI for it.

+1


> Since tweets often contain URLs, it's important to render them
> correctly. The display style is to have "@Username: " at the start of
> the first line, and the same number of spaces on subsequent lines,
> which creates a very readable display. (Also, quoting retweets show
> the original tweet indented underneath, giving a bit more
> indentation.) Unfortunately, forcing every URL onto its own line would
> make the display a bit too vertical for my liking; often there's a
> very short URL (eg someone's Twitch link) that doesn't want to be
> split across.
> 
> Actually the ultimate solution would be the not-yet-standardized
> protocol for effectively showing hypertext on the console, where the
> abbreviated text could actually be made clickable as the full text.
> But that requires more help from the terminal emulator, unless I'm
> just misreading the examples (it's supposed to work in gnome-terminal
> but I couldn't get it to behave). Or alternatively, as mentioned, a
> way to say to the terminal emulator, "please indent this text by at
> least this amount" (where the amount would most likely be specified as
> a number of characters, but in theory could be millimeters instead).
> 
> For now, though, all I can do is rewrap URLs. And at the moment, what
> I've done is just block all long words from being wrapped AND block
> words from being wrapped at hyphens, when really what I actually want
> is to say "https://......."; becomes unbreakable, and leave all the
> rest unchanged. Hence the question about the regex.
> 
> Currently, the regex splits a long line of text into a series of
> words, including hyphen points. What I want is to assert "this ain't a
> word split point, because the word starts [a-z]+:// and is thus a
> URL". But that might be beyond the flexibility of REs.


As you observe, the problem with terminal emulators is the extent of 
their emulation and the degree of adoption of their 'extended features'!

My concern grows because of the need (I assume) for the URL to be 
'clickable', not merely 'complete' and 'unadorned' (no added hyphen 
rendering it useless).

Despite the almost-irresistible urge to prove that I'm a class-y guy 
[albeit with a warped sense of humor], I'm warming to the 'partition' 
suggestion:
- the output from textwrap.wrap() is a list of strings,
- we are talking about fixed-length lines denominated in characters,
- locating a[ll] URI (from a sub-set of schemes, eg "http", "https", 
...) is a well-worn path.

Thus (simplified to assume the presence of exactly one URI!!!):
- textwrap the first partition
- len( URI )
- it becomes trivial to ascertain if the URL might append the last line 
(of the first 'wrapped' partition)
- failing that, the URI must start a new line (by defn)
- if it is 'too long', ie would be wrapped by textwrap, treat it separately,
- conversely, prepend the URI to the third partition,
- text-wrap the third partition,
- assemble the tweet-display.

Opinion to the contrary, I am not a twit. I fear missing something in 
those subtleties...
-- 
Regards =dn