Suggestions on mechanism or existing code - maintain persistence of file download history
On 30/01/20 9:35 PM, R.Wieser wrote:
>> MRAB's scheme does have the disadvantages to me that Chris has pointed
> Nothing that can't be countered by keeping copies of the last X number of
> to-be-dowloaded-URLs files.
That's a good idea, but how would the automated system 'know' to give-up
on the current file and utilise generation n-1? Unable to open the file
> As for rewriting every time, you will /have/ to write something for every
> action (and flush the file!), if you think you should be able to ctrl-c (or
> worse) out of the program.
Which is the nub of the problem!
Using ctrl+c is a VERY BAD idea. Depending upon the sophistication of
the solution/existing code, surely there is another way...
Even closing/pulling-out the networking connection to cause an exception
within Python, would enable management of a more 'clean' and 'data safe'
(see also 'sledgehammer to crack a nut')
Why do you need to abandon the process mid-way?
> But, you could opt to write this sessions successfully downloaded URLs to a
> seperate file, and only merge that with the origional one program start.
> That together with an integrity check of the seperate file (eventually on a
> line-by-line (URL) basis) should make the origional files corruption rather
What is the OP's definition of "unlikely" or "acceptable risk"?
If RDBMS == "unnecessary complexity", then (presumably) 'concern' will
be commensurately low, and much of the discussion to-date, moot?
I've not worked on 'downloads' (which I take to mean data files, eg
forms from the tax office - guess what task I'm procrastinating over?)
but have automated the downloading of web page content/headers. There
are so many reasons why such won't work first-time, when they should
every time; that it may be quite difficult to detect 'corruption' (as
distinct from so many of these other issues that may arise)...
> A database /sounds/ good, but what happens when you ctrl-c outof a
> non-atomic operation ? How do you fix that ? IOW: Databases can be
> corrupted for pretty-much the same reason as for a simple datafile (but with
> much worse consequences).
[apologies for personal comment]
I, (with my skill-set, tool-set, collection of utilities, ... - see
earlier mention of "bias") reach for an RDBMS more quickly than many*.
Mea culpa or 'more power to [my] right arm'?
The DB suggestion (posted earlier) involved only a single table, to
which fields would be added/populated during processing as a record of
progress/status. Thus, replacing the single file that the OP
(originally) outlined as fitting his/her needs, with a single DB-table.
Accordingly, there is no non-atomic transaction in the proposal - UPDATE
is atomic in most (competent) RDBMS.
(again, in my ignorance of that project, please don't (anyone) think I'm
Contrarily, if the 'single table idea' is hardly a "database" by most
definitions, why bother? The answer lies in the very mechanisms to
combat corruptions and interruptions being discussed! As a
fundamentally-lazy person, I'd rather leave the RDBMS-coders to wrestle
with such complexities 'for me'. Then, I can 'stand on the shoulders' of
such 'giants', by driving their (competently working) 'black box'...
Now, it transpires, the OP possesses DB skills. So, (s)he is in a
position to make the go/no decision which suits the actual spec. Yahoo!
> Also think of the old adagio: "I had a problem, and than I thought I could
> use X. Now I have two problems..." - with X traditionally being "regular
> expressions". In other words: do KISS (keep it ....)
Good point! (I'm not a great fan of RegEx-es either)
- reduce/avoid complexity, "simple is better than complex"! (Python:
Surely though, it is only appropriate to dive into the concerns and
complexities of DB accuracy and "consistency", if we do likewise with
The rationale of my 'laziness' argument 'for' using an RDBMS, also
applies to plain-vanilla file systems. Do I want to deal with the
complexities of managing files and corruptions, in that arena?
(you could easily guess the answer to that!)
(the answer may be quite different - but no matter, I'm not going to say
you are "wrong", as long as in making such a decision (files?DB) we
compare 'like with like' - in fact, before that: as long as the client's
spec says that we need to be worrying about such detail!
(otherwise YAGNI applies!)
> By the way: The "just write the URLs in a folder" method is not at all a bad
> one. /Very/ easy to maintain, resilent (especially when you consider the
> self-repairing capabilities of some filesystems) and the polar opposite of a
> "customer lock-in". :-)
Be aware that formation rules for URLs are not congruent with OS FS rules!
(such concerns don't apply if the URLs are data within a file/table)
* was astonished to discover (a show-of-hands poll at some conference or
other) that 'the average applications programmer' dislikes SQL/RDBMS and
would rather have 'someone else' handle that side of things. Most of
those ascribed their attitude to not having been able to 'get [their]
heads around SQL' - which left me baffled because I 'just see it'.
However, my mental processes have been queried (more than once)! Upon
reflection, this 'discovery' made me happy - found me another niche to