git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Suggestions on mechanism or existing code - maintain persistence of file download history


On Thu, 30 Jan 2020 07:26:36 +1100
Chris Angelico <rosuav at gmail.com> wrote:

> On Thu, Jan 30, 2020 at 7:06 AM jkn <jkn_gg at nicorp.f9.co.uk> wrote:

> > The situation is this - I have a long list of file URLs and want to
> > download these as a 'background task'. I want this to process to be
> > 'crudely persistent' - you can CTRL-C out, and next time you run
> > things it will pick up where it left off.

> A decent project. I've done this before but in restricted ways.

> > The download part is not difficult. Is is the persistence bit I am
> > thinking about.  It is not easy to tell the name of the downloaded
> > file from the URL.

Where do the names of the downloaded files come from now, and why can't
that same algorithm be used later to determine the existence of the
file?  How much control do you have over this algorithm (which leads to
what ChrisA suggested)?

> > I could have a file with all the URLs listed and work through each
> > line in turn.  But then I would have to rewrite the file (say, with
> > the previously-successful lines commented out) as I go.

Files have that problem.  Other solutions, e.g., a sqlite3 database,
don't.  Also, a database might give you a place to store other
information about the URL, such as the name of the associated file.

> Hmm. The easiest way would be to have something from the URL in the
> file name. For instance, you could hash the URL and put the first few
> digits of the hash in the file name, so
> http://some.domain.example/some/path/filename.html might get saved
> into "a39321604c - filename.html". That way, if you want to know if
> it's been downloaded already, you just hash the URL and see if any
> file begins with those digits.

> Would that kind of idea work?

Dan

-- 
?Atoms are not things.? ? Werner Heisenberg
Dan Sommers, http://www.tombstonezero.net/dan