git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

web scraper


On 10/25/19 9:19 AM, joseph pareti wrote:
> but can it be generalized?
> Not all tags are in the form of <a class --->  <div class, so is it doable
> to just replace those tags in the code, should
> one process a different website?

Not really, no.  There is not an easy way to generalize this sort of web
scraping.  There are many different ways to use html tags.  And each web
site is going to use a different scheme for defining CSS ids and
classes.  Really web scraping is customized to each website, and it's
prone to breaking as the website can change itself at any time.

The only reliable way to access and process information is if a web site
offers a nice stable web services API you can use.