Subject: fetching pdfs from our website



Hey currently,

we are on nutch 2.3.1 and using it to crawl our websites.
One of our focus is to get all the pdfs on our website crawled. -> Links on
different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
I tried different things:
At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt
and added the download url, added parse-tika to nutch-.site.xml in plugins,
added application/pdf in default-site.xml in http-accept, added pdf to
parse-plugins.xml.
But still no pdf link is been fetched.

regex-urlfilter.txt
+https://assets.*. mysite.com/asset

parse-plugins.xml
<mimeType name="application/pdf">
<plugin id="parse-tika" />
</mimeType>

nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
</property>

default-site.xml
<property>
<name>http.accept</name>

<value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>

Is there anything else I have to configure?

Thanks

David



...



Programming list archiving by: Enterprise Git Hosting