Page 1 of 1

Wikimedia Enterprise plans to provide preprocessed HTML dumps

Posted: Wed May 10, 2023 8:10 pm
by Bbb23sucks

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Posted: Thu May 11, 2023 6:17 am
by ericbarbour
From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Posted: Thu May 11, 2023 6:18 am
by Bbb23sucks
ericbarbour wrote:
Thu May 11, 2023 6:17 am
From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha
As usual - STILL NOT FIXED after over a year!