Wikimedia Enterprise plans to provide preprocessed HTML dumps
Posted: Wed May 10, 2023 8:10 pm
BADSITEBADSITEBADSITE
https://www.wikipediasucks.co/forum/
https://www.wikipediasucks.co/forum/viewtopic.php?f=31&t=2803
Ah ha ha ha ha haFrom my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).
There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.
So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
As usual - STILL NOT FIXED after over a year!ericbarbour wrote: ↑Thu May 11, 2023 6:17 amAh ha ha ha ha haFrom my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).
There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.
So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.