Wikimedia Enterprise plans to provide preprocessed HTML dumps

For WMF employee / slave nonsense, developer hijinks, and MediaWiki and related software screw-ups.
Post Reply
User avatar
Bbb23sucks
Sucker
Posts: 1345
Joined: Fri Jan 06, 2023 9:08 am
Location: The Astral Plane
Has thanked: 1272 times
Been thanked: 270 times

Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by Bbb23sucks » Wed May 10, 2023 8:10 pm

"Globally banned" since September 5, 2023 for exposing harassment.

User avatar
ericbarbour
Sucks Admin
Posts: 4592
Joined: Sat Feb 25, 2017 1:56 am
Location: The ass-tral plane
Has thanked: 1141 times
Been thanked: 1831 times

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by ericbarbour » Thu May 11, 2023 6:17 am

From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha

User avatar
Bbb23sucks
Sucker
Posts: 1345
Joined: Fri Jan 06, 2023 9:08 am
Location: The Astral Plane
Has thanked: 1272 times
Been thanked: 270 times

Re: Wikimedia Enterprise plans to provide preprocessed HTML dumps

Post by Bbb23sucks » Thu May 11, 2023 6:18 am

ericbarbour wrote:
Thu May 11, 2023 6:17 am
From my experience working with the Wiktionary HTML dumps I can say that the data quality
is quite poor: there are stale and missing entries
(https://phabricator.wikimedia.org/T305407).

There are also entire namespaces excluded from the dumps, and more recently there have
been issues with the dumps not getting updated.

So it depends what kind of processing you need to do–in general I find the parsing to be
much easier, hopefully they'll manage to sort out the problems.
Ah ha ha ha ha ha
As usual - STILL NOT FIXED after over a year!
"Globally banned" since September 5, 2023 for exposing harassment.

Post Reply