Page 1 of 1

Problems with the WMF database data

Posted: Wed Aug 30, 2017 9:41 pm
by Kumioko
The WMF and the Wikipedia community has been more and more reliant on statistical data to do analysis, perform quality control and to justify that the project is succeeding. People routinely report that X number of edits have been done, what portion of the community did them and report on the health of the projects and the community. To do this, they use a replica database that runs parallel to the live one all the edits are made too and that is what they call the Labs server. Makes sense right, you don't want to have the live database bogged down by queries. This is also where the adhoc report generation tool called Quarry gets it's data.

Over time, various problems have been identified to show that the data the WMF uses is not as reliable as they would have the world believe. For example, they recently reported that there were "zombie" entries appearing in the data that could not be killed off; due to differences in the tables between the live database and the backup they also made changes to the wrong records because the data table updates weren't inline thus introducing errors. This was referred to as Replica drift here: https://wikitech.wikimedia.org/wiki/Hel ... lica_drift. And certainly there are other problems as well we are not yet aware of. Certainly they are not reporting every problem in phabricator!

Without knowing what percentage of the data is "adrift", it's hard to say how bad the problem is and they seem to have a plan to fix it..by spending more money and buying more servers...but it begs the question. Just how many records were affected and was any of this incorrect data used in reports that were presented to the public, the donors or the board of trustees?