Data-Mining Wikipedia for Fun and Profit
billpg 2021-08-19 15:08:52 +0000 UTC [ - ]
In an act of divine justice, my website is down.
https://web.archive.org/web/20210711201037/https://billpg.co...
(I'll send you a donation. Thank you!)
papercrane 2021-08-19 14:49:30 +0000 UTC [ - ]
billpg 2021-08-19 14:57:12 +0000 UTC [ - ]
Perhaps, but I already know how to scrape HTML and I know the data I wanted to pull out was in there. I have no idea how to query wikidata and it could have ended up being a blind alley.
Also, it was only my reading your comment just now that told me wikidata was even a thing.
shock-value 2021-08-19 15:03:41 +0000 UTC [ - ]
3pt14159 2021-08-19 15:04:29 +0000 UTC [ - ]
When I was analyzing Wikipedia about 10 years ago for fun and, later, actual profit. I did the responsible thing and downloaded one of their megadumps because I needed every English page. That's what people here are concerned about, but it doesn't matter for your use case.
papercrane 2021-08-19 15:55:07 +0000 UTC [ - ]
I only thought of it myself because you mentioned the problem with deducing which parent is the mother and which is the father, and I remember in wikidata those are separate fields.
mistrial9 2021-08-19 15:02:24 +0000 UTC [ - ]
superior RDF triples are like martian language to millions of humans
over
udp 2021-08-19 15:03:57 +0000 UTC [ - ]
squeaky-clean 2021-08-19 16:38:56 +0000 UTC [ - ]
brodo 2021-08-19 15:00:14 +0000 UTC [ - ]
matkoniecz 2021-08-19 15:10:20 +0000 UTC [ - ]
Yes, I made some improvements ( https://www.wikidata.org/wiki/Special:Contributions/Mateusz_... ).
But overall I would not encourage using it, if I would know how much work it takes to get usable data I would not bother with it.
Queries as simple as "is this entry describing event, bridge or neither" are requiring extreme effort to get right in a reliable way, including maintaining private list of patches and exemptions.
And bots creating millions of known duplicated entries and expecting people to resolve this manually is quite discouraging. Creating Wikidata entries for Cebuano Wikipedia 'articles' was accepted, despite that Cebuano botpedia is nearly completely bot-generated.
And that is without unclear legal status. Yes, they can legally import databases covered by database rights - but they should either make clear that Wikidata is a legal quagmire in EU or forbid such imports. But Wikidata community did neither.
guidovranken 2021-08-19 15:24:39 +0000 UTC [ - ]
lacksconfidence 2021-08-19 16:24:27 +0000 UTC [ - ]
devbas 2021-08-19 15:09:18 +0000 UTC [ - ]
powera 2021-08-19 14:57:13 +0000 UTC [ - ]
mrVentures 2021-08-19 15:06:44 +0000 UTC [ - ]
barbinbrad 2021-08-19 15:44:40 +0000 UTC [ - ]
Here's a demo: https://www.youtube.com/watch?v=SyhaxCjrZFw
lizhang 2021-08-19 15:06:48 +0000 UTC [ - ]
billpg 2021-08-19 15:59:06 +0000 UTC [ - ]
ad404b8a372f2b9 2021-08-19 15:48:22 +0000 UTC [ - ]
humanistbot 2021-08-19 14:58:09 +0000 UTC [ - ]
cldellow 2021-08-19 15:14:37 +0000 UTC [ - ]
The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).
The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.
[1]: https://github.com/spencermountain/wtf_wikipedia
lou1306 2021-08-19 15:24:42 +0000 UTC [ - ]
cldellow 2021-08-19 15:33:22 +0000 UTC [ - ]
But, his project really was very reasonable:
- it fetched ~2,400 pages
- he cached them after first fetch
- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)
English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.
I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.
habibur 2021-08-19 15:51:04 +0000 UTC [ - ]
For 1 person consuming from one of the most popular sites on the web, this really reads big.
cldellow 2021-08-19 16:01:56 +0000 UTC [ - ]
The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.
Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.
If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.
jancsika 2021-08-19 15:59:19 +0000 UTC [ - ]
How is it possible that "give me all the infoboxes, please" is more than a single query, download, or even URL at this point?
ceejayoz 2021-08-19 16:13:59 +0000 UTC [ - ]
Look at the template for a subway line infobox, for example. https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT
It's a whole little clever language (https://en.wikipedia.org/wiki/Wikipedia:Route_diagram_templa...) for making complex diagrams out of rather simple pictograms (https://commons.wikimedia.org/wiki/Template:Bsicon).
cldellow 2021-08-19 16:18:03 +0000 UTC [ - ]
That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.
That actually might be kind of a neat thing to publish. Hmmmm.
squeaky-clean 2021-08-19 16:37:18 +0000 UTC [ - ]
whall6 2021-08-19 15:06:42 +0000 UTC [ - ]
michaelbuckbee 2021-08-19 15:10:04 +0000 UTC [ - ]
jcun4128 2021-08-19 15:55:52 +0000 UTC [ - ]
thechao 2021-08-19 16:56:50 +0000 UTC [ - ]
billpg 2021-08-19 15:15:20 +0000 UTC [ - ]
That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.
SlimyHog 2021-08-19 15:19:24 +0000 UTC [ - ]
FalconSensei 2021-08-19 16:32:46 +0000 UTC [ - ]
nonameiguess 2021-08-19 15:20:52 +0000 UTC [ - ]
matkoniecz 2021-08-19 15:36:03 +0000 UTC [ - ]
Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.
It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).
spoonjim 2021-08-19 15:57:28 +0000 UTC [ - ]
traceroute66 2021-08-19 15:08:52 +0000 UTC [ - ]
Wikimedia no doubt have caching, CDNs and all that jazz in place so the likely impact on infrastructure is probably de-minimis in the grand scheme of things (the thousands or millions of humans who visit the site every second).
learc83 2021-08-19 15:10:43 +0000 UTC [ - ]
They said please don't, not don't do it or they'll sue you.
But content license and site terms of use are different things.
From their terms of use you aren’t allowed to
> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;
Wikipedia is also well within their rights to implement scraping countermeasures.
traceroute66 2021-08-19 16:03:17 +0000 UTC [ - ]
Two things:
_jal 2021-08-19 15:19:27 +0000 UTC [ - ]
I mean, sure, you can do a lot of things you shouldn't with freely available services. There's even an economics term that describes this: the Tragedy of the Commons.
Individual fish poachers' hauls are also, individually, de-minimis.