Hugo Hacker News

Data-Mining Wikipedia for Fun and Profit

humanistbot 2021-08-19 14:58:09 +0000 UTC [ - ]

Please don't scrape raw HTML from Wikipedia. They do a lot of work to make their content accessible in so many machine-readable formats, from the raw XML dumps (https://dumps.wikimedia.org) to the fully-featured API with a nice sandbox (https://en.wikipedia.org/wiki/Special:ApiSandbox) and Wikidata (https://wikidata.org).

cldellow 2021-08-19 15:14:37 +0000 UTC [ - ]

The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

The reason is that the dumps just have pointers to templates, and you need to understand quite a bit about Wikipedia's bespoke rendering system to know how to fully realize them (or use a constantly-evolving library like wtf_wikipedia [1] to parse them).

The rendered HTML, on the other hand, is designed for humans, and so what you see is what you get.

[1]: https://github.com/spencermountain/wtf_wikipedia

lou1306 2021-08-19 15:24:42 +0000 UTC [ - ]

Still, I guess you could get the dumps and do a local Wikimedia setup based on them, and then crawl that instead?

cldellow 2021-08-19 15:33:22 +0000 UTC [ - ]

You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.

But, his project really was very reasonable:

- it fetched ~2,400 pages

- he cached them after first fetch

- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)

English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.

I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.

habibur 2021-08-19 15:51:04 +0000 UTC [ - ]

> This guy's use was 0.001% of traffic on that day

For 1 person consuming from one of the most popular sites on the web, this really reads big.

cldellow 2021-08-19 16:01:56 +0000 UTC [ - ]

He was probably one of the biggest users that day, so that makes sense.

The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.

Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.

If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.

jancsika 2021-08-19 15:59:19 +0000 UTC [ - ]

> The infoboxes, which is what this guy is scraping, are much easier to scrape from the HTML than from the XML dumps.

How is it possible that "give me all the infoboxes, please" is more than a single query, download, or even URL at this point?

ceejayoz 2021-08-19 16:13:59 +0000 UTC [ - ]

The problem lies in parsing them.

Look at the template for a subway line infobox, for example. https://en.wikipedia.org/wiki/Template:Bakerloo_line_RDT

It's a whole little clever language (https://en.wikipedia.org/wiki/Wikipedia:Route_diagram_templa...) for making complex diagrams out of rather simple pictograms (https://commons.wikimedia.org/wiki/Template:Bsicon).

cldellow 2021-08-19 16:18:03 +0000 UTC [ - ]

Even just being able to download a tarball of the HTML of the infoboxes would be really powerful, setting aside the difficulty of, say, translating them into a consistent JSON format.

That plus a few other key things (categories, opening paragraph, redirects, pageview data) enable a lot of powerful analysis.

That actually might be kind of a neat thing to publish. Hmmmm.

squeaky-clean 2021-08-19 16:37:18 +0000 UTC [ - ]

The infoboxes aren't standardized at all. The HTML they generate is.

whall6 2021-08-19 15:06:42 +0000 UTC [ - ]

Genuine question from a non-programmer: why? Is it because the volume of requests increases load on the servers/costs?

michaelbuckbee 2021-08-19 15:10:04 +0000 UTC [ - ]

That's part of it, but also it's typically much more difficult and there's an element of "why are you making this so much harder on yourself".

jcun4128 2021-08-19 15:55:52 +0000 UTC [ - ]

Can make it even harder, use Puppeteer to take screenshots then pass it to an OCR to get the text.

thechao 2021-08-19 16:56:50 +0000 UTC [ - ]

billpg 2021-08-19 15:15:20 +0000 UTC [ - ]

(Author of original article here.)

That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.

SlimyHog 2021-08-19 15:19:24 +0000 UTC [ - ]

The HTML is more volatile and subject to change than other sources though

FalconSensei 2021-08-19 16:32:46 +0000 UTC [ - ]

Don't remember the last time wikipedia changed the infobox though

nonameiguess 2021-08-19 15:20:52 +0000 UTC [ - ]

Unlike APIs, html class/tag names or whatever provide no stability guarantees. The site owner can break your parser whenever they want for any reason. They can do that with an API, but usually won't since some guarantee of stability is the point of an API.

matkoniecz 2021-08-19 15:36:03 +0000 UTC [ - ]

Scrapping, especially on a large scale, can put a noticeable strain on servers.

Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.

It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).

spoonjim 2021-08-19 15:57:28 +0000 UTC [ - ]

How does scraping raw HTML from Wikipedia hurt them? I'd think they could serve the HTML from cache more likely than the API call.

traceroute66 2021-08-19 15:08:52 +0000 UTC [ - ]

IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

Wikimedia no doubt have caching, CDNs and all that jazz in place so the likely impact on infrastructure is probably de-minimis in the grand scheme of things (the thousands or millions of humans who visit the site every second).

learc83 2021-08-19 15:10:43 +0000 UTC [ - ]

>IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

They said please don't, not don't do it or they'll sue you.

But content license and site terms of use are different things.

From their terms of use you aren’t allowed to

> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Wikipedia is also well within their rights to implement scraping countermeasures.

traceroute66 2021-08-19 16:03:17 +0000 UTC [ - ]

> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Two things:

  1) The English wikipedia *alone* gets 250 million page views per day !  So you would have to be doing an awful lot to cause "undue burden".

  2) The Wikipedia robots.txt page openly implies that crawling (and therefore scraping) is acceptable *as long as* you do it in a rate-limited fashion, e.g.:

  >Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.

  > There are a lot of pages on this site, and there are some misbehaved spiders out there that go _way_ too fast.

  >Sorry, wget in its recursive mode is a frequent problem. Please read the man page and use it properly; there is a --wait option you can use to set the delay between hits, for instance.

_jal 2021-08-19 15:19:27 +0000 UTC [ - ]

"Who's gonna stop me" is kind of a crappy attitude to take with a cooperative project like Wikipedia.

I mean, sure, you can do a lot of things you shouldn't with freely available services. There's even an economics term that describes this: the Tragedy of the Commons.

Individual fish poachers' hauls are also, individually, de-minimis.

billpg 2021-08-19 15:08:52 +0000 UTC [ - ]

(Author here)

In an act of divine justice, my website is down.

https://web.archive.org/web/20210711201037/https://billpg.co...

(I'll send you a donation. Thank you!)

papercrane 2021-08-19 14:49:30 +0000 UTC [ - ]

For this particular problem I wonder if wikidata would be better instead of scraping the HTML.

billpg 2021-08-19 14:57:12 +0000 UTC [ - ]

(Author here.)

Perhaps, but I already know how to scrape HTML and I know the data I wanted to pull out was in there. I have no idea how to query wikidata and it could have ended up being a blind alley.

Also, it was only my reading your comment just now that told me wikidata was even a thing.

shock-value 2021-08-19 15:03:41 +0000 UTC [ - ]

Generally Wikidata would definitely be the way to go here, though I just now tried to retrace your graph in Wikidata and it seems to be missing at least one relation (Ada of Holland has no children listed -- https://www.wikidata.org/wiki/Q16156475).

3pt14159 2021-08-19 15:04:29 +0000 UTC [ - ]

Don't worry about the haters. You needed a paltry amount of data and you got it with the tools you had and knew.

When I was analyzing Wikipedia about 10 years ago for fun and, later, actual profit. I did the responsible thing and downloaded one of their megadumps because I needed every English page. That's what people here are concerned about, but it doesn't matter for your use case.

papercrane 2021-08-19 15:55:07 +0000 UTC [ - ]

Hard to use it if you don't know about it!

I only thought of it myself because you mentioned the problem with deducing which parent is the mother and which is the father, and I remember in wikidata those are separate fields.

mistrial9 2021-08-19 15:02:24 +0000 UTC [ - ]

earth calling ivory tower, earth calling ivory tower

superior RDF triples are like martian language to millions of humans

over

udp 2021-08-19 15:03:57 +0000 UTC [ - ]

The ivory tower is working on it: https://github.com/w3c/EasierRDF

squeaky-clean 2021-08-19 16:38:56 +0000 UTC [ - ]

Like people on the other comment have said, if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you.

brodo 2021-08-19 15:00:14 +0000 UTC [ - ]

It’s so sad that almost nobody knows or uses SPARQL…

matkoniecz 2021-08-19 15:10:20 +0000 UTC [ - ]

In my experience SPARQL is really hard to use, and Wikidata data quality is really low. To the point that one of my larger project is trying to filter data to make it usable for my usecase.

Yes, I made some improvements ( https://www.wikidata.org/wiki/Special:Contributions/Mateusz_... ).

But overall I would not encourage using it, if I would know how much work it takes to get usable data I would not bother with it.

Queries as simple as "is this entry describing event, bridge or neither" are requiring extreme effort to get right in a reliable way, including maintaining private list of patches and exemptions.

And bots creating millions of known duplicated entries and expecting people to resolve this manually is quite discouraging. Creating Wikidata entries for Cebuano Wikipedia 'articles' was accepted, despite that Cebuano botpedia is nearly completely bot-generated.

And that is without unclear legal status. Yes, they can legally import databases covered by database rights - but they should either make clear that Wikidata is a legal quagmire in EU or forbid such imports. But Wikidata community did neither.

guidovranken 2021-08-19 15:24:39 +0000 UTC [ - ]

At least the last times I checked, the WikiData SPARQL server was extremely slow, frequently timing out.

lacksconfidence 2021-08-19 16:24:27 +0000 UTC [ - ]

seems to depend on the query. I can issue straight forward queries that visit a few hundred thousand triples easily. But when i write a query that visits tens of millions of triples it times out.

devbas 2021-08-19 15:09:18 +0000 UTC [ - ]

Because the syntax is relatively complex and it is difficult to judge which endpoints and definitions to use.

pphysch 2021-08-19 15:10:43 +0000 UTC [ - ]

Why should more people know SPARQL?

powera 2021-08-19 14:57:13 +0000 UTC [ - ]

Definitely. There's even a public query service ( https://query.wikidata.org/ ) which can do a lot of this (though SQL is not good with searching for chains).

2021-08-19 15:05:43 +0000 UTC [ - ]

mrVentures 2021-08-19 15:06:44 +0000 UTC [ - ]

Lol I love that it was edited out right away. I probably agree with that decision but this was a really cool project. I have followed a few YouTube channels which make graphic visualizations for Generations and I think they would really appreciate if you shared this Tech with them

barbinbrad 2021-08-19 15:44:40 +0000 UTC [ - ]

For anyone who is interested: I used to work with a guy named Richard Wang, who indexed Wikipedia as his training data set in order to do named entity recognition. He'd be a good person to talk for anyone pursuing this.

Here's a demo: https://www.youtube.com/watch?v=SyhaxCjrZFw

2021-08-19 16:03:54 +0000 UTC [ - ]

lizhang 2021-08-19 15:06:48 +0000 UTC [ - ]

Where's the profit?

billpg 2021-08-19 15:59:06 +0000 UTC [ - ]

For three minutes, something I wrote was cited on Wikipedia.

ad404b8a372f2b9 2021-08-19 15:48:22 +0000 UTC [ - ]

Good question, as a data-hoarder mining has always been fun but I was interested in seeing the profit. I've yet to find any monetary value in the troves of data I accumulate.