How would I get a subset of Wikipedia's pages?

https://stackoverflow.com/questions/1320475

19-09-2019
|

Question

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

Solution

The stackoverflow database is available for download.

OTHER TIPS

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.

Out of curiosity, what are you using all this data for?

One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).

If you don't want to download the entire thing, you're left with the option of scarping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.

You could use a web crawler and scrape 100MB of data?

There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.

One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).

https://dumps.wikimedia.org/metawiki/latest/

You want to look for any files with the -articles.xml.bz2 suffix.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow