According to the statistics from Wikipedia, especially the info conveyed from this image, biography and geography are two largest categories of english Wikipedia. Actually, take the number of articles as a measurement, nearly half of Wikipedia’s articles fall in these two categories in my survey.
To get the number of biography articles on Wikipedia is not a trivial task as it seems to be. Due to the crowdsourcing nature of Wikipedia, articles come and go at any moment, the style and format of articles are not consistent. However, some concepts are very helpful in statistics of Wikipedia. These concepts include “category” and “template“.
One can derive the number of biography articles by getting the number of articles under the category “WikiProject_Biography_articles”. here is the query result of MediaWiki table “Category“.
From the query results above we can tell that the number of pages which are in “WikiProject_Biography_articles” category is 1.8 million. This is a solid estimate of the number of biography articles on Wikipedia.
Unfortunately, there is not a single category to be used to ‘tag’ a geography article. We can estimate the number of geography articles by counting ‘Coord’ template which is included in geographic articles. By traversing the 20GB dump file of Wikipedia’s all articles and detecting the specific template, we can get the number which is 1.2 million.
So, here is the conclusions:
more than one-fourth (1.8 million / 6.3 million) of all articles of wikipedia are biographies. and nearly one-fifth (1.2 million / 6.3 million) are geography articles. They combined occupy almost half( 3 million / 6.3 million) of the english Wikipedia. To show this kind of data on some open maps, the fundamental step is to extract those information, I will cover the work in the next post.