How to extract all titles of biography and geography articles from english Wikipedia

There are more than 1.8 million biography articles and 1.2 million geography articles in english Wikipedia. Here we need to extract all the titles of these articles from Wikipedia’s data files.

  1. Get the dump file of Wikipedia from the database download page. (file size ~ 20GB)
  2. Get database dump file of category links whose name like this one: “enwiki-20210501-categorylinks.sql.gz” (file size ~ 3GB)
  3. Restore the database dump file of category link to a docker Mysql server. (database size ~ 30GB)
  4. Save the query result on the table “categorylinks” on the Mysql server. The query is to retrieve all the ids of page that was in the category “WikiProject_Biography_articles”.
  5. Parse the Wikipedia articles file downloaded at step 1 with Python, SAX and mwparserfromhell. Retrieving all the biography article titles whose talk page id was in the result set of step 4.
  6. Retrieving all the geography article titles from file downloaded at step 1 by filter articles that contain {{Coord:}} template.
  7. The title of all titles of biography articles and geography articles in english Wikipedia as of 2021 can be downloaded below.

Leave a comment

Your email address will not be published. Required fields are marked *