Analyzing Gender Proportions Using Python and Web Scraping

Support your argument with data!

Mor Kapronczay
HCLTech-Starschema Blog

--

source: https://assets.weforum.org/article/image/large_9M3Mr4G6HoxKpUnt1kMxQ2FOH5Z0dOy0aRWwumfWFmw.jpg

The subject of gender, particularly gender inequality, has generated a lot of debate recently. This post aims to provide helpful insights for anyone who’d like to study gender proportions in specific fields. I will provide some tips for data collection using web scraping as well as an automated way of finding probable gender of a person based on first names.

Data collection

If you are lucky, you may have your data in a handy format, like excel or .csv from some source. Nevertheless, this is rarely the case. In most analyses, you have to collect your data — generally from a website. This methodology is called web scraping.

It’s important to note that not all accessible data is collectable. Just because you can see something in your browser does not necessarily mean that you are allowed to legally scrape it. Some websites protect themselves against web scrapers. Always make sure that what you do is legal! For instance, scraping Wikipedia is perfectly fine, while scraping social media websites is illegal in most cases if not done through public APIs of these websites.

It may sound intimidating, but basically scraping is just mimicking what your favorite browser does:

  1. Sends an http request to a site.
  2. Parses the response it gets.

In some cases, websites protect their data from scrapers, but a quite common source of information is Wikipedia, where no such protection is present, and information there is free to use. Therefore, you can scrape anything you want from Wikipedia. Python even has a package for that. For didactic reasons let’s not use the package, but scrape the information the old fashioned way!

Let’s say we want to assess the gender of composers and lyricists of anthems from around the world. We go to this site and press Ctrl+Shift+I (in Google Chrome) or right click on virtually any place of the website and click inspect. This is what you will see (you may have to switch to Elements in the upper panel on the right):

On the right-hand side, you can inspect the structure of the website, which will be important for how you parse your response. The purple text refers to the tag of this element, through which you will be able to find it when you parse the response of the page.

In this code snippet, you can see what I did in this case: send a request and parse the response into a searchable a BeautifulSoup object. In this object, one can easily find the specific part of information you are looking for. In this case, a row corresponding to an anthem is stored in <tr> tag, in which <td> tags contain the specific information I need. Do not hesitate to check my github for the full code!

To further clarify, here is what you need to do to collect the data you want:

  1. Navigate to the page where the information is to be found.
  2. Inspect the structure of the website, find the tags where the information is stored.
  3. Using python, send a http request to the site.
  4. Using the BeautifoulSoup object created from the response, and the learned structure from point 2, create the algorithm, to extract and store the information you need.

Gender guessing

In order to make Python guess genders for us, the only thing we need is to supply it with a first name. gender-guesser is a Python package written for this purpose. It can return 6 different values: unknown (name not found), Andy (androgynous), female, male, mostly_male, or mostly_female. The difference between Andy and unknown is that the former is found to have the same probability to be male than to be female, while the latter means that the name wasn’t found in the database.

In this snippet, you can see what I did. After instantiating the detector, I created a function which takes a pandas DataFrame column, extracts the first name then performs gender guessing on it. Finally, it creates a column with the “_gender” (or any arbitrary) suffix and fills it with the guessed genders.

Concluding remarks

Do not forget to check the results manually at the end! In some cases, you must do a google search to clarify unknown or Andy cases, and it is always good to double check your work. These are great tools to speed up collecting gender proportion data.

--

--

Mor Kapronczay
HCLTech-Starschema Blog

Machine Learning Team Lead, Chatbot Developer@K&H (KBC Bank Hungary)