IISMA Scraper

A (debatably) simple script to scrape IISMA universities list and upload it to a publicly available Google Sheet.

January 2024 — February 2024

Summary

This script consist of 3 parts to do its job of scraping the data and uploading it to the sheet:

  1. Scrape the data and put it into JSON

  2. Process the JSON to a human-readable XLSX

  3. Upload the XLSX to a previously created Google Sheet file

Developer Diaries

Origin

I wanted to choose the university I'm going to based on the courses that they provide so at first, I made this primarily to be able to lay out in one sheet all of those available courses. To do that, I'll have to compare and contrast all of the courses from each of the universities against each other but using the IISMA site to do so is a huge hassle because the courses are put in each of the university's individual page. Of course I could just make the sheet manually but here's why I didn't do that:

  1. I thought it's a cool project to scrape the site.

  2. I wanted to reuse the web scraping skill I acquired when doing the database laboratory assistant selection process (especially since I didn't get accepted).

  3. At the time of initializing this project, I knew that the universities list have not been updated. Thus, making a sheet at that point would obviously be futile. Nonetheless, I wanted to have the most up-to-date sheet as soon as the list is updated. With this script, I can just rerun it once the site is updated and get that up-to-date sheet almost instantly.

  4. Having Excel's powerful filtering tools available would provide me with a rigorous way to filter through the universities.

Thus, the project begins. I would say that it took me approximately 3-4 working days to put this together (of course the work I did was spread out at a much broader time span, but if it were to be compacted to one continuous duration, that'd be it). At the end, I'm glad that I did this because this year IISMA ended up having more than 100 universities (114 to be precise) and it'd have been so tedious had I made this sheet manually.

Scope Creep

At first, I only wanted to have all of the courses that each of the university has without any of its other attribute (intake statistics, language requirement, etc.). And I didn't even think of uploading the result to a Google Sheet, I only thought of printing out the result in a local file. It's just that as I proceeded with the scraping, I was almost always thinking ("hey, wouldn't it be nice to have this info scraped as well?") and that's how the sheet ended up having more than 60 columns now for each of the university.

The thought of uploading the result to a Google Sheet primarily stems from a bug with LibreOffice Calc. As a Linux user, I cannot use Microsoft Office (or at least I haven't invested time to work around it) and so I use LibreOffice in my machine locally. But after the script was able to output XLSX, I found that LibreOffice Calc wasn't able to display the XLSX according to the formatting that I had set in the script. I then uploaded to XLSX to Google Drive to check if Google Sheet can show the XLSX in the proper format, and it did. Since it'd be a hassle (it's actually not, I'm just lazy) to having to manually upload the XLSX to Google Sheet just to see the script's output after it's been ran, I decided to add another component to this script that handles uploading the XLSX to Google Sheet.

Challenges

The most complex part of this script, by a long shot, is the first component. Its responsibility is not just to scrape and output the raw data but also preprocess it because I wanted the outputted JSON to simply be read and formatted to an XLSX without any further processing by the script's second component. And the two task it's been given are not trivial.

The IISMA website is a static site made by WordPress (I know because I saw that the HTML elements had Elementor's class names) so it didn't require complex use of the Puppeteer library (I only had to open the browser, load the content, and get the HTML). But WordPress site notoriously have verbose HTML. I've worked on a WordPress site before so I've seen that a simple container could have half a dozen HTML element encapsulating it and how modifying the style of an element could result in a wholly new element to be generated. This, combined with the fact that each of the individual universities page on the IISMA website is probably created manually by hand one-by-one makes the semantic structure of each page varies in subtle ways. This wouldn't matter that much if the website were to have meaningful classname and id attribute for its HTML element. But this is a WordPress site so it also doesn't have that.

Accounting for the subtle differences in how each pages is structured made the code for scraping a lot more complex because it has to be littered with the handling of exception to the rule. Notably, the structure of the HTML element for the intake statistics is wildly different for each pages. Some have each year be separated into several p element, others have it in only one span element, etc.. The solution at the end was to use this godsend of a library to nuke this issue from orbit by stripping the HTML, leaving only the text content to be parsed.

Preprocessing is another issue. The attributes that have to be stored is not laid out in conveniently placed div, it's placed within the text content of the pages. This is especially apparent for intake statistics, where I had to chain two dozens string replacement function to "tokenize" the text into an easily parsed form to get all of the attributes of the intake statistics.

Originally, I thought that formatting the result to a readable XLSX would be the hardest part but it's actually the easiest. The exceljs library that I used have very concise and clear documentation that I quickly grasped. The biggest issue with this part actually corresponded with what the outputted XLSX should be like. Which will be explored in this blog's latter part.

Another thing that has to be kept in mind is inferring certain aspects of the universities from the outputted JSON. For example, if the university didn't have any intake statistics, it can be inferred that the university didn't participate in the previous years. But if the university just didn't have an intake statistics for last year co-funding applicants and awardees, then there simply weren't any co-funding applicants for the university last year.

The final component of the script which uploads the result to the Google Sheet should've been pretty easy to do but I had got lost in Google's official documentation which is not as easy to follow to someone that isn't familiar with how the overall Google API works. I ended up following most of the steps in this tutorial and the code only differed slightly from the tutorial's end result.

A Lesson in Displaying Data

When I first started this project, I had wondered why they didn't present the information in ways that are more practical to be used for selecting the universities. But after ending up with a Google Sheet that has 62 columns with numerous empty cells, I think I understand why.

Each universities are different, some have quirks that cannot be displayed in a uniform Excel sheet. This is most apparent with intake statistics that has many variations. Some universities only became a part of IISMA last year (and so didn't have statistics for 2022), while others didn't have any co-funding applicants, etc.. The language statistics are also wildly different. For example, some don't have any applicants with TOEFL for a particular year while others accepted a range of TOEFL score in the same year. Accounting for all these differences into an Excel sheet ends with a large number of columns that are quite sparsely populated. Even now, some universities have particularities that simply cannot be put into an Excel sheet without making it unwieldy, such as a multitude of campuses with different courses. This is the case with University of Toronto, my first university of choice for the program, that I had only realized after going to the university page have two campuses, Mississauga and Scarborough, each with different courses.

I've also learned that seemingly simple data may require a somewhat complex format to be saved. The intermediary JSON file created by the script's scraping component is not that complex but I was still surprised of the complexity during development since I originally had thought by looking at the individual university's page that there wasn't that much information about each universities.

Conclusion

I had a lot of fun making this script because I get to be acquainted with many useful libraries that might be of use in other projects in the future. My web scraping skills also got sharpened due to the non-uniform structure of the scraped pages, something that wasn't found in the previous website due to its comparatively modern tech stack.

All things considered, I'm pretty satisfied with the end result because I actually used the outputted Google Sheet to quantitatively decide my choice of university (University of Toronto, University of Manchester, and Lancaster University) based on my interest of the provided courses. Additionally, I'm also really glad to be able to help other IISMA applicants in choosing their destination university by sharing this sheet publicly to my peers.

Gallery