Data Science

Publish Date: 2020-04-15

Word Count: 1.1k

Read Times: 7 Min

Read Count:

Introduction

This article aims to build a web scraper by using BeautifulSoup and Selenium, and scrape QS Rankings to discover the top universities from all over the world. “Uni name”, “ranking” and “location” are fetched from the table and stored as a csv file. Jupyter notebook is available as well through this link.

Python Implementation

Download chrome web dirver first.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import bs4
from bs4 import BeautifulSoup
from selenium.common.exceptions import *
from selenium import webdriver
%matplotlib inline

Main Code

Only 25 universities are listed on the table per page, so I have to set the number of page I want to crawl. (max page: 40)

Open the html via chrome driver. (make sure webdriver.exe is in the same directory)
Parse the html using BeautifulSoup.
Create a loop to crawl all the elements (ranking, uni name, location) in each row.
Click to the next page and start over the loop in step three.
Stop fetching the data until all pages are done.

def get_uni_information(year=2020, unilist, page=40):
    url = r"https://www.topuniversities.com/university-rankings/world-university-rankings/{}".format(year)
    # Open url and get the QS Ranking html page
    driver_path = r"C:\Users\YangWang\Desktop\machineLearning\indiaNewsClassification\chromedriver.exe"
    driver = webdriver.Chrome(driver_path)
    time.sleep(2)
    driver.get(url)
    time.sleep(5)
    
    # Crawl all the pages (max page is 40)
    if page <= 40: 
        for _ in range(int(page)):
            # Use BeautifulSoup to parse every page
            soup = BeautifulSoup(driver.page_source, "html.parser")
            # Find the table which contains the information I want
            x = soup.find(name="table", attrs={"class": "dataTable no-footer"})
            # Use 'for' loop to catch every rows in the table, and append the rows into the list
            for tr in x.find(name="tbody"):
                try: 
                    tds = tr('td')
                    if tds[0].find(name="span") is not None:
                        rank = tds[0].find(name="span").string
                    else: 
                        rank = None
                    if tds[1].find(name="a") is not None:
                        uni = tds[1].find(name="a").string
                    else: 
                        uni = None
                    if tds[2].find(attrs={"class": "td-wrap"}) is not None:
                        location = tds[2].find(attrs={"class": "td-wrap"}).string
                    else: 
                        location = None
                except (RuntimeError, TypeError, NameError):
                    pass
                unilist.append([rank, uni, location])
            # Click next page button
            element = driver.find_element_by_xpath('//*[@id="qs-rankings_next"]')
            driver.execute_script("arguments[0].click();", element)
            time.sleep(5)
    else:
        print("Max page is 40.")
    
    driver.quit()
    return unilist

Get the DataFrame

Using get_uni_information() function to crawl the information and then store them into a dataframe. Also do some dataframe preprocessing in order to make sure every columns are in right data types.

def get_qs_ranking_dataframe(year=2020, page=40):
    unilist = []
    unilist = get_uni_information(year, unilist, page)
    df = pd.DataFrame(unilist)
    df.columns = ["ranking", "uni", "location"]
    df.reset_index(drop=True)
    
    # Dataframe preprocessing
    df["ranking"] = [int(x)+1 for x in range(len(df))]
    df["uni"] = df["uni"].map(str)
    df["location"] = df["location"].map(str)
    
    return df

Take a look at the dataframe.

Japan

qs_2020 = get_qs_ranking_dataframe(year=2020, page=40)
qs_2020[(qs_2020["location"] == "Japan") & (qs_2020["ranking"] <= 100)]

ranking	uni	location
23	The University of Tokyo	Japan
34	Kyoto University	Japan
59	Tokyo Institute of Technology (Tokyo Tech)	Japan
71	Osaka University	Japan
82	Tohoku University	Japan

United States

qs_2020[(qs_2020["location"] == "United States") & (qs_2020["ranking"] <= 100)]

ranking	uni	location
1	Massachusetts Institute of Technology (MIT)	United States
2	Stanford University	United States
3	Harvard University	United States
5	California Institute of Technology (Caltech)	United States
10	University of Chicago	United States
13	Princeton University	United States
14	Cornell University	United States
15	University of Pennsylvania	United States
17	Yale University	United States
18	Columbia University	United States
21	University of Michigan-Ann Arbor	United States
24	Johns Hopkins University	United States
25	Duke University	United States
28	University of California, Berkeley (UCB)	United States
31	Northwestern University	United States
36	University of California, Los Angeles (UCLA)	United States
39	New York University (NYU)	United States
45	University of California, San Diego (UCSD)	United States
48	Carnegie Mellon University	United States
56	University of Wisconsin-Madison	United States
57	Brown University	United States
65	University of Texas at Austin	United States
68	University of Washington	United States
72	Georgia Institute of Technology	United States
74	University of Illinois at Urbana-Champaign	United States
86	Rice University	United States
90	University of North Carolina, Chapel Hill	United States
93	Pennsylvania State University	United States
98	Boston University	United States

United Kingdom

qs_2020[(qs_2020["location"] == "United Kingdom") & (qs_2020["ranking"] <= 100)]

ranking	uni	location
4	University of Oxford	United Kingdom
7	University of Cambridge	United Kingdom
8	UCL	United Kingdom
9	Imperial College London	United Kingdom
20	The University of Edinburgh	United Kingdom
27	The University of Manchester	United Kingdom
33	King’s College London	United Kingdom
44	The London School of Economics and Political S…	United Kingdom
49	University of Bristol	United Kingdom
62	The University of Warwick	United Kingdom
67	University of Glasgow	United Kingdom
77	Durham University	United Kingdom
78	The University of Sheffield	United Kingdom
81	University of Birmingham	United Kingdom
94	University of Leeds	United Kingdom
96	University of Nottingham	United Kingdom
97	University of Southampton	United Kingdom
100	University of St Andrews	United Kingdom

Taiwan

qs_2020[(qs_2020["location"] == "Taiwan") & (qs_2020["ranking"] <= 1000)]

ranking	uni	location
69	National Taiwan University (NTU)	Taiwan
174	National Tsing Hua University	Taiwan
226	National Cheng Kung University (NCKU)	Taiwan
228	National Chiao Tung University	Taiwan
251	National Taiwan University of Science and Tech…	Taiwan
288	National Yang Ming University	Taiwan
333	National Taiwan Normal University	Taiwan
381	Taipei Medical University (TMU)	Taiwan
412	National Sun Yat-sen University	Taiwan
427	National Central University	Taiwan
485	Chang Gung University	Taiwan
514	National Taipei University of Technology	Taiwan
556	National Chengchi University	Taiwan
669	Kaohsiung Medical University	Taiwan
675	National Chung Hsing University	Taiwan
866	National Chung Cheng University	Taiwan

Visualise

Visualise top top_ranking universities and show top num countries in the image of the certain year.

def visualise_qs_ranking(df, year, top_ranking, num):
    """
    df: dataframe
    year: year of the qs ranking
    top_ranking: top # of universities to be selected
    num: # of countries to be visaulised
    """
    plt.style.use('seaborn-paper')
    top = df.iloc[0:top_ranking]
    
    ax = (top['location'].value_counts().head(num).plot(
        kind='barh', 
        figsize=(20, 10), 
        color="tab:blue", 
        title="Number of Top {} Universities in QS Ranking {}".format(len(top['location']), str(year))))
    ax.set_xticks(np.arange(0, top['location'].value_counts()[0]+2, 1))

visualise_qs_ranking(df=qs_2020, year=2020, top_ranking=100, num=10)

Conclusion

The above is a combination of Python’s Selenium and BeautifulSoup to achieve dynamic web crawler, the complete code can be found in my GitHub, if you encounter problems in the process of implementation, welcome to share in the comments below.

Yang Wang

https://penguinwang96825.github.io/Yang-Tech-Blog/Yang-Tech-Blog/2020/04/15/2020-04-15-qs-ranking-crawler/

All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Yang Wang !

Python Crawler Visualisation

Collect Tweets using Twint

Twint is a Python-based advanced Twitter scraping app that allows you to scrape Tweets from Twitter profiles without having to use Twitter's API. Twint makes use of Twitter's search operators to allow you to scrape Tweets from specific individuals, scrape Tweets referring to specific themes, hashtags, and trends, and sort out sensitive information like e-mail and phone numbers from Tweets. This is something I find quite handy, and you can get fairly creative with it as well.

2020-10-16 Data Science

Python Twitter Tweet

Sentiment Analysis for KKBOX

This sentiment classification task is based on reviews data of UtaPass and KKBOX from Google Play platform. As a KKStreamer at KKBOX, I become more interested in Natural Language Processing, especially text classification. First, I start crawling the text data using web crawler technique, namely BeautifulSoup and Selenium. Second, I develop several different neural network architectures, including simple RNN, LSTM, GRU, and CNN, to name but a few, to detect the polarity of reviews from customers.

2019-07-10 NLP

Python NLP KKBOX UtaPass

QS Ranking Crawler

Introduction

Python Implementation

Main Code

Get the DataFrame

Japan

United States

United Kingdom

Taiwan

Visualise

Conclusion