Webscraping Fandom to Build a Custom Dataset

Open Table of Contents

Time to Build Our Own Dataset
Post Processing

Time to Build Our Own Dataset

The first step to build the dataset is to find a good source of information. Fandom seems to be the perfect source for this project, even if the data is unparsed. Now to parse the data.

The work from Kaggle Dataset gathers information from the list of characters. My goal is to gather more information for my dataset.

Using Beautiful Soup

Beautiful Soup, aka, Bs4, is a Python package used for webscraping. Since Fandom has no working API, we will manually parse info from the frontend.

This is what we see on the Character list page: Fandom

Reading the code directly from inspecting the browser, we can see that the character galleries are between h2 tags. This is the first thing the script does, get the data between h2s. The result is just a bunch of unreadable garbage that will be used later.

# Building upon https://www.kaggle.com/code/birlinha/data-mining-boku-no-hero-my-hero-academia
from bs4 import BeautifulSoup, NavigableString
from rich import print as richprint
import requests

url = 'https://myheroacademia.fandom.com/wiki/List_of_Characters'

html_data = requests.get(url)
soup = BeautifulSoup(html_data.text, 'html.parser')
h2_tags = soup.find_all('h2')
new_soups = []
# Find content between h2 tags
for i in range(len(h2_tags) - 1):
        # Get the text between the current h2 and the next h2
        text_between_h2s = []
        next_tag = h2_tags[i + 1]
        for sibling in h2_tags[i].find_next_siblings():
                if sibling == next_tag:
                        break
                if isinstance(sibling, NavigableString):
                        text_between_h2s.append(str(sibling).strip())
                else:
                        text_between_h2s.append(str(sibling))
        
        # Create a new soup with the text between the h2 tags
        new_soup = BeautifulSoup(''.join(text_between_h2s), 'html.parser')
        new_soups.append(new_soup)
richprint(new_soups)

After some more inspection, we can see that each gallery is numbered quite arbitrarily if you ask me, but each one has an attribute “id” with a value “gallery-{i}” where i is a number. With i=100 we can get all of the information. Relevant to extract from this page is: The character name, the image URL, and the character URL in Fandom (to extract the remaining information).

galleries_element_list = []
people_info = []
name_list = []

for s in new_soups:
        for i in range(0, 100): # 100 is arbitrary
                pat_gal = f"gallery-{i}" # Items
                gallery = s.find("div", attrs={"id": pat_gal})
                if gallery:
                        galleries_element_list.append(gallery)
                        for g in galleries_element_list:
                                people = g.find_all(name="div", attrs={"class": "wikia-gallery-item"})
                                for person in people:
                                        chargallery_prof = person.find(name="div", attrs={"class": "chargallery-profile-caption"})
                                        href_person = chargallery_prof.find(name="a").get("href")
                                        name = chargallery_prof.find(name="a").text
                                        img = person.find("img").get("src")
                                        img_url = img.split("/revision/")[0]
                                        if name not in name_list: # Prevent duplicates
                                                people_info.append({"name": name, "href_person": href_person, "img_url": img_url})
                                                name_list.append(name)
richprint(people_info)

Getting Character Attributes

Now that we have the base character information, we can start working on the details about each character. To simplify everything, we create a class to hold the information. This class is based on an SQL model (which is a combination of Pydantic and SQLAlchemy). This has several advantages:

No need to create a constructor for the class;
Built-in serialization with the method model_dump();
Database support (as it is an SQLAlchemy object);

The initial idea for this script was to send this data to a database. Using the SQLModel, we just need to add __tablename__ as a property.

Note: The attributes in the class were created by making a first run of all the characters and listing the attributes (This is not the first version of this script ;))

from sqlmodel import SQLModel, Field, Column, String, ARRAY
from typing import Optional
class PeopleAtts(SQLModel, table=True):
        __tablename__= "people_atts"
        
        id: int | None = Field(default=None, primary_key=True)
        href_person: str | None = Field(default=None)
        img_url: str | None = Field(default=None)
        occupation: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        quirk: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        manga_debut: str | None = Field(default=None)
        romaji_name: str | None = Field(default=None)
        quirk_range: str | None = Field(default=None)
        real_name: str | None = Field(default=None)
        eye_color: str | None = Field(default=None)
        english_va: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        gender: str | None = Field(default=None)
        teams: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        birthday: str | None = Field(default=None)
        participants: str | None = Field(default=None)
        located_in: str | None = Field(default=None)
        status: str | None = Field(default=None)
        leaders: str | None = Field(default=None)
        user: str | None = Field(default=None)
        location: str | None = Field(default=None)
        japanese_name: str | None = Field(default=None)
        blood_type: str | None = Field(default=None)
        alias: str | None = Field(default=None)
        skin_color: str | None = Field(default=None)
        other_members: str | None = Field(default=None)
        epithet: str | None = Field(default=None)
        hair_color: str | None = Field(default=None)
        movie_debut: str | None = Field(default=None)
        height: str | None = Field(default=None)
        japanese_va: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        quirk_type: str | None = Field(default=None)
        anime_debut: str | None = Field(default=None)
        fighting_style: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        vigilantes_debut: str | None = Field(default=None)
        age: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        kanji_name: str | None = Field(default=None)
        weight: str | None = Field(default=None)
        leader: str | None = Field(default=None)
        family: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        birthplace: str | None = Field(default=None)
        affiliation: list[str] | None = Field(sa_column=Column(ARRAY(String)))
        anime_debut_arc: str | None = Field(default=None)

Visiting the character page, I am presented with: We need to get the gallery on the bottom right and parse its information. This process is similar to what I did before. In this case, I first parse the “aside” portion of the game and then parse the information corresponding to the tags “pi-data-label” and “pi-data-value”. This is a 1-1 match, so there is no need to fix edge cases here.

Notes regarding parsing:

For anime debut, I am actually interested in the link of the episode, not the name of the episode itself (used later);
For height, I want 172 from 172cm (…), and therefore I use a regex match;
For lists, I use stripped_strings instead of text (which automatically joins the text);
For alias, I just use the first element of the list;
For affiliation, occupation, teams, family, fighting_style, english_va, japanese_va, quirk, I use the list strategy;
For the remainder, I use the .text method.

import re
BASE_URL = "https://myheroacademia.fandom.com"

def transform_atts(name, img_url, href_person):
        global counter, total
        data_html = requests.get(BASE_URL + href_person)
        soup = BeautifulSoup(data_html.text, 'html.parser')

        aside = soup.find("aside")

        pi_label = aside.find_all("h3", {"class": "pi-data-label"})
        pi_val = aside.find_all("div", {"class": "pi-data-value"})

        new_label = [item.text.replace(' ', '_').replace('(', '').replace(')', '').replace('ō', 'o').lower() for item in pi_label]
        pre_parsed_pi = dict(zip(new_label, pi_val))
        parsed = {'real_name': name, 'img_url': img_url, 'affiliation': None, 'occupation': None, 'teams': None, 'family': None, 'fighting_style': None, 'japanese_va': None, 'english_va': None, 'quirk': None}
        for k, v in pre_parsed_pi.items():
                if k == 'anime_debut':
                        a_tags = v.find_all('a')
                        if a_tags:
                                parsed[k] = a_tags[-1].get('href')
                        else:
                                parsed[k] = v.text
                elif k == 'height': # Parse height to only get the number
                        matches = re.findall(r"\b\d{2,3}\s?cm\b", v.text)
                        parsed[k] = matches[0].replace(' ', '')[:-2] if matches else v.text
                elif k in ['alias']: # Only get the first element
                        clean = [s for s in v.stripped_strings if '(' not in s and ')' not in s and '[' not in s and ']' not in s]
                        parsed[k] = clean[0] if clean else None
                elif k in ['affiliation', 'occupation', 'teams', 'family', 'fighting_style', 'japanese_va', 'english_va', 'quirk']:
                        parsed[k] = [s for s in v.stripped_strings if '(' not in s and ')' not in s and '[' not in s and ']' not in s]
                else: # Get text instead of array form
                        cleaned_text = re.sub(r'\[\d+\]|\([^)]*\)', '', v.text) # Remove [1] and (text)
                        parsed[k] = cleaned_text.strip()
        if parsed.get("age") is not None:
                parsed["age"] = parsed.get("age").split(' ')

        richprint(f"Transformed {parsed.get('real_name')} - {counter} of {total}")
        return PeopleAtts(href_person=href_person, **parsed)

Adding Arc Information

To add the arc information, I use a similar strategy as the one used to get character info, using the episode info collected in anime_debut. Each episode page looks like this: Arc

Very importantly, there is an arc pi_property (in the image “Entrance Exam Arc”). I also found some edge cases where there are episodes with multiple arcs. For consistency, I select the first one (as I do in alias).

def _add_arc(epi_name): # Fix for multiple arcs
        arc_url = BASE_URL + '/wiki/' + epi_name if not epi_name.startswith('/wiki/') else BASE_URL + epi_name
        data_html = requests.get(arc_url)
        soup = BeautifulSoup(data_html.text, 'html.parser')
        aside = soup.find("aside")
        pi_label = aside.find_all("h3", {"class": "pi-data-label"})
        pi_val = aside.find_all("div", {"class": "pi-data-value"})
        new_label = [item.text.replace(' ', '_').replace('(', '').replace(')', '').replace('ō', 'o').lower() for item in pi_label]
        pre_parsed_pi = dict(zip(new_label, pi_val))
        parsed = {}
        for k, v in pre_parsed_pi.items():
                if k == 'arc':
                        clean = [s for s in v.stripped_strings if '(' not in s and ')' not in s and '[' not in s and ']' not in s]
                        parsed[k] = clean[0] if clean else None # Some episodes have multiple arcs so we only get the first one
                else:
                        cleaned_text = re.sub(r'\[\d+\]|\([^)]*\)', '', v.text) # Remove [1] and (text)
                        parsed[k] = cleaned_text.strip()
        return parsed.get("arc", None)

I also update the transform_atts function with this information:

if parsed.get("anime_debut") is not None:
                #print(f"Parsing anime debut {parsed.get('anime_debut')}")
                parsed["anime_debut_arc"] = _add_arc(parsed.get("anime_debut"))

To run the full project, we just need to put the entire thing together (check and run the file):

pp_list = {}
for char in people_info:
        if char.get('name') is not None:
                pp_list[char['name']] = transform_atts(char.get('name'), char.get('img_url'), char.get("href_person")).model_dump()
richprint(pp_list)

Post Processing

After doing all this, one might think this is it, but no. I had to fix some things manually (which took like 2 minutes and wasn’t really worth automating), ensuring that occupation and affiliation are correct.

For this, I created a mini script that prints all the affiliation and occupation (I could verify everything else, but I am only interested in a subset of the attributes) and fix the names that have something wrong (extra comma, removing “Leader of the” and such).

import json
from models.people import PeopleAtts
from rich import print as richprint
def fix_people_data(path):
        # Do manual verification
        data = json.load(open(path))
        char = {k: PeopleAtts(**v) for k, v in data.items()}
        test = set()
        for k, v in char.items():
                if v.affiliation is not None:
                        for aff in v.affiliation:
                                test.add(aff)
        richprint(test)
        richprint('----------------')
        test2 = set()
        for k, v in char.items():
                if v.occupation is not None:
                        for aff in v.occupation:
                                test2.add(aff)
        richprint(test2)
fix_people_data('filtered_people.json')