Recommending Your Favorite Manwhas, Mangas, Manhuas with a Machine Learning Approach¶

Sharath Kannan and Eric Cho

UID: 117174137 and 117287493

Introduction¶

Welcome to our cutting-edge recommendation system for manhwas, mangas, and manhuas! Are you tired of spending endless hours searching for your next captivating read in the world of graphic literature? Our advanced machine learning models are here to assist you in discovering hidden gems and popular masterpieces tailored to your unique preferences. In case you are unaware, manhwas are comics created in South Korea, mangas are comics created in Japan, and manhuas are comics created in China. These comics have gained immense popularity worldwide. By leveraging the power of data analysis and machine learning, we have created an intelligent platform that understands your preferences and suggests the most engaging and immersive reading experiences. So, how does our system work? It all starts with you. We invite you to embark on a journey of self-discovery by answering a series of questions that allow us to gain insight into your interests, preferred genres, and narrative themes. Your responses will be carefully analyzed and processed by our state-of-the-art machine learning models, which have been trained on an extensive dataset of manhwas, mangas, and manhuas.

Once we have captured your unique preferences, our models will work their magic, employing advanced pattern recognition techniques and collaborative filtering to match you with manhwas, mangas, and manhuas that align perfectly with your tastes. Whether you're a fan of thrilling action-packed adventures, heartwarming romance, mind-bending mysteries, or thought-provoking dramas, our system has got you covered. We also understand that preferences can evolve and change over time. As you explore the titles recommended to you, our system will continually learn from your interactions, adapting and fine-tuning its suggestions to ensure a personalized experience. The more you engage with our platform, the better it becomes at predicting your future preferences and introducing you to captivating stories that you might have missed otherwise. So, say goodbye to endless searching and let our recommendation system be your guide in the world of manhwas, mangas, and manhuas. Your next immersive and thrilling reading experience is just a few clicks away!

Recommender System¶

Content-based filtering¶

  • Content-based filtering relies on analyzing the characteristics and attributes of the items being recommended. In the context of manhwas, mangas, and manhuas, content-based filtering would involve examining the content itself, such as the genres, themes, and other textual or visual features.
  • The system match content resources to user characteristics. Content-based filtering techniques normally base their predictions on user’s information(Isinkaye et. al, 2015), and then suggests items that have similar attributes to the ones they have enjoyed in the past.
  • For example, if a user has shown a preference for romance and fantasy genres, the content-based filtering algorithm would recommend manhwas, mangas, or manhuas that contain similar elements.

Collaborative filtering¶

  • Collaborative filtering recommends items by identifying other users with similar taste; it uses their opinion to recommend items to the active user(Isinkaye et. al, 2015). It works by finding similarities between users or items based on their interactions and preferences. There are two main types of collaborative filtering: user-based and item-based.
  • User-based collaborative filtering: This approach finds users who have similar preferences to the target user and recommends items that those similar users have liked(Isinkaye et. al, 2015). For example, if User A and User B have similar reading habits and User B enjoyed a particular manhwa, then the system would recommend that manhwa to User A.
  • Item-based collaborative filtering: In this approach, the system identifies similar items based on users' interactions. If User A enjoys a certain manhwa, the system will recommend other manhwas that users with similar tastes have also enjoyed.

Hybrid-based filtering¶

  • Hybrid-based filtering combines two or more filtering techniques in different ways in order to increase the accuracy and performance of recommender systems has been proposed(Isinkaye et. al, 2015). By integrating both approaches, the system can leverage the advantages of each method and provide more diverse and accurate recommendations.
  • In a hybrid-based filtering system, content-based and collaborative filtering algorithms may work independently or be combined in various ways. For example, the system might use collaborative filtering to identify a set of initial recommendations and then apply content-based filtering to refine those recommendations based on the specific preferences and attributes of the user.
  • Hybrid-based filtering allows for flexibility and can be tailored to suit different recommendation scenarios. It addresses some of the weaknesses of individual approaches, offering a more robust and comprehensive recommendation solution

Dataset¶

The dataset that we used is a list of manga scrapped by Victor Soreiro in 2022 (link: https://www.kaggle.com/datasets/victorsoeiro/manga-manhwa-and-manhua-dataset)

In [271]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
from operator import itemgetter

import warnings
warnings.filterwarnings("ignore")
In [272]:
data = pd.read_csv("data.csv")
data.head()
Out[272]:
title description rating year tags cover
0 Salad Days (Tang LiuZang) - Part 2 The second season of Salad Days (Tang LiuZang). 4.7 2021.0 ['BL', 'Manhua', 'Romance', 'Shounen-ai', 'Spo... https://cdn.anime-planet.com/manga/primary/sal...
1 The Master of Diabolism As the grandmaster who founded the Demonic Sec... 4.7 2017.0 ['Action', 'Adventure', 'BL', 'Comedy', 'Manhu... https://cdn.anime-planet.com/manga/primary/the...
2 JoJo's Bizarre Adventure Part 7: Steel Ball Run Set in 1890, Steel Ball Run spotlights Gyro Ze... 4.7 2004.0 ['Action', 'Adventure', 'Horror', 'Mystery', '... https://cdn.anime-planet.com/manga/primary/joj...
3 A Sign of Affection Yuki is a typical college student, whose world... 4.7 2019.0 ['Romance', 'Shoujo', 'Slice of Life', 'Disabi... https://cdn.anime-planet.com/manga/primary/a-s...
4 Moriarty the Patriot Before he was Sherlock’s rival, Moriarty fough... 4.7 2016.0 ['Mystery', 'Shounen', 'Detectives', 'England'... https://cdn.anime-planet.com/manga/primary/mor...

Feature Engineering and Data Analysis¶

One of the first things I did was to find out how much data we were dealing with. I discovered that there were 70948 different anime,manhwa,manga and manhua in the entire dataset. There were 6 columns each with different properties such as title,description, rating, year, tags, and cover. My partner decided that the best way to clean the data was to remove the ‘cover’ column since the information was irrelevant. Since there are 70,939 unique items available, we decided to get every single anime, manhwa, manga and manhua from the year 2000 and up since we thought that would be more appropriate for the general population. We also decided to drop every NaN value in either the year or ratings. Then sorting the entire dataset by the year.

In [273]:
print("Number of rows in the dataset:", data.shape[0])
print("Number of columns in the dataset:", data.shape[1])
print("There are {} unique manhwa/manga available".format(len(data["title"].unique())))
Number of rows in the dataset: 70948
Number of columns in the dataset: 6
There are 70939 unique manhwa/manga available
In [274]:
useful_data = data.drop(columns=["cover"])
useful_data = useful_data[useful_data["year"] >= 2000]
useful_data = useful_data.sort_values("year")


#dropping manwhas and manhuas
#useful_data = useful_data[useful_data["tags"] in "]

useful_data.head()
Out[274]:
title description rating year tags
60501 Copy Cat This entry currently doesn't have a synopsis. ... NaN 2000.0 ['BL', 'Drama', 'Slice of Life', 'Yaoi']
30068 Sonna no Koi ja Nai Chiaki found out that her boyfriend had anothe... NaN 2000.0 ['Drama', 'Romance', 'Shoujo', 'Collections']
57167 Maihime Terpsichore Late into the school year, Sudo Kumi transfers... NaN 2000.0 ['Drama', 'Josei', 'Ballet Dancing', 'Dancing']
59215 Almost Paradise (Debbie MACOMBER) This entry currently doesn't have a synopsis. ... NaN 2000.0 ['Josei', 'Romance', 'Harlequin', 'Based on a ...
46087 Easy Writer Monica has finally gotten her dream job as a r... NaN 2000.0 ['Comedy', 'Drama', 'Josei', 'Romance', 'Slice...

Then my partner and I thought that we should have a visualization such as a bar graph to see the most frequent appearing genres in the entire dataset. I decided that the best way to create a simple bar graph was to first create a dictionary on a key value pair, key being the name of genre and value being the number of genres. With that we chose the top 50 most frequent genres and outputted the bar graph.

In [275]:
tag_column = useful_data["tags"]

for index,row in useful_data["tags"].iteritems():
    row = row[1:-1]
    row = row.replace("'", "").replace(" ", "")
    row = row.split(",")
    useful_data.at[index,"tags"] = row

weed = {}
for tag in useful_data["tags"]:
     for genre in tag:
        if (genre in weed):
            weed[genre] = weed[genre] + 1
        else:
            weed[genre] = 1
            
#print(weed)
In [276]:
N = 50
 
# N largest values in dictionary
# Using sorted() + itemgetter() + items()
weed = dict(sorted(weed.items(), key=itemgetter(1), reverse=True)[:N])
 
# printing result
print("The top N value pairs are " + str(weed))
The top N value pairs are {'Romance': 28058, 'Comedy': 19669, 'Drama': 17042, 'Fantasy': 15179, 'BL': 12347, 'SchoolLife': 11989, 'Action': 11317, 'Yaoi': 9996, 'FullColor': 9668, 'LightNovels': 9287, 'Webtoons': 8871, 'Seinen': 8170, 'SliceofLife': 7653, 'Supernatural': 7472, 'Manhwa': 6861, 'Shoujo': 6677, 'Shounen': 6369, 'Manhua': 5645, 'Adventure': 5471, 'Josei': 5002, 'OneShot': 4452, 'Ecchi': 3572, 'SciFi': 3415, 'ExplicitSex': 3370, 'Historical': 3220, 'PersoninaStrangeWorld': 3134, 'BasedonaWebNovel': 3040, 'Mystery': 3019, 'AdultCouples': 2863, 'Collections': 2751, 'Shounen-ai': 2351, 'BasedonaNovel': 2310, 'WebNovels': 2224, 'Non-HumanProtagonists': 2170, 'Psychological': 2165, 'BasedonaLightNovel': 2065, 'GL': 2010, 'MatureThemes': 2001, 'Horror': 1991, 'MatureRomance': 1898, '4-koma': 1818, 'AdaptedtoAnime': 1802, 'Magic': 1789, 'Isekai': 1788, 'Harem': 1659, 'Harlequin': 1639, 'Royalty': 1587, 'Smut': 1327, 'Shoujo-ai': 1251, 'Demons': 1130}
In [277]:
values_list = list(weed.values())

values_array = np.array(values_list)

keys_list = list(weed.keys())

keys_array = np.array(keys_list)

no_of_colors=len(keys_array)
color=["#"+''.join([random.choice('0123456789ABCDEF') for i in range(6)])
       for j in range(no_of_colors)]

bar_plot = pd.DataFrame({"tags":values_array},index = keys_array)
#bar_plot.plot.bar()

colors = color
plt.figure(figsize=(40, 20))
plt.bar(keys_array, values_array, color=colors,width = 1)
plt.xticks(rotation=30)
plt.title('The different types of genre', fontsize=20)
plt.xlabel('Different genres', fontsize=20)
plt.ylabel('#', fontsize=20)
plt.grid(True)
plt.show()

As you can see, the most common tags seem to be romance, comedy, drama, and fantasy. There are a few suspicious tags in here, but we decided to keep them in because they are important factors in generalizing the list of manhwas for a good reccomendation. Next, we decided to one-hot encode the tags column so the data can work with a machine learning model. The function below creates the one-hot-encoded dataset. From there, we appended that dataset to the main dataset and reset the indices. With that, the dataset is ready for reccomendation

In [278]:
# one hot encode 
useful_data.head()
ohctags = {}

for tags in useful_data["tags"]:
    splt = tags
    for wtag in weed.keys():
        if (wtag not in ohctags.keys()):
            ohctags[wtag] = []
        if (wtag in splt):
            ohctags[wtag].append(1)
        else:
            ohctags[wtag].append(0)
            
ohctags_df = pd.DataFrame.from_dict(ohctags)
ohctags_df.head(10)
Out[278]:
Romance Comedy Drama Fantasy BL SchoolLife Action Yaoi FullColor LightNovels ... 4-koma AdaptedtoAnime Magic Isekai Harem Harlequin Royalty Smut Shoujo-ai Demons
0 0 0 1 0 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
7 1 0 0 0 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 1 1 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 50 columns

In [279]:
#combine both into one dataframe!

prep_df = useful_data
#prep_df = prep_df.reset_index()
prep_df = pd.concat([useful_data.reset_index(drop=True), ohctags_df.reset_index(drop=True)], axis=1)
prep_df.head()
Out[279]:
title description rating year tags Romance Comedy Drama Fantasy BL ... 4-koma AdaptedtoAnime Magic Isekai Harem Harlequin Royalty Smut Shoujo-ai Demons
0 Copy Cat This entry currently doesn't have a synopsis. ... NaN 2000.0 [BL, Drama, SliceofLife, Yaoi] 0 0 1 0 1 ... 0 0 0 0 0 0 0 0 0 0
1 Sonna no Koi ja Nai Chiaki found out that her boyfriend had anothe... NaN 2000.0 [Drama, Romance, Shoujo, Collections] 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Maihime Terpsichore Late into the school year, Sudo Kumi transfers... NaN 2000.0 [Drama, Josei, BalletDancing, Dancing] 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Almost Paradise (Debbie MACOMBER) This entry currently doesn't have a synopsis. ... NaN 2000.0 [Josei, Romance, Harlequin, BasedonaNovel] 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 Easy Writer Monica has finally gotten her dream job as a r... NaN 2000.0 [Comedy, Drama, Josei, Romance, SliceofLife] 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 55 columns

Using K-NN as the backbone of our Reccomendation System¶

K-Nearest Neighbors is a machine learning algorithm that uses the distances between feature values to calculate the nearest items in training data X to a sample Y. It algorithm calcualtes the Euclidean Distance between Y and all elements in X, sorts them, and aquires the first K sorted distances. In classification, it lables Y with the most common label of the K sorted disances. In regression, this process is similar. For our case, we thought we could get K good reccomendations for a specific manga, manwha, or manhua by running KNN on the one-hot encoded tag features. In other words, we are using a Content-Based Filtering System to acquire recommendations. We decided to use sklearn's neighbors framework for their Nearest Neighbors model. Since we wanted 10 reccomendations (not including Y itself), we decided to set K to 11.

A small drawback with KNN is that distance calculations can take time, especially if the size of X is massive. Luckily, sklearn's KNN implementation comes with a feature that allows us to skip a few calculations. This algorithm is called "ball tree", which "recursively divides the data into nodes defined by a centroid and radius, such that each point in the node lies within the hyper-sphere defined by [the radius and the centroid]" (sklearn, n.d). This in turn results in fewer distance calculations. With every paramter set, we are finally able to set up the model.

In [280]:
from sklearn.neighbors import NearestNeighbors

# get all the features by themselves as a numpy array
X = ohctags_df.to_numpy()

# implement knn using scipi, find the first k nearest neighbors.
# ball tree is to make sure we dont calculate 67,000 distances 
# we choose 12 to avoid duplicates
knn = NearestNeighbors(n_neighbors=11, algorithm='ball_tree')
nbrs = knn.fit(X)
In [281]:
# code for grabbing a random sample - used for quick testing.
# rand = prep_df.sample()
# Y = (rand.drop(["title", "description", "rating", "year", "tags"], axis=1)).to_numpy()

# This function below prints a readable string representation of the first K reccomendations for Y
def knnrecs(Y):
    Ydat = (Y.drop(labels=["title", "description", "tags", "rating", "year"])).to_numpy()
    distances, indices = nbrs.kneighbors([Ydat])
    #print(indices)
    print("Best Reccomendations if You've Read: " + str(Y["title"]))
    print(str(Y["tags"]))
    print()
    i = 0
    for x in indices[0]:
        row = useful_data.iloc[x]
        title = row["title"]
        desc = row["description"]
        tags = row["tags"]
        if (title != Y["title"]):
            print(str(title))
            print("distance: " + str(distances[0][i]))
            print("tags: " + str(tags))
            print("description: " + str(desc))
            print()
        i += 1
    return(indices[0])
In [282]:
# Let's test this with one of Sharath's favorite mangas!
dsid = prep_df.index[prep_df["title"] == "The Eminence in Shadow"]
Y = prep_df.iloc[dsid[0]]

recs = knnrecs(Y)
Best Reccomendations if You've Read: The Eminence in Shadow
['Action', 'Comedy', 'Fantasy', 'Shounen', 'Isekai', 'Magic', 'MagicSchool', 'OverpoweredMainCharacters', 'Parody', 'PersoninaStrangeWorld', 'Reincarnation', 'SchoolLife', 'Swordplay', 'ExplicitViolence', 'AdaptedtoAnime', 'BasedonaLightNovel']

Villainess: Reloaded! Blowing Away Bad Ends with Modern Weapons
distance: 1.4142135623730951
tags: ['Action', 'Comedy', 'Fantasy', 'Shounen', 'Guns', 'Isekai', 'Magic', 'ModernKnowledge', 'OtomeGame', 'PersoninaStrangeWorld', 'Reincarnation', 'Villainess', 'Violence', 'BasedonaLightNovel']
description: Astrid von Oldenburg is no ordinary four-year-old. She’s a child prodigy with a passion for military technology who now finds herself reincarnated in the world of an otome game she played during her past life. But not as the heroine! As the game’s villainess, she's born with wealth, power, and a fearsome talent for magic. The only problem is that every route leads to her inevitable destruction. Or does it? What if averting her destruction was a simple matter of amassing enough firepower to annihilate anyone who dared even attempt to bring her down?! In a bid to resist fate, the young villainess embarks on the reproduction of all of her favorite weaponry. Whatever it takes, Astrid’s determined to blow away her bad ends with superior firepower!

Doryoku Shisugita Sekai Saikyou no Butouka wa, Mahou Sekai wo Yoyuu de Ikinuku.
distance: 1.4142135623730951
tags: ['Action', 'Comedy', 'Fantasy', 'Shounen', 'Isekai', 'Magic', 'MartialArts', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'Reincarnation', 'BasedonaLightNovel']
description: One day, a martial artist named Ash was suddenly reborn into another world. He decided that he will become a sorcerer in his second life.He went through harsh training after becoming the apprentice of the former hero, Morris. Then, the "Emperor of darkness" suddenly appeared! Right when the end of the world was approaching, he one-shotted the demon lord?!

Tsukimichi: Moonlit Fantasy
distance: 1.4142135623730951
tags: ['Action', 'Adventure', 'Comedy', 'Fantasy', 'Shounen', 'Isekai', 'KingdomBuilding', 'Magic', 'ModernKnowledge', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'RPG', 'SummonedIntoAnotherWorld', 'AdaptedtoAnime', 'BasedonaLightNovel']
description: Misumi Makoto was just a normal high-school student... until the day he was summoned to another world because of an agreement his parents had made with a goddess. Except when he meets the only goddess in the world he's sent to, she insults him by calling him "extremely unattractive" and makes an arbitrary decision to banish him to a deserted wilderness. Makoto searches the desolate land to find other humans, but for some reason only finds creatures that are all nonhuman. Even the two pretty women who decide to follow him on his journey used to be a dragon and a giant spider. Along with two extremely eccentric characters (but really reliable in battles), so begins Makoto's unlucky adventure in the new world! A fantasy about a teenage boy who keeps getting hit by one problem after another!

Didn't I Say to Make My Abilities Average in the Next Life?!
distance: 1.4142135623730951
tags: ['Action', 'Adventure', 'Comedy', 'Fantasy', 'Shounen', 'Hiatus', 'Isekai', 'Magic', 'Nobility', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'Reincarnation', 'AdaptedtoAnime', 'BasedonaLightNovel']
description: When she turns ten years old, Adele von Ascham is hit with a horrible headache–and memories of her previous life as an eighteen-year-old Japanese girl named Kurihara Misato. That life changed abruptly, however, when Misato died trying to aid a little girl and met god. During that meeting, she made an odd request and asked for average abilities in her next life. But few things–especially wishes–ever go quite as planned.

Ore no Ie ga Maryoku Spot Datta Ken: Sun de Iru Dake de Sekai Saikyou
distance: 1.7320508075688772
tags: ['Action', 'Comedy', 'Fantasy', 'Shounen', 'Dragons', 'Isekai', 'Magic', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'SummonedIntoAnotherWorld', 'ExplicitSex', 'BasedonaLightNovel']
description: Living carefree at home is the greatest shortcut---My House is the world's greatest Magic Power Spot, that being the case both my house and I were summoned to another world by some guys who are aiming for it. However, I've been living in this place for many years and my body is, apparently, abnormally overflowing with magic. Due to some unforeseen circumstances by those guys who summoned me, they quickly ran away. Be that as it may, there are some ill-mannered people who covet the magic leaking out of my house.

Twisted-Wonderland: The Comic - Episode of Heartslabyul
distance: 1.7320508075688772
tags: ['Comedy', 'Fantasy', 'Shounen', 'BoardingSchool', 'Disney', 'Isekai', 'Magic', 'MagicSchool', 'PersoninaStrangeWorld', 'SchoolLife', 'SummonedIntoAnotherWorld', 'BasedonaMobileGame']
description: Enma Yuuken is a high school student and member of kendo club. After an accident with a Black Carriage, he ends in Night Raven College, a pretigious magic school in Twisted Wonderland.

My Instant Death Ability Is So Overpowered, No One in This Other World Stands a Chance Against Me! ΑΩ
distance: 1.7320508075688772
tags: ['Action', 'Adventure', 'Comedy', 'Fantasy', 'Shounen', 'Cheats', 'Isekai', 'Magic', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'SummonedIntoAnotherWorld', 'Violence', 'BasedonaLightNovel']
description: Awaking to absolute chaos and carnage while on a school trip, Yogiri Takatou discovers that everyone in his class has been transported to another world! He had somehow managed to sleep through the entire ordeal himself, missing out on the Gift — powers bestowed upon the others by a mysterious Sage who appeared to transport them. Even worse, he and another classmate were ruthlessly abandoned by their friends, left as bait to distract a nearby dragon. Although not terribly bothered by the thought of dying, he reluctantly decides to protect his lone companion. After all, a lowly Level 1000 monster doesn't stand a chance against his secret power to invoke Instant Death with a single thought! If he can stay awake long enough to bother using it, that is...

Akashic Records of Bastard Magic Instructor
distance: 1.7320508075688772
tags: ['Action', 'Comedy', 'Ecchi', 'Fantasy', 'Shounen', 'Magic', 'MagicSchool', 'SchoolLife', 'Teaching', 'AdaptedtoAnime', 'BasedonaLightNovel']
description: Lumia and Sisti are mages-in-training at a prestigious magical academy where they hope to be taught by the best of the best. However, when their favorite instructor suddenly retires, his replacement turns out to be a total jerk - he's idle, incompetent, and always late! Can Lumia help uncover their new teacher's true potential - and can Sisti still learn magic and unravel the secrets of the mysterious Sky Castle with such a terrible mentor as her guide?

Yankee wa Isekai de Seirei ni Aisaremasu.
distance: 1.7320508075688772
tags: ['Action', 'Comedy', 'Fantasy', 'Delinquents', 'Isekai', 'Magic', 'PersoninaStrangeWorld', 'Reincarnation', 'BasedonaLightNovel']
description: While trying to save a child who was going to be run over by a truck, Manai Zero loses his life. When he wakes up even though he should have lost his life, Manai is given the choice of being reincarnated, but in a different world, with a power called "Beloved by Spirits". A world of magic and sprites, monsters and adventures.

Isekai Cheat Magician
distance: 1.7320508075688772
tags: ['Action', 'Adventure', 'Fantasy', 'Shounen', 'Cheats', 'Guilds', 'Isekai', 'Magic', 'OverpoweredMainCharacters', 'PersoninaStrangeWorld', 'SummonedIntoAnotherWorld', 'AdaptedtoAnime', 'BasedonaLightNovel']
description: As regular high school students Taichi and Rin disappeared in a beam of light. When they came to, the two of them were already in a world of swords and magic. Finally getting away after experiencing an attack by monsters, following the suggestion of adventurers they headed on the path towards the guild. In the guild, the two of them found out that they possessed unbelievably powerful magic. Thus the regular high school students transformed into the strongest cheats...

After developing a function to output an easy-to-read string representation of the results, we decided to test KNN with a manga called "The Eminence of Shadow". As seen above, the distances of each reccomendation appears to increase as we go down. This is expected of the KNN algorithm, so we have successfully aquired the 10 nearest reccomendations for "The Eminence of Shadow". However, there is a problem.

How do we know that these results are good recommendations?

Evaluating Recommendations¶

In the example above, we noticed that the tags of all reccomendations are very similar to the tags of "The Eminence of Shadow". As we go further down the recommendation list, the similarity in tags start to vanish. Also, when reading the descriptions of the recommended items, we see parallels between the general plot of "The Eminence of Shadow" and the recommendations. The best way to verify the reccomendation is by actually reading the recommendations, but we do not have the time for that. All three methods are decent for evalution, but none of them are rooted in the ground truth.

After discussing this issue, we decided to modify the way we make recommendations. The best way to validate a model is by comparing a feature's value generated by the machine and value from the ground truth. For classifcation, we look for the loss curve with a loss function. For regression, we would use statistics that compare the generated value to the actual value, such as Root Mean Squared Error.

In a real world setting, a user profile will have the mangas, manwhas, and manhuas that they've read, and a score on whether they've liked it or not. A reccomender system would use this information (as well as information from other users if going for a Collaborative-Filtering approach) to provide reccomendations for the user. So, we decided to simulate our own "user profile" with 25 mangas, manwhas, and manhuas we've read, labelled with 1. We then grab 35 random items in prep_df not already in our "user profile", and label them with a 0.

In [283]:
import re

#Sharath Kannan's List
skidx = []
skidx.append(prep_df.index[prep_df["title"] == "The Eminence in Shadow"][0])
skidx.append(prep_df.index[prep_df["title"] == "Oshi no Ko"][0])
skidx.append(prep_df.index[prep_df["title"] == "Tsukimichi: Moonlit Fantasy"][0])
skidx.append(prep_df.index[prep_df["title"] == "Chainsaw Man"][0])
skidx.append(prep_df.index[prep_df["title"] == "Overlord"][0])
skidx.append(prep_df.index[prep_df["title"] == "The Saga of Tanya the Evil"][0])
skidx.append(prep_df.index[prep_df["title"] == "Kaguya-sama: Love Is War"][0])
skidx.append(prep_df.index[prep_df["title"] == "Reincarnated as a Sword"][0])
skidx.append(prep_df.index[prep_df["title"] == "Black Clover"][0])
skidx.append(prep_df.index[prep_df["title"] == "Dragon Ball Super"][0])
skidx.append(prep_df.index[prep_df["title"] == "Re:ZERO -Starting Life in Another World- Chapter 1: A Day in the Capital"][0])
skidx.append(prep_df.index[prep_df["title"] == "Re:ZERO -Starting Life in Another World- Chapter 2: A Week at the Mansion"][0])
skidx.append(prep_df.index[prep_df["title"] == "Re:ZERO -Starting Life in Another World- Chapter 3: Truth of Zero"][0])
skidx.append(prep_df.index[prep_df["title"] == "Re:ZERO -Starting Life in Another World- Chapter 4: Sanctuary and Witch of Greed"][0])
skidx.append(prep_df.index[prep_df["title"] == "One-Punch Man (Webcomic)"][0])
skidx.append(prep_df.index[prep_df["title"] == "The Quintessential Quintuplets"][0])
skidx.append(prep_df.index[prep_df["title"] == "Classroom of the Elite"][0])
skidx.append(prep_df.index[prep_df["title"] == "Horimiya"][0])
skidx.append(prep_df.index[prep_df["title"] == "Spice and Wolf (Light Novel)"][0])
skidx.append(prep_df.index[prep_df["title"] == "86: Eighty-Six"][0])
skidx.append(prep_df.index[prep_df["title"] == "How a Realist Hero Rebuilt the Kingdom"][0])
skidx.append(prep_df.index[prep_df["title"] == "That Time I Got Reincarnated as a Slime: The Ways of the Monster Nation"][0])
skidx.append(prep_df.index[prep_df["title"] == "That Time I Got Reincarnated as a Slime: Trinity in Tempest"][0])
skidx.append(prep_df.index[prep_df["title"] == "The Rising of the Shield Hero"][0])
skidx.append(prep_df.index[prep_df["title"] == "Goblin Slayer"][0])

# fill out the rest of the dataset.
ld = np.ones(len(skidx))
random_seed = 615
np.random.seed(2129)
# fill rest with random values-0
i = 0
while (i < 35):
    r = np.random.randint(len(prep_df))
    if (r not in skidx):
        skidx = np.append(skidx, r)
        ld = np.append(ld, 0)
        i += 1

sksample = prep_df.iloc[skidx]
sksample["Like_Dislike"] = ld
sksample.head()
Out[283]:
title description rating year tags Romance Comedy Drama Fantasy BL ... AdaptedtoAnime Magic Isekai Harem Harlequin Royalty Smut Shoujo-ai Demons Like_Dislike
43740 The Eminence in Shadow Shadowbrokers are those who go unnoticed, posi... 4.6 2018.0 [Action, Comedy, Fantasy, Shounen, Isekai, Mag... 0 1 0 1 0 ... 1 1 1 0 0 0 0 0 0 1.0
57732 Oshi no Ko In the entertainment industry, lying is both y... 4.6 2020.0 [Drama, Seinen, SliceofLife, Acting, Idols, Ps... 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1.0
30544 Tsukimichi: Moonlit Fantasy Misumi Makoto was just a normal high-school st... 4.5 2015.0 [Action, Adventure, Comedy, Fantasy, Shounen, ... 0 1 0 1 0 ... 1 1 1 0 0 0 0 0 0 1.0
43592 Chainsaw Man Denji's life of poverty is changed forever whe... 4.6 2018.0 [Action, Fantasy, Horror, Shounen, DarkFantasy... 0 0 0 1 0 ... 1 0 0 0 0 0 0 0 1 1.0
28152 Overlord What do you do when your favorite game shuts d... 4.6 2014.0 [Action, Adventure, Fantasy, DarkFantasy, Demo... 0 0 0 1 0 ... 1 1 1 0 0 0 0 0 1 1.0

5 rows × 56 columns

In [284]:
# Eric Cho's List
manhwa_data = []
manhwa_data.append(prep_df.index[prep_df["title"] == "Nano Machine"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Tower of God - Part 3"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "A Good Day to be a Dog"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Tower of God"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "A Business Proposal"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Positively Yours"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "What's Wrong with Secretary Kim?"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Hold Me Tight"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Weak Hero"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Bastard"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Teenage Mercenary"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Surviving Romance"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Homeless"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Walk on Water"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Second Life Ranker (Novel)"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Light and Shadow"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Odd Girl Out"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Love Shuttle"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Sweet Home"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Positively Yours"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Unholy Blood"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Medical Return"][0])
manhwa_data.append(prep_df.index[prep_df["title"] == "Who Made Me a Princess"][0])

ld = np.ones(len(manhwa_data))

random_seed = 615
np.random.seed(2193)


for i in range(35):
    ran = np.random.randint(len(prep_df))
    if(ran not in manhwa_data):
        manhwa_data = np.append(manhwa_data,ran)
        ld = np.append(ld,0)
        

manhwa_data = prep_df.iloc[manhwa_data]
manhwa_data["Like_Dislike"]=ld
manhwa_data.head()
Out[284]:
title description rating year tags Romance Comedy Drama Fantasy BL ... AdaptedtoAnime Magic Isekai Harem Harlequin Royalty Smut Shoujo-ai Demons Like_Dislike
57872 Nano Machine After being held in disdain and having his lif... 4.7 2020.0 [Action, Adventure, Fantasy, Manhwa, SciFi, We... 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1.0
51958 Tower of God - Part 3 The third season of Tower of God. 4.7 2019.0 [Action, Adventure, Drama, Fantasy, Manhwa, We... 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 1.0
42249 A Good Day to be a Dog Hana is cursed into a dog from her first kiss,... 4.6 2017.0 [Comedy, Drama, Manhwa, Romance, Webtoons, Adu... 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 1.0
16076 Tower of God Fame. Glory. Power. Anything in your wildest d... 4.6 2010.0 [Action, Adventure, Drama, Fantasy, Manhwa, We... 0 0 1 1 0 ... 1 0 0 0 0 0 0 0 0 1.0
43833 A Business Proposal Ha-ri made a deal—go on one blind date for her... 4.6 2018.0 [Comedy, Drama, Manhwa, Romance, Webtoons, Adu... 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 1.0

5 rows × 56 columns

Essentially, we've turned this problem into a classification problem. By using a model that can generalize the trends of tags in our "user profile", we can figure out if a manga, manhwa, or manhua is worth recommending to us. This then raises the opportunity to see which classification model preforms the best for recommendation. The three models we will be testing are KNN, Decision Trees, and a Neural Network. The reason for choosing these three is that they seem to be the most popular models in recommendation systems.

Sharath's taste in manwha, manga, and manhua appears to be the trope of reincarnating in a fantasy world, and prefers manga. Eric's taste appears to be action-heavy and prefers manwha. The goal now is the see if these three models can generalize these trends. With that, we can check if a new manga, manhwa, or manhua is suitable for us.

By using the model with the least loss, we can run the model with potential items and find the next best manga, manwha, or manhua to enjoy!

Finding the Best Model¶

K-Nearest Neighbors¶

Let us start with K-Nearest Neighbors. Once again, we use the sklearn implementation of the model. This time, we wouldn't have to use the "ball tree" algorithm since we are training with 60 values. Brute forcing the distances would be more efficient. We've also used a tecnique called "cross validation" to split our "user profile" data into a training and test set (80% and 20% respectively). With this, we can calculate training and testing loss. This can then tell us if the model is overfitting or underfitting. Having K be 10 might not be as optimal here, so we need to choose the best K value that best generalizes the data.

An easy way to do this is to compare model loss for different K values. For our loss function, we decided to go with logistic loss. But we've been expirimentign with other functions

In [285]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

# set up Sharath's samples
skX = (sksample.drop(labels=["title", "description", "tags", "rating", "year", "Like_Dislike"], axis=1)).to_numpy()
skY = (sksample["Like_Dislike"]).to_numpy()

# make training/testing set
X_train, X_test, Y_train, Y_test = train_test_split(skX, skY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)
In [286]:
# we train 11 KNN models, starting with 1 and going up by 5 up to 50.
K = np.arange(0,40,1)
K[0] = 1

# find training and testing loss
tr_loss = []
te_loss = []
tr_acc = []
te_acc = []
for k in K:
    knn = KNeighborsClassifier(n_neighbors=k, algorithm="brute")
    knn = knn.fit(X_train, Y_train)
    # score returns mean accuracy of training/testing
    tr_acc.append(knn.score(X_train, Y_train))
    te_acc.append(knn.score(X_test, Y_test))
    tr_loss.append(log_loss(Y_train, knn.predict_proba(X_train)))
    te_loss.append(log_loss(Y_test, knn.predict_proba(X_test)))

# plot loss values
plt.title("K-value vs Loss (Sharath Kannan's Data)")
plt.xlabel("K-value")
plt.ylabel("Loss")

# might remove this
ax = plt.gca()
ax.set_ylim([0.0, 1.0])

plt.plot(K, te_loss, tr_loss)
plt.legend(["testing", "training"])
plt.show()
In [287]:
# Accuracy
plt.title("K-value vs Accuracy (Sharath Kannan's Data)")
plt.xlabel("K-value")
plt.ylabel("Accuracy")

# might remove this
ax = plt.gca()
ax.set_ylim([0.0, 1.0])

plt.plot(K, te_acc, tr_acc)
plt.legend(["testing", "training"])
plt.show()
In [288]:
# set up Erics Samples
ecX = (manhwa_data.drop(labels=["title", "description", "tags", "rating", "year", "Like_Dislike"], axis=1)).to_numpy()
ecY = (manhwa_data["Like_Dislike"]).to_numpy()

# make training/testing set
X_train, X_test, Y_train, Y_test = train_test_split(ecX, ecY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)

# we train 11 KNN models, starting with 1 and going up by 5 up to 50.
K = np.arange(0,40,1)
K[0] = 1

# find training and testing loss
tr_loss = []
te_loss = []
tr_acc = []
te_acc = []
for k in K:
    knn = KNeighborsClassifier(n_neighbors=k, algorithm="brute")
    knn = knn.fit(X_train, Y_train)
    # score returns mean accuracy of training/testing
    tr_acc.append(knn.score(X_train, Y_train))
    te_acc.append(knn.score(X_test, Y_test))
    tr_loss.append(log_loss(Y_train, knn.predict_proba(X_train)))
    te_loss.append(log_loss(Y_test, knn.predict_proba(X_test)))

# plot loss values
plt.title("K-value vs Loss (Eric Cho's Data)")
plt.xlabel("K-value")
plt.ylabel("Loss")

# might remove this
ax = plt.gca()
ax.set_ylim([0.0, 1.0])

plt.plot(K, te_loss, tr_loss)
plt.legend(["testing", "training"])
plt.show()
In [289]:
# Accuracy
plt.title("K-value vs Accuracy (Eric Cho's Data)")
plt.xlabel("K-value")
plt.ylabel("Accuracy")

# might remove this
ax = plt.gca()
ax.set_ylim([0.0, 1.0])

plt.plot(K, te_acc, tr_acc)
plt.legend(["testing", "training"])
plt.show()

As expected, as the K value increases, both loss and accuracy decrease. This is because the model is underfitting the data at high K-values, resulting in wrong classifications. We noticed minimal loss for both datasets between the K-values of 0 and 5. The accuracy of the training and testing data for both datasets seems to be following a similar pattern---it is maximized between the K-values of 0 and 5. This tells us that the most optimal K value is within this interval. We see 100% accuracy and 0 loss when K is 1 or 2, but we risk overfitting the model in this case. This means that items that we test with later on have a higher chance of being wrongly classified.

After careful consideration, we decided to settle with K=3. Below are the accuracies of a 3-NN model trained with our datasets.

In [290]:
# sk with 4
X_train, X_test, Y_train, Y_test = train_test_split(skX, skY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)
knn = KNeighborsClassifier(n_neighbors=3, algorithm="brute")
knn = knn.fit(X_train, Y_train)

print("Training score (Sharath): " + str(knn.score(X_train, Y_train)))
print("Testing score (Sharath): " + str(knn.score(X_test, Y_test)))
print()

#EC with 4
X_train, X_test, Y_train, Y_test = train_test_split(ecX, ecY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)
knn = KNeighborsClassifier(n_neighbors=3, algorithm="brute")
knn = knn.fit(X_train, Y_train)

print("Training score (Eric): " + str(knn.score(X_train, Y_train)))
print("Testing score (Eric): " + str(knn.score(X_test, Y_test)))
Training score (Sharath): 0.9166666666666666
Testing score (Sharath): 0.9166666666666666

Training score (Eric): 0.9565217391304348
Testing score (Eric): 1.0

A 91% testing score for Sharath's dataset and a 100% testing score for Eric's dataset are good values for reccomendation. Thus, using KNN to classify a like or a dislike seems like a good idea.

Decision Trees¶

Next, we decided to test decision trees. When it comes to training a decisiosn tree, it esesntially looks for the feature with the highest information gain using GINI and splits items in that dataset based on that feature. This recursive process ends when all items in each children have the same class.

The trends in our datasets can be determined on whether or not a certain tag appears, so we assume that a decsision tree would work extremely well to generalize this data and predict a like or dislike. For this, we used sklearn's implementation of a decision tree. Once again, we split our dataset into training and a test set, and trained the model with our training set. We could end the training process early by limiting the height of the tree, but since the there are only 48 items to train, we decided against it. Afterwards, we aquired the accuracy of running the model on the training and testing data. Using a visualization of the data, we can see which features the model prioritized.

In [291]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score


skX = (sksample.drop(labels=["title", "description", "tags", "rating", "year","Like_Dislike"], axis=1)).to_numpy()
sky = (sksample["Like_Dislike"]).to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(skX, skY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)


# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,Y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Training Accuracy:", clf.score(X_train, Y_train))
print("Testing Accuracy:", clf.score(X_test, Y_test))
Training Accuracy: 1.0
Testing Accuracy: 1.0
In [292]:
from sklearn.tree import *
#import pydotplus
#from IPython.display import Image
#from six import StringIO

#dot_data = StringIO()
feature_cols = sksample.drop(labels=["title", "description", "tags", "rating", "year","Like_Dislike"], axis=1).columns.tolist()

plt.figure(figsize=(15,7))


plot_tree(clf, feature_names=feature_cols, class_names=["dislike", "like"])
#plot_tree(clf, feature_names=feature_cols)
Out[292]:
[Text(0.5769230769230769, 0.9, 'AdaptedtoAnime <= 0.5\ngini = 0.478\nsamples = 48\nvalue = [29, 19]\nclass = dislike'),
 Text(0.3076923076923077, 0.7, 'BasedonaLightNovel <= 0.5\ngini = 0.219\nsamples = 32\nvalue = [28, 4]\nclass = dislike'),
 Text(0.15384615384615385, 0.5, 'MatureThemes <= 0.5\ngini = 0.069\nsamples = 28\nvalue = [27, 1]\nclass = dislike'),
 Text(0.07692307692307693, 0.3, 'gini = 0.0\nsamples = 27\nvalue = [27, 0]\nclass = dislike'),
 Text(0.23076923076923078, 0.3, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = like'),
 Text(0.46153846153846156, 0.5, 'Action <= 0.5\ngini = 0.375\nsamples = 4\nvalue = [1, 3]\nclass = like'),
 Text(0.38461538461538464, 0.3, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]\nclass = like'),
 Text(0.5384615384615384, 0.3, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike'),
 Text(0.8461538461538461, 0.7, 'Action <= 0.5\ngini = 0.117\nsamples = 16\nvalue = [1, 15]\nclass = like'),
 Text(0.7692307692307693, 0.5, 'PersoninaStrangeWorld <= 0.5\ngini = 0.278\nsamples = 6\nvalue = [1, 5]\nclass = like'),
 Text(0.6923076923076923, 0.3, 'gini = 0.0\nsamples = 4\nvalue = [0, 4]\nclass = like'),
 Text(0.8461538461538461, 0.3, 'Demons <= 0.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]\nclass = dislike'),
 Text(0.7692307692307693, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike'),
 Text(0.9230769230769231, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = like'),
 Text(0.9230769230769231, 0.5, 'gini = 0.0\nsamples = 10\nvalue = [0, 10]\nclass = like')]
In [293]:
# Set up with Erics Examples
ecX = (manhwa_data.drop(labels=["title", "description", "tags", "rating", "year", "Like_Dislike"], axis=1)).to_numpy()
ecY = (manhwa_data["Like_Dislike"]).to_numpy()

# make training/testing set
X_train, X_test, Y_train, Y_test = train_test_split(ecX, ecY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,Y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Training Accuracy:", clf.score(X_train, Y_train))
print("Testing Accuracy:", clf.score(X_test, Y_test))
Training Accuracy: 1.0
Testing Accuracy: 1.0
In [294]:
feature_cols = manhwa_data.drop(labels=["title", "description", "tags", "rating", "year","Like_Dislike"], axis=1).columns.tolist()

plt.figure(figsize=(15,7))


plot_tree(clf, feature_names=feature_cols, class_names=["dislike", "like"])
#plot_tree(clf, feature_names=feature_cols)
Out[294]:
[Text(0.4090909090909091, 0.9375, 'Manhwa <= 0.5\ngini = 0.476\nsamples = 46\nvalue = [28, 18]\nclass = dislike'),
 Text(0.3181818181818182, 0.8125, 'gini = 0.0\nsamples = 24\nvalue = [24, 0]\nclass = dislike'),
 Text(0.5, 0.8125, 'Shounen-ai <= 0.5\ngini = 0.298\nsamples = 22\nvalue = [4, 18]\nclass = like'),
 Text(0.4090909090909091, 0.6875, 'FullColor <= 0.5\ngini = 0.245\nsamples = 21\nvalue = [3, 18]\nclass = like'),
 Text(0.18181818181818182, 0.5625, 'Drama <= 0.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]\nclass = dislike'),
 Text(0.09090909090909091, 0.4375, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = like'),
 Text(0.2727272727272727, 0.4375, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike'),
 Text(0.6363636363636364, 0.5625, 'SciFi <= 0.5\ngini = 0.188\nsamples = 19\nvalue = [2, 17]\nclass = like'),
 Text(0.45454545454545453, 0.4375, 'ExplicitSex <= 0.5\ngini = 0.111\nsamples = 17\nvalue = [1, 16]\nclass = like'),
 Text(0.36363636363636365, 0.3125, 'gini = 0.0\nsamples = 12\nvalue = [0, 12]\nclass = like'),
 Text(0.5454545454545454, 0.3125, 'Yaoi <= 0.5\ngini = 0.32\nsamples = 5\nvalue = [1, 4]\nclass = like'),
 Text(0.45454545454545453, 0.1875, 'MatureRomance <= 0.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]\nclass = dislike'),
 Text(0.36363636363636365, 0.0625, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike'),
 Text(0.5454545454545454, 0.0625, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = like'),
 Text(0.6363636363636364, 0.1875, 'gini = 0.0\nsamples = 3\nvalue = [0, 3]\nclass = like'),
 Text(0.8181818181818182, 0.4375, 'Adventure <= 0.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]\nclass = dislike'),
 Text(0.7272727272727273, 0.3125, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike'),
 Text(0.9090909090909091, 0.3125, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]\nclass = like'),
 Text(0.5909090909090909, 0.6875, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]\nclass = dislike')]

Something important to note is that the training accuracy for both datasets is 100%. This means that the training data is generalized perfectly. Usually, this means that the model is overfitting the dataset. There is a chance that this occurs with Sharath's dataset as we get a testing accuracy of 83%. Though at the same time, we see a testing accuracy of 100% on Erics dataset. What is the reason for this?

Note that Eric has labelled all the manwha in his list with 1, since he is a manhwa fan. As seen in his decision tree above, the first split is based on the "manhwa" tag. If it is not a manwha, it is immediately labelled with a 0 class. Meanwhile, Sharath's dataset has more variety in the tags, which is why the trees appear more balanced. That said, we came to the conclusion that a decision tree works better for "user profile lists" that have a defined trend in the tags. It doesn't perform as well for lists with more variance.

Neural Network¶

As mentioned before, neural networks are becoming popular for reccomendation systems with the advent of big data. The benefit of neural networks is that it is known to classify non-linear distributions with accuracy. Suppose we use a neural network to generalize our "user profile" dataset.

There are many good rules of thumb to start building a neural network. For us, we decided to choose ReLU as our activation function. The algorithm we chose for backpropagation is stochastic gradeitn decent. A good intial hidden layer size is "2/3 the size of the input layer, plus the size of the output layer" (Krishnan, 2021). In our case, that would be (2/3)*48 + 1 = 33. We experimented with different organizations of these 33 neurons, and we settled with 3 layers: 15 on the 1st, 15 on the 2nd, and 3 on the 3rd. With all that in mind, we set up an MLPClassifier (Multi-Layer-Perceptron) from sklearn and fit our datasets.

In [295]:
from sklearn.neural_network import MLPClassifier

# set up Sharath's samples
skX = (sksample.drop(labels=["title", "description", "tags", "rating", "year", "Like_Dislike"], axis=1)).to_numpy()
skY = (sksample["Like_Dislike"]).to_numpy()

# make training/testing set
X_train, X_test, Y_train, Y_test = train_test_split(skX, skY, test_size=0.2, train_size=0.8, random_state=614, shuffle=True)

# initialize and train neural network
# activation function is ReLU-standard
# loss function is sgd
# learning rate has been set to 0.2 for training
nn = MLPClassifier(hidden_layer_sizes=(15,15,3), activation="relu", solver="sgd", shuffle=True, random_state=614, max_iter=600, verbose=True, learning_rate_init=0.2)
nn = nn.fit(X_train, Y_train)

# see scroes (mean error)
print("\nScores (Mean Accuracy)")
print("test score: " + str(nn.score(X_test, Y_test)))
print("training score: " + str(nn.score(X_train, Y_train)))
Iteration 1, loss = 0.66405658
Iteration 2, loss = 0.62816355
Iteration 3, loss = 0.58852031
Iteration 4, loss = 0.51603694
Iteration 5, loss = 0.43474858
Iteration 6, loss = 0.35646902
Iteration 7, loss = 0.29229925
Iteration 8, loss = 0.23382722
Iteration 9, loss = 0.18127073
Iteration 10, loss = 0.12635842
Iteration 11, loss = 0.08155672
Iteration 12, loss = 0.05028862
Iteration 13, loss = 0.03075169
Iteration 14, loss = 0.01758722
Iteration 15, loss = 0.01025202
Iteration 16, loss = 0.00635454
Iteration 17, loss = 0.00429061
Iteration 18, loss = 0.00298275
Iteration 19, loss = 0.00210838
Iteration 20, loss = 0.00150633
Iteration 21, loss = 0.00109183
Iteration 22, loss = 0.00080513
Iteration 23, loss = 0.00060697
Iteration 24, loss = 0.00046955
Iteration 25, loss = 0.00037554
Iteration 26, loss = 0.00030796
Iteration 27, loss = 0.00025825
Iteration 28, loss = 0.00022203
Iteration 29, loss = 0.00019517
Iteration 30, loss = 0.00017497
Iteration 31, loss = 0.00015953
Iteration 32, loss = 0.00014749
Iteration 33, loss = 0.00013794
Iteration 34, loss = 0.00013033
Iteration 35, loss = 0.00012419
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.

Scores (Mean Accuracy)
test score: 0.9166666666666666
training score: 1.0

For 35 iterations, we see the loss of the dataset continously go down, which is expected for a well made neural network. For Sharath's dataset, we see a testing accuracy of 91% and a training accuracy of 100%. This might signify overfitting, but with such a high training accuracy, we believe it should be fine. When it comes to neural networks, the more neurons and layers there are in the hidden layer, the more complex the model gets, and the accuracy should be better. So, we decided to try to double the neurons at each layer and see what we get.

In [296]:
# what if we doubled it!
nn = MLPClassifier(hidden_layer_sizes=(20,20,6), activation="relu", solver="sgd", shuffle=True, random_state=614, max_iter=600, verbose=True, learning_rate_init=0.2)
nn = nn.fit(X_train, Y_train)

# see scroes (mean error)
print("\nScores (Mean Accuracy)")
print("test score: " + str(nn.score(X_test, Y_test)))
print("training score: " + str(nn.score(X_train, Y_train)))
Iteration 1, loss = 0.69842120
Iteration 2, loss = 0.67927610
Iteration 3, loss = 0.65912810
Iteration 4, loss = 0.63718486
Iteration 5, loss = 0.61311979
Iteration 6, loss = 0.58339721
Iteration 7, loss = 0.54399954
Iteration 8, loss = 0.49506397
Iteration 9, loss = 0.44645408
Iteration 10, loss = 0.40350928
Iteration 11, loss = 0.35974043
Iteration 12, loss = 0.31400745
Iteration 13, loss = 0.26929240
Iteration 14, loss = 0.23055452
Iteration 15, loss = 0.19526259
Iteration 16, loss = 0.16177237
Iteration 17, loss = 0.13005466
Iteration 18, loss = 0.10026079
Iteration 19, loss = 0.07391577
Iteration 20, loss = 0.05362966
Iteration 21, loss = 0.03919814
Iteration 22, loss = 0.02823829
Iteration 23, loss = 0.02007347
Iteration 24, loss = 0.01460553
Iteration 25, loss = 0.01083800
Iteration 26, loss = 0.00808354
Iteration 27, loss = 0.00607342
Iteration 28, loss = 0.00460708
Iteration 29, loss = 0.00353540
Iteration 30, loss = 0.00275041
Iteration 31, loss = 0.00217171
Iteration 32, loss = 0.00174375
Iteration 33, loss = 0.00142549
Iteration 34, loss = 0.00118880
Iteration 35, loss = 0.00100719
Iteration 36, loss = 0.00086580
Iteration 37, loss = 0.00075457
Iteration 38, loss = 0.00066595
Iteration 39, loss = 0.00059460
Iteration 40, loss = 0.00053661
Iteration 41, loss = 0.00048903
Iteration 42, loss = 0.00044958
Iteration 43, loss = 0.00041635
Iteration 44, loss = 0.00038842
Iteration 45, loss = 0.00036480
Iteration 46, loss = 0.00034438
Iteration 47, loss = 0.00032683
Iteration 48, loss = 0.00031169
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.

Scores (Mean Accuracy)
test score: 1.0
training score: 1.0
In [297]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# Confusion matrix for Sharath's data
pred = nn.predict(X_test)
cm = confusion_matrix(Y_test, pred)

cm_display = ConfusionMatrixDisplay(cm).plot(cmap="pink")

As assumed, the testing score accuracy jumped up to 100%. It is also important to note that the runtime of the model does not take that long due to the size of the training dataset. The confusion matrix above also shows us that everything was classified correctly with no false positives or false negatives. So, we decided to use the same neural network build for Erics dataset.

In [298]:
ecX = (manhwa_data.drop(labels=["title", "description", "tags", "rating", "year", "Like_Dislike"], axis=1)).to_numpy()
ecY = (manhwa_data["Like_Dislike"]).to_numpy()

# make training/testing set
X_train, X_test, Y_train, Y_test = train_test_split(ecX, ecY, test_size=0.2, train_size=0.8, random_state=615, shuffle=True)

nn = MLPClassifier(hidden_layer_sizes=(20,20,6), activation="relu", solver="sgd", shuffle=True, random_state=614, max_iter=600, verbose=True, learning_rate_init=0.2)
nn = nn.fit(X_train, Y_train)

# see scroes (mean error)
print("\nScores (Mean Accuracy)")
print("test score: " + str(nn.score(X_test, Y_test)))
print("training score: " + str(nn.score(X_train, Y_train)))
Iteration 1, loss = 0.68933202
Iteration 2, loss = 0.66470290
Iteration 3, loss = 0.63535866
Iteration 4, loss = 0.60549905
Iteration 5, loss = 0.57569762
Iteration 6, loss = 0.53604651
Iteration 7, loss = 0.47560466
Iteration 8, loss = 0.39678589
Iteration 9, loss = 0.32356104
Iteration 10, loss = 0.26112361
Iteration 11, loss = 0.20836340
Iteration 12, loss = 0.17205988
Iteration 13, loss = 0.14783393
Iteration 14, loss = 0.13021786
Iteration 15, loss = 0.11168324
Iteration 16, loss = 0.09012525
Iteration 17, loss = 0.07342062
Iteration 18, loss = 0.06490571
Iteration 19, loss = 0.05699885
Iteration 20, loss = 0.04935083
Iteration 21, loss = 0.04064668
Iteration 22, loss = 0.02847394
Iteration 23, loss = 0.01560167
Iteration 24, loss = 0.00860033
Iteration 25, loss = 0.00520570
Iteration 26, loss = 0.00332491
Iteration 27, loss = 0.00230007
Iteration 28, loss = 0.00174464
Iteration 29, loss = 0.00139190
Iteration 30, loss = 0.00111719
Iteration 31, loss = 0.00087734
Iteration 32, loss = 0.00067983
Iteration 33, loss = 0.00052999
Iteration 34, loss = 0.00042258
Iteration 35, loss = 0.00035213
Iteration 36, loss = 0.00030207
Iteration 37, loss = 0.00026559
Iteration 38, loss = 0.00023819
Iteration 39, loss = 0.00021713
Iteration 40, loss = 0.00020050
Iteration 41, loss = 0.00018709
Iteration 42, loss = 0.00017611
Iteration 43, loss = 0.00016701
Iteration 44, loss = 0.00015942
Iteration 45, loss = 0.00015305
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.

Scores (Mean Accuracy)
test score: 1.0
training score: 1.0
In [299]:
# confusion matrix for Erics data
pred = nn.predict(X_test)
cm = confusion_matrix(Y_test, pred)

cm_display = ConfusionMatrixDisplay(cm).plot(cmap="pink")

Again, we see a 100% in training and testing accuracy. The confusion matrix also shows us that there are no false positives or false negatives. The neural network does a better job in generalizing datasets when there is a defined trend as well as datasets where there is a little more variance. We thought that a neural network would need more data points to train and completely generalize the dataset, but that doesn't seem to be the case.

Due to the high accuracy and low runtime of the neural network on our "reading lists", we think that the best model to reccomend a user to their next manga, manhwa, or manhua is a neural network.

Conclusion¶

In conclusion, our machine learning project has successfully developed a recommendation system that assists users in discovering the best manhwas, mangas, and manhuas best suited to their taste. Throughout the project, we explored various techniques and considerations to enhance the accuracy and personalization of our recommendations. One important future modification that we will be considering is incorporating the ratings and the years correlated with each of the comics. By considering user ratings, we would be able to factor the collective opinions of the readership, giving prominence to highly rated works and potentially influencing the recommendations. This will help us create a better system that can help users discover popular titles they have never heard of.

Furthermore, although we implemented content-based techniques in our recommendation system, we acknowledge the potential benefits of exploring alternative models such as Support Vector Machine (SVM). SVM is known for its ability to handle high-dimensional feature spaces and is effective at classification tasks. Integrating SVM into our system could provide an additional perspective for generating accurate and diverse recommendations.

Although we did not include every single machine learning model in our project to test the variety of results, my partner and I decided that a Neural Network model might be the best choice. This is because the model achieved 100% accuracy for both lists that my partner and I created. This would mean that the Neural Network model is best for generalization and is fantastic for the average reader.

To further enhance our recommendation system, we can acquire multiple user lists by conducting surveys. By gathering data from a diverse range of users with varying preferences, we can enrich our training dataset and enable broader coverage of genres, art styles, and narrative themes. Incorporating these user lists into our system would allow us to fine-tune our recommendations and account for the unique preferences of different user segments.

Lastly, employing a hybrid filtering method allows for the combination of strengths from both content-based and collaborative filtering. By integrating both approaches, we can leverage the advantages of each method and create a versatile recommendation system. This hybrid approach could further improve recommendation accuracy by considering the attributes of the items being recommended.

In summary, our machine learning project has developed a sophisticated recommendation system that takes into account user preferences, and the potential for alternative models like SVM. Additionally, by acquiring multiple user lists and exploring hybrid filtering methods, we can enhance the personalization and accuracy of our recommendations. Through our efforts, we aim to guide users towards their next captivating read in the vast world of manhwas, mangas, and manhuas, ensuring an immersive and tailored experience for every reader.

Works Cited¶

Isinkaye, F. O., Folajimi, Y. O., Ojokoh, B. A., (2015). Recommendation systems: Principles, Methods and Evaluation. Egyptian Informatics Journal. 16(3), 261-273, DOI: https://doi.org/10.1016/j.eij.2015.06.005

Krisnan, S., (2021). How do determine the number of layers and neurons in the hidden layer? Geek Culture. https://medium.com/geekculture/introduction-to-neural-network-2f8b8221fbd3

Scikit Learn. (2011). Nearest Neighbors. https://scikit-learn.org/stable/modules/neighbors.html#unsupervised-neighbors