Members: Sharon Yin, Kyra Zhu
Webpage: https://zzzyx21.github.io
As the gaming industry continues to flourish, it stands as a vibrant and influential force in today's cultural landscape. Recognizing that the success of a video game goes beyond mere entertainment, we delve into understanding the intricate factors that contribute to its triumph.
Among these factors, sales and reviews emerge as pivotal metrics. Sales represent more than just revenue; they serve as a testament to a game's commercial viability, reflecting its resonance with consumer preferences, quality, and overall impact. Positive reviews, on the other hand, not only bolster a game's reputation but also shape its future success. Achieving a harmonious balance between sales performance and critical acclaim is crucial for developers and publishers alike.
From a data science perspective, exploring the dynamics of video game success presents a compelling and practical endeavor. With access to rich datasets containing variables such as sales figures, critic scores, and genre classifications, we have the opportunity to unveil hidden patterns, correlations, and predictive models through sophisticated statistical analyses, data visualization techniques, and advanced modeling approaches.
Therefore, we embark on this video game analysis journey, aiming to gain deeper insights into the interplay between sales and reviews. By leveraging these invaluable metrics, we not only navigate the competitive landscape of the gaming industry but also hone our understanding of what drives a game's triumph in today's digital age.
We're utilizing a dataset sourced from Kaggle, featuring video game sales data updated as of 2024. This dataset provides various information, including game titles, consoles, genres, critic scores, and sales figures across various regions. Additionally, we've integrated the 2023 Steam Store games dataset to delve into pricing and review ratings. Meanwhile, by combining this with the aforementioned video game sales data, we aim to analyze the correlation between reviews and other factors of hot pick games and their high sales, identifying patterns in game sales, pricing, and reviews, specifically, over time.
These datasets are ideal for our analysis, offering comprehensive insights into video game sales and associated factors. By exploring this data, we aim to uncover valuable insights into the popularity of video games, as well as trends and prediction in sales and critical reception over time.
Therefore, we'll investigate three main questions:
Data source
Bayne Brannen, and Asaniczka. (2024). Video Game Sales 2024 [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7507070
Kanchana1990. (2024a, February 6). Steam store 2024: Hot Picks & reviews 💨. Kaggle. https://www.kaggle.com/datasets/kanchana1990/steam-store-2024-hot-picks-and-reviews
%cd /content
!git clone https://github.com/zzzyx21/zzzyx21.github.io.git
%cd /content/zzzyx21.github.io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
/content Cloning into 'zzzyx21.github.io'... remote: Enumerating objects: 34, done. remote: Counting objects: 100% (34/34), done. remote: Compressing objects: 100% (22/22), done. remote: Total 34 (delta 16), reused 25 (delta 10), pack-reused 0 Receiving objects: 100% (34/34), 2.96 MiB | 8.30 MiB/s, done. Resolving deltas: 100% (16/16), done. /content/zzzyx21.github.io
We imported the first dataset and labeled it as games. Then, we filtered and retained only the necessary information, as demonstrated in the code, dropping irrelevant columns such as image, publisher, and release & last update dates.
games = pd.read_csv("./vgchartz-2024.csv", encoding="ISO-8859-1")
games = games[["title", "console", "genre", "critic_score", "total_sales", "na_sales", "jp_sales", "pal_sales", "other_sales", "release_date"]]
games
title | console | genre | critic_score | total_sales | na_sales | jp_sales | pal_sales | other_sales | release_date | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 140 | PC | Platform | NaN | NaN | NaN | NaN | NaN | NaN | 2013/10/16 |
1 | 140 | WiiU | Platform | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 757 | PC | Simulation | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1849 | OSX | Misc | NaN | NaN | NaN | NaN | NaN | NaN | 2014/07/01 |
4 | 1849 | PC | Misc | NaN | NaN | NaN | NaN | NaN | NaN | 2014/07/01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64011 | Zwei!! | PSP | Role-Playing | NaN | 0.02 | NaN | 0.02 | NaN | NaN | 2008/12/11 |
64012 | Zwei!! | PS2 | Role-Playing | NaN | NaN | NaN | NaN | NaN | NaN | 2004/08/26 |
64013 | Zwei!! | PC | Role-Playing | NaN | NaN | NaN | NaN | NaN | NaN | 2001/01/01 |
64014 | Zyklus | PC | Adventure | NaN | NaN | NaN | NaN | NaN | NaN | 2003/04/30 |
64015 | Zyuden Sentai Kyoryuger: Game on Gaburincho | 3DS | Action | NaN | 0.05 | NaN | 0.05 | NaN | NaN | 2013/08/08 |
64016 rows × 10 columns
For preprocessing, we want to clean the data so we remove NaN data from our dataset to avoid noise. Initially, we identify games where critic score and total sales data are NaN. Then, we utilize the .dropna() function to accomplish this task.
# Dropping rows where all specified columns are NaN
columns_to_check = ['critic_score','total_sales']
games = games.dropna(subset=columns_to_check)
regions_to_check = ['na_sales', 'jp_sales', 'pal_sales', 'other_sales']
games = games.dropna(subset=regions_to_check, how="all")
games
title | console | genre | critic_score | total_sales | na_sales | jp_sales | pal_sales | other_sales | release_date | |
---|---|---|---|---|---|---|---|---|---|---|
33 | .hack//G.U. Vol.2//Reminisce | PS2 | Role-Playing | 6.2 | 0.23 | 0.11 | NaN | 0.09 | 0.03 | 2007/05/08 |
35 | .hack//G.U. Vol.3//Redemption | PS2 | Role-Playing | 5.7 | 0.17 | NaN | 0.17 | NaN | NaN | 2007/09/10 |
36 | .hack//Infection Part 1 | PS2 | Role-Playing | 7.7 | 1.27 | 0.49 | 0.26 | 0.38 | 0.13 | 2003/02/11 |
38 | .hack//Mutation Part 2 | PS2 | Role-Playing | 7.5 | 0.68 | 0.23 | 0.20 | 0.18 | 0.06 | 2003/05/07 |
39 | .hack//Outbreak Part 3 | PS2 | Role-Playing | 7.1 | 0.46 | 0.14 | 0.17 | 0.11 | 0.04 | 2003/09/09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63919 | Zoo Tycoon DS | DS | Strategy | 4.6 | 0.98 | 0.86 | 0.01 | 0.03 | 0.07 | 2005/10/11 |
63934 | ZooCube | GBA | Puzzle | 8.6 | 0.05 | 0.03 | NaN | 0.01 | 0.00 | 2002/05/14 |
63935 | ZooCube | GC | Puzzle | 6.9 | 0.02 | 0.02 | NaN | 0.00 | 0.00 | 2002/05/05 |
63977 | Zubo | DS | Misc | 7.5 | 0.11 | 0.08 | NaN | 0.02 | 0.01 | 2009/03/10 |
63988 | Zuma's Revenge! | PC | Puzzle | 8.3 | 0.01 | 0.01 | NaN | NaN | 0.00 | 2009/09/16 |
4126 rows × 10 columns
Here is the summary statistics for the numerical columns after preprocessing:
summary_statistics = games.describe()
summary_statistics
critic_score | total_sales | na_sales | jp_sales | pal_sales | other_sales | |
---|---|---|---|---|---|---|
count | 4126.000000 | 4126.000000 | 3738.000000 | 1402.000000 | 3779.000000 | 4003.000000 |
mean | 7.101890 | 0.737230 | 0.416581 | 0.108959 | 0.263697 | 0.083560 |
std | 1.439307 | 1.408497 | 0.734706 | 0.162062 | 0.612218 | 0.199425 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 6.300000 | 0.110000 | 0.080000 | 0.020000 | 0.020000 | 0.010000 |
50% | 7.300000 | 0.300000 | 0.180000 | 0.050000 | 0.070000 | 0.020000 |
75% | 8.100000 | 0.750000 | 0.430000 | 0.130000 | 0.250000 | 0.080000 |
max | 10.000000 | 20.320000 | 9.760000 | 1.870000 | 9.850000 | 3.120000 |
We proceed to generate the figure depicting the distribution of total sales. It's evident from the graph that the distribution is highly right-skewed.
plt.figure(figsize=(7,3))
plt.hist(games['total_sales'], bins=30)
plt.title('Distribution of Total Sales (in millions)')
plt.xlabel('Total Sales (in millions)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
In this bar plot, we observe that the majority of games have total sales ranging from 0.0 to 2.5 million units. This distribution is typical in sales data, where numerous products sell modestly, while a select few achieve significant success. To enhance clarity, we intend to apply a logarithmic scale to the graph.
plt.figure(figsize=(7,3))
plt.hist(games['total_sales'], bins=30, log = True)
plt.title('Distribution of Total Sales (in millions)')
plt.xlabel('Total Sales (in millions)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
The log-transformed plot provides a clear depiction of the distribution of game sales, offering insights into the sales pattern. It obviously reveals that the majority of games achieve moderate sales performance, with a notable presence of a few highly successful ones.
Next, we'll calculate the correlation between the critic score and total sales to determine their relationship.
games["total_sales"].corr(games["critic_score"])
# a correlation coefficient of 0.28 is considered to be a weak correlation.
0.2811658100265469
games.plot.scatter(x="critic_score", y="total_sales", alpha=.5, color = 'green')
# There is a moderate positive correlation between critic scores and total sales.
<Axes: xlabel='critic_score', ylabel='total_sales'>
A correlation coefficient of 0.28 could be considered as weak correlation between sales and critic score.
To determine which genre has the highest sales, we'll aggregate the sales figures for each genre and then create a plot illustrating the total sales for each genre. This visualization will provide a clear indication of which genre is the most successful in terms of sales.
# Sum the sales figures for each genre
genre_sales = games.groupby('genre')[['na_sales', 'jp_sales', 'pal_sales', 'other_sales']].sum()
# Add a total sales column for each genre
genre_sales['total_sales'] = genre_sales.sum(axis=1)
# Sort the genres by total sales
genre_sales_sorted = genre_sales.sort_values('total_sales', ascending=False)
# Plotting the total sales for each genre
plt.figure(figsize=(7,3))
genre_sales_sorted['total_sales'].plot(kind='bar')
plt.title('Total Sales by Genre')
plt.xlabel('Genre')
plt.ylabel('Total Sales (in millions)')
plt.xticks(rotation=45)
plt.show()
genre_sales_sorted['total_sales']
genre Shooter 609.93 Action 574.34 Sports 484.90 Role-Playing 254.23 Racing 252.24 Misc 164.90 Platform 155.96 Fighting 138.59 Adventure 131.52 Simulation 99.43 Action-Adventure 76.36 Strategy 46.81 Puzzle 31.36 Music 13.37 Party 2.99 Sandbox 1.89 MMO 1.17 Education 0.61 Board Game 0.31 Visual Novel 0.03 Name: total_sales, dtype: float64
The bar chart visualizes the total sales by genre. Here are the key takeaways:
Similarly, to determine which consoles have the highest sales, we sum the sales figures for each console and then visualize to compare.
# Combine the steps for grouping PlayStation and Xbox consoles into a single cell
# Group all PlayStation consoles under a single 'PS' category and all Xbox consoles under 'Xbox'
games['console_grouped'] = games['console'].replace(
{
r'PS.*': 'PlayStation', # Use regex to match any PlayStation variation
r'X.*': 'Xbox' # Use regex to match any Xbox variation
},
regex=True
)
# Recalculate the total sales figures for each console group
console_grouped_sales = games.groupby('console_grouped')[['na_sales', 'jp_sales', 'pal_sales', 'other_sales']].sum()
console_grouped_sales['total_sales'] = console_grouped_sales.sum(axis=1)
# Sort the console groups by total sales
console_grouped_sales_sorted = console_grouped_sales.sort_values('total_sales', ascending=False)
# Plotting the total sales for each console group in a single chart
plt.figure(figsize=(10, 6))
console_grouped_sales_sorted['total_sales'].plot(kind='bar')
plt.title('Total Sales by Console Group')
plt.xlabel('Console Group')
plt.ylabel('Total Sales (in millions)')
plt.xticks(rotation=90)
plt.show()
# Return the sorted total sales for further inspection if needed
console_grouped_sales_sorted['total_sales']
console_grouped PlayStation 1546.15 Xbox 772.41 Wii 190.09 DS 133.28 PC 99.98 GC 80.58 GBA 75.08 3DS 43.92 N64 27.67 NS 25.20 WiiU 20.12 DC 10.68 NES 4.17 SAT 3.80 GBC 3.78 GB 3.31 SNES 0.53 GEN 0.19 VC 0.00 Name: total_sales, dtype: float64
When grouped together, PlayStation consoles (across all generations) have the highest cumulative sales, underscoring the brand's strong presence and enduring popularity in the gaming market. The Xbox group ranks second, showcasing significant market penetration and success, particularly in western markets.
Next, we would like to find out which game has the highest critic score and highest sales by sorting the values using the data after cleaning.
# Select games with highest critic scores
top_games = games.sort_values(by='critic_score', ascending=False).head(20)
# Plot bar graph
plt.figure(figsize=(10, 6))
plt.bar(top_games['title'], top_games['critic_score'])
plt.title('Top Games with Highest Critic Scores')
plt.xlabel('Game Title')
plt.ylabel('Critic Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Sort by total sales in descending order
highest_sales_games = games.sort_values(by='total_sales', ascending=False).head(20)
# Plot horizontal bar graph
plt.figure(figsize=(10, 10))
plt.barh(highest_sales_games['title'], highest_sales_games['total_sales'])
plt.title('Top Games by Total Sales')
plt.xlabel('Total Sales')
plt.ylabel('Game Title')
plt.gca().invert_yaxis() # Invert y-axis to display highest sales at the top
plt.tight_layout()
plt.show()
These graphs indicate that Grand Theft Auto IV receives the highest critic score, while Grand Theft Auto V achieves the highest sales across all regions. It's apparent that both the Grand Theft Auto series and the Red Dead Redemption series are consistently associated with high critic scores and sales figures, followed by the Call of Duty series.
We then explore how various factors impact total sales. Specifically, we'll examine the influence of factors such as console, genre, critic score, and release date on total sales. By analyzing their relationships, we can gain insights into the drivers behind total sales in the video game industry.
# Create a pairplot with total sales against console, genre, critic score, and release date
sns.pairplot(games, x_vars=["console", "genre", "critic_score", "release_date"], y_vars=["total_sales"], kind='scatter', height=4)
plt.show()
Here is a summary of different features that are corresponding to the total sales.
We now read in the second table and name it as steam.
steam = pd.read_csv("./steam_store_data_2024.csv")
steam = steam[["title", "price", "allReviews"]]
steam
title | price | allReviews | |
---|---|---|---|
0 | 100 Ninja Cats | NaN | NaN |
1 | And the Hero Was Never Seen Again | NaN | NaN |
2 | Tycoon Collection | NaN | NaN |
3 | Arzette: The Jewel of Faramore | NaN | NaN |
4 | Atomic Heart | $29.99 | Mostly Positive |
... | ... | ... | ... |
81 | ULTROS | $22.49 | NaN |
82 | Uncharted: Legacy of Thieves Collection | $24.99 | Very Positive |
83 | Undying | $13.99 | Mostly Positive |
84 | Vampire: The Masquerade - Bloodlines 2 | NaN | NaN |
85 | WitchHand | NaN | NaN |
86 rows × 3 columns
We change the all reviews to numeric that can be easier for us to analyze later. Overwhelmingly positive points to 9 scores, Very Positive points to 8 scores and so on.
ratings_map = {
'Overwhelmingly Positive': 9,
'Very Positive': 8,
'Positive': 7,
'Mostly Positive': 6,
'Mixed': 5,
'Mostly Negative': 4,
'Negative': 3,
'Very Negative': 2,
'Overwhelmingly Negative': 1
}
steam['allReviews'] = steam['allReviews'].replace(ratings_map)
steam
title | price | allReviews | |
---|---|---|---|
0 | 100 Ninja Cats | NaN | NaN |
1 | And the Hero Was Never Seen Again | NaN | NaN |
2 | Tycoon Collection | NaN | NaN |
3 | Arzette: The Jewel of Faramore | NaN | NaN |
4 | Atomic Heart | $29.99 | 6.0 |
... | ... | ... | ... |
81 | ULTROS | $22.49 | NaN |
82 | Uncharted: Legacy of Thieves Collection | $24.99 | 8.0 |
83 | Undying | $13.99 | 6.0 |
84 | Vampire: The Masquerade - Bloodlines 2 | NaN | NaN |
85 | WitchHand | NaN | NaN |
86 rows × 3 columns
Delete the $ sign in the 'price'* column and convert this column figures to float number.
steam['price'] = steam['price'].astype(str)
steam['price'] = steam['price'].str.replace('$', '', regex = False)
steam['price'] = steam['price'].astype('float64')
steam
title | price | allReviews | |
---|---|---|---|
0 | 100 Ninja Cats | NaN | NaN |
1 | And the Hero Was Never Seen Again | NaN | NaN |
2 | Tycoon Collection | NaN | NaN |
3 | Arzette: The Jewel of Faramore | NaN | NaN |
4 | Atomic Heart | 29.99 | 6.0 |
... | ... | ... | ... |
81 | ULTROS | 22.49 | NaN |
82 | Uncharted: Legacy of Thieves Collection | 24.99 | 8.0 |
83 | Undying | 13.99 | 6.0 |
84 | Vampire: The Masquerade - Bloodlines 2 | NaN | NaN |
85 | WitchHand | NaN | NaN |
86 rows × 3 columns
To preprocess, we check how many data has no value and duplicates.
steam.isna().sum()
title 0 price 25 allReviews 29 dtype: int64
steam.dropna(inplace=True) # Drops rows with any NaN value
steam.drop_duplicates(subset='title')
steam
title | price | allReviews | |
---|---|---|---|
4 | Atomic Heart | 29.99 | 6.0 |
6 | Bendy and the Dark Revival | 5.99 | 8.0 |
7 | Bendy and the Dark Revival | 5.99 | 8.0 |
8 | BlazBlue | 17.15 | 6.0 |
9 | Boxes: Lost Fragments | 13.49 | 8.0 |
10 | CARRION | 5.99 | 8.0 |
11 | CARRION | 5.99 | 8.0 |
15 | Crisis Core: Final Fantasy VII Reunion | 29.99 | 8.0 |
16 | Days Gone | 12.49 | 8.0 |
17 | Dead by Daylight | 7.99 | 8.0 |
18 | Dead by Daylight | 7.99 | 8.0 |
19 | Dead Space | 23.99 | 8.0 |
21 | Destiny 2 | 14.99 | 6.0 |
22 | Destiny 2: Lightfall + Annual Pass | 33.00 | 4.0 |
23 | Dragon Quest XI: Echoes of an Elusive Age | 23.99 | 8.0 |
24 | Dragon Quest XI: Echoes of an Elusive Age | 23.99 | 8.0 |
25 | Fallout 76 | 9.99 | 6.0 |
27 | Final Fantasy VII Remake Intergrade | 34.99 | 8.0 |
28 | Final Fantasy VII Remake Intergrade | 34.99 | 8.0 |
29 | Final Fantasy X / X-2 HD Remaster | 11.99 | 8.0 |
30 | Final Fantasy XII: The Zodiac Age | 19.99 | 8.0 |
31 | Final Fantasy XV: Windows Edition | 13.99 | 8.0 |
32 | Flashing Lights - Police, Firefighting, Emerge... | 8.49 | 8.0 |
33 | Flashing Lights - Police, Firefighting, Emerge... | 8.49 | 8.0 |
35 | God of War | 24.99 | 9.0 |
36 | Grounded | 26.79 | 8.0 |
38 | Halo Infinite | 23.99 | 6.0 |
39 | Halo Infinite (Campaign) | 23.99 | 5.0 |
40 | Headbangers: Rhythm Royale | 13.39 | 8.0 |
41 | Hell Let Loose | 42.24 | 8.0 |
42 | HELLCARD | 15.79 | 8.0 |
45 | Hogwarts Legacy | 35.99 | 8.0 |
46 | Hogwarts Legacy | 35.99 | 8.0 |
49 | Last Train Home | 26.39 | 8.0 |
50 | LEGO Star Wars: The Complete Saga | 4.99 | 9.0 |
54 | Marvel’s Spider-Man Remastered | 35.99 | 8.0 |
55 | Marvel’s Spider-Man: Miles Morales | 29.99 | 8.0 |
56 | Moonbreaker | 22.49 | 8.0 |
58 | Ori and the Will of the Wisps | 9.89 | 9.0 |
59 | Ori and the Will of the Wisps | 9.89 | 9.0 |
60 | Overcooked! 2 | 6.24 | 8.0 |
61 | Poppy Playtime - Chapter 3 | 9.89 | 8.0 |
62 | Ratchet & Clank: Rift Apart | 40.19 | 8.0 |
63 | Ready or Not | 37.49 | 6.0 |
64 | Ready or Not | 37.49 | 6.0 |
65 | Returnal | 40.19 | 8.0 |
69 | Star Wars Battlefront II | 3.49 | 8.0 |
70 | Star Wars: The Force Unleashed - Ultimate Sith... | 6.99 | 8.0 |
71 | Star Wars: Empire at War - Gold Pack | 6.99 | 9.0 |
72 | Stranger of Paradise: Final Fantasy Origin | 23.99 | 8.0 |
73 | The Elder Scrolls V: Skyrim Special Edition | 9.99 | 8.0 |
75 | Thronefall | 5.24 | 9.0 |
76 | Thronefall | 5.24 | 9.0 |
77 | Thymesia | 14.99 | 8.0 |
78 | Thymesia | 14.99 | 8.0 |
82 | Uncharted: Legacy of Thieves Collection | 24.99 | 8.0 |
83 | Undying | 13.99 | 6.0 |
# Cleaning the price data to remove the dollar sign and convert it to float
steam['price'] = steam['price'].replace('[\$,]', '', regex=True).astype(float)
# Plotting the histogram of prices
steam['price'].hist(bins=20)
plt.title('Distribution of Game Prices on Steam')
plt.xlabel('Price ($)')
plt.ylabel('Number of Games')
plt.show()
After cleaning the data, we have this plot of the distribution of game prices on Steam.
Next, we'll narrow our focus to PC games from the first dataset games. We'll filter out PC games from this dataset and then merge the filtered dataset with the steam data. Since Steam games are specifically for the PC platform, this merge will allow us to conduct further analysis on PC games.
games = pd.read_csv("./vgchartz-2024.csv", encoding="ISO-8859-1")
games = games[["title", "console", "genre", "critic_score", "total_sales", "na_sales", "jp_sales", "pal_sales", "other_sales"]]
steam = pd.read_csv("./steam_store_data_2024.csv")
steam = steam[["title", "price", "allReviews"]]
steam
ratings_map = {
'Overwhelmingly Positive': 9,
'Very Positive': 8,
'Positive': 7,
'Mostly Positive': 6,
'Mixed': 5,
'Mostly Negative': 4,
'Negative': 3,
'Very Negative': 2,
'Overwhelmingly Negative': 1
}
steam['allReviews'] = steam['allReviews'].replace(ratings_map)
steam['price'] = steam['price'].astype(str)
steam['price'] = steam['price'].str.replace('$', '', regex = False)
steam['price'] = steam['price'].astype('float64')
steam
games = games[["title", "console", "genre"]]
pc_games_vgchartz = games[games['console'] == 'PC']
# Merge the filtered PC games dataset with the Steam data
pc_merged_data = pd.merge(steam, pc_games_vgchartz, on='title', how='inner')
pc_merged_data = pc_merged_data.drop_duplicates(subset='title')
pc_merged_data['price'] = pc_merged_data['price'].fillna(pc_merged_data['price'].mean())
pc_merged_data['allReviews'] = pc_merged_data['allReviews'].fillna(pc_merged_data['allReviews'].mean())
pc_merged_data
title | price | allReviews | console | genre | |
---|---|---|---|---|---|
0 | Tycoon Collection | 22.224375 | 7.714286 | PC | Strategy |
1 | Atomic Heart | 29.990000 | 6.000000 | PC | Shooter |
2 | Banishers: Ghosts of New Eden | 49.990000 | 7.714286 | PC | Role-Playing |
3 | Crisis Core: Final Fantasy VII Reunion | 29.990000 | 8.000000 | PC | Role-Playing |
4 | Days Gone | 12.490000 | 8.000000 | PC | Action-Adventure |
5 | Dead by Daylight | 7.990000 | 8.000000 | PC | Action |
7 | Dead Space | 23.990000 | 8.000000 | PC | Shooter |
8 | Deep Rock Galactic | 22.224375 | 7.714286 | PC | Shooter |
9 | Destiny 2 | 14.990000 | 6.000000 | PC | Shooter |
10 | Fallout 76 | 9.990000 | 6.000000 | PC | Role-Playing |
11 | Fight Crab | 22.224375 | 7.714286 | PC | Action |
12 | Final Fantasy X / X-2 HD Remaster | 11.990000 | 8.000000 | PC | Role-Playing |
13 | Final Fantasy XII: The Zodiac Age | 19.990000 | 8.000000 | PC | Role-Playing |
14 | Final Fantasy XV: Windows Edition | 13.990000 | 8.000000 | PC | Role-Playing |
15 | Goat Simulator | 22.224375 | 7.714286 | PC | Misc |
16 | God of War | 24.990000 | 9.000000 | PC | Action-Adventure |
17 | Grounded | 26.790000 | 8.000000 | PC | Action-Adventure |
18 | Gunvolt Records Cychronicle | 22.224375 | 7.714286 | PC | Music |
19 | Halo Infinite | 23.990000 | 6.000000 | PC | Shooter |
20 | Helldivers 2 | 59.990000 | 7.714286 | PC | Shooter |
21 | Hogwarts Legacy | 35.990000 | 8.000000 | PC | Role-Playing |
23 | Last Train Home | 26.390000 | 8.000000 | PC | Strategy |
24 | LEGO Star Wars: The Complete Saga | 4.990000 | 9.000000 | PC | Misc |
25 | Ori and the Will of the Wisps | 9.890000 | 9.000000 | PC | Action-Adventure |
27 | Ratchet & Clank: Rift Apart | 40.190000 | 8.000000 | PC | Action-Adventure |
28 | Ready or Not | 37.490000 | 6.000000 | PC | Shooter |
30 | Returnal | 40.190000 | 8.000000 | PC | Shooter |
31 | Star Wars Battlefront II | 3.490000 | 8.000000 | PC | Shooter |
32 | Star Wars: The Force Unleashed - Ultimate Sith... | 6.990000 | 8.000000 | PC | Action |
33 | Star Wars: Empire at War - Gold Pack | 6.990000 | 9.000000 | PC | Strategy |
34 | Stranger of Paradise: Final Fantasy Origin | 23.990000 | 8.000000 | PC | Action |
35 | The Elder Scrolls V: Skyrim Special Edition | 9.990000 | 8.000000 | PC | Role-Playing |
36 | Nicolas Eymerich - The Inquisitor | 22.224375 | 7.714286 | PC | Misc |
37 | Thymesia | 14.990000 | 8.000000 | PC | Role-Playing |
39 | Tomb Raider I-III Remastered | 26.990000 | 7.714286 | PC | Action-Adventure |
40 | ULTROS | 22.490000 | 7.714286 | PC | Action-Adventure |
41 | Uncharted: Legacy of Thieves Collection | 24.990000 | 8.000000 | PC | Action-Adventure |
42 | Undying | 13.990000 | 6.000000 | PC | Adventure |
43 | Vampire: The Masquerade - Bloodlines 2 | 22.224375 | 7.714286 | PC | Action |
pc_merged_data.shape
(39, 5)
First, we'll visualize the distribution of game prices to understand the pricing landscape for PC games on Steam.
plt.figure(figsize=(7, 3))
sns.histplot(pc_merged_data['price'], bins=30)
plt.title('Distribution of Game Prices')
plt.xlabel('Price ($)')
plt.ylabel('Number of Games')
plt.grid(True)
plt.show()
We observe that the majority of Steam games are priced between $20 and $25.
Moving forward, we'll analyze the correlation between review scores and prices to determine if higher-priced games generally receive better reviews.
plt.figure(figsize=(6, 3))
sns.scatterplot(data=pc_merged_data, x='price', y='allReviews')
plt.title('Price vs. Review Scores')
plt.xlabel('Price ($)')
plt.ylabel('Review Score')
plt.grid(True)
plt.show()
reg_model = LinearRegression()
reg_model.fit(pc_merged_data[['price']], pc_merged_data['allReviews'])
# Predict review scores using the linear regression model
predicted_reviews = reg_model.predict(pc_merged_data[['price']])
According to this figure, the review scores appear to be generally high across all price ranges, indicating that price is not a direct indicator of higher satisfaction or quality perception among players.
Also, many games, irrespective of price, have received high review scores (8-9), suggesting that quality experiences are available across various price points.
However, we found that this graph doesn't provide us a clear trend or correlation between price and review scores. The points are fairly evenly distributed across the range of prices, with most of the review scores hovering around the same region, regardless of the game's price. This suggests that there is little to no linear relationship between the price of the games on Steam and their review scores.
We now explore which genres are most common and how they vary in terms of pricing and reviews. This will give us insights into market trends and consumer preferences.
# Group by genre and calculate average price and average review score
genre_analysis = pc_merged_data.groupby('genre').agg(
Average_Price=('price', 'mean'),
Average_Review_Score=('allReviews', 'mean'),
Count=('title', 'count')
).sort_values(by='Count', ascending=False)
# Plotting average price and review score by genre
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
genre_analysis['Average_Price'].plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Average Price by Genre')
axes[0].set_xlabel('Genre')
axes[0].set_ylabel('Average Price ($)')
axes[0].tick_params(axis='x', rotation=45)
genre_analysis['Average_Review_Score'].plot(kind='bar', ax=axes[1], color='lightgreen')
axes[1].set_title('Average Review Score by Genre')
axes[1].set_xlabel('Genre')
axes[1].set_ylabel('Average Review Score')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# Display the genre analysis for number of games per genre
genre_analysis[['Count']]
Count | |
---|---|
genre | |
Role-Playing | 9 |
Shooter | 9 |
Action-Adventure | 8 |
Action | 5 |
Misc | 3 |
Strategy | 3 |
Adventure | 1 |
Music | 1 |
Some key insights include:
Genre Popularity and Count: Action-Adventure and Shooter genres are the most common among PC games in the dataset, each with 5 titles. Role-Playing games are also notably present with 4 titles.
Average Price by Genre: Shooter games tend to have higher prices on average compared to other genres, which might reflect higher production values or market expectations for this genre. Other genres like Action-Adventure and Role-Playing show lower average prices, indicating more price variability or possibly different pricing strategies within these genres.
The success of any predictive modeling project hinges on robust data preparation. For this analysis, we began by cleaning the video game sales data sourced from Kaggle and the Steam Store, ensuring completeness by removing entries with missing values in critical fields such as console, genre, critic score, release date, and total sales. We then extracted relevant features for our prediction models: game titles, consoles, genres, critic scores, release dates, and sales figures.
To enhance the predictive power of our models, we transformed categorical data (e.g., console and genre) using one-hot encoding, allowing models to better capture the impact of different categories on sales. Numerical data (e.g., critic scores) were standardized to bring them onto a comparable scale, mitigating any bias towards variables with larger ranges.
We selected two distinct types of regression models to predict video game sales: RandomForest and K-Nearest Neighbors (KNN), each with its strengths and computational considerations.
Rationale: Capable of handling complex nonlinear relationships and interactions between features without extensive hyperparameter tuning. It's robust against overfitting, especially with many trees in the forest. Configuration: We configured the RandomForest with 100 trees and used random state 42 for reproducibility. Strengths: Excellent at capturing complex patterns in the data, provides feature importance metrics, and is generally less sensitive to outliers. Limitations: Computationally intensive, not as interpretable as simpler models.
Rationale: A simple, instance-based learning model that predicts new instances based on the closest historical examples in the feature space. Configuration: We used K=5 for our baseline model, allowing the model to consider the five nearest points in the feature space to make its prediction. Strengths: Very intuitive and straightforward, performs well on smaller datasets. Limitations: Sensitive to the local structure of the data, can be significantly impacted by noisy or irrelevant features, and scaling is necessary to ensure equal distance weighting.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_selector as selector
# Load the data
games = pd.read_csv("./vgchartz-2024.csv", encoding="ISO-8859-1")
games = games[["title", "console", "genre", "critic_score", "total_sales", "release_date"]]
# Handle missing values
games.dropna(subset=["console", "genre", "critic_score", "release_date", "total_sales"], inplace=True)
# Split features and target variable
X = games.drop(columns=["total_sales"])
y = games["total_sales"]
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing
# Define transformers for numerical and categorical features
numerical_features = selector(dtype_exclude="object")(X)
categorical_features = selector(dtype_include="object")(X)
# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# Define model
model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))])
# Train the model
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
Mean Squared Error: 1.6234001455878784
# Log-transform the actual and predicted total sales
y_test_log = np.log1p(y_test)
y_pred_log = np.log1p(y_pred)
# Fit a linear regression line
slope, intercept = np.polyfit(y_test_log, y_pred_log, 1)
line = slope * y_test_log + intercept
# Plotting actual vs predicted total sales with a trend line on a log scale
plt.figure(figsize=(10, 6))
plt.scatter(y_test_log, y_pred_log, color='blue', alpha=0.5)
plt.plot(y_test_log, line, color='red', linewidth=2)
plt.xlabel('Log(Actual Total Sales)')
plt.ylabel('Log(Predicted Total Sales)')
plt.title('Log(Actual) vs Log(Predicted) Total Sales with Trend Line Random Forest Regressor')
plt.grid(True)
plt.show()
# Calculate R-squared value
r_squared = r2_score(y_test_log, y_pred_log)
print(f"R-squared value: {r_squared}")
R-squared value: 0.2792974010449363
from sklearn.neighbors import KNeighborsRegressor
# Define model
knn_model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', KNeighborsRegressor(n_neighbors=5))])
# Train the model
knn_model .fit(X_train, y_train)
# Predictions
y_pred = knn_model .predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
Mean Squared Error: 1.5529656727272727
# Log-transform the actual and predicted total sales
y_test_log = np.log1p(y_test)
y_pred_log = np.log1p(y_pred)
# Fit a linear regression line
slope, intercept = np.polyfit(y_test_log, y_pred_log, 1)
line = slope * y_test_log + intercept
# Plotting actual vs predicted total sales with a trend line on a log scale
plt.figure(figsize=(10, 6))
plt.scatter(y_test_log, y_pred_log, color='blue', alpha=0.5)
plt.plot(y_test_log, line, color='red', linewidth=2)
plt.xlabel('Log(Actual Total Sales)')
plt.ylabel('Log(Predicted Total Sales)')
plt.title('Log(Actual) vs Log(Predicted) Total Sales with Trend Line (KNN)')
plt.grid(True)
plt.show()
# Calculate R-squared value
r_squared = r2_score(y_test_log, y_pred_log)
print(f"R-squared value: {r_squared}")
R-squared value: 0.27131038170548916
Random Forest Regressor: The scatter plot shows a dense clustering of points around the lower sales values with predictions generally following the actual sales trend as indicated by the linear trend line. This model demonstrates a stronger linear relationship between predicted and actual sales, especially for lower and mid-range sales values, indicated by the closer fit of points along the trend line.
K-Nearest Neighbors (KNN): Similar to the Random Forest model, there is a dense clustering around lower sales values, but the spread of points away from the trend line is slightly more pronounced, especially at higher sales values. The trend line suggests that while KNN follows the overall trend well, it might struggle with accuracy at the extremes, potentially due to its sensitivity to local variations in data.
from sklearn.model_selection import cross_val_score
# Define the RandomForestRegressor model
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))])
# Perform cross-validation
cv_scores_rf = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
# Convert the negative MSE scores to positive and calculate RMSE
cv_rmse_scores_rf = np.sqrt(-cv_scores_rf)
# Print the RMSE scores and mean RMSE
print("Cross-Validation RMSE Scores:", cv_rmse_scores_rf)
print("Mean RMSE:", cv_rmse_scores_rf.mean())
# Plotting cross-validation RMSE scores
plt.figure(figsize=(7, 3))
plt.bar(range(1, 6), cv_rmse_scores_rf)
plt.xlabel('Fold')
plt.ylabel('RMSE')
plt.title('Cross-Validation RMSE Scores for Random Forest Regressor')
plt.grid(axis='y')
plt.show()
Cross-Validation RMSE Scores: [1.82545462 1.64839904 1.01734427 1.04961326 0.96906819] Mean RMSE: 1.3019758763516938
from sklearn.neighbors import KNeighborsRegressor
# Define the KNeighborsRegressor model
knn_model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', KNeighborsRegressor(n_neighbors=5))])
# Perform cross-validation on the KNN model
cv_scores_knn = cross_val_score(knn_model, X, y, cv=5, scoring='neg_mean_squared_error')
# Convert the negative MSE scores to positive and calculate RMSE for KNN
cv_rmse_scores_knn = np.sqrt(-cv_scores_knn)
# Print the RMSE scores and mean RMSE for KNN
print("Cross-Validation RMSE Scores for KNN:", cv_rmse_scores_knn)
print("Mean RMSE for KNN:", cv_rmse_scores_knn.mean())
# Plotting cross-validation RMSE scores for KNN
plt.figure(figsize=(7, 3))
plt.bar(range(1, 6), cv_rmse_scores_knn)
plt.xlabel('Fold')
plt.ylabel('RMSE')
plt.title('Cross-Validation RMSE Scores for KNN')
plt.grid(axis='y')
plt.show()
Cross-Validation RMSE Scores for KNN: [1.7273006 1.62028745 1.00503727 1.07073941 1.02809062] Mean RMSE for KNN: 1.2902910677318675
Random Forest Regressor: The RMSE scores across the five folds show variability, with some folds performing significantly better than others. The variation suggests that while the model generally performs well, its predictions can vary depending on the specific subset of data used, indicating potential overfitting or sensitivity to data partitioning.
K-Nearest Neighbors (KNN): KNN shows less variability in RMSE scores across folds compared to Random Forest, suggesting more consistent performance across different data subsets. However, the overall RMSE levels are higher than those of the Random Forest, indicating that on average, the KNN model may not predict as accurately, likely due to its reliance on local data characteristics and the lack of a global generalization that Random Forest provides.
from sklearn.model_selection import learning_curve
# Perform cross-validation on the KNN model
cv_scores_knn = cross_val_score(knn_model, X, y, cv=5, scoring='neg_mean_squared_error')
# Convert the negative MSE scores to positive and calculate RMSE for KNN
cv_rmse_scores_knn = np.sqrt(-cv_scores_knn)
# Define the learning curve parameters
train_sizes, train_scores, test_scores = learning_curve(
knn_model, X, y, cv=5, scoring='neg_mean_squared_error',
train_sizes=np.linspace(0.1, 1.0, 5))
# Calculate mean and standard deviation for train and test scores
train_scores_mean = -train_scores.mean(axis=1)
test_scores_mean = -test_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_std = test_scores.std(axis=1)
# Convert scores to RMSE
train_rmse_scores = np.sqrt(train_scores_mean)
test_rmse_scores = np.sqrt(test_scores_mean)
train_rmse_std = np.sqrt(train_scores_std)
test_rmse_std = np.sqrt(test_scores_std)
# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_rmse_scores, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_rmse_scores, 'o-', color="g", label="Cross-validation score")
plt.fill_between(train_sizes, train_rmse_scores - train_rmse_std,
train_rmse_scores + train_rmse_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_rmse_scores - test_rmse_std,
test_rmse_scores + test_rmse_std, alpha=0.1, color="g")
plt.title('Learning Curves for KNN Model')
plt.xlabel('Training examples')
plt.ylabel('RMSE')
plt.legend(loc="best")
plt.grid(True)
plt.show()
# Output RMSE scores from cross-validation
cv_rmse_scores_knn
array([1.7273006 , 1.62028745, 1.00503727, 1.07073941, 1.02809062])
# Define the learning curve parameters for RandomForest
train_sizes_rf, train_scores_rf, test_scores_rf = learning_curve(
rf_model, X, y, cv=5, scoring='neg_mean_squared_error',
train_sizes=np.linspace(0.1, 1.0, 5))
# Calculate mean and standard deviation for train and test scores for RandomForest
train_scores_mean_rf = -train_scores_rf.mean(axis=1)
test_scores_mean_rf = -test_scores_rf.mean(axis=1)
train_scores_std_rf = train_scores_rf.std(axis=1)
test_scores_std_rf = test_scores_rf.std(axis=1)
# Convert scores to RMSE for RandomForest
train_rmse_scores_rf = np.sqrt(train_scores_mean_rf)
test_rmse_scores_rf = np.sqrt(test_scores_mean_rf)
train_rmse_std_rf = np.sqrt(train_scores_std_rf)
test_rmse_std_rf = np.sqrt(test_scores_std_rf)
# Plot learning curve for RandomForest
plt.figure(figsize=(10, 6))
plt.plot(train_sizes_rf, train_rmse_scores_rf, 'o-', color="r", label="Training score")
plt.plot(train_sizes_rf, test_rmse_scores_rf, 'o-', color="g", label="Cross-validation score")
plt.fill_between(train_sizes_rf, train_rmse_scores_rf - train_rmse_std_rf,
train_rmse_scores_rf + train_rmse_std_rf, alpha=0.1, color="r")
plt.fill_between(train_sizes_rf, test_rmse_scores_rf - test_rmse_std_rf,
test_rmse_scores_rf + test_rmse_std_rf, alpha=0.1, color="g")
plt.title('Learning Curves for RandomForest Model')
plt.xlabel('Training examples')
plt.ylabel('RMSE')
plt.legend(loc="best")
plt.grid(True)
plt.show()
# Output RMSE scores from cross-validation for RandomForest
cv_rmse_scores_rf
array([1.82545462, 1.64839904, 1.01734427, 1.04961326, 0.96906819])
KNN Model Learning Curve: The training score starts at a higher RMSE and decreases as more training data is used. This suggests that the model initially overfits to smaller samples but generalizes better with more data. The cross-validation score decreases and begins to plateau, indicating that adding more data beyond a certain point might not significantly improve the model's performance on unseen data.
RandomForest Model Learning Curve: The training score starts at a very low RMSE, showing that the RandomForest model fits the training data very well right from the start. The cross-validation score decreases significantly as the number of training examples increases, suggesting that the model is learning effectively. The relatively flat trend at larger training sizes suggests that the model might not gain much from more data in terms of generalization.
Given the datasets from Kaggle and the Steam Store for the years 2023, our analysis aimed to understand the factors influencing video game sales, utilizing models to predict future sales based on various features like critic scores, consoles, and genres. Through rigorous data preprocessing and employing machine learning models such as K-Nearest Neighbors (KNN) and RandomForest, we evaluated model performance via cross-validation and learning curves.
Our findings reveal that the RandomForest model generally provided better performance and robustness compared to KNN, as indicated by lower RMSE values and more stable learning curves. This suggests that complex models with capabilities to handle nonlinear relationships and interactions between features are more effective in capturing the dynamics of video game sales.
In conclusion, by applying these predictive models, stakeholders in the gaming industry can better anticipate sales outcomes, aiding in strategic decisions such as marketing and development. Future work may explore more sophisticated models or incorporate additional data such as user engagement metrics to further enhance predictive accuracy.
We also want to explore the relationship between just the critic scores and total sales of video games using the K-Nearest Neighbors (KNN) regression model. The KNN model is chosen for its simplicity and effectiveness in capturing non-linear relationships without the need for complex parameter tuning. By focusing solely on critic scores as predictors, this analysis seeks to quantify how well these scores can predict sales outcomes and to what extent they reflect market success.
In addition to the KNN model, a simple linear regression trend line is also plotted to provide a baseline visualization of the relationship between critic scores and sales. This approach not only complements the KNN model with a traditional regression analysis but also highlights the potential variability and prediction accuracy when using critic scores as a standalone metric.
# Selecting features and target
X = games[['critic_score']] # Using only critic_score as feature for simplicity
y = games['total_sales']
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Defining a pipeline for preprocessing and modeling
pipeline = Pipeline([
('scaler', StandardScaler()), # Feature scaling
('knn', KNeighborsRegressor(n_neighbors=5)) # KNN regressor
])
# Train the KNN model
pipeline.fit(X_train, y_train)
# Creating an example with a specific critic score to predict sales
example = pd.DataFrame({'critic_score': [8.5]}) # Example critic score of 85
# Predicting sales for the example
predicted_sales = pipeline.predict(example)
predicted_sales
array([2.916])
# Creating a scatter plot of critic_score vs total_sales
plt.figure(figsize=(10, 6))
plt.scatter(X_train['critic_score'], y_train, color='blue', alpha=0.5, label='Training data')
# Fitting a simple linear regression to add a trend line
slope, intercept = np.polyfit(X_train['critic_score'], y_train, 1)
trend_line = slope * X_train['critic_score'] + intercept
# Plot the trend line
plt.plot(X_train['critic_score'], trend_line, color='red', linewidth=2, label='Trend line')
# Adding the example point with predicted sales
plt.scatter([8.5], predicted_sales, color='red', s=100, label='Example prediction')
# Setting up the plot
plt.xlabel('Critic Score')
plt.ylabel('Total Sales (millions)')
plt.title('Critic Score vs. Total Sales with Trend Line')
plt.legend()
plt.grid(True)
plt.show()
The scatter plot of Critic Score vs. Total Sales with the trend line shows a general relationship between the critic scores of video games and their sales in millions. The data points indicate that while there's a positive correlation, the relationship isn't very strong, as evidenced by the spread of data points around the trend line. Most games, irrespective of high or moderate critic scores, tend to cluster towards the lower sales range, suggesting other factors might be influencing sales more significantly.
The trend line, plotted as a simple linear regression, suggests a modest upward trend, indicating that games with higher critic scores tend to achieve slightly higher sales. However, the vast spread of data points, especially at higher critic scores, highlights the variability in sales outcomes.
The example prediction, marked for a game with a critic score of 8.5, aligns close to the trend line but on the lower end of the sales spectrum compared to other games with similar scores. This suggests that while critic scores can provide a guide, they are not definitive predictors of sales success.
In this analysis, we explored the relationship between video game sales and critic scores using machine learning models, specifically K-Nearest Neighbors (KNN). We preprocessed the data to handle missing values and scale numerical inputs appropriately, ensuring our models could interpret the data effectively.
Our findings indicate that while there is a positive correlation between critic scores and game sales, the relationship is weak and highly variable. This suggests that other factors such as game genre, platform exclusivity, marketing, and external economic conditions might also play significant roles in influencing game sales.
The models used, including KNN and exploratory analysis via regression plotting, provided valuable insights but also highlighted the limitations of using critic scores alone to predict sales. Future analyses could benefit from integrating additional data points such as user reviews, social media sentiment, and detailed market conditions to create more robust predictive models.
This project underscores the complexity of the video game market and the multifaceted influences on sales, suggesting a need for comprehensive analytics approaches to fully understand and predict market behaviors.