CS 4641 Final Project: Trending Video Predictor

Introduction and Background

YouTube is one of the biggest premier video streaming platforms with millions of daily active users. With a big website comes questions about the popularity of videos posted by users, and whether we can find trends in highly popular videos. Videos which are placed on its trending page are guaranteed a big boost of engagement and prestige. The dataset we wish to use holds several months of data on the daily top 200 listed trending videos in 11 countries. Each video entry includes the video title, channel title, publish time, tags, views, likes and dislikes, description of the video, and the comment count.

Problem Definition

Specifically, we will focus on whether we can predict the popularity of YouTube videos based on various features. We want to investigate the factors that correlate with a video being successful in having a high number of views and likes. Much work has already been done in this area, with many different algorithms created using various techniques, such as regression [1] and calculating the minimum of a mean relative square error [2]. We hope to identify patterns that up-and-coming content creators can use to boost the performance of their videos.

Data Collection

We used the Kaggle YouTube trending dataset to aggregate data and statistics about trending YouTube videos in the US. This dataset includes a variety of statistics for trending videos from 2020 to 2022 as we described in the introduction. We created seven features from the data in the original dataset. In this section we will detail our rationale for choosing our features and the methods used to create them from the original dataset.

Number of Keywords in Title and Description

A common idea in search engine optimization for webpages is to include keywords in webpage content. This works similarly for YouTube videos, so we wanted to measure the number of times keywords appear in Trending videos' titles and descriptions. We decided to measure keyword presence using two separate features for the title and description because the title having a keyword is probably rarer and also implies a greater likelihood of the video content actually being related to that keyword. This is contrasted with the description where some creators put a huge list of tangentially related keywords at the end to hopefully improve search engine optimization.

To get our list of keywords, we used the Google Trends tool to gather the top and rising search terms within the date range of our video dataset (2020-2022). To clean the dataset, we needed to replace N/A values for videos that didn't have a description and also remove carriage returns and line feeds from the raw data file as it was messing up the pandas csv parser. We looped through the list of keywords and counted the number of times they occurred in the title and description of our dataset, creating two features which would help us measure the relevance and search term optimization of the trending videos.

Top Channel Boolean

We also wanted a feature which captured whether the trending video was by one of the most subscribed channels on YouTube, a so-called "top" channel. To this end we scraped a list of the most subscribed YouTubers from Social Blade. We looped from this list and checked for each entry in our original dataset if the channel was a top channel, if it was, we gave it a 1 or true for this feature.

Number of comments and comment to view ratio

Number of commments was a field in the original dataset so we could directly copy that column from the pandas dataframe. A separate comment to view ratio feature was created by dividing the number of comments by the number of views. There were a few video outliers in the dataset that had 0 views and 0 comments, which were YouTube Originals content that YouTube put on Trending immediately when posted. We removed these as they caused a divide by 0 error and didn't represent "successful" videos which we wanted to model using the Trending videos dataset. These two comments-related features were meant to measure the level of viewer engagement with a video.

Time of day posted

Another factor of video creation and posting on YouTube is when in the day you decide to post the video. This could represent indirectly what geographic audience you are trying to reach and certain posting times may result in higher reach than others. The original dataset had a formal datetime string, from which we pulled a time of day substring which we converted into a integer value in seconds from the beginning of the day.

Category ID

The category ID of the video was directly taken from the dataset. It is a numerical value which tells us which video category the video belonged to. This was meant to give the model a way to differentiate between videos of different categories.

Normalization

We normalized all the features by dividing every value in a feature by the maximum value for that feature.

List of all the features in our feature array

Keyword presence (title)
Keyword presence (description)
Top Channel Boolean
Number of Comments
Comment to view ratio
Time published
Category ID

Methods

We are planning to predict the popularity of the video by looking at its title, description, comments, time of day posted, category, and whether it is a top channel. We will be using those characteristics of the video to attempt to predict the video’s popularity by predicting its number of views and likes.

View count prediction is hard because views vary greatly in magnitude causing very large errors. To combat this issue, we have opted to evaluate our results based on a threshold value. The threshold is the median of our data: 998262. If the video falls above the median, then it would be a successful video and if it falls below, then it would classified as an unsuccessful video. We consider a prediction accurate if our predicted view count and actual view count are both above or below the threshold value. We plan to still look at the results of the regression and this new clasification with a threshold.

Our working dataset was 164,190 entries large, with seven features for each entry found as detailed in the above data collection section. We split the dataset into 82,000 for training and 82,190 for testing. We used with variety of different regression methods to help us predict the number of views a video would have and compared their respective results. The list of different methods used is as follows:

Linear Ridge Regression
Stochastic Gradient Descent
Naive Bayes Classifier
Decision Tree
Support Vector Machine

We also tested between using principal component analysis (PCA) to reduce our features down to two and not using PCA, comparing the results for each model. The library we are using to help us with training models is scikit [3].

Results and Discussion

Midterm Report and Linear Ridge Regression

First, for the midterm report we chose the specific 4 features (keywords in title, keywords in description, time posted, comments status) to look at to predict the amount of views of a video. We use PCA to decrease the number of features down to 2. We then ran linear regression as supervised learning on a big portion of our data that we set aside for training, and then we used that model to predict the values for our testing data, the other part of our dataset that was not a part of training. From this, we initially thought that our model is a alright predictor for the testing set. In Fig. 1, we can see that the predicted values are close to the actual values with exceptions to the outliers with very high view counts.

Fig.1 - Plot of Testing Actual Values vs. Predicted Values

However, after running the model through an R2 score and root mean square metric, we discovered that the model does not accurately predict the amount of views even though visually based off the figure, it looks like it would. The R2 score returned a number less than .75 and the root mean square metric returned a number greater than .40. We are not sure about the case of these errors but have some predictions as to what could be causing them:

2. Linear Regression may not be a good model for our features

Continuing this project meant making changes to the model because of how inaccurate this model was. Because of this we decided to change the features we were using. Instead we changed the features by removing comment status and adding in the new features category, number of comments, and the view to comment ratios. Adding on to that, we attempted different models such as stochastic gradient descent, gaussian naive bayes, decision trees, and support vector machine. Lastly, we also added a threshold, since predicting the number of views was hard and the numbers were expected to vary by thousands, we took the median of the view count (998262) and created a threshold using the value. We then ran regression with and without PCA (to compare), and took the predicted view count and compared it against the threshold. If the predicted value and the expected value were both above or both below the median, we considered that as an accurate prediction.

With these new changes, we first tried the new changes on the linear ridge regression model. We found that the model was still inaccurate with PCA, but actually more accurate without PCA. With PCA, it had an accuracy of 44% with the threshold and a R2 value of -0.03 when just predicting the view count, so it was very inaccurate. Even with the new features, running linear regression with PCA yielded very inaccurate results. Figure 2 shows the results of the model. When we did not use PCA, and just put in the raw data, we found that it was slightly more accurate with an accuracy of 49% and a R2 score of 0.43. Figure 3 shows a cut of the data since we only plotted on a 3D graph. In this cut, we can see that the prediction and real values match up better than the plot with PCA.

Fig.2 - Linear Ridge Regression with New Features with PCA

Fig.3 - Linear Ridge Regression with New Features and no PCA

Stochastic Gradient Descent

Seeing that linear ridge regression did not have much success, we attempted to run the data through a stochastic gradient descent (SGD) model. Since our data was already normalized, we can run the model without having to use a standard scaler. After running the model, we found that the R2 score without PCA ended up being 0.33 with an accuracy of 49% against our threshold. With PCA, the R2 score drops to 0.04 with an accuracy of 45%. These results indicate SGD would be a bad predictor of the number of views a video would get and would not be able to accurately identify if a video would be successful or not. Figures 4 and 5 highlight the comparison between the original data and predicted data SGD generated with pca and without pca respectively.

Fig.4 - SGD Y-values vs Datapoint for Predicted and Test with no PCA

Fig.5 - SGD Y-values vs Datapoint for Predicted and Test with PCA

Naive Bayes Classifier

Another model that we decided to try is the Gaussian Naive Bayes Classifier. This model makes the assumption that the likelihood of each feature is assumed to be Gaussian, and that the features are independently distributed. However, because of these assumptions, the classifier is extremely fast as the probability of each feature can be calculated independently of the others. Additionally, this also reduces the impact of the curse of dimensionality -- since all features are assumed independent, a Naive Bayes classifier's performance does not decrease as the number of features increases. This could be especially helpful for modeling something as complex as YouTube data, when there are separate components (such as descriptions, channel names, and thumbnails, etc.) to be captured.

To appropriately classify the data, we assigned each row of training and testing data a binary attribute instead of a view count. Rows with views higher than the median threshold were assigned a 1, and rows with views lower were assigned a 0. This was the only change made to the training and testing sets before passing them into the model. The classification report, including accuracy, f-1 score, precision, and recall, is shown below in figure 6.

Fig.6 - Naive Bayes Classification Report

While the Gaussian Naive Bayes model performed better than the regression models, the feature independence assumption likely caused the overall weak performance. For a datapoint like a YouTube video, many of the features depend heavily on each other. For example, the number of keywords (top search terms) and the channel's status as a top channel: it's likely that the top searches and the top channels are highly correlated. Despite the Naive Bayes' strengths in being computationally fast and not suffering from the curse of dimensionality, the assumptions made by the model do appear to hamper its performance.

Decision Tree

We also ran our model through a decision tree model. Decision trees look to find the best feature to split on and then find the best place to split on that chosen feature. Using the tree, we can then follow the splits to accurately determine where a piece of testing data would fall into for number of views. Running it through the decision tree without PCA, we found that the R2 score was 0.91 and had an accuracy of 99%. With PCA, we found that the R2 score was 0.95 and the accuracy was 89%. The results show that decision trees would be a highly accurate way to predict the amount of views a video would get and would be able to classify if a video would be successful or not. Below in figures 7 and 8, decision trees of depth 5 showcase how the model works to predict our youtube views without and with PCA respectively. Our decision tree predictor does go deeper than 5, but attempting to plot a depth greater than 5 would result in a plot too hard to read.

Fig.7 - Decision Tree with no PCA

Fig.8 - Decision Tree with PCA

Support Vector Machine

The model was also run through a support vector machine (SVM); specifically, we used a linear support vector regression. We chose linear support vector regression because it should scale better with a large number of samples. After running the model, we found that the R2 score was as high as 0.45 and as low as 0.15 and that the accuracy for predicting successful videos was consistently around 80%. The results show that linear support vector regression is a subpar way to predict the amount of views and a decent way to predict if a video would be successful or not. Our model can still be massively improved upon by tuning the hyperparameters. The regularization parameter was found through iteration of various integer values using an equally distant spread of numbers from the logorithmic scale. We could instead try cross validation to tune parameters. Below are the results of the model plotted out in a line graph and a 3d dot plot.

Fig.9 - SVM Predicted and Real Values without PCA

Fig.10 - SVM Y-values vs Data Points for Predicted and Real Values

Conclusion

In conclusion, we found that the two models that were the most accurate were the decision trees and support vector machine. With decision trees, we were able to predict if a video would have greater than the median views or less than the median views with 89% accuracy with PCA and 99% accuracy without. This is quite good for predicting the number of views against a threshold. But what we were surprised to find was that decision trees were also good at predicting the number of views in general with a high R2 score of 0.91 for no PCA and 0.95 with PCA.

As for the support vector machine, we found that it was accurate, but not as accurate as the decision tree model. It still had a high accuracy of 80%. While this accuracy for predicting against the threshold was high, the actual prediction values of the view count was not good with a R2 score of 0.45 at its best.

For our other models, we found that they were not as accurate, most of them had below 70% accuracy and some were even below 50% accuracy. Because of this, we have deemed that the decision tree model and the support vector machine were our most accurate models when put against a threshold. If we were strictly trying to predict the number of views, then decision trees are the best model.

Beyond just trying new models, we found that changing the features did help with the accuracy since we saw changes in accuracy in the Linear Ridge Regression model before we made those changes and after. The models were not tested against both feature sets, since we had found a set that can be accurate when put into models. Therefore, we believe this set of features to be good predictors of the view count of a video against a threshold.

This brings us to the next change we made which was adding and comparing against a threshold. We used this threshold to make the problem a classification problem since we found that predicting the exact number of views of a video was very hard due to how much they can vary. We then decided to make the median number of views in the dataset to be our threshold. By adding this threshold, we were able to look at how accurately we were able to classify a video as above or below this threshold after getting the predicted number of views from the regression model, or in a special case, we could get the accuracy of the Naive Bayes. We found that by doing this, despite most models having a low R2 score for predicting actual view counts, some models like SVM still predicted quite accurately against the threshold. The decision tree model, however, did well with the threshold and without. Overall, adding a threshold simplified our problem a lot more and helped us understand which models did well in predicting the general performance of a video.

Lastly, we compared the models with PCA and without. This means that we had the models predict off data that was run through PCA and compared it to a model trained on just raw data. We found that models trained on raw data was more accurate than models trained on data with PCA.

Throughout this project, we have looked at models, features, and data pre-processing to better understand how to predict a video's view counts. We have found that some models (SVM, decision trees) performed very well for this problem statement, and discovered that including some features (comment count, top channels) vastly improved the performance of the models. We have also learned how complicated capturing something like a YouTube video can be and how to simplify a datapoint down to better understand it. Looking to the future, there is a lot of work that can be done off the results of this project, like incorporating more features of the dataset (such as the thumbnail) but using the same models.

Datasets

Youtube Trending Video Dataset

Google Trends

Gantt Chart

Link

Contribution Table

Name	Contribution
Kenny Hoang	PCA, Decision Trees, SGD models and report, video script
Jeffrey Lei	SVM, Parsing data and normalizing data
Elizabeth Liu	Set up Numpy matrix for data and incorporate keyword counts, top channel feature, PCA, Naive Bayes model and report
Allison Lu	PCA, Linear Ridge Regression model and report, video script
Michael Zhou	Parsing data and normalizing data

References

"Popularity Prediction of Videos in YouTube as Case Study: A Regression Analysis Study" https://dl.acm.org/doi/10.1145/3090354.3090406
"Predict the Popularity of Youtube Videos Using View" https://www.semanticscholar.org/paper/Predict-the-Popularity-of-YouTube-Videos-Using-View/7dade77c5a6c58ec2543ea10ed499395957fbcf4
"Image Processing Python Libraries for Machine Learning" https://neptune.ai/blog/image-processing-python-libraries-for-machine-learning