Introduction
In the evolving landscape of digital content, understanding audience engagement and optimizing video performance are pivotal for broadcasters and educators alike. Whether you’re a content creator, a broadcaster, or developing a Massive Open Online Course (MOOC), insights into video performance can be the key to success. This project aims to provide an in-depth analysis of YouTube channels and video performance, leveraging advanced data analytics and machine learning techniques.
Table of Contents
- Research Problem
- Technical Considerations
- Implementation
- Key Outcomes
- Limitations
- Areas for Improvement
- Conclusion
Research Problem
Imagine you’re a broadcaster or an MOOC developer looking to understand how “Educational” video streamers perform. You might want to analyze audience engagement across various topics, such as DIY projects versus Social Science discussions, and understand how video duration impacts viewer interaction. YouTube’s Analytics dashboard is robust but limited in scope, especially when it comes to analyzing competitors’ data. This project uses the Google YouTube Data API v3 to extract valuable insights, focusing on engagement rates, video duration, upload frequency, and audience sentiment.
Technical Considerations
Data Collection
Using the YouTube Data API v3 and Social Blade, data was collected for both videos and channels. These sources provide comprehensive metadata crucial for analysis. The decision to include channels with a similar number of subscribers ensures comparability and relevance in the analysis.
Feature Selection
According to the structure of the data, an efficient and balanced approach was chosen to use Embedded Methods for feature selection. These methods integrate feature selection into the model training process, offering several advantages:
- Efficiency: Embedded methods are generally more computationally efficient than wrapper methods.
- Performance: They often strike a good balance between computational cost and performance.
- Overfitting: They typically include regularization techniques that help reduce the risk of overfitting, which is particularly important given the data structure.
Recommended Methods for Embedded Feature Selection:
- LASSO (Least Absolute Shrinkage and Selection Operator):
- How it Works: LASSO adds an L1 penalty to the regression objective, which can shrink some coefficients to zero, effectively performing feature selection.
- Why it’s Suitable: LASSO is useful for linear models and can help identify the most important features while controlling for overfitting.
from sklearn.linear_model import LassoCV
# Select features and target variable
X = video_stats_filtered[columns_to_consider]
y = video_stats_filtered['engagement_category']
# Apply LASSO with cross-validation
lasso = LassoCV(cv=5).fit(X, y)
# Get the coefficients and select features with non-zero coefficients
importance = np.abs(lasso.coef_)
selected_features = X.columns[importance > 0]
# Print the selected features
print(f'Selected features using LASSO: {list(selected_features)}')
- Tree-based Methods (Random Forest, Gradient Boosting):
- How it Works: These methods naturally rank features based on their importance in reducing impurity in decision trees.
- Why it’s Suitable: They are non-linear and can handle interactions between features well. They also provide a clear ranking of feature importance.
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Select features and target variable
X = video_stats_filtered[columns_to_consider]
y = video_stats_filtered['engagement_category']
# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
# Select features with importance above a threshold
selected_features_rf = feature_importances[feature_importances > feature_importances.mean()].index
# Print the selected features
print(f'Selected features using Random Forest: {list(selected_features_rf)}')
Engagement Rate Calculation
The engagement rate was chosen as the target metric because it provides a comprehensive measure of audience interaction. It considers likes, comments, and views, offering a balanced view of how engaging a video is. For example, the Engagement Rate by Reach (ERR) formula is used, which calculates engagement based on likes and comments relative to the number of views.
Avoiding Data Leakage
To avoid data leakage, feature selection was performed after splitting the data into training and test sets. This ensures that the model does not gain any advantage by seeing information from the test set during training.
Hyperparameter Tuning with Optuna
Optuna was used for hyperparameter tuning because it offers several advantages over traditional methods like GridSearch and RandomizedSearch. Optuna utilizes Bayesian optimization, which is more efficient and effective in exploring the hyperparameter space. It also includes early stopping mechanisms, saving computational resources.
import optuna
# Define the objective function for Optuna
def objective(trial):
params = {
'xgb__n_estimators': trial.suggest_int('xgb__n_estimators', 100, 300),
'xgb__learning_rate': trial.suggest_float('xgb__learning_rate', 0.01, 0.1),
'lgbm__n_estimators': trial.suggest_int('lgbm__n_estimators', 100, 300),
'lgbm__learning_rate': trial.suggest_float('lgbm__learning_rate', 0.01, 0.1),
'gbc__n_estimators': trial.suggest_int('gbc__n_estimators', 100, 300),
'gbc__learning_rate': trial.suggest_float('gbc__learning_rate', 0.01, 0.1)
}
voting_clf.set_params(**params)
return cross_val_score(voting_clf, X_train, y_train, cv=3).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
Passthrough vs Non-Passthrough in Stacking Ensemble
Both passthrough and non-passthrough structures were assessed in the stacking ensemble to determine the best model configuration. Passthrough allows the meta-model to access the original features along with the base model predictions, potentially improving performance.
Implementation
Data Collection and Preprocessing
Data was collected using the YouTube Data API v3 and Social Blade. Preprocessing involved cleaning the data, handling missing values, and engineering new features to improve the quality of the dataset.
Exploratory Data Analysis (EDA)
EDA was conducted to understand the data distribution and relationships between features. Visualization tools like Plotly and Power BI were used to create interactive plots and dashboards.
Modeling and Evaluation
A Stacking Ensemble method was used, combining models like XGBoost, LightGBM, and Gradient Boosting. Hyperparameter tuning was done using Optuna, and the model’s performance was evaluated using various metrics.
Sentiment Analysis
Sentiment analysis was performed on the comments to understand audience reactions. TextBlob was used for sentiment analysis, with plans for further improvements.
Dashboard and Visualization
An interactive Power BI dashboard was created to visualize the findings. Plotly was used for additional visualizations to support EDA and model evaluation.

PowerBI Dashboard:

Key Outcomes
By understanding the dynamics of video engagement and sentiment, this research aims to offer actionable strategies for content creators to maximize their impact on YouTube. The project delivers interactive dashboards and comprehensive reports with insights and recommendations.
Limitations
While this project provides extensive insights, it’s essential to note that engagement on social media platforms involves complex metrics. Metrics such as Engagement Rate by View and Conversation Rate are crucial for YouTube analytics but do not cover the full spectrum of engagement. Advanced tools like Google Analytics API can provide more detailed insights.
Areas for Improvement
Automation
To be a full end-to-end solution, the entire process from data collection to dashboard creation should be automated using tools like Apache Airflow. This includes scheduling data updates, automating preprocessing and modeling steps, and refreshing the dashboard. Automating these processes ensures that the analysis is always up-to-date and reduces the manual effort required to maintain the system.
User Interaction
Adding a user interface or making the dashboard interactive would significantly enhance the project’s utility. Allowing users to explore different aspects of the data dynamically can provide more insights and make the tool more useful for decision-makers. This could involve using web-based dashboard tools like Dash or Streamlit, which provide more flexibility compared to traditional tools.
Deployment
An end-to-end solution should ideally include deployment, where the model is made available for real-time predictions. This could involve creating an API using frameworks like Flask or FastAPI, or deploying the model on cloud platforms such as AWS, GCP, or Azure. This step ensures that the insights from the model can be used in real-time applications.
Reproducibility
Providing detailed documentation and scripts for setting up the environment, dependencies, and running the entire workflow is necessary. This ensures that others can reproduce the results easily. Using Docker containers can also help in creating a consistent environment for running the analysis.
Continuous Integration/Continuous Deployment (CI/CD)
Implementing CI/CD pipelines can automate testing and deployment of new code changes. This ensures that the project is always in a deployable state and reduces the risk of breaking changes.
Conclusion
This project demonstrates the potential of using advanced data analytics and machine learning techniques to gain insights into YouTube channel and video performance. By leveraging the YouTube Data API v3 and various machine learning models, the project provides valuable insights into engagement rates, video duration, upload frequency, and audience sentiment. These insights can help content creators optimize their strategies and improve their impact on YouTube.
While the project is comprehensive, there are areas for improvement, including automation, user interaction, deployment, and reproducibility. Addressing these areas can turn the project into a full end-to-end solution, making it more robust and useful for a wider audience.
This project showcases the importance of data-driven decision-making in the digital content space. By understanding the dynamics of video engagement and sentiment, content creators can develop more effective strategies to engage their audience and stay competitive in the ever-evolving landscape of online video content.
There is no comment. be first one!