Machine Learning Algorithms and Basic Implementation
Overview
Teaching: 130 min
Exercises: 130 minQuestions
What are the fundamental concepts in ML?
Which frame work is used in ML?
How do I train and test ML models?
Objectives
Gain an understanding of fundamental machine learning concepts.
Learn and apply best practices for training, evaluating, and interpreting machine learning models.
WHAT DO THE WORDS MEAN?
Artificial Intelligence: The ability of computing systems to achieve human-like performance on complex tasks.
- Conceptual umbrella term
-
AI is the outcome, not the method
- It is defined as the science and engineering of making intelligent machines that behave like human.
Machine Learning Algorithms
Machine learning is a subset of artificial intelligence (AI) that focuses on developing computer systems capable of learning and improving from data without being explicitly programmed.
-
is a field of study that enables computers to learn and improve from experience without being explicitly programmed.
-
Machine learning is methodology and currently our most successful way of achieving AI.
-
At a high level, ML systems look a tons of data and come up with rules to predict outcomes for unseen data.
Machine learning is widely used in various fields and applications such as:
• Face Recognition
• Object Detection
• Chatbots
• Recommendation Systems
• Autonomous Vehicles
• Disease Diagnosis
• Fraud detection
And many more it is nowadays being used in almost all sectors including economics, statistics, healthcare, agriculture, education, business, construction, astronomy, etc.
Types of Machine Learning
-
Supervised Learning: Algorithms learn from labeled training data to predict outcomes. It includes classification (predicting categories) and regression (predicting numerical values).
-
The goal is to learn the mapping from input to output.
-
Some examples of supervised learning include image classification, email spam detection.
-
Unsupervised Learning: Algorithms find hidden patterns in unlabeled data. A common task is clustering, which groups similar examples together.
-
Reinforcement Learning: Algorithms learn by interacting with an environment and receiving rewards or penalties for actions to maximize performance.
Linear Models
Linear Regression
Assumptions:
- Linearity: The relationship between
predictors
and thetarget
is linear. - Independence of errors: Residuals should be independent of each other.
- Normality of residuals (for inference).
- Homoscedasticity: variance of the residuals is constant across all levels of the predictor variable(s)
How It Works:
- Finds a linear combination of input features to predict the target value. Minimizes the sum of squared residuals.
Pros:
- Simple and interpretable.
- Fast to train and widely understood.
Cons:
- Sensitive to outliers.
- Might underfit if the relationship is non-linear.
Install Machine Learning Frameworks and Libraries
pip install xgboost
pip install lightgbm
pip install sklearn
pip install -U scikit-learn
Importing Libraries
# Import the NumPy library for numerical operations
import numpy as np
# Import the Pandas library for data manipulation and analysis
import pandas as pd
# Import Matplotlib for plotting and visualization
import matplotlib.pyplot as plt
# Import train_test_split from scikit-learn for splitting data into training and testing sets
from sklearn.model_selection import train_test_split
# Import various regression models from scikit-learn
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet # Linear models
from sklearn.tree import DecisionTreeRegressor # Decision tree model
from sklearn.neighbors import KNeighborsRegressor # K-nearest neighbors model
from sklearn.svm import SVR # Support vector regression model
# Import ensemble models from scikit-learn
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, StackingRegressor
# Import XGBoost library for gradient boosting
import xgboost as xgb
# Import LightGBM library for gradient boosting
import lightgbm as lgb
# Import r2_score from scikit-learn for evaluating model performance
from sklearn.metrics import r2_score
# Import StandardScaler from scikit-learn for feature scaling
from sklearn.preprocessing import StandardScaler
# Interactive Widgets
import ipywidgets as widgets
Download data
Load the data from the csv file
# Load the data from the csv file
data = pd.read_csv('gdp_data.csv', index_col='Year')
data.head()
Plot the timeseries of the data
# Create a function to plot the time series
def plot_time_series(column):
plt.figure(figsize=(10, 6))
data[column].plot()
plt.title(f'{column}')
plt.xlabel('Date')
plt.ylabel(column)
plt.grid(True)
plt.tight_layout()
plt.show()
# Create a dropdown widget for selecting the column
column_selector = widgets.Dropdown(
options=data.columns,
description='Column:',
disabled=False,
)
# Link the dropdown widget to the plot_time_series function
interactive_plot = widgets.interactive_output(plot_time_series, {'column': column_selector})
# Display the widget and the interactive plot
display(column_selector, interactive_plot)
Split data
X = data.drop("gdp", axis=1)
y = data["gdp"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=False)
Define and Train the Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)
Make predictions
predictions_lr = lr.predict(X_test)
predictions_lr
Evaluate the model
r2_lr = r2_score(y_test, predictions_lr)
print(f" R^2: {r2_lr:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_lr, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Linear Regression')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Ridge Regression
Assumptions:
- Similar to linear regression, with an added assumption that parameter weights should be small.
How It Works:
- Adds an L2 penalty (sum of squared coefficients) to the cost function to shrink coefficients and reduce variance.
Pros:
- Reduces model complexity and prevents overfitting.
- Handles multicollinearity well.
Cons:
- Still assumes a linear relationship.
- Coefficients are shrunk but not set to zero.
Define and train the Ridge Regression Model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
Make predictions
predictions_ridge = ridge.predict(X_test)
predictions_ridge
Evaluate the model
r2_ridge = r2_score(y_test, predictions_ridge)
print(f" R^2: {r2_ridge:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_ridge, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Ridge Regression')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Lasso Regression
Assumptions:
- Similar to linear regression, but encourages sparsity in coefficients.
How It Works:
- Adds an L1 penalty (absolute sum of coefficients) to the cost function, which can set some coefficients to exactly zero.
Pros:
- Performs feature selection by eliminating irrelevant features.
Cons:
- Can be unstable for some datasets (feature selection might be too aggressive).
Define and train Lasso Regression Model
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
Make predictions
predictions_lasso = lasso.predict(X_test)
predictions_lasso
Evaluate the model
r2_lasso = r2_score(y_test, predictions_lasso)
print(f" R^2: {r2_lasso =:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_lasso, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Lasso Regression')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Elastic Net Regression
Assumptions:
- Similar to linear regression, combines L1 and L2 penalties.
How It Works:
- Balances between Ridge and Lasso.
- Encourages both coefficient shrinkage and sparsity.
Pros:
- More flexible than Ridge or Lasso alone.
- Good for high-dimensional data.
Cons:
- Additional hyperparameters to tune (ratio of L1 vs L2).
Define and train ElasticNet Regression Model
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)
Make predictions
predictions_elastic = elastic.predict(X_test)
predictions_elastic
Evaluate the model
r2_elastic = r2_score(y_test, predictions_elastic)
print(f" R^2: {r2_elastic:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_elastic, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Elastic Net Regression')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Tree-Based Models
Decision Tree Regressor
Assumptions:
- Non-parametric model, no strict assumptions about data distribution.
How It Works:
- Splits data into subsets based on feature values, creating a tree structure that ends in leaf nodes predicting average values.
Pros:
- Easy to interpret.
- Can capture non-linear relationships.
Cons:
- Prone to overfitting if not regularized.
- Unstable to small variations in data.
Define and Train Decision Tree Regressor
dt = DecisionTreeRegressor(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
Make predictions
predictions_dt = dt.predict(X_test)
predictions_dt
Evaluate the model
r2_dt = r2_score(y_test, predictions_dt)
print(f" R^2: {r2_dt:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_dt, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Decision Tree Regressor')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Instance-Based Learning
K-Nearest Neighbors (KNN) Regressor
Assumptions:
- Similar points exist in local neighborhoods.
How It Works:
- Predicts target by averaging the values of the nearest k data points.
Pros:
- Simple concept, no explicit training process.
- Can model complex, non-linear relationships.
Cons:
- Sensitive to the scale of features and choice of k.
- Computationally expensive at prediction time.
Define and Train K-Nearest Neighbors (KNN) Regressor
# Step 1: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Step 2: Train the KNN model
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
Make predictions
# Step 3: Standardize the test data
X_test_scaled = scaler.transform(X_test)
# Step 4: Make predictions
predictions_knn = knn.predict(X_test_scaled)
predictions_knn
Evaluate the model
r2_knn = r2_score(y_test, predictions_knn)
print(f" R^2: {r2_knn:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_knn, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with K-Nearest Neighbors (KNN) Regressor')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Support Vector Regressor (SVR)
Assumptions:
- Data can be separated (in a high-dimensional space) by a hyperplane with minimum error.
How It Works:
- Fits a regression line that tries to fit as many points within a certain epsilon margin. Uses kernel tricks for non-linear data.
Pros:
- Robust to outliers.
- Handles non-linearities with kernel functions.
Cons:
- Parameter tuning (C, epsilon, kernel parameters) can be tricky.
- Scalability can be an issue with large datasets.
Define and Train Support Vector Regressor (SVR)
# Step 1: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Step 2: Train the SVR model
svr = SVR(C=1.0, epsilon=0.1, kernel='rbf')
svr.fit(X_train_scaled, y_train)
Make predictions
# Step 3: Standardize the test data
X_test_scaled = scaler.transform(X_test)
# Step 4: Make predictions
predictions_svr = svr.predict(X_test_scaled)
predictions_svr
Evaluate the model
r2_svr = r2_score(y_test, predictions_svr)
print(f" R^2: {r2_svr:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_svr, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Support Vector Regressor (SVR)')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Boosting Methods
Gradient Boosting Machines (GBM)
Assumptions:
- Uses decision trees as weak learners combined in a sequential manner.
How It Works:
- Iteratively adds new models that correct errors of previous ensembles, using gradient descent on the loss function.
Pros:
- Often high accuracy.
- Can handle complex relationships and interactions.
Cons:
- More complex to tune.
- Can overfit without proper regularization.
Define and Train Gradient Boosting Machines
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbr.fit(X_train, y_train)
Make predictions
# Make predictions
predictions_gbr = gbr.predict(X_test)
predictions_gbr
Evaluate the model
r2_gbr = r2_score(y_test, predictions_gbr)
print(f" R^2: {r2_gbr:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_gbr, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Gradient Boosting Machines')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
XGBoost (Extreme Gradient Boosting)
Assumptions:
- Similar to GBM but with additional optimization and regularization.
How It Works:
- Optimized tree-based boosting algorithm with regularization for better generalization and faster training.
Pros:
- Very efficient and often achieves state-of-the-art performance.
- Handles missing values and can be parallelized easily.
Cons:
- Parameter tuning can still be complex.
- Slightly more complex implementation.
Define and Train Extreme Gradient Boosting
xgbr = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgbr.fit(X_train, y_train)
Make predictions
# Make predictions
predictions_xgbr = xgbr.predict(X_test_scaled)
predictions_xgbr
Evaluate the model
r2_xgbr = r2_score(y_test, predictions_xgbr)
print(f" R^2: {r2_xgbr:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_xgbr, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Extreme Gradient Boosting')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
LightGBM (Light Gradient Boosting Machine)
Assumptions:
- Similar to GBM but builds trees leaf-wise and uses histogram-based methods.
How It Works:
- More memory and computationally efficient than traditional GBM.
Pros:
- Very fast and scalable.
- Excellent accuracy on large datasets.
Cons:
- Might not perform well on very small datasets.
- Still requires careful hyperparameter tuning.
Define and Train Light Gradient Boosting Machine
lgb = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgb.fit(X_train, y_train)
Make predictions
# Make predictions
predictions_lgb = lgb.predict(X_test)
predictions_lgb
Evaluate the model
r2_lgb = r2_score(y_test, predictions_lgb)
print(f" R^2: {r2_lgb:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_lgb, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Light Gradient Boosting Machine')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Bagging
Random Forest Regressor
Assumptions:
- Individual decision trees do not have strong assumptions.
How It Works:
- Builds many trees on bootstrap samples and averages their predictions.
- Uses random feature subsets for further diversity.
Pros:
- Reduces variance drastically compared to a single decision tree.
- Generally robust and accurate.
Cons:
- Less interpretable than a single decision tree.
- Can be computationally expensive.
Define and Train Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
Make predictions
# Make predictions
predictions_rf = rf.predict(X_test)
predictions_rf
Evaluate the model
r2_rf = r2_score(y_test, predictions_rf)
print(f" R^2: {r2_rf:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_rf, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Random Forest Regressor')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Stacking (Stacked Regression)
Assumptions:
- Combines multiple different models’ predictions as features for a final meta-model.
How It Works:
- Trains multiple base learners and then trains a meta-learner on their combined predictions to improve overall performance.
Pros:
- Can often improve predictive performance beyond any single model.
Cons:
- More complex workflow.
- Risk of overfitting if not done carefully.
Define and Train Stacked Regression
base_models = [
('lr', LinearRegression()),
('rf', RandomForestRegressor(n_estimators=50, max_depth=5, random_state=42)),
('gbr', GradientBoostingRegressor(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42))
]
stack = StackingRegressor(estimators=base_models, final_estimator=Ridge())
stack.fit(X_train, y_train)
Make predictions
# Make predictions
predictions_stack = stack.predict(X_test)
predictions_stack
Evaluate the model
r2_stack = r2_score(y_test, predictions_stack)
print(f" R^2: {r2_stack:.4f}")
Plot the predictions
plt.figure(figsize=(10, 6))
plt.plot(data.index, data.gdp, label='Actual GDP', marker='o', color='blue')
plt.plot(y_test.index, predictions_stack, label='Predicted GDP', linestyle='--', color='red', marker='o')
plt.title('GDP Prediction with Stacked Regression')
plt.xlabel('Year')
plt.ylabel('GDP')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Machine Learning Model Comparison
r2_scores = pd.DataFrame({
'Model': ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Elastic Net Regression',
'Decision Tree Regressor', 'K-Nearest Neighbors Regressor', 'Support Vector Regressor',
'Gradient Boosting Machines', 'XGBoost Regressor', 'Light Gradient Boosting Machine',
'Random Forest Regressor', 'Stacked Regression'],
'R^2 Score': [r2_lr, r2_ridge, r2_lasso, r2_elastic, r2_dt, r2_knn, r2_svr, r2_gbr, r2_xgbr, r2_lgb, r2_rf, r2_stack]
})
r2_scores.sort_values(by='R^2 Score', ascending=False)
Plot the R^2 scores
plt.figure(figsize=(12, 6))
plt.bar(r2_scores['Model'], r2_scores['R^2 Score'], color='skyblue')
plt.xlabel('R^2 Score')
plt.title('Model Comparison')
plt.xticks(rotation=90)
plt.show()
Key Points
ML algorithms like linear regression, k-nearest neighbors,support vector Machine, xgboost and random forests are vital algorithms