Machine Learning & Data Science
Machine Learning Workflow Overview
Typical Project Flow
1. Data Preparation
├── Train-test split (stratified for balanced classes)
├── Handle categorical variables (one-hot encoding)
├── Scale numeric features (min-max or standard scaling)
├── Engineer domain-specific features
└── Reduce dimensionality if needed (PCA)
2. Baseline Establishment
└── Calculate naive model performance (majority class prediction)
3. Model Development
├── Start simple (logistic regression, decision tree)
└── Progress to ensemble methods (random forest, gradient boosting)
4. Evaluation & Validation
├── Accuracy, precision, recall, F1, ROC AUC
├── Cross-validate for robust estimates
└── Analyze feature importance
5. Optimization
└── Grid search or randomized search for hyperparameters
Train-Test Split with Stratification
Problem
Random splitting can create imbalanced distributions where training and test sets have different proportions of target classes. Without stratification, model performance metrics become unreliable and training may fail to learn minority class patterns.
Implementation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features,
targets,
test_size=0.2,
random_state=42,
stratify=targets # Maintain class distribution
)
# Verify
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
One-Hot Encoding for Categorical Variables
Problem
Converting categories to arbitrary numeric codes implies false ordinal relationships. Different encoding strategies are needed for training vs test data to prevent data leakage.
Implementation
from sklearn.preprocessing import OneHotEncoder
# Fit on training data only
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[categorical_columns])
# Transform both sets
train_encoded = encoder.transform(train_df[categorical_columns])
test_encoded = encoder.transform(test_df[categorical_columns])
feature_names = encoder.get_feature_names_out(categorical_columns)
Feature Scaling (Min-Max Normalization)
Problem
Features with vastly different scales cause distance-based algorithms to be dominated by high-magnitude features. Different scaling parameters needed for train vs test to prevent data leakage.
Implementation
from sklearn.preprocessing import MinMaxScaler
# Fit on training data only
scaler = MinMaxScaler()
scaler.fit(train_df[numeric_columns])
# Transform both sets
train_df[numeric_columns] = scaler.transform(train_df[numeric_columns])
test_df[numeric_columns] = scaler.transform(test_df[numeric_columns])
Dimensionality Reduction with PCA
Problem
High-dimensional datasets suffer from the curse of dimensionality—overfitting, long training times, and difficulty visualizing data. Many features are correlated or redundant.
Implementation
from sklearn.decomposition import PCA
# Fit on training data only
pca = PCA(n_components=2)
pca.fit(train_df)
train_pca = pca.transform(train_df)
test_pca = pca.transform(test_df)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.2%}")
# Choosing n_components: plot cumulative explained variance,
# pick elbow point or where cumsum >= 0.95
Baseline Model Performance Measurement
Problem
Without establishing baseline performance, it's impossible to know if sophisticated models provide meaningful improvements. A model at 92% accuracy is unimpressive if predicting the majority class achieves 91%.
Implementation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
def calculate_baseline_metrics(train_targets, test_targets, prediction_value=1):
train_predictions = np.full(len(train_targets), prediction_value)
test_predictions = np.full(len(test_targets), prediction_value)
return {
"train": {
"accuracy": round(accuracy_score(train_targets, train_predictions), 4),
"recall": round(recall_score(train_targets, train_predictions, zero_division=0), 4),
"precision": round(precision_score(train_targets, train_predictions, zero_division=0), 4),
"fscore": round(f1_score(train_targets, train_predictions, zero_division=0), 4),
},
"test": {
"accuracy": round(accuracy_score(test_targets, test_predictions), 4),
"recall": round(recall_score(test_targets, test_predictions, zero_division=0), 4),
"precision": round(precision_score(test_targets, test_predictions, zero_division=0), 4),
"fscore": round(f1_score(test_targets, test_predictions, zero_division=0), 4),
}
}
K-Means Clustering and Optimal Cluster Selection
Problem
Determining the optimal number of clusters is non-trivial—too few oversimplify patterns while too many create artificial divisions. Manual inspection of high-dimensional data cannot reveal cluster structures.
Implementation
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
# Find optimal K
kmeans = KMeans(init='k-means++', n_init=10, max_iter=300, random_state=42)
visualizer = KElbowVisualizer(kmeans, k=(2, 10), metric='distortion')
visualizer.fit(features_scaled)
visualizer.show()
optimal_k = int(visualizer.elbow_value_)
# Train final model
final_model = KMeans(n_clusters=optimal_k, init='k-means++', n_init=10, random_state=42)
train_labels = final_model.fit_predict(train_features_scaled).tolist()
test_labels = final_model.predict(test_features_scaled).tolist()
Classification Models: Logistic Regression, Decision Tree, Ensemble Methods
Standard Metrics Function
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix)
def get_metrics(y_true, y_pred, y_proba):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return {
"accuracy": round(accuracy_score(y_true, y_pred), 4),
"precision": round(precision_score(y_true, y_pred, zero_division=0), 4),
"recall": round(recall_score(y_true, y_pred, zero_division=0), 4),
"fscore": round(f1_score(y_true, y_pred, zero_division=0), 4),
"fpr": round(fp / (fp + tn) if (fp + tn) > 0 else 0, 4),
"fnr": round(fn / (fn + tp) if (fn + tp) > 0 else 0, 4),
"roc_auc": round(roc_auc_score(y_true, y_proba), 4),
}
Logistic Regression:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
model = LogisticRegression(penalty='l1', fit_intercept=False,
solver='liblinear', random_state=42, max_iter=1000)
model.fit(X_train, y_train)
# RFE for top features
rfe = RFE(estimator=LogisticRegression(penalty='l1', solver='liblinear'), n_features_to_select=10)
rfe.fit(X_train, y_train)
Decision Tree:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=10, min_samples_split=20,
min_samples_leaf=10, random_state=42)
model.fit(X_train, y_train)
# Feature importance: model.feature_importances_ (use this, not RFE)
Random Forest & Gradient Boosting:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
rf = RandomForestClassifier(n_estimators=200, max_depth=20,
random_state=42, n_jobs=-1)
gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1,
max_depth=5, random_state=42)
# Both: model.feature_importances_ for rankings
Grid Search for Hyperparameter Optimization
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'random_state': [42]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
return_train_score=True
)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(f"Best CV ROC AUC: {grid_search.best_score_:.4f}")
Overfitting detection: Train-test gap > 0.1 on any metric = overfit. Apply regularization, reduce max_depth, increase min_samples_split/leaf.
Text-Based Feature Extraction
URL Feature Extraction
from urllib.parse import urlparse
import re, math
def extract_url_features(url):
f = {}
f['url_length'] = len(url)
f['num_dots'] = url.count('.')
f['num_hyphens'] = url.count('-')
f['num_at'] = url.count('@')
f['num_digits'] = sum(c.isdigit() for c in url)
try:
p = urlparse(url)
f['has_https'] = int(p.scheme == 'https')
f['domain_length'] = len(p.netloc)
f['path_length'] = len(p.path)
f['num_subdomains'] = max(len(p.netloc.split('.')) - 2, 0)
f['has_ip'] = int(bool(re.search(r'\b\d{1,3}(\.\d{1,3}){3}\b', p.netloc)))
except Exception:
f.update({'has_https': 0, 'domain_length': 0, 'path_length': 0,
'num_subdomains': 0, 'has_ip': 0})
suspicious = ['login','signin','verify','update','secure','bank','password','admin']
f['suspicious_word_count'] = sum(1 for w in suspicious if w in url.lower())
n = len(url) or 1
f['digit_ratio'] = f['num_digits'] / n
# Shannon entropy
counts = {}
for c in url: counts[c] = counts.get(c, 0) + 1
f['entropy'] = -sum((v/n) * math.log2(v/n) for v in counts.values())
return f
# Apply to DataFrame
features_df = pd.DataFrame(df['url'].apply(extract_url_features).tolist())