MMIS692 Customer Segmentation¶
Our goal is to classify customers into segments based on input features that represent customer characteristics.
- We shall train and evaluate candidate classifers using the labeled training samples in the file "customer_segmentation.train.csv" through 5-fold cross-validation, eliminating irrelevant input features if possible.
- Choose a classifier that performs well, find a good set of hyper-parameters for the classifier through cross-validation, train our model with chosen hyper-parameters on the training examples, and evaluate its classification accuracy on the labeled validation samples in the file "customer_segmentation.valid.csv"
- Use our trained model to classify customers in the file "customer_segmentation.unlabeled.csv" into segments, based on their characteristics.
Mount Drive¶
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Import libraries¶
A list of available Scikit-Learn supervised learning classifiers is available at https://scikit-learn.org/stable/supervised_learning.html
Use any classifier that you are familiar with.
import pandas as pd # for data handling
import matplotlib.pyplot as plt # for plotting
from time import time # to record time for training and cross-validation
# scikit-learn classifiers (import other classifiers if you want to)
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix # to evaluate models
from sklearn.model_selection import cross_val_score, GridSearchCV # for cross-validation and tuning hyper-parameters
import warnings
warnings.filterwarnings("ignore") # ignore warnings
Get data¶
For this task, we are going to use data from 3 CSV files:
- 'customer_segmentation.train.csv'
- 'customer_segmentation.valid.csv'
- 'customer_segmentation.unlabeled.csv'
! unzip '/content/drive/MyDrive/NSU - MS - Data Analytics and AI/MMIS 0692 - Data Analytics and AI Project/Task 3/data.MMIS692.Fall2025.zip'
train = pd.read_csv('customer_segmentation.train.csv')
valid = pd.read_csv('customer_segmentation.valid.csv')
unlabeled = pd.read_csv('customer_segmentation.unlabeled.csv')
! rm *.csv
Archive: /content/drive/MyDrive/NSU - MS - Data Analytics and AI/MMIS 0692 - Data Analytics and AI Project/Task 3/data.MMIS692.Fall2025.zip inflating: quality_control.new_batches.csv inflating: quality_control.measurements.csv inflating: quality_control.defective.csv inflating: production_planning.resource.csv inflating: production_planning.product.csv inflating: customer_segmentation.unlabeled.csv inflating: customer_segmentation.valid.csv inflating: customer_segmentation.train.csv
Specify classifiers¶
We shall use the following sklearn classifiers with default hyper-parameters.
You can use any set of classifiers that you want.
CLF = {} # dictionary of classifiers
CLF['GNB'] = GaussianNB()
CLF['DT'] = DecisionTreeClassifier()
CLF['RF'] = RandomForestClassifier()
CLF['ET'] = ExtraTreesClassifier()
CLF['AB'] = AdaBoostClassifier()
CLF['SGD'] = SGDClassifier()
CLF['Ridge'] = RidgeClassifier()
CLF['LR'] = LogisticRegression(max_iter=1000)
CLF['Lin_SVC'] = LinearSVC()
CLF['SVC'] = SVC()
CLF['KNN'] = KNeighborsClassifier()
CLF['MLP'] = MLPClassifier()
print('Classifiers:')
for c in CLF:
print(f'{c} : {CLF[c].__class__.__name__}')
Classifiers: GNB : GaussianNB DT : DecisionTreeClassifier RF : RandomForestClassifier ET : ExtraTreesClassifier AB : AdaBoostClassifier SGD : SGDClassifier Ridge : RidgeClassifier LR : LogisticRegression Lin_SVC : LinearSVC SVC : SVC KNN : KNeighborsClassifier MLP : MLPClassifier
Evaluate classifiers¶
We shall train the classifiers on all available features using 5 fold cross-validation on just the training data.
features = list(train)[1:] # input features
res = [] # list with results
for c in CLF: # for each classifier
model = CLF[c] # create classifier object with default hyper-parameters
st = time() # start time for 5-fold cross-validation
score = cross_val_score(model, train[features], train.y).mean() # mean crosss-validation accuracy
t = time() - st # # time for 5-fold cross-validation
print(c, round(score,4), round(t,2)) # show results for classifier
res.append([c, score, t]) # append results for classifier
pd.DataFrame(res, columns=['model', 'score', 'time']).round(4) # show results as dataframe
GNB 0.8642 0.12 DT 0.8796 14.95 RF 0.9357 101.6 ET 0.939 14.64 AB 0.8738 24.54 SGD 0.862 3.22 Ridge 0.873 0.14 LR 0.8814 1.47 Lin_SVC 0.879 1.38 SVC 0.9516 26.11 KNN 0.9355 2.13 MLP 0.8998 115.68
| model | score | time | |
|---|---|---|---|
| 0 | GNB | 0.8642 | 0.1245 |
| 1 | DT | 0.8796 | 14.9515 |
| 2 | RF | 0.9357 | 101.5999 |
| 3 | ET | 0.9390 | 14.6360 |
| 4 | AB | 0.8738 | 24.5439 |
| 5 | SGD | 0.8620 | 3.2175 |
| 6 | Ridge | 0.8730 | 0.1357 |
| 7 | LR | 0.8814 | 1.4711 |
| 8 | Lin_SVC | 0.8790 | 1.3781 |
| 9 | SVC | 0.9516 | 26.1142 |
| 10 | KNN | 0.9355 | 2.1318 |
| 11 | MLP | 0.8998 | 115.6761 |
Eliminate irrelevant features¶
We shall use the 'feature_importances_' attribute of a trained ExtraTreesClassifier model to estimate the importance of each feature, sort the features based on importance, and check if some of the features seem irrelevant for this classification task.
ET = ExtraTreesClassifier().fit(train[features], train.y) # Train ExtraTreesClassifier
fi = sorted([(imp, f) for imp, f in zip(ET.feature_importances_, features)], reverse=True) # features sorted in descending order of importance
k = 20 # consider the k most important features (change as desired)
plt.figure(figsize=(10, 5)) # size of figure to be displayed
_ = plt.bar([v[1] for v in fi][:k], [v[0] for v in fi][:k]) # plot importance
pd.DataFrame(fi[:k], columns=['importance', 'feature']).round(3).T # show importance
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| importance | 0.204 | 0.131 | 0.109 | 0.068 | 0.056 | 0.051 | 0.048 | 0.047 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 | 0.007 |
| feature | x46 | x6 | x22 | x13 | x40 | x27 | x24 | x42 | x21 | x3 | x44 | x20 | x12 | x35 | x11 | x36 | x9 | x5 | x1 | x34 |
We identify a list of relevant_features and then use only these relevant features to train models.
k = 8
relevant_features = [v[1] for v in fi][:k]
print("Relevant features:", ', '.join(relevant_features))
Relevant features: x46, x6, x22, x13, x40, x27, x24, x42
Evaluate models using relevant features¶
res = [] # list with results
for c in CLF: # for each classifier
model = CLF[c] # create classifier object with default hyper-parameters
st = time() # start time for 5-fold cross-validation
score = cross_val_score(model, train[relevant_features], train.y).mean() # mean crosss-validation accuracy
t = time() - st # # time for 5-fold cross-validation
print(c, round(score,4), round(t,2)) # show results for classifier
res.append([c, score, t]) # append results for classifier
res_df = pd.DataFrame(res, columns=['model', 'mean accuracy', 'time']).round(4) # show results as dataframe
res_df.to_csv('cross_validation_results.csv', index=False)
res_df
GNB 0.8656 0.04 DT 0.8964 1.61 RF 0.9511 25.23 ET 0.9562 5.3 AB 0.8748 5.99 SGD 0.8673 0.95 Ridge 0.8736 0.04 LR 0.8832 0.39 Lin_SVC 0.8801 0.32 SVC 0.9605 12.03 KNN 0.9598 0.83 MLP 0.9623 53.99
| model | mean accuracy | time | |
|---|---|---|---|
| 0 | GNB | 0.8656 | 0.0435 |
| 1 | DT | 0.8964 | 1.6134 |
| 2 | RF | 0.9511 | 25.2330 |
| 3 | ET | 0.9562 | 5.2962 |
| 4 | AB | 0.8748 | 5.9941 |
| 5 | SGD | 0.8673 | 0.9531 |
| 6 | Ridge | 0.8736 | 0.0436 |
| 7 | LR | 0.8832 | 0.3880 |
| 8 | Lin_SVC | 0.8801 | 0.3187 |
| 9 | SVC | 0.9605 | 12.0255 |
| 10 | KNN | 0.9598 | 0.8328 |
| 11 | MLP | 0.9623 | 53.9871 |
Choose good model¶
Based on cross-validation results we shall create a short-list of the best performing models and then use Grid Search to find a good set of hyper-parameters for these models through cross-validation.
para = {'C':[1.0, 5.0, 10.0, 20.0, 50.0, 100.0]}
clf = GridSearchCV(SVC(), para, scoring='accuracy',
n_jobs=-1, verbose=1) # grid search model
print(clf) # show model
print()
print("Tuning hyper-parameters ... " )
clf.fit(train[relevant_features], train.y) # tune using 5-fold cross-validation
print()
print("Accuracy: mean +/- 2*standard_dev") # show results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print("Best parameters:", clf.best_params_)
GridSearchCV(estimator=SVC(), n_jobs=-1,
param_grid={'C': [1.0, 5.0, 10.0, 20.0, 50.0, 100.0]},
scoring='accuracy', verbose=1)
Tuning hyper-parameters ...
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Accuracy: mean +/- 2*standard_dev
0.961 (+/-0.003) for {'C': 1.0}
0.963 (+/-0.002) for {'C': 5.0}
0.963 (+/-0.004) for {'C': 10.0}
0.963 (+/-0.004) for {'C': 20.0}
0.963 (+/-0.005) for {'C': 50.0}
0.961 (+/-0.004) for {'C': 100.0}
Best parameters: {'C': 20.0}
para = {'n_neighbors': [3, 5, 7, 9, 11, 13]}
clf = GridSearchCV(KNeighborsClassifier(), para, scoring='accuracy',
n_jobs=-1, verbose=1) # grid search model
print(clf) # show model
print()
print("Tuning hyper-parameters ... " )
clf.fit(train[relevant_features], train.y) # tune using 5-fold cross-validation
print()
print("Accuracy: mean +/- 2*standard_dev") # show results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print("Best parameters:", clf.best_params_)
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'n_neighbors': [3, 5, 7, 9, 11, 13]},
scoring='accuracy', verbose=1)
Tuning hyper-parameters ...
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Accuracy: mean +/- 2*standard_dev
0.958 (+/-0.005) for {'n_neighbors': 3}
0.960 (+/-0.004) for {'n_neighbors': 5}
0.960 (+/-0.006) for {'n_neighbors': 7}
0.960 (+/-0.005) for {'n_neighbors': 9}
0.960 (+/-0.004) for {'n_neighbors': 11}
0.960 (+/-0.003) for {'n_neighbors': 13}
Best parameters: {'n_neighbors': 7}
Train the chosen model with desired hyper-parameter values and evaluate it.
model = SVC(C=20.0)
print('Chosen classifier:')
print(model)
model.fit(train[relevant_features], train.y)
pred = model.predict(valid[relevant_features]) # predict labels
acc = accuracy_score(valid.y, pred)
print(f'Validation accuracy with chosen classifier = {acc: .4f}')
print()
print("Classification report with chosen classifier:")
print(classification_report(valid.y, pred, digits=4))
print()
print("Precision for class = %d = %4.3f"
%(0, precision_score(valid.y, pred, average=None)[0]))
print("Recall for class = %d = %4.3f"
%(2, recall_score(valid.y, pred, average=None)[2]))
print('\nConfusion matrix')
cm = pd.DataFrame(confusion_matrix(valid.y, pred))
cm.to_csv("confusion_matrix.csv")
cm
Chosen classifier:
SVC(C=20.0)
Validation accuracy with chosen classifier = 0.9620
Classification report with chosen classifier:
precision recall f1-score support
0.0 0.9630 0.9636 0.9633 1649
1.0 0.9674 0.9583 0.9628 1701
2.0 0.9556 0.9642 0.9599 1650
accuracy 0.9620 5000
macro avg 0.9620 0.9620 0.9620 5000
weighted avg 0.9620 0.9620 0.9620 5000
Precision for class = 0 = 0.963
Recall for class = 2 = 0.964
Confusion matrix
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 1589 | 26 | 34 |
| 1 | 31 | 1630 | 40 |
| 2 | 30 | 29 | 1591 |
model = KNeighborsClassifier(n_neighbors=7)
print('Chosen classifier:')
print(model)
model.fit(train[relevant_features], train.y)
pred = model.predict(valid[relevant_features]) # predict labels
acc = accuracy_score(valid.y, pred)
print(f'Validation accuracy with chosen classifier = {acc: .4f}')
print()
print("Classification report with chosen classifier:")
print(classification_report(valid.y, pred, digits=3))
print()
print("Precision for class = %d = %4.3f"
%(0, precision_score(valid.y, pred, average=None)[0]))
print("Recall for class = %d = %4.3f"
%(2, recall_score(valid.y, pred, average=None)[2]))
print('\nConfusion matrix')
cm = pd.DataFrame(confusion_matrix(valid.y, pred))
cm.to_csv("confusion_matrix.csv")
cm
Chosen classifier:
KNeighborsClassifier(n_neighbors=7)
Validation accuracy with chosen classifier = 0.9604
Classification report with chosen classifier:
precision recall f1-score support
0.0 0.964 0.961 0.963 1649
1.0 0.963 0.961 0.962 1701
2.0 0.954 0.959 0.957 1650
accuracy 0.960 5000
macro avg 0.960 0.960 0.960 5000
weighted avg 0.960 0.960 0.960 5000
Precision for class = 0 = 0.964
Recall for class = 2 = 0.959
Confusion matrix
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 1585 | 25 | 39 |
| 1 | 30 | 1634 | 37 |
| 2 | 29 | 38 | 1583 |
Predict unlabeled samples¶
unlabeled.head()
| ID | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | ... | x41 | x42 | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | -1.302 | 0.634 | 0.304 | -0.341 | 0.487 | 1.896 | 1.244 | 2.305 | -1.864 | ... | 1.355 | -2.872 | -1.387 | -0.814 | -1.198 | -1.916 | 0.857 | -0.885 | -1.857 | -1.092 |
| 1 | 2 | -1.154 | -1.939 | 1.744 | 1.248 | 0.177 | -0.336 | -0.356 | -0.019 | -2.898 | ... | 0.665 | 1.654 | 0.169 | -0.901 | -0.406 | 0.374 | -0.040 | 1.520 | 0.910 | 0.274 |
| 2 | 3 | 0.594 | 0.162 | 0.821 | 1.499 | -0.073 | 4.519 | -0.298 | 0.858 | -0.624 | ... | -1.359 | 1.224 | -0.079 | -0.201 | -1.355 | -1.885 | 1.018 | 0.780 | -1.106 | -0.837 |
| 3 | 4 | -0.313 | 0.393 | 0.222 | 0.543 | -0.412 | -0.211 | -1.823 | 1.289 | -0.149 | ... | -0.256 | 0.321 | -0.546 | -0.321 | 1.048 | -0.772 | -1.320 | -1.063 | -1.464 | 0.931 |
| 4 | 5 | -1.811 | 1.041 | 0.417 | -1.184 | -0.535 | 3.124 | -0.614 | 0.043 | -2.141 | ... | -0.281 | 0.269 | -0.565 | 0.895 | -0.043 | -3.652 | 0.591 | -0.833 | 1.654 | 1.455 |
5 rows × 51 columns
predTest = model.predict(unlabeled[relevant_features]) # predict labels for val example
new = pd.DataFrame() # results data frame
new['ID'] = unlabeled.ID
new['predicted'] = predTest # predicted values
new.to_csv("unlabeled.results.csv", index=False) # save results
new
| ID | predicted | |
|---|---|---|
| 0 | 1 | 0.0 |
| 1 | 2 | 2.0 |
| 2 | 3 | 0.0 |
| 3 | 4 | 0.0 |
| 4 | 5 | 0.0 |
| 5 | 6 | 0.0 |
| 6 | 7 | 0.0 |
| 7 | 8 | 0.0 |
| 8 | 9 | 0.0 |
| 9 | 10 | 0.0 |
| 10 | 11 | 0.0 |
| 11 | 12 | 0.0 |
| 12 | 13 | 0.0 |
| 13 | 14 | 0.0 |
| 14 | 15 | 0.0 |
| 15 | 16 | 0.0 |
| 16 | 17 | 0.0 |
| 17 | 18 | 0.0 |
| 18 | 19 | 0.0 |
| 19 | 20 | 0.0 |
| 20 | 21 | 0.0 |
| 21 | 22 | 1.0 |
| 22 | 23 | 1.0 |
| 23 | 24 | 1.0 |
| 24 | 25 | 1.0 |
| 25 | 26 | 1.0 |
| 26 | 27 | 1.0 |
| 27 | 28 | 1.0 |
| 28 | 29 | 1.0 |
| 29 | 30 | 1.0 |
| 30 | 31 | 1.0 |
| 31 | 32 | 1.0 |
| 32 | 33 | 1.0 |
| 33 | 34 | 1.0 |
| 34 | 35 | 1.0 |
| 35 | 36 | 1.0 |
| 36 | 37 | 1.0 |
| 37 | 38 | 1.0 |
| 38 | 39 | 1.0 |
| 39 | 40 | 1.0 |
| 40 | 41 | 1.0 |
| 41 | 42 | 2.0 |
| 42 | 43 | 2.0 |
| 43 | 44 | 2.0 |
| 44 | 45 | 2.0 |
| 45 | 46 | 2.0 |
| 46 | 47 | 2.0 |
| 47 | 48 | 2.0 |
| 48 | 49 | 2.0 |
| 49 | 50 | 2.0 |
| 50 | 51 | 2.0 |
| 51 | 52 | 2.0 |
| 52 | 53 | 2.0 |
| 53 | 54 | 2.0 |
| 54 | 55 | 2.0 |
| 55 | 56 | 2.0 |
| 56 | 57 | 1.0 |
| 57 | 58 | 2.0 |
| 58 | 59 | 2.0 |
| 59 | 60 | 0.0 |
new.predicted.value_counts()
| count | |
|---|---|
| predicted | |
| 0.0 | 22 |
| 1.0 | 20 |
| 2.0 | 18 |