MMIS692 Customer Segmentation¶

Our goal is to classify customers into segments based on input features that represent customer characteristics.

  1. We shall train and evaluate candidate classifers using the labeled training samples in the file "customer_segmentation.train.csv" through 5-fold cross-validation, eliminating irrelevant input features if possible.
  2. Choose a classifier that performs well, find a good set of hyper-parameters for the classifier through cross-validation, train our model with chosen hyper-parameters on the training examples, and evaluate its classification accuracy on the labeled validation samples in the file "customer_segmentation.valid.csv"
  3. Use our trained model to classify customers in the file "customer_segmentation.unlabeled.csv" into segments, based on their characteristics.

Mount Drive¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Import libraries¶

A list of available Scikit-Learn supervised learning classifiers is available at https://scikit-learn.org/stable/supervised_learning.html

Use any classifier that you are familiar with.

In [ ]:
import pandas as pd # for data handling
import matplotlib.pyplot as plt # for plotting
from time import time # to record time for training and cross-validation

# scikit-learn classifiers (import other classifiers if you want to)
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix # to evaluate models

from sklearn.model_selection import cross_val_score, GridSearchCV # for cross-validation and tuning hyper-parameters

import warnings
warnings.filterwarnings("ignore") # ignore warnings

Get data¶

For this task, we are going to use data from 3 CSV files:

  • 'customer_segmentation.train.csv'
  • 'customer_segmentation.valid.csv'
  • 'customer_segmentation.unlabeled.csv'
In [ ]:
! unzip '/content/drive/MyDrive/NSU - MS - Data Analytics and AI/MMIS 0692 - Data Analytics and AI Project/Task 3/data.MMIS692.Fall2025.zip'
train = pd.read_csv('customer_segmentation.train.csv')
valid = pd.read_csv('customer_segmentation.valid.csv')
unlabeled = pd.read_csv('customer_segmentation.unlabeled.csv')
! rm *.csv
Archive:  /content/drive/MyDrive/NSU - MS - Data Analytics and AI/MMIS 0692 - Data Analytics and AI Project/Task 3/data.MMIS692.Fall2025.zip
  inflating: quality_control.new_batches.csv  
  inflating: quality_control.measurements.csv  
  inflating: quality_control.defective.csv  
  inflating: production_planning.resource.csv  
  inflating: production_planning.product.csv  
  inflating: customer_segmentation.unlabeled.csv  
  inflating: customer_segmentation.valid.csv  
  inflating: customer_segmentation.train.csv  

Specify classifiers¶

We shall use the following sklearn classifiers with default hyper-parameters.

You can use any set of classifiers that you want.

In [ ]:
CLF = {} # dictionary of classifiers
CLF['GNB'] = GaussianNB()
CLF['DT'] = DecisionTreeClassifier()
CLF['RF'] = RandomForestClassifier()
CLF['ET'] = ExtraTreesClassifier()
CLF['AB'] =  AdaBoostClassifier()
CLF['SGD'] = SGDClassifier()
CLF['Ridge'] = RidgeClassifier()
CLF['LR'] = LogisticRegression(max_iter=1000)
CLF['Lin_SVC'] = LinearSVC()
CLF['SVC'] = SVC()
CLF['KNN'] = KNeighborsClassifier()
CLF['MLP'] = MLPClassifier()

print('Classifiers:')
for c in CLF:
    print(f'{c} : {CLF[c].__class__.__name__}')
Classifiers:
GNB : GaussianNB
DT : DecisionTreeClassifier
RF : RandomForestClassifier
ET : ExtraTreesClassifier
AB : AdaBoostClassifier
SGD : SGDClassifier
Ridge : RidgeClassifier
LR : LogisticRegression
Lin_SVC : LinearSVC
SVC : SVC
KNN : KNeighborsClassifier
MLP : MLPClassifier

Evaluate classifiers¶

We shall train the classifiers on all available features using 5 fold cross-validation on just the training data.

In [ ]:
features = list(train)[1:] # input features
res = [] # list with results
for c in CLF: # for each classifier
    model = CLF[c] # create classifier object with default hyper-parameters
    st = time() # start time for 5-fold cross-validation
    score = cross_val_score(model, train[features], train.y).mean() # mean crosss-validation accuracy
    t = time() - st # # time for 5-fold cross-validation
    print(c, round(score,4), round(t,2)) # show results for classifier
    res.append([c, score, t]) # append results for classifier
pd.DataFrame(res, columns=['model', 'score', 'time']).round(4) # show results as dataframe
GNB 0.8642 0.12
DT 0.8796 14.95
RF 0.9357 101.6
ET 0.939 14.64
AB 0.8738 24.54
SGD 0.862 3.22
Ridge 0.873 0.14
LR 0.8814 1.47
Lin_SVC 0.879 1.38
SVC 0.9516 26.11
KNN 0.9355 2.13
MLP 0.8998 115.68
Out[ ]:
model score time
0 GNB 0.8642 0.1245
1 DT 0.8796 14.9515
2 RF 0.9357 101.5999
3 ET 0.9390 14.6360
4 AB 0.8738 24.5439
5 SGD 0.8620 3.2175
6 Ridge 0.8730 0.1357
7 LR 0.8814 1.4711
8 Lin_SVC 0.8790 1.3781
9 SVC 0.9516 26.1142
10 KNN 0.9355 2.1318
11 MLP 0.8998 115.6761

Eliminate irrelevant features¶

We shall use the 'feature_importances_' attribute of a trained ExtraTreesClassifier model to estimate the importance of each feature, sort the features based on importance, and check if some of the features seem irrelevant for this classification task.

In [ ]:
ET = ExtraTreesClassifier().fit(train[features], train.y) # Train ExtraTreesClassifier
fi = sorted([(imp, f) for imp, f in zip(ET.feature_importances_, features)], reverse=True) # features sorted in descending order of importance
k = 20 # consider the k most important features (change as desired)
plt.figure(figsize=(10, 5)) # size of figure to be displayed
_ = plt.bar([v[1] for v in fi][:k], [v[0] for v in fi][:k]) # plot importance
pd.DataFrame(fi[:k], columns=['importance', 'feature']).round(3).T # show importance
Out[ ]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
importance 0.204 0.131 0.109 0.068 0.056 0.051 0.048 0.047 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007 0.007
feature x46 x6 x22 x13 x40 x27 x24 x42 x21 x3 x44 x20 x12 x35 x11 x36 x9 x5 x1 x34
No description has been provided for this image

We identify a list of relevant_features and then use only these relevant features to train models.

In [ ]:
k = 8
relevant_features = [v[1] for v in fi][:k]
print("Relevant features:", ', '.join(relevant_features))
Relevant features: x46, x6, x22, x13, x40, x27, x24, x42

Evaluate models using relevant features¶

In [ ]:
res = [] # list with results
for c in CLF: # for each classifier
    model = CLF[c] # create classifier object with default hyper-parameters
    st = time() # start time for 5-fold cross-validation
    score = cross_val_score(model, train[relevant_features], train.y).mean() # mean crosss-validation accuracy
    t = time() - st # # time for 5-fold cross-validation
    print(c, round(score,4), round(t,2)) # show results for classifier
    res.append([c, score, t]) # append results for classifier
res_df = pd.DataFrame(res, columns=['model', 'mean accuracy', 'time']).round(4) # show results as dataframe
res_df.to_csv('cross_validation_results.csv', index=False)
res_df
GNB 0.8656 0.04
DT 0.8964 1.61
RF 0.9511 25.23
ET 0.9562 5.3
AB 0.8748 5.99
SGD 0.8673 0.95
Ridge 0.8736 0.04
LR 0.8832 0.39
Lin_SVC 0.8801 0.32
SVC 0.9605 12.03
KNN 0.9598 0.83
MLP 0.9623 53.99
Out[ ]:
model mean accuracy time
0 GNB 0.8656 0.0435
1 DT 0.8964 1.6134
2 RF 0.9511 25.2330
3 ET 0.9562 5.2962
4 AB 0.8748 5.9941
5 SGD 0.8673 0.9531
6 Ridge 0.8736 0.0436
7 LR 0.8832 0.3880
8 Lin_SVC 0.8801 0.3187
9 SVC 0.9605 12.0255
10 KNN 0.9598 0.8328
11 MLP 0.9623 53.9871

Choose good model¶

Based on cross-validation results we shall create a short-list of the best performing models and then use Grid Search to find a good set of hyper-parameters for these models through cross-validation.

In [ ]:
para = {'C':[1.0, 5.0, 10.0, 20.0, 50.0, 100.0]}
clf = GridSearchCV(SVC(), para, scoring='accuracy',
                   n_jobs=-1, verbose=1) # grid search model

print(clf) # show model
print()

print("Tuning hyper-parameters ... " )
clf.fit(train[relevant_features], train.y) # tune using 5-fold cross-validation

print()
print("Accuracy: mean +/- 2*standard_dev") # show results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))

print("Best parameters:", clf.best_params_)
GridSearchCV(estimator=SVC(), n_jobs=-1,
             param_grid={'C': [1.0, 5.0, 10.0, 20.0, 50.0, 100.0]},
             scoring='accuracy', verbose=1)

Tuning hyper-parameters ... 
Fitting 5 folds for each of 6 candidates, totalling 30 fits

Accuracy: mean +/- 2*standard_dev
0.961 (+/-0.003) for {'C': 1.0}
0.963 (+/-0.002) for {'C': 5.0}
0.963 (+/-0.004) for {'C': 10.0}
0.963 (+/-0.004) for {'C': 20.0}
0.963 (+/-0.005) for {'C': 50.0}
0.961 (+/-0.004) for {'C': 100.0}
Best parameters: {'C': 20.0}
In [ ]:
para = {'n_neighbors': [3, 5, 7, 9, 11, 13]}
clf = GridSearchCV(KNeighborsClassifier(), para, scoring='accuracy',
                   n_jobs=-1, verbose=1) # grid search model

print(clf) # show model
print()

print("Tuning hyper-parameters ... " )
clf.fit(train[relevant_features], train.y) # tune using 5-fold cross-validation

print()
print("Accuracy: mean +/- 2*standard_dev") # show results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))

print("Best parameters:", clf.best_params_)
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 7, 9, 11, 13]},
             scoring='accuracy', verbose=1)

Tuning hyper-parameters ... 
Fitting 5 folds for each of 6 candidates, totalling 30 fits

Accuracy: mean +/- 2*standard_dev
0.958 (+/-0.005) for {'n_neighbors': 3}
0.960 (+/-0.004) for {'n_neighbors': 5}
0.960 (+/-0.006) for {'n_neighbors': 7}
0.960 (+/-0.005) for {'n_neighbors': 9}
0.960 (+/-0.004) for {'n_neighbors': 11}
0.960 (+/-0.003) for {'n_neighbors': 13}
Best parameters: {'n_neighbors': 7}

Train the chosen model with desired hyper-parameter values and evaluate it.

In [ ]:
model = SVC(C=20.0)
print('Chosen classifier:')
print(model)
model.fit(train[relevant_features], train.y)
pred = model.predict(valid[relevant_features]) # predict labels
acc = accuracy_score(valid.y, pred)
print(f'Validation accuracy with chosen classifier = {acc: .4f}')
print()
print("Classification report with chosen classifier:")
print(classification_report(valid.y, pred, digits=4))
print()
print("Precision for class = %d = %4.3f"
      %(0, precision_score(valid.y, pred, average=None)[0]))
print("Recall for class = %d = %4.3f"
      %(2, recall_score(valid.y, pred, average=None)[2]))
print('\nConfusion matrix')
cm = pd.DataFrame(confusion_matrix(valid.y, pred))
cm.to_csv("confusion_matrix.csv")
cm
Chosen classifier:
SVC(C=20.0)
Validation accuracy with chosen classifier =  0.9620

Classification report with chosen classifier:
              precision    recall  f1-score   support

         0.0     0.9630    0.9636    0.9633      1649
         1.0     0.9674    0.9583    0.9628      1701
         2.0     0.9556    0.9642    0.9599      1650

    accuracy                         0.9620      5000
   macro avg     0.9620    0.9620    0.9620      5000
weighted avg     0.9620    0.9620    0.9620      5000


Precision for class = 0 = 0.963
Recall for class = 2 = 0.964

Confusion matrix
Out[ ]:
0 1 2
0 1589 26 34
1 31 1630 40
2 30 29 1591
In [ ]:
model = KNeighborsClassifier(n_neighbors=7)
print('Chosen classifier:')
print(model)
model.fit(train[relevant_features], train.y)
pred = model.predict(valid[relevant_features]) # predict labels
acc = accuracy_score(valid.y, pred)
print(f'Validation accuracy with chosen classifier = {acc: .4f}')
print()
print("Classification report with chosen classifier:")
print(classification_report(valid.y, pred, digits=3))
print()
print("Precision for class = %d = %4.3f"
      %(0, precision_score(valid.y, pred, average=None)[0]))
print("Recall for class = %d = %4.3f"
      %(2, recall_score(valid.y, pred, average=None)[2]))
print('\nConfusion matrix')
cm = pd.DataFrame(confusion_matrix(valid.y, pred))
cm.to_csv("confusion_matrix.csv")
cm
Chosen classifier:
KNeighborsClassifier(n_neighbors=7)
Validation accuracy with chosen classifier =  0.9604

Classification report with chosen classifier:
              precision    recall  f1-score   support

         0.0      0.964     0.961     0.963      1649
         1.0      0.963     0.961     0.962      1701
         2.0      0.954     0.959     0.957      1650

    accuracy                          0.960      5000
   macro avg      0.960     0.960     0.960      5000
weighted avg      0.960     0.960     0.960      5000


Precision for class = 0 = 0.964
Recall for class = 2 = 0.959

Confusion matrix
Out[ ]:
0 1 2
0 1585 25 39
1 30 1634 37
2 29 38 1583

Predict unlabeled samples¶

In [ ]:
unlabeled.head()
Out[ ]:
ID x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x41 x42 x43 x44 x45 x46 x47 x48 x49 x50
0 1 -1.302 0.634 0.304 -0.341 0.487 1.896 1.244 2.305 -1.864 ... 1.355 -2.872 -1.387 -0.814 -1.198 -1.916 0.857 -0.885 -1.857 -1.092
1 2 -1.154 -1.939 1.744 1.248 0.177 -0.336 -0.356 -0.019 -2.898 ... 0.665 1.654 0.169 -0.901 -0.406 0.374 -0.040 1.520 0.910 0.274
2 3 0.594 0.162 0.821 1.499 -0.073 4.519 -0.298 0.858 -0.624 ... -1.359 1.224 -0.079 -0.201 -1.355 -1.885 1.018 0.780 -1.106 -0.837
3 4 -0.313 0.393 0.222 0.543 -0.412 -0.211 -1.823 1.289 -0.149 ... -0.256 0.321 -0.546 -0.321 1.048 -0.772 -1.320 -1.063 -1.464 0.931
4 5 -1.811 1.041 0.417 -1.184 -0.535 3.124 -0.614 0.043 -2.141 ... -0.281 0.269 -0.565 0.895 -0.043 -3.652 0.591 -0.833 1.654 1.455

5 rows × 51 columns

In [ ]:
predTest = model.predict(unlabeled[relevant_features]) # predict labels for val example
new = pd.DataFrame() # results data frame
new['ID'] = unlabeled.ID
new['predicted'] = predTest # predicted values
new.to_csv("unlabeled.results.csv", index=False) # save results
new
Out[ ]:
ID predicted
0 1 0.0
1 2 2.0
2 3 0.0
3 4 0.0
4 5 0.0
5 6 0.0
6 7 0.0
7 8 0.0
8 9 0.0
9 10 0.0
10 11 0.0
11 12 0.0
12 13 0.0
13 14 0.0
14 15 0.0
15 16 0.0
16 17 0.0
17 18 0.0
18 19 0.0
19 20 0.0
20 21 0.0
21 22 1.0
22 23 1.0
23 24 1.0
24 25 1.0
25 26 1.0
26 27 1.0
27 28 1.0
28 29 1.0
29 30 1.0
30 31 1.0
31 32 1.0
32 33 1.0
33 34 1.0
34 35 1.0
35 36 1.0
36 37 1.0
37 38 1.0
38 39 1.0
39 40 1.0
40 41 1.0
41 42 2.0
42 43 2.0
43 44 2.0
44 45 2.0
45 46 2.0
46 47 2.0
47 48 2.0
48 49 2.0
49 50 2.0
50 51 2.0
51 52 2.0
52 53 2.0
53 54 2.0
54 55 2.0
55 56 2.0
56 57 1.0
57 58 2.0
58 59 2.0
59 60 0.0
In [ ]:
new.predicted.value_counts()
Out[ ]:
count
predicted
0.0 22
1.0 20
2.0 18