For my final project in Data Mining, we were given a dataset, the Wisconsin Breast Cancer Database, and asked to apply five different types of models to the data: KNN, Decision Tree, Random Forest, Polynomial SVM, Gaussian SVM, and a Multilayer Perceptron. For each of these model types, we actually needed to build multiple models with various combinations of hyperparameters, and evaluate the hyperparameter combinations between each model.

At first, this seemed like a very tedious and long assignment. Not only would we have to implement each of these models from scratch, we would need to implement multiple versions of each model. This is the kind of thing that we would do over the course of an entire semester in a machine learning course.

However, the professor allowed us to use existing machine learning packages for this assignment, significantly reducing the amount of effort that the assignment would take. In fact, using functionality inside of Scikit-Learn, the code for this assignment was quite easy.

Not only does Scikit-Learn already have each of these model types implemented automatically, with arguments to easily modify the hyperparameters for the model, it includes a model selection module called GridSearchCV which makes it very easy to tune hyperparameters for a model.

Here's what the code for KNN looks like, assuming that data preprocessing is already taken care of:

params = {'n_neighbors':[1,3,5,7,9,11],'metric':['euclidean', 'manhattan', 'chebyshev']}
clf = sklearn.model_selection.GridSearchCV(KNeighborsClassifier(), param_grid=params, cv=10, scoring=['precision','recall','accuracy', 'f1'], refit='f1'), y_train)
df = pd.DataFrame.from_dict(clf.cv_results_)
df.drop(df.filter(regex=r'(split)|(std)|(params)|(rank)').columns, axis=1, inplace=True)
print(df.sort_values(by=['mean_test_f1'], ascending=False))

In just 6 lines of code, we're able to train 18 individual KNN models, each with a different combination of hyperparameters, and evaluate each of the models on multiple evaluation metrics. As we can see in the results, the KNN model using Euclidean distance and k=5 performs the best on this dataset.

I had never used GridSearchCV before this assignment, but it's a very powerful and easy-to-use framework for model tuning, and I'll definitely be using it more in the future.

Categories: Blog


Leave a Reply

Avatar placeholder

Your email address will not be published.