Train a model in Python from start to finish

When it comes down to training and assessing a machine learning model in Python, the process tends to be pretty standard.

The individual steps usually include:

  • Accessing the data
  • Preparing the data as appropriate. This step is highly dependent on the particular situation.
  • Splitting the data into training, testing, and validation sets
  • Training the model
  • Tuning the model iteratively based on how it performing against the testing data
  • Assessing the model against the validation data by examining the accuracy, confusion matrix, and other metrics reflected in a detailed report
  • Identifying the most important features. While this isn’t always essential, it’s a question that’s almost guaranteed to come up in any business context.

Here, I focus mainly on the random forest method, but the same technique can be used with really any sort of supervised classifier, with some modification as needed.

The code snippet in its entirety:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics

dat_prep = dat # modify the source data as needed, leaving the target variable

# split into test/train/validate

set_seed = <some numeric value>
target_field = '<whatever the dependent variable is>'

# convert data into matrices of values
X = dat_prep.drop(labels = target_field, axis = 1)
y = dat_prep.loc[:, target_field]

# first split into a training set and an initial test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=set_seed)

# second split, including a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = .2, random_state = set_seed)


# train model
rf=RandomForestClassifier()

model_rf = rf.fit(X_train, y_train)


# generate predictions
predictions_rf = model_rf.predict(X_test)


# get accuracy score
print(metrics.accuracy_score(y_test, predictions_rf))

# confusion matrix
confusion_matrix(y_test, predictions_rf)

# detailed report, including precision, recall, and F1 score
print(classification_report(y_test, predictions_rf))

# run model against the validation
validation_predictions_rf = model_rf.predict(X_val)

print(metrics.accuracy_score(y_val, validation_predictions_rf))

confusion_matrix(y_val, validation_predictions_rf)

print(classification_report(y_val, validation_predictions_rf))



# Assuming model looks good, let's get a glimpse of the feature importance
importances = model_rf.feature_importances_
len(importances)

importances_df = pd.DataFrame(importances)
importances_df['feature'] = (X_train.columns)

importances_df.sort_values(0,ascending=False) # zero is the default column name of importance.

You’ll definitely have to make some tweaks to suit your particular situation, but the core of the code is here.