Python test-train split

Suitability #

Good for supervised learning in both classification or regression form.

Setup #

Define split proportion.

Code #

Bare bones:

from sklearn.model_selection import train_test_split

train, test = train_test_split(<data>, test_size = <value 0.00-1.00 or number of rows>)

Alternatively, specify split with train_size.

With the data split into input X (possibly with many columns) and output y, structured as arrays.

X_train, X_test, y_train, y_test = train_test_split(X, y, ...)

Standardize the split with a seed for repeatability, specified with random_state.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<some value>, random_state=<some integer>)

Unbalanced data #

Sometimes classes of data in classification work are unbalanced. Can specify split so the original proportions are maintained through the split with stratify=y.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<some number>, random_state=<some number>, stratify=y)

Split the dataframe #

Assuming the data comes in as a dataframe, it can be split into X (indepedent variables) and y (dependent variable):

data = df.values
X,y = [data:, :-1], data[:. -1] # assuming independent is the last column

If the source is a .csv, drop the header if present.

For more details, check out Python modeling

Training with the split #

Once the data is split, train on the training data.

model = <some model>(...)

model.fit(X_train, y_train)

Predict with the split #

Once the model has been trained, use it to predict on the test data independent variables.

predictions = medi.predict(X_test)

Evaluate predictions #

Once we have the predictions, we can compare them again the true values.

acc = accuracy_score(y_test, predictions)
print(acc)

Warnings #

  • Not suitable for small datasets (i.e., when the data is not sufficiently representative)
  • Unbalanced data

Try K-folds cross-validation instead. See details in Python cross validation.

Resources #