Suitability #
Good for supervised learning in both classification or regression form.
Setup #
Define split proportion.
Code #
Bare bones:
from sklearn.model_selection import train_test_split
train, test = train_test_split(<data>, test_size = <value 0.00-1.00 or number of rows>)
Alternatively, specify split with train_size
.
With the data split into input X
(possibly with many columns) and output y
, structured as arrays.
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)
Standardize the split with a seed for repeatability, specified with random_state
.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<some value>, random_state=<some integer>)
Unbalanced data #
Sometimes classes of data in classification work are unbalanced. Can specify split so the original proportions are maintained through the split with stratify=y
.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<some number>, random_state=<some number>, stratify=y)
Split the dataframe #
Assuming the data comes in as a dataframe, it can be split into X (indepedent variables) and y (dependent variable):
data = df.values
X,y = [data:, :-1], data[:. -1] # assuming independent is the last column
If the source is a .csv, drop the header if present.
For more details, check out Python modeling
Training with the split #
Once the data is split, train on the training data.
model = <some model>(...)
model.fit(X_train, y_train)
Predict with the split #
Once the model has been trained, use it to predict on the test data independent variables.
predictions = medi.predict(X_test)
Evaluate predictions #
Once we have the predictions, we can compare them again the true values.
acc = accuracy_score(y_test, predictions)
print(acc)
Warnings #
- Not suitable for small datasets (i.e., when the data is not sufficiently representative)
- Unbalanced data
Try K-folds cross-validation instead. See details in Python cross validation.