
Making decisions or predictions
Before we build a prediction, we need to separate our data into training and test sets. Model validation is a large and very important topic that will be covered later in the book, but for the purpose of this end-to-end example, we will do a basic train-test split. We will then build the prediction model on the training data and score it on the test data using the F1 score.
# Split into train/validation/test set
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_lda, test_size=0.3, random_state=42)
# Sanity check
print('train set shape = ' + str(df_train.shape))
print('test set shape = ' + str(df_test.shape))
print(df_train.head())
You will see the following output after executing the preceding code:

Now we can move on to predictions. Let's first try a Support Vector Machine (SVM) by using the Support Vector Classifier (SVC) module. Notice how the classifier objects in scikit-learn have similar API calls to the PCA and LDA transforms from earlier. So, once you gain an understanding of the library, you can learn how to apply different transformations, classifiers, or other methods with very little effort:
# classify with SVM
from sklearn.svm import SVC
from sklearn.metrics import f1_score
clf = SVC(kernel='rbf', C=0.8, gamma=10)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])
# predict on test set
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')
# check prediction score
print("f1 score for SVM classifier = %2f " % f1)
The F1 score for this classifier is 0.79, as calculated on the test set. At this point, we can try to change a model setting and fit it again. The C parameter was set to 0.8 in our first run – using the C=0.8 arg in the instantiation of the clf object. C is a penalty term and is called a hyperparameter; this means that it is a setting that an analyst can use to steer a fit in a certain direction. Here, we will use the penalty C hyperparameter to tune the model towards better predictions. Let's change it from 0.8 to 1, which will effectively raise the penalty term.
# classify with SVM
from sklearn.svm import SVC
from sklearn.metrics import f1_score
clf = SVC(kernel='rbf', C=1, gamma=10)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')
print("f1 score for SVM classifier = %2f " % f1)
The F1 score for this classifier is now 0.85. The obvious next step is to tune the parameters and maximize the F1 score. Of course, it will be very tedious to change a parameter (refit, analyze, and repeat). Instead, you can employ a grid search to automate this parameterization. Grid search and cross-validation will be covered in more detail in later chapters. An alternative method to employing a grid search is to choose an algorithm that doesn't require tuning. A popular algorithm that requires little-to-no tuning is Random Forest. The forest refers to how the method adds together multiple decision trees into a voted prediction:
# classify with RF
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=2, random_state=42)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')
# check prediction score
print("f1 score for SVM classifier = %2f " % f1)
The F1 score for this classifier is 0.96, that is, with no tuning. The Random Forest method will be discussed in more detail in later chapters.