Python Data Mining Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Making decisions or predictions

Before we build a prediction, we need to separate our data into training and test sets. Model validation is a large and very important topic that will be covered later in the book, but for the purpose of this end-to-end example, we will do a basic train-test split. We will then build the prediction model on the training data and score it on the test data using the F1 score. 

I recommend using a random seed for the most randomized data selection. This seed tells the pseudo-random number generator where to begin its randomization routine. The result is the same random choice every time. In this example, I've used the random seed when splitting into test and training sets. Now, if I stop working on the project and pick it back up later, I can split with the random seed and get the exact same training and test sets. I used 42 for my seed as it is common in the field due to the popularity of The Hitchhiker's Guide to the Galaxy by Douglas Adams.
# Split into train/validation/test set
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_lda, test_size=0.3, random_state=42)

# Sanity check
print('train set shape = ' + str(df_train.shape))
print('test set shape = ' + str(df_test.shape))
print(df_train.head())

You will see the following output after executing the preceding code:

Now we can move on to predictions. Let's first try a Support Vector Machine (SVM) by using the Support Vector Classifier (SVC) module. Notice how the classifier objects in scikit-learn have similar API calls to the PCA and LDA transforms from earlier. So, once you gain an understanding of the library, you can learn how to apply different transformations, classifiers, or other methods with very little effort: 

# classify with SVM
from sklearn.svm import SVC
from sklearn.metrics import f1_score
clf = SVC(kernel='rbf', C=0.8, gamma=10)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])

# predict on test set
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')

# check prediction score
print("f1 score for SVM classifier = %2f " % f1)

The Fscore for this classifier is 0.79, as calculated on the test set. At this point, we can try to change a model setting and fit it again. The C parameter was set to 0.8 in our first run – using the C=0.8 arg in the instantiation of the clf object. C is a penalty term and is called a hyperparameter; this means that it is a setting that an analyst can use to steer a fit in a certain direction. Here, we will use the penalty C hyperparameter to tune the model towards better predictions. Let's change it from 0.8 to 1, which will effectively raise the penalty term. 

C is the penalty term in an SVM. It controls how large the penalty is for a mis-classed example internally during the model fit. For a utilitarian understanding, it is called the soft margin penalty because it tunes how hard or soft the resulting separation line is drawn. Common hyperparameters for SVMs will be covered in more detail in a later chapter. 
# classify with SVM
from sklearn.svm import SVC
from sklearn.metrics import f1_score
clf = SVC(kernel='rbf', C=1, gamma=10)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')
print("f1 score for SVM classifier = %2f " % f1)

The Fscore for this classifier is now 0.85. The obvious next step is to tune the parameters and maximize the Fscore. Of course, it will be very tedious to change a parameter (refit, analyze, and repeat). Instead, you can employ a grid search to automate this parameterization. Grid search and cross-validation will be covered in more detail in later chapters. An alternative method to employing a grid search is to choose an algorithm that doesn't require tuning. A popular algorithm that requires little-to-no tuning is Random Forest. The forest refers to how the method adds together multiple decision trees into a voted prediction:

# classify with RF
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=2, random_state=42)
clf.fit(df_train[['lda1', 'lda2']], df_train['species'])
y_pred = clf.predict(df_test[['lda1', 'lda2']])
f1 = f1_score(df_test['species'], y_pred, average='weighted')

# check prediction score
print("f1 score for SVM classifier = %2f " % f1)

The Fscore for this classifier is 0.96, that is, with no tuning. The Random Forest method will be discussed in more detail in later chapters.