Should you standardize test data?

Yes you need to apply normalisation to test data, if your algorithm works with or needs normalised training data*. That is because your model works on the representation given by its input vectors. The scale of those numbers is part of the representation.

Why can we calculate containment features across all data training & test prior to splitting the DataFrame for modeling?

Answer: The reason why we can calculate containment features across all data (training & test), prior to splitting the DataFrame, is that at the moment of computing the containment value we are only using one answer text (either beloning to the training or the test set) and one source text.

How to split data into train validation and test sets?

Now that you know what these datasets do, you might be looking for recommendations on how to split your dataset into Train, Validation and Test sets. This mainly depends on 2 things. First, the total number of samples in your data and second, on the actual model you are training.

How to use standardization / standardscaler for cross validation?

While in the first instance I thought this is how it should be I’m about to change my mind as I think I have to use the mean and std of the train set to use within the test set?

Which is the best method for cross validation?

Basically you use your training set to generate multiple splits of the Train and Validation sets. Cross validation avoids over fitting and is getting more and more popular, with K-fold Cross Validation being the most popular method of cross validation.

How to use scikit-learn for cross validation?

The idea behind this is to prevent data leakage from the testing to the training set because the aim of model validation is to subject the testing data to the same conditions as the data used for the model training. I guess you are using scikit-learn… What you have to do is to fit the pipeline with X_train and for X_test only tranform.

Should you standardize test data?