Why are large data sets better than small data sets?

Contents

1 Why are large data sets better than small data sets?
2 How to deal with small data sets in machine learning?
3 What’s the best way to reduce model size?
4 How to reduce the size of source data?

Why are large data sets better than small data sets?

If we have small data, running a large number of iteration can result in overfitting. Large dataset helps us avoid overfitting and generalizes better as it captures the inherent data distribution more effectively. Here are a few important factors which influence the network optimization process:

How to deal with small data sets in machine learning?

1. Data Augmentation: Data augmentation can be an effective tool while dealing with a small dataset without overfitting. It is also a good technique to make our model invariant to changes in size, translation, viewpoint, illumination etc. We can achieve this by augmenting our data in a few of the following ways:

Can a small data set be a deal breaker?

Recent advances in deep learning have shown that Sequence models such as LSTM or GRU are really good at such tasks. However, small data set can be a deal breaker and also finding a good model for transfer learning for such use case is very difficult.

Why is training with a whole dataset slow?

Training with the whole dataset makes training computationally expensive and slow. Adam, RMSprop, Adagrad, Stochastic Gradient descent are a few variations of gradient descent which optimizes the gradient update process and improves model performance. Check out this blog for detailed understanding on various versions of gradient descent.

What’s the best way to reduce model size?

Perhaps the most effective technique to reduce a model size is to load pre-summarized data. This technique can be used to raise the grain of fact-type tables. There is a distinct trade-off, however, resulting in loss of detail.

Above figure tries to capture the core issues faced while dealing with small data sets and possible approaches and techniques to address them. In this part we will focus on only the techniques used in traditional machine learning and the rest will be discussed in part 2 of the blog.

How to reduce the size of source data?

When source data is loaded into memory, it is possible to see 10x compression, and so it is reasonable to expect that 10 GB of source data can compress to about 1 GB in size. Further, when persisted to disk an additional 20% reduction can be achieved.

Why are large data sets better than small data sets?

Why are large data sets better than small data sets?

How to deal with small data sets in machine learning?

What’s the best way to reduce model size?

How to reduce the size of source data?

Can Butcher Block be glued?

Which algorithms are used for regression?