• Sheikh Aman

Limitations and Challenges for Machine Learning Models. | 2020 edition.

Common challenges faced by beginners or by masters during training any models.

  • Insufficient Quantity of Training Data

  • Non-Representative Training Data

  • Poor Quality Data

  • Irrelevant Features

  • Overfitting Training Data

  • Underfitting Training Data

Insufficient Quantity of Training Data

Let us consider a fact. For a child or you can say a baby to learn what a mango is, all it takes for you to point to mango and say "mango". After repeating this procedure few chid will be able to recognise mango in all colours and shapes. That's why humans are genius.

But machine learning is not there yet, It takes lots of data to train and work properly even for the solution of simple problems. In the case of image or speech recognition, it takes millions of data (unless you can reuse parts of the existing model).

So, an insufficient quantity of data will never satisfy you.

I wanna share my experience regarding this. During my final semester of graduation, I was building a machine learning model for my final year project purpose there I have faced this problem. I was trying to build a model which will predict a suitable college for any students in India based on their performance. Initially, It was giving 43 % accuracy. When I asked my mentor and he, after observing my model said there is insufficient data. After collecting a few more data it was showing 92 % accuracy and that was pretty good.

For Testing, I entered my performance as input and the prediction was correct. It has predicted the college I was studying right now.

Non-Representative Training Data

In order to generalize the model well, it's crucial that the training data be an accurate representation of the population. In other words, whenever a replacement sample is derived from the population, it's crucial that the sample must accurately paint an image of the population. A training set of data must be representative of the cases you would like to generalize to. It is, however, harder than it sounds. If the sample is just too small, you'll have sampling noise, which is that the non-representative data as a result of chance, but even large samples are often nonrepresentative if the sampling method is flawed. this is often called Sampling Bias.

It is important here to notice that if there must be a reduction in sampling bias, the variance jumps up, while if the variance is to be reduced, the bias creeps up. This phenomenon also referred to as the Bias-Variance Tradeoff, is that the process of finding a sweet spot which keeps both the variance and therefore the bias in the dataset low.

Poor Quality Data

Obviously, if your training data is filled with errors, outliers, and noise (e.g., because of poor quality measurements), it'll make it harder for the system to detect the underlying patterns, so your system is very less likely to perform well. it's often well definitely worth the effort to spend time cleaning up your training data. the reality is, most data scientists spend a big a part of their time doing just that. For example:

  • If some instances are clearly outliers, it should help to easily discard them or try and fix the errors manually. To affect outliers, it'd be better if we will either drop those observations, which reduces the dimensions of the dataset, or we will winsorize them. Winsorized observations are those which were originally outliers, but an extra condition has been passed on to them so as to restrict them within a specific boundary.

  • If some instances are missing a couple of features (e.g., 5% of your customers didn't specify their income), you would like to decide whether you want to ignore this attribute altogether, ignore these instances, fill within the missing values (e.g., with the median income), or train one model with the feature and one model without it, and so on. Blanks during a dataset can either be within the sort of missing values or non-existing values. Non-existing values are generally present in surveys, where some particular questions were not or let unanswered, and for that, we don't have any data for them, while missing values are more present on a general scale where values do exist for the data, but wasn't recorded and hence are often treated without using special Standardization methods.

Irrelevant Features

As the saying goes: garbage in, garbage out. Your model will only be capable of learning if the training data contains enough relevant and not too many irrelevant features. A critical a part of the success of a Machine Learning project is arising with an honest set of features to train on. This process, called feature engineering, involves:

  • Feature selection: selecting the foremost useful features to train on among existing features.

  • Feature extraction: combining existing features to supply a more useful one (as we saw earlier, dimensionality reduction algorithms can help).

  • Creating new features by gathering new data.

Overfitting Training Data

Overfitting of data is when the model has been tried to perfectly fit the training data. this is often done by force-fitting and is usually too good to be true. it's also popularly called as fool’s gold. this is often because whenever such a model sees new data, the model will give incredibly poor performances. The possible solutions to such a drag are:

  • Regularization: this is often the method by which the models can be simplified by selecting one with lesser parameters by decreasing the number of attributes within the training data or by constraining the model. the quantity of regularization to use during learning are often controlled by a hyperparameter.

  • Gathering more training data

  • Reduce noise in training data

Underfitting Training Data

Underfitting is that the opposite of overfitting: it occurs when a model is just too simple to learn the underlying structure of data. This often leads to high unexplained variance, as a model is unable to clarify the variance. The solutions to the present problem can be:

  • Selecting a far better (more powerful) model, with more parameters

  • Feeding best features to a learning algorithm (feature engineering)

  • Reducing the model's constraints (reducing the regularizations applied or reducing hyperparameters)

So, these are the basic problem you gonna face while training your model. If you found this helpful then show your love by smashing the heart button by your hands. And just fillup the newsletter present in the footer section and stay updated.

Something Interesting



Subscribe to Our Newsletter
Copyright © 2020 MR. Machine. All Rights Reserved
  • Facebook
  • Twitter
  • Pinterest
  • Instagram