Random Machine Learning Tips

stillbigjosh
2 min readOct 17, 2018

--

I do often get random thoughts and discover some interesting tips during my researches or when taking a break, as much as I found them fascinating I share them but these tips might be a waste if I don’t write about a few distinguished ones.

1. Scaling a datasets
There are times when we scale a dataset hyper parameters in cases such as when Age, Time are features in a dataset, by dividing them symmetrically by the maximum value found in our data.
I found this method seriously lacking, the reason being, it’s totally disconnecting from a realistic baseline, though it could ideally pass. If 'time' is a prime example, our clockwork is 24hrs a day and the datasets could be limited enough to have it’s maximum value to be 22:00 hours. How about the 2 hrs? The best dividing factor should be our realistic time of a day being 24 hours. Since, a model ought to reflect the human condition.
In cases, of scaling 'age' as a feature in a datasets - the best dividing factor is the Life expectancy of the population. The same should apply to other features with a realistic baseline.

2. NLP false positives.
A sentiment analysis polarity testing positive doesn't translate to an approval rating towards a subject. Proof - a TextBlob sentiment analysis program will recognise a statement "Buhari needs to do better" as a positive sentiment(0.5), which is totally reasonable but this doesn't correlate to me handing over an approval of him.

To solve the problem - use preprocessed and semantically analysed corporal data with same polarity as a baseline to filter out false positives and run a classification algorithm over the test data. This is highly effective with domain specific corpus.

3. K fold Cross validation or Leave an instance out.
Choosing between 'K fold cross validation technique' or 'Leaving an instance out in training' when working with unbalanced data is largely proportional to the level of imbalance in the data. In a dataset with high number of classes and fewer number of samples from each classes symmetrically, leaving an instance out in training will be a good choice. K fold on the other hand would create blind spots to the classes in test data in training.

Follow my github https://githib.com/alojoecee for more interesting researches from me. Happy hacking.

--

--