We’re in part six on our series of the Keys to Effective Data Science Projects. I won’t cover basic Feature Engineering in this article – it’s a huge topic and central to working in Machine Learning areas. I do recommend you check out as many articles as you can find on the subject, and once you’ve grasped the basics, you’ll see why this may be the most difficult part of the project.
Note that the most important part of the project is getting the question right. But after you do have that question, you need data to be able to answer it with Machine Learning.
Machine Learning uses patterns similar to our own brains. We take in lots of information – and most of it we never use. We focus on the attributes, or features, we think will help us get what we want the quickest.
When we walk down the isle of a grocery store, we pass hundreds of products we aren’t interested in. We’re actually focusing in on just what we need – say, chocolate donuts. (One always needs chocolate donuts). Once you find the aisle that has sweets or pastries or perhaps the bakery department, you slow down and scan the shelves more carefully, because you know this is where the donuts are likely to be. You do this because donuts are sweet, made from flour and sugar, and are normally kept in those areas. Those attributes – features – are what led to your prediction that this is where the prize is.
And that’s exactly how Machine Learning predictions are formed. If you throw too many features at the algorithm, it can’t make an accurate prediction – there’s simply too much “noise” data. If you don’t have enough features, the algorithm also doesn’t return an accurate prediction, because there isn’t enough data to make a choice and learn from. And of course even if you have the right amount of features, they need to be features that will likely predict your outcome. Knowing the price of the items, especially in the grocery store, probably isn’t that predictive of the desired chocolate confection. Color, ingredients, size of objects kept on a given aisle – those features might be more predictive.
So what is the real Key here in successful projects? There are actually more than one.
The first is that you need some domain experience to be able to start selecting the most predictive features. If you have never seen, eaten, or been exposed to donuts, especially of the chocolate persuasion, you’re less likely to know what features to look for when searching out the right grocery aisle.
This doesn’t mean you can’t do the Data Science work for a medical problem if you’re not a doctor. The point is that you either need to be familiar with the domain, or be able to get someone who is to assist you. You might ask a doctor what attributes they look for in a diseased cell and so on.
The second important Key in this area is repeated experimentation. The danger in Data Science is that you can often get an answer from data – it just may not be the right one. You need to add and subtract features in various combinations to see if your algorithm performs better. And each algorithm type requires different features, and different shapes of the data those hold.
It’s interesting to note here that several tools have evolved within Data Science that can actually help you select the features you need. Azure ML has one such, well, feature, and other tools also have the ability to ask the data which columns most accurately predict the answer (the label) you are looking for. So Data Science can help you do Data Science.
Remember, the Key in this step is to know a lot about the data you’re working with and the problem you want to solve – each of the steps in the Team Data Science Process builds on the previous one. And repeated experimentation is important as well.
I’ll be back with another Key next time. I just remembered that I need to run to the grocery store for a moment and pick something up.