Machine Learning (ML) has emerged as one of the most popular fields in the world of artificial intelligence in the past few years. The scope of machine learning is so immense that Gartner estimates there will be 2.3 million jobs in the field of AI and ML by 2022.
With the average salary of a machine learning engineer being higher than the salaries offered to other job profiles in Europe, a career in ML becomes lucrative. A machine learning professional not only gets to chart the course of AI at their organisation but can become a key leader in the AI tech stack being built by the company.
First coined by IBM researcher Arthur Samuel, machine learning is now central to the success of every data-driven company. In our deep dive on ML, we have already explained how ML works, various categories, use cases, and challenges. However, one of the common questions that comes up a lot is how to get started with machine learning. Well, we are here to help answer that question.
What is machine learning?
For beginners, let’s start by familiarising ourselves with machine learning. Machine Learning is essentially a sub-field of artificial intelligence designed to turn data into numbers and find patterns in those numbers. This ability of computers to find patterns using available data is often referred to as drawing conclusions or inferences.
Organisations around the world have been using machine learning to draw insights that help make business decisions. A machine learning algorithm, according to UC Berkeley, comprises three main components: a decision process, an error function, and a model optimisation process.
As Daniel Bourke, a machine learning engineer turned YouTuber observes in one of his popular videos, machine learning “is amazing” but it still requires traditional programming to be successful. In his video titled Machine Learning Roadmap, Bourke says engineers should build a simple rule-based system that doesn’t require machine learning if possible.
Citing Google’s Machine Learning Handbook, Bourke argues that ML should not be the first option engineers adopt. He instead suggests looking at the problem to be solved before resorting to implementing an ML-based system. He also explains the machine learning process, tools, and even resources.
Machine Learning Process: steps to solve a ML problem
Before understanding the steps to solve a machine learning problem, it is important to understand the problem itself. In Bourke’s words, the most common problem is that engineers and clients tend to apply machine learning to solving everything. He akins this to putting cart before the horse and trying to solve the problem of moving things.
There are also various categories of learning such as supervised learning, unsupervised learning, reinforcement learning, and transfer learning. You can read more about these categories here. There is also a need to learn classification and regression.
To solve all of these challenges, Bourke recommends a series of processes to follow. The first process is data collection where he suggests ML engineers to ask questions about the type of problem they are trying to solve and consider data source already available, privacy concerns if any, where to store the data, and even whether the data is public. As part of data collection, it is also important for ML engineers to consider the type of data. This is easier to define as structured data and unstructured data.
The next process is data preparation which includes exploratory data analysis or also defined as learning about the data you will be working with. The exploratory phase is followed by data processing where engineers prepare the data for modelling, which includes actions like filling in the missing data, turning values into numbers, scaling or standardisation of data, transforming data into meaningful representations, selecting the most valuable features of your dataset, and dealing with imbalances.
The next step is splitting the data into a training set, which is usually 70 to 80 per cent of the dataset. The ML model learns using this data. The remaining 10 to 15 per cent of data is used as a validation set where the model’s hyperparameters are tuned on this and the remaining 10 to 15 per cent data is used as a test set for final evaluation.
Once the data is split, a ML engineer needs to train the model on data. For this, Bourke says there are three steps including choosing an algorithm, overfitting the model, and reducing overfitting with regularisation. We recommend watching this video to understand these steps in detail. The next process involves analysis, deployment, and retraining the model for accurate results.
Machine Learning tools: what should you use to build your ML solution?
At ai.nl, we have extensively covered all the Python-based solutions available for ML engineers. Bourke suggests classifying the tools in two categories: libraries and toolbox. For the toolbox, he further classifies it into pretrained models, experiment tracking, data and model tracking, cloud compute services, hardware, AutoML, explainability, and machine learning lifecycle.
For Python-flavoured libraries, Bourke recommends Scikit-Learn, PyTorch, TensorFlow, ONNX. You can read about all these open-source tools here. For transfer learning, he recommends TensorFlow Hub, PyTorch Hub, HuggingFace Transformers for NLP, and Detectron2 for computer vision.
For experiment tracking, a ML engineer can rely on TensorBoard, Dashboard by Weights & Biases, neptune.ai while data and model tracking can be done using either Artefacts by Weights & Biases or data version control (DVC).
For cloud compute services, Bourke recommends Google Colab, a free GPU power Jupyter Notebooks. A ML engineer can also look at Sagemaker from AWS, AI Platform from Google Cloud Platform, or Azure Machine Learning from Microsoft Azure. For hardware, it is important to understand the right GPU for your workload before going on a mission to build a $1,000 PC.
In the past few years, we have also seen an explosion in the use of AutoML, which automatically builds machine learning models based on your dataset and hyperparameter tuning. From TPot, Google Cloud AutoML to Microsoft Automated Machine Learning, Sweeps by Weights & Biases, and Keras Tuner, there are a number of options for engineers. For explainability, there is what-if tool and SHAP values.