DataMachine Learning

How AutoML Can Simplify ML Model-Building

AutoML is the process of automating the tasks for applying machine learning to real-world problems and covers the complete pipeline from the raw dataset to the deployable machine learning model.

As with every technology, especially in the world of AI, marketing and sales hype tend to inflate the capabilities of AutoML in the minds of enterprise buyers. Articles about AutoML replacing data scientists or making machine learning a candidate for “citizen developers” give the impression that AutoML is a drag-and-drop interface that can be used by anyone to unlock machine learning across the enterprise. While we’re not quite there yet, AutoML does simplify the creation of machine learning models and enables rapid experimentation and prototyping when coupled with a well-understood dataset.

In a 2015 presentation on AutoML, Rich Caruana of Microsoft wrote: “75 percent of machine learning is preparing to do machine learning … and 15 percent is what you do afterward.” That leaves 10 percent of the work to be done by the machine learning models themselves. AutoML exists to simplify, automate and accelerate the 75 percent preparation phase.

The best way to think about how AutoML can be applied to a business problem is to use an example:

BigCo has a mobile application that is viewed as central to the company’s future. It’s a complex cloud-native application with backend services that provide data to BigCo’s app on customer devices. BigCo has noticed that defects in their defect tracking system are taking longer to resolve, and they have done some historical trend analysis to prove this. As they approach an important release, BigCo’s leadership wants to know how big an impact defect resolution times will have on deployment. They want the application team’s best estimate on resolving currently open defects and future defect resolution to plan resourcing and the release date.

To provide this estimate, the team could just extrapolate the trend analysis on resolved defects, but they want to provide more than just a single summarized resolution time estimate. They want to provide detailed estimates for different categories of defects based on ML models run against the data stored in their defect tracking system. With thousands of rows and 28 columns of inconsistently filled out data, where do they start?

This is where AutoML can provide a shortcut to help BigCo solve a business problem. The team splits their data into three buckets. One bucket is filled with training data for training the ML models. The second bucket is filled with resolved defect data and will be used to test the accuracy of the models. The third bucket is unresolved defects and sample defects. The team will use the tested models to predict the resolution times of issues in the third bucket.

In this case, the team uses an AutoML tool released by Amazon called AutoGluon to analyze the data, build the models and apply those models to their open defects. By writing some python code to import the correct AutoGluon classes and execute their methods, the team can use AutoML to accelerate the initial 75 percent of ML effort. 

Let’s look at the output of the python code the team wrote to see how AutoGluon helps “prepare to do machine learning”:

Beginning AutoGluon training … Time limit = 300s

Preprocessing data …

AutoGluon infers your prediction problem is: ‘regression’ (because dtype of label-column == float and many unique label-values observed).

        Label info (max, min, mean, stddev): (294.7052199, 0.228738426, 28.3506, 42.71245)

        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.

        Stage 1 Generators:

                Fitting AsTypeFeatureGenerator…

                        Note: Converting 2 features to boolean dtype as they only contain 2 unique values.

        Useless Original Features (Count: 7): [‘Parent Issue Key (Ex)’, ‘Parent Link’, ‘Resolved or Open’, ‘Product’, ‘Device’, ‘Device Platform:’, ‘RCA’]

                These features carry no predictive signal and should be manually investigated.

After AutoGluon completes preprocessing and feature training it executes fitting and training of ML models.

…Fitting model: KNeighborsUnif … Training model for up to 299.49s of the 299.48s of remaining time.

Once the models have been trained, they can be used to predict resolution times for the test data and use the actual resolution times to evaluate the accuracy of the models.

Prediction of Resolution times by row:

0      13.983104

1      28.530710

2      13.991171

The team can also view evaluations of each individual model.

                  model  score_test

0              CatBoost   -5.915733

1         LightGBMLarge   -6.150060

2   WeightedEnsemble_L2   -6.502910

Instead of doing extensive data preparation and preprocessing, the team was able to run their existing data through AutoGluon. It inferred the prediction problem that needed to be solved and sorted through the tabular data to assign data types and importance. Using what it learned, it trained several models using the training data, which were then tested and evaluated before being used to predict the resolution time of currently open defects. Based on experience with this example, the team can use AutoGluon to quickly process data for predictive analysis without needing a data scientist to help with preprocessing and feature engineering. While this is ideal for a small project or a prototype, more skill and effort may need to be applied for larger, more complex ML projects.

While AutoGluon’s implementation of AutoML does not replace the need for data scientists or create a drag-and-drop interface for machine learning, it does provide a shortcut through the most time-consuming parts of the machine learning development process. This reduces the investment of time and resources normally required to begin an ML project and thus allows organizations to apply the technology to problems not normally associated with ML.

AutoML, today, is not suitable for every project. Using ML to detect fraud by analyzing millions of financial transactions or aiding in accurate cancer diagnoses by analyzing radiological reports requires the skills of an experienced data scientist. Instead, AutoML can and should be used to solve smaller real-world problems, like improving the performance of a company’s customer support helpdesk by analyzing ticket data, or identifying opportunities for savings by analyzing expense reports. 

Related Articles

Back to top button