Original Source Here
Self-Healing Testing using
Deep Learning Algorithms
Why Deep Learning?
I’m part of the DBS iTES team, which supports the automation testing framework used by Software Development Engineer in Test (SDET) to test Treasury & Markets (T&M) applications. Each day, there are numerous regression tests conducted by various Investment Trading Technology (ITT) project teams to verify their application stability or assurance before applications go live.
From time to time, project teams may encounter test failures due to upgrades, or UI changes which take plenty of time to debug and resolve. With the exponential growth in the Artificial Intelligence (AI) space, our team was curious to find out whether we could leverage some of the available solutions to resolve the above mentioned issues while running tests.
The selection process for what feature to roll out was rather straightforward. Our team conducted a poll and put forth several features that SDET were able to choose from at one of our guild meetings. From the poll, we found out that 47% of the test engineers would like to have AI automatically fix their regression errors for them.
It is the End Result That Matters
Early on, it was clear to our team that we needed a proof of concept (POC) to show that our wild idea was worthy of further pursuit.
“To get what you want, you have to deserve what you want. The world is not yet a crazy enough place to reward a whole bunch of undeserving people.” ― Charles T. Munger
With reference to papers and studies by other companies who have applied some form of AI/ML for testing, our team decided to explore the possibility of a mixed technique consisting of both visual recognition and element properties weight optimisation to resolve broken locators.
Rapid prototyping was the approach chosen to get the project off the ground. First, a simple Object Detection Model trained on images retrieved from testing regression screenshots is deployed as an application on IBM® Red Hat® OpenShift®. This is followed by the development of an interface between our testing framework and the AI Application (VSDojo) hosted on Openshift to communicate on the screenshots of the errors as well as the returned predicted results.
What if the situation was more complex? Does the model hold up for the various user interface (UI) design within the ITT space?
Truth be told, there were times where the model did not perform to expectations. But more on that later.
We will go into detailed explanations for the architectural design, strategy, and decision-making later on in this article. Briefly, the current Object Detection Model is manually labelled (Supervised Learning) to detect only two web elements buttons and inputs.
We used an 80/20 train test split while keeping certain projects out of the training sample. Our team manually labelled each of the images using the Python tool LabelImg, which conveniently converts the coordinates and class definition as xml that would be used to train our model. Further evaluation was conducted using the 20% test split to ensure that the model can be used beyond the images it was trained on.
The Journey, Challenges Faced, Hurdles We Overcame
Now, let’s take a closer look at the overall architecture. The diagram below might look overwhelming, but it can be broken down into three broad categories: Object Detection Model (OpenShift), Logging (ELK) and DriveX (Object Storage).
The process kicks off the moment SDET run their tests via the Testcenter’s Serenity framework, and upon failure, the screenshot would be captured and sent to the Object Detection Model hosted on OpenShift (VSDojo). As mentioned in the previous section, the model would then return its prediction and coordinates of detected web elements back to the framework.
The model’s ability to generalise across multiple ITT projects without having to programmatically specify conditions is not something to be sniffed at! In order to appreciate its elegance, it is necessary to understand “deeply” how it works.
Traditionally, the first thing that comes to mind when dealing with computer vision is Convolutional Neural Net (CNN) models. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialised kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
In short, the given input image would have been stride through and flattened within each connected layer to extract key features. The final layer is typically a SoftMax or sigmoid function that outputs the predicted result. The backpropagation of errors from different training images and corresponding labels is what trains is the machine learning aspect, which allows the model to learn and classify other similar images.
Since our team requires a solution that is not only able to identify the type of web elements but also their coordinates, the Object Detection Model lends itself naturally as a good solution.
There are certainly many other object detection architectures currently practiced in the AI industry, including Single Shot Detection (SSD), ResNet, Faster RCNN, and You Only Look Once (YOLO). Each of them has their trade-offs, for instance, a larger model would have slower inference time for higher accuracy.
Our team chose to stick with YOLOV3 due to its low inference time 22ms, to minimise impact and delay to the testing regression. The superior speed of YOLOV3 can be attributed to its unique bounding box strategy. For more information, refer to the original paper linked in the reference below.
Finally, to overcome the problem of localisation failure, we plan to increase the amount of data used to train the model. We plan to do this using data augmentation, which involves generating different variations of the image by rotating and inverting the original images for improved generalisation by the model. This feature is included in the Tensorflow library under the pre-processing layer.
Now that we have a better understanding of the model, let’s explore the next key piece of ELK logging. One key point made by our Tech Lead, before our team kicked off the project, was to ensure a proper test bed that would measure the performance of our proof of concept, instead of simply measuring the Intercept Over Union (IOU) or accuracy of our model.
“If you cannot measure it, you cannot improve it.” ― William Thomson, Lord Kelvin.
Staying true to the advice given, our team addressed the need by piping the results from the framework directly into ELK. Through the dashboard, we can pinpoint exactly where the model is not performing well, along with the stack trace which can be used to further optimise our model.
The final essential component for the proof of concept would be the storage of metadata to support the sustainability of the model’s performance over time. The main idea revolves around storing past success and failure information in DriveX (Object Storage) that can be used as training data for the model.
The other question to be addressed is the kind of information that our team chose to retain and automate the training process. For each successful test run, the original selector, coordinates, outer html and most importantly, images regarding the web elements, were kept. This data would eventually be the cornerstone on which to build future models.
To recap, the process flow begins with SDET running regression via our framework. If there were web element failures due to UI changes, an API call would be made to Model resting on the OpenShift to retrieve predictions, allowing for self-healing to take place. Testing Logs would be sent over onto ELK to track real-time performance, and finally, metadata would be sent to DriveX to train future models.
As such, we decided to stick with using the visual recognition model to identify and resolve possible match for failure. During subsequent phases, we used a weighted optimisation model to pick the best match out of all the returned web elements.
The Weakest Link
But we have yet to address the elephant in the room: the selection of best matching elements when multiple web elements are detected by the model. What we are doing right now is simply using regex to slice our model’s prediction outer html to match against the original input selector.
This approach is rather primitive, and we foresee issues such as false positive selections, which could arise due to the nature of implementation. Therefore, we plan to train another model to select best matching web element for any given selector. Essentially, we will have two models: one for detecting elements from webpage on failure, and a separate model to pick the right element for self-healing.
The list of crucial pending tasks to be completed includes piloting the proof of concept on a couple of projects to validate the performance live and using metadata from successful test runs, as well as building weighted optimisation model to select the best match.
Once the proof of concept results shown in ELK have proven to be useful, our team will place greater emphasis on the sustainability and scalability of the system, mainly through the ability to retrain and replace model automatically if the performance of the existing model begins to decay.
Deep Dive into Data
When we retrieved statistics from ELK Stack, which we had been diligently sending our testing results to, we found that the AI model helped resolve 14.32% of the total test cases, which looks underwhelming at first glance. However, this proof of concept was scoped on just fixing input and buttons-related errors.
After closer inspection, if we only consider input and button related test errors, the model has a successful fix rate of 77%, as seen in the chart below.
Our team also made some minor changes to improve the overall performance of the system, mainly the introduction of VSMode, which consists of three workflow states. These are the training mode, which is only activated during locale machine test run; evaluation mode, which is activated during CI/CD pipeline; and finally, disable mode for edge cases like performance testing, which does not need the feature.
The workflow state would allow the model to learn from new data as the SDETs are developing new testcases, as well as performing self-healing on failed testcases within the CI/CD pipeline, improving overall efficiency of computing resources.
Getting your hands dirty
Inspired to get started with your own micro-innovation? Here are some articles and video tutorials to help you kick things off. If you are sitting on the fence whether you should start your own AI journey, know that motivation comes after action — all you need is to be brave enough to take the first step!
A good place to start would be MNIST dataset where we train a model to predict handwritten numbers from 1 to 10. I highly recommend beginners to use high level wrappers such as Keras, PyTorch or scikit-learn to train their model.
The wrapper makes the coding much more intuitive and easier to learn. Once you have mastered either of the recommended libraries, proceed to learn Tensorflow, which allows greater control. Check out the Deep Learning with Python, TensorFlow, and Keras tutorial.
This exercise is a simple follow-along supervised learning model tutorial. The beauty of machine learning models the flexibility and ability to generalise across various unseen inputs, much like our VSDojo model’s ability to generalise across multiple ITT web application without any formal code in place.
Interested to learn more? Check out the latest cutting-edge techniques that industry leaders are practicing or using within their projects here. The possibility for learning is endless but the best way to learn is to apply what you have learnt in a project. It can even be a small feature that does simple optimisation.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot