Head over to our on-demand library to view sessions from VB Transform 2023. Register here
Have you ever tried to forget something you had already learned? You can imagine how difficult that would be.
As it turns out, forgetting information is also difficult for machine learning (ML) models. So what happens when these algorithms are trained on outdated, incorrect or private data?
It is hugely impractical to retrain the model from scratch every time a problem occurs with the original data set. This has led to the demand for a new field in AI called machine unlearning.
With new lawsuits being filed what seems like every other day, the need for ML systems to effectively ‘forget’ information is becoming paramount for businesses. Algorithms have proven to be incredibly useful in many fields, but the inability to forget information has significant implications for privacy, security and ethics.
VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Sign up to access the on-demand library of all our featured sessions.
Let’s take a closer look at the nascent field of machine learning – the art of teaching artificial intelligence (AI) systems to forget.
Understanding machine learning
So, as you may have understood by now, machine learning is the process of erasing the influence that specific data sets have had on an ML system.
Most often when a concern arises with a dataset it is a case of modifying or simply deleting the dataset. However, in those cases where data has been used to train a model, things can get tricky. ML models are essentially black boxes. This means that it is difficult to understand exactly how specific data sets affected the model during training and even more difficult to undo the effects of a problematic data set.
OpenAI, the creators of ChatGPT, have repeatedly come under fire regarding the data used to train their models. A number of generative AI art tools are also facing legal battles over their training data.
Privacy concerns have also been raised after membership inference attacks have shown that it is possible to infer whether specific data was used to train a model. This means that the models can potentially reveal information about the people whose data was used to train them.
While machine learning may not keep companies out of court, it would certainly help the defense’s case to show that datasets of concern have been completely removed.
With current technology, if a user requests data deletion, the entire model would have to be retrained, which is hugely impractical. The need for an efficient way to handle data removal requests is imperative to the development of widely available AI tools.
The mechanics behind machine learning
The simplest solution to producing an untrained model is to identify problematic data sets, exclude them, and retrain the entire model from scratch. Although this method is currently the simplest, it is prohibitively expensive and time-consuming.
Recent estimates show that training an ML model currently costs around $4 million. Due to an increase in both dataset size and computational power requirements, this figure is expected to rise to a whopping $500 million by 2030.
The “brute force” retraining approach may be appropriate as a last resort in extreme circumstances, but it is far from a silver bullet solution.
The conflicting goals of machine learning pose a challenging problem. Specifically, forgetting bad data while maintaining utility, which must be done with high efficiency. There is no point in developing a machine learning algorithm that uses more energy than retraining would.
Progression of machine learning
All this is not to say that there hasn’t been progress towards developing an efficient unlearning algorithm. The first mention of machine learning was seen in this 2015 paper, with a follow-up paper in 2016. The authors propose a system that allows incremental updates to an ML system without expensive retraining.
A 2019 paper advances machine learning research by introducing a framework that speeds up the unlearning process by strategically limiting the influence of data points in the training procedure. This means that specific data can be removed from the model with minimal negative impact on performance.
This 2019 paper also describes a method for “scrubbing” network weights clean of information about a particular set of training data without access to the original training data set. This method prevents insights about forgotten data by probing the weights.
This 2020 paper introduced the new approach to slicing and slicing. Sharding aims to limit the influence of a data point, while slicing divides the shard’s data further and trains incremental models. This approach aims to speed up the unlearning process and eliminate extensive retention.
A 2021 study introduces a new algorithm that can unlearn more data samples from the model compared to existing methods while maintaining model accuracy. Later in 2021, researchers developed a strategy for handling data deletion in models, even when deletions are based only on the model’s output.
Since the term was introduced in 2015, various studies have proposed increasingly effective and efficient unlearning methods. Despite considerable progress, a complete solution has not yet been found.
Challenges in machine learning
Like any emerging field of technology, we generally have a good idea of where we want to go, but not a good idea of how to get there. Some of the challenges and limitations facing machine learning algorithms include:
- efficiency: Any successful machine learning tool must use fewer resources than retraining the model would. This applies to both computational resources and time consumption.
- Standardization: Currently, the method used to evaluate the effectiveness of machine learning algorithms varies between each piece of research. To make better comparisons, standard metrics need to be identified.
- Efficiency: Once an ML algorithm has been instructed to forget a data set, how can we be sure that it has really forgotten it? Solid validation mechanisms are needed.
- Privacy: Machine learning must ensure that it does not inadvertently compromise sensitive data in its efforts to forget. Care must be taken to ensure that no traces of data are left in the unlearning process.
- Compatibility: Machine learning algorithms should ideally be compatible with existing ML models. This means that they must be designed in such a way that they can be easily implemented in different systems.
- Scalability: As datasets grow larger and models more complex, it’s important that machine learning algorithms are able to scale to match. They must handle large amounts of data and potentially perform unlearning tasks across multiple systems or networks.
Addressing all these issues poses a significant challenge and a healthy balance must be found to ensure stable development. To help navigate these challenges, companies can hire interdisciplinary teams of AI experts, data protection lawyers, and ethicists. These teams can help identify potential risks and keep track of progress in the field of machine learning.
The future of machine learning
Google recently announced the first machine learning challenge. This aims to solve the problems outlined so far. Specifically, Google hopes to unify and standardize the evaluation metrics for delearning algorithms as well as promote new solutions to the problem.
The competition, which considers an age prediction tool that must forget certain training data to protect the privacy of certain individuals, began in July and runs until mid-September 2023. For business owners who may have concerns about data used in their models , the results of this competition are definitely worth paying attention to.
In addition to Google’s efforts, the continuous build-up of lawsuits against AI and ML companies will undoubtedly trigger action in these organizations.
Looking further ahead, we can foresee advances in hardware and infrastructure to support the computational demands of machine learning. There may be an increase in interdisciplinary collaboration, which can contribute to streamlining development. Legal professionals, ethicists and data protection experts can join forces with AI researchers to adapt the development of unlearning algorithms.
We should also expect machine learning to attract attention from lawmakers and regulators, potentially leading to new policies and regulations. And as data privacy issues continue to make headlines, increased public awareness may also impact the development and use of machine learning in unforeseen ways.
Actionable insight for companies
Understanding the value of machine learning is essential for companies that want to implement or have already implemented AI models trained on large data sets. Some actionable insights include:
- Monitoring of research: Keeping an eye on the latest academic and industry research will help you stay ahead of the curve. Pay particular attention to the results of events like the Google Machine Learning Challenge. Consider subscribing to AI research newsletters and following AI thought leaders for up-to-date insights.
- Implementation of rules for data handling: It is critical to examine your current and historical data handling practices. Always try to avoid using questionable or sensitive data during the model training phase. Establish procedures or review processes for proper handling of data.
- Consider multidisciplinary teams: The multifaceted nature of machine learning benefits from a diverse team that could include AI experts, data protection lawyers and ethicists. This team can help ensure your practice complies with ethical and legal standards.
- Consider retraining costs: It never hurts to prepare for the worst. Consider the cost of retraining in the event that machine learning is unable to resolve any issues that may arise.
Keeping up with machine learning is a smart long-term strategy for any company that uses large data sets to train AI models. By implementing some or all of the strategies outlined above, companies can proactively address any issues that may arise due to the data used in training large AI models.
AI and ML are dynamic and constantly evolving fields. Machine learning has emerged as a crucial aspect of these fields, allowing them to adapt and evolve more responsibly. It ensures better data handling options while maintaining the quality of the models.
The ideal scenario is to use the right data from the start, but the reality is that our perspectives, information and privacy needs change over time. Adopting and implementing machine learning is no longer optional, but a necessity for businesses.
In the broader context, machine learning fits into the philosophy of responsible AI. It underscores the need for systems that are transparent and accountable and prioritize user privacy.
It’s still early days, but as the field progresses and evaluation metrics become standardized, implementing machine learning will inevitably become more manageable. This new trend warrants a proactive approach from companies that regularly work with ML models and large data sets.
Matthew Duffin is a mechanical engineer, dedicated blogger and founder of Rare Connections.
Data Decision Makers
Welcome to the VentureBeat community!
DataDecisionMakers is the place where experts, including the technical people who do data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contributing your own article!
Read more from DataDecisionMakers
#Machine #Learning #Critical #Art #Teaching #Forget