AI sets off a big battle for data

By | August 13, 2023

Nor so long since then, analysts openly wondered whether artificial intelligence (AI) would be the death of Adobe, a maker of software for creative types. New tools like OFF-E 2 and Midjourney, which conjures images from text, seemed to render Adobe’s image-editing offerings redundant. As recently as April, Seeking Alpha, a financial news site, published an article headlined “Ice AI The Adobe killer?”

Far from. Adobe has used its database of hundreds of millions of stock photos to build its own suite AI tools, dubbed Firefly. Since its release in March, the software has been used to create over 1 billion images, says Dana Rao, a director at the company. By avoiding searching for images on the Internet, as competitors did, Adobe has sidestepped the deepening copyright dispute that is now engulfing the industry. The company’s share price has risen by 36% since Firefly was launched.

Adobe’s triumph over the men of death illustrates a broader point about the competition for dominance in the fast-growing market for AI tools. The super-sized models driving the latest wave of so-called “generative” AI rely on gigantic amounts of data. Having helped themselves to much of the internet – often without permission – modelers are now seeking new sources of data to sustain the feeding frenzy. Meanwhile, companies with large volumes of the stuff are weighing how best to take advantage of it. A data invasion is underway.

The two essential ingredients for one AI model is data sets on which the system is trained and processing power through which the model detects relationships within and between these data sets. These two ingredients are to some extent substitutes: a model can be improved either by ingesting more data or adding more processing power. However, the latter is becoming difficult due to a lack of specialists AI chips, leading modelers to be doubly focused on seeking out data.

Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, Epoch estimates AI, a research outfit. The latest AI models from Google and Meta, two tech giants, are believed to have been trained on more than 1 step word. By comparison, the sum of English words on Wikipedia, an online encyclopedia, is about 4 billion.

It is not only the size of the dataset that counts. The better the data, the better the model. Text-based models are ideally trained on long-form, well-written, factually accurate writing, notes Scale’s Russell Kaplan AI, a data startup. Models given this information are more likely to produce the same high quality output. Likewise, AI chatbots provide better answers when asked to explain their work step-by-step, increasing demand for sources like textbooks that do as well. Specialized information sets are also valued as they allow models to be “fine-tuned” for more niche applications. Microsoft’s purchase of GitHub, a repository for software code, for DKK 7.5 billion. USD in 2018 it helped develop a code writing AI tool.

As the demand for data grows, it becomes more difficult to access, and content creators are now demanding compensation for material that has been consumed in AI models. A number of copyright infringement lawsuits have already been filed against model builders in America. A group of writers, including Sarah Silverman, a comedian, is suing OpenAIproducer of ChatGPTone AI chatbot and Meta. A group of artists is also suing Stability AIwhich builds text-to-image tools, and Midjourney.

The result of all this has been a flurry of dealmaking which AI companies race to secure data sources. Open in JulyAI entered into an agreement with the Associated Press, a news agency, to access its archive of stories. It also recently extended a deal with Shutterstock, a stock photography provider with which Meta also has a deal. On August 8, it was reported that Google was in discussions with Universal Music, a record label, to license artists’ voices to feed a songwriting AI tool. Fidelity, an asset manager, has said it has been contacted by technology firms asking for access to its financial data. The rumors are swirling around AI laboratories approaching BBC, the UK’s public broadcaster, to access its archive of images and films. Another supposed target is JSTORa digital library of academic journals.

Holders of information take advantage of their greater bargaining power. Reddit, a discussion forum, and Stack Overflow, a question-and-answer site popular with coders, have increased the cost of accessing their data. Both sites are particularly valuable because users “upvote” favorite answers, helping models know which ones are most relevant. Twitter (now known as X), a social media site, has introduced measures to limit the ability of bots to scrape the site and is now charging anyone who wants to access its data. Elon Musk, its mercurial owner, plans to build his own AI company that uses the data.

As a consequence, modelers work hard to improve the quality of the inputs they already have. Many AI labs employ armies of data annotators to perform tasks such as labeling images and evaluating responses. Some of that work is complex; an advertisement for such a job seeks applicants with a master’s degree or doctorate in life sciences. But much of it is mundane and it is outsourced to places like Kenya where labor is cheap.

AI companies also collect data via users’ interaction with their tools. Many of these have some form of feedback mechanism where users indicate which outputs are useful. Firefly’s text-to-image generator allows users to choose from one of four options. Bard, Google’s chatbot, similarly suggests three responses. Users can provide ChatGPT a thumbs up or thumbs down when it answers queries. This information can be fed back as input to the underlying model, forming it Douwe Kiela, co-founder of Contextual AI, a startup, calls the “data flywheel.” An increasingly strong signal about the quality of a chatbot’s response is whether users copy the text and paste it elsewhere, he adds. The analysis of such information helped Google to rapidly improve its translation tool.

Extends the limit

However, there is one source of data that remains largely untapped: the information found within the walls of tech companies’ business customers. Many companies hold, often unwittingly, vast amounts of useful data, from call center transcripts to customer spend. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, such as helping call center agents answer customer inquiries or business analysts find ways to increase sales.

Still, using the rich resource isn’t always straightforward. Roy Singh of Bain, a consulting firm, notes that historically most companies have not paid much attention to the types of huge but unstructured data sets that would prove most useful for training AI tools. Often these are spread across multiple systems, buried in the company’s servers rather than in the cloud.

Locking down this information would help companies adapt AI tools to better meet their specific needs. Amazon and Microsoft, two tech giants, now offer tools to help companies better manage their unstructured data sets, as does Google. Christian Kleinerman of Snowflake, a database company, says business is booming as customers look to “break down data silos”. Startups are piling in. In April Weaviate, a AI-focused database company, raised $50m. worth $200 million. Barely a week later, PineCone, a rival, raised $100m. worth $750 million. Earlier this month, Neon, another database startup, raised another $46 million in funding. The battle for data has only just begun.

#sets #big #battle #data

Leave a Reply

Your email address will not be published. Required fields are marked *