AI is developing rapidly. ChatGPT has become the fastest-growing online service in history. Google and Microsoft are integrating generative AI into their products. And world leaders are excitedly embracing AI as a tool for economic growth.
As we move beyond ChatGPT and Bard, we’re likely to see AI chatbots become less generic and more specialized. AIs are limited by the data they’re exposed to in order to make them better at what they do—in this case, mimicking human speech and providing users with useful answers.
Training often casts the net wide, with AI systems absorbing thousands of books and web pages. But a more select, focused set of training data could make AI chatbots even more useful for people working in particular industries or living in certain areas.
The Value of Data
An important factor in this evolution will be the growing costs of amassing training data for advanced large language models (LLMs), the type of AI that powers ChatGPT. Companies know data is valuable: Meta and Google make billions from selling advertisements targeted with user data. But the value of data is now changing. Meta and Google sell data “insights”; they invest in analytics to transform many data points into predictions about users.
Data is valuable to OpenAI—the developer of ChatGPT—in a subtly different way. Imagine a tweet: “The cat sat on the mat.” This tweet is not valuable for targeted advertisers. It says little about a user or their interests. Maybe, at a push, it could suggest interest in cat food and Dr. Suess.
But for OpenAI, which is building LLMs to produce human-like language, this tweet is valuable as an example of how human language works. A single tweet cannot teach an AI to construct sentences, but billions of tweets, blogposts, Wikipedia entries, and so on, certainly can. For instance, the advanced LLM GPT-4 was probably built using data scraped from X (formerly Twitter), Reddit, Wikipedia and beyond.
The AI revolution is changing the business model for data-rich organizations. Companies like Meta and Google have been investing in AI research and development for several years as they try to exploit their data resources.
Organizations like X and Reddit have begun to charge third parties for API access, the system used to scrape data from these websites. Data scraping costs companies like X money, as they must spend more on computing power to fulfill data queries.
Moving forward, as organizations like OpenAI look to build more powerful versions of its GPT models, they will face greater costs for acquiring data. One solution to this problem might be synthetic data.
Synthetic data is created from scratch by AI systems to train more advanced AI systems—so that they improve. They are designed to perform the same task as real training data but are generated by AI.
It’s a new idea, but it faces many problems. Good synthetic data needs to be different enough from the original data it’s based on in order to tell the model something new, while similar enough to tell it something accurate. This can be difficult to achieve. Where synthetic data is just convincing copies of real-world data, the resulting AI models may struggle with creativity, entrenching existing biases.
Another problem is the “Hapsburg AI” problem. This suggests that training AI on synthetic data will cause a decline in the effectiveness of these systems—hence the analogy using the infamous inbreeding of the Hapsburg royal family. Some studies suggest this is already happening with systems like ChatGPT.
One reason ChatGPT is so good is because it uses reinforcement learning with human feedback (RLHF), where people rate its outputs in terms of accuracy. If synthetic data generated by an AI has inaccuracies, AI models trained on this data will themselves be inaccurate. So the demand for human feedback to correct these inaccuracies is likely to increase.
However, while most people would be able to say whether a sentence is grammatically accurate, fewer would be able to comment on its factual accuracy—especially when the output is technical or specialized. Inaccurate outputs on specialist topics are less likely to be caught by RLHF. If synthetic data means there are more inaccuracies to catch, the quality of general-purpose LLMs may stall or decline even as these models “learn” more.
Little Language Models
These problems help explain some emerging trends in AI. Google engineers have revealed that there is little preventing third parties from recreating LLMs like GPT-3 or Google’s LaMDA AI. Many organizations could build their own internal AI systems, using their own specialized data, for their own objectives. These will probably be more valuable for these organizations than ChatGPT in the long run.
Recently, the Japanese government noted that developing a Japan-centric version of ChatGPT is potentially worthwhile to their AI strategy, as ChatGPT is not sufficiently representative of Japan. The software company SAP has recently launched its AI “roadmap” to offer AI development capabilities to professional organizations. This will make it easier for companies to build their own, bespoke versions of ChatGPT.
Consultancies such as McKinsey and KPMG are exploring the training of AI models for “specific purposes.” Guides on how to create private, personal versions of ChatGPT can be readily found online. Open source systems, such as GPT4All, already exist.
As development challenges—coupled with potential regulatory hurdles—mount for generic LLMs, it is possible that the future of AI will be many specific little—rather than large—language models. Little language models might struggle if they are trained on less data than systems such as GPT-4.
But they might also have an advantage in terms of RLHF, as little language models are likely to be developed for specific purposes. Employees who have expert knowledge of their organization and its objectives may provide much more valuable feedback to such AI systems, compared with generic feedback for a generic AI system. This may overcome the disadvantages of less data.