6/6/2024
Something I haven’t been able to intuitively understand, and which I’ve seen repeated and seemingly well-understood by others, is how it is possible for us to train a model on synthetic data and expect to produce something better than the model that fed it.
My understanding with current cutting edge models (whether this is 4o, or a rumored 5) is that we’re starting to near a data wall. Perhaps this isn’t the case yet for video, or audio, and that can keep the information coal fire burning for a little longer, but eventually at the level of growth we’ve seen we’re going to run out of those very quickly also.
And so the conventional wisdom goes that we should be able to use existing models to generate data instead. And so it also goes that this be a reasonable methodology for producing new and better models.
And so let me back up for a moment, because in order to ask my question, I need to state my understanding of what these models actually are. And so maybe the answer to my curiosity is just “you fundamentally misunderstand LLM’s”. I’m not a ML researcher or AI scientist. Hell, I’ve never even used PyTorch.
The way I understand this current technology is that it, as input, ingests some large amount of sequential and related data and it breaks this down into its’ own token size, and records (to put it in an unsophisticated way) the relationship between those tokens and the broader context. In this way, it creates an initial series of probabilities. This process continues (and in parallel as other workers can process information sourced from different places) and so it goes that with each iteration it recalculates and applies some differential to these probabilities.
At the end of all of this it will have created a function of sorts. A semantic mapping. A probabilistic relationship between all of the tokens it has, and their likeliness of appearing in relationship with one another. A glorious multi-dimensional Markov chain from hell.
Given this behemoth we can then query it with a series of tokens, and it will happily provide us the next most likely token. There are, I suppose, other unconcerned details (system prompts, and grounding, filtering, and so on) that don’t matter to my question. But this is more or less the gist of the state of things as I understand them. We have created a highly multi-dimensional fitting function for language (or video, or audio, whatever).
Back to my point — and maybe you’ve understood my misunderstanding for a few paragraphs now (or maybe even at the beginning of my paper, I’m sorry) which is this supposition that we should be able to create a more useful, a more accurate, a qualitatively better model by just using synthetic data from existing models. To which I say: Wait, how?
It’s my laypersons intuition that this should not result in the creation of novel, interesting, or unexpected relationships in the model which then lead to more usefulness. Instead, it seems to me like this would simply reinforce the existing models own relationships and at best create a copy of it, and at worst eliminate data from the new model by way of having not represented all of its own relationships during generation.
It’s not clear to me how this makes it more useful, or more accurate, unless except in the case that accuracy is measured in the same way we think of model temperature, because it seems like this would just steer the newly created model toward the least novel relationships.
A common rebuke I’ve seen online to this, when others have raised it (I don’t dare post these questions online), is something along the lines of ‘well what is it you suppose humans are doing, if not inferencing?’ I’m probably conflating things here, because I think this is often replied to those who think that these models wont reach human level intelligence, but assuming that is the goal and assuming we need to get past this data wall to reach it, I think it’s at least somewhat fair to apply that rebuke here.
But that rebuke really misses out on some fundamental details about the human experience. I mean, for starters, we’re… y’know… in the world. We are constantly receiving new information from our environment which we are re-integrating in real time with our existing understanding. It might be the case that human ingenuity is just the consequence of inference, but that ingenuity is fueled by a relentless barrage of new and novel data. It could be said that as a species our quest toward truth is just as an ad-hoc network of LLMMs (large language meat models) finding the function of best fit for the universe. But we are billions of eyes and billions of hands and billions of ears receiving new data from our environment. An LLM is trained once (tuning notwithstanding).
How could this form of the technology ever be intelligent? How could creating an LLM ouroboros ever seriously make something more than what it started with. These systems only respond when asked. They only reason when instructed. The human mind performs inference continuously (conscious and sub-conscious thought).
What the fuck am I missing? It seems like the best metaphor for current LLM technology is that it’s just a very large data compression and querying mechanism.
For fun, Claude.ai responds (with a bit of back and forth):