Using a Model to Model
TL;DR
Once we get past ‘bullshit work‘, the primary enterprise use cases for Large Language Models (LLMs) appear to converge on various ways to make it easier to work with unstructured data. That’s because an LLM can generate an ‘understanding’ of the data, saving the painstaking process of getting humans to provide context. Of course there are quick wins, but also booby traps to beware of.
Unstructured or unmodeled?
Many times over the years I’ve heard Simon Wardley say “it’s not unstructured data, it’s unmodelled data”, and I think he’s completely right about that.
The ‘books and records’ of most organisations live in relational database management systems (RDBMS). This is what we typically call ‘structured’, as the data is forced into the schema of the database. It’s also true that this data is ‘modeled’ because that’s how we get to the schema definitions. The flip side of that coin is that everything else gets referred to as ‘unstructured’. That usually starts with a bunch of stuff in various document formats, but it runs on to pretty much everything that’s not in an RDBMS. Generally we can do things like plain text search of that data, but it’s hard to ask it deeper questions (at least not in the way that SQL makes it easy to ask questions of ‘structured’ data in RDBMS).
Of course there’s an entire industry that’s positioned itself as unlocking the value of all that unstructured data – Autonomy, Elastic and Google easily spring to mind. But those things have usually allowed us to find the document that contains some data we’re looking for. It’s another thing entirely to then do something useful with the data in the document. And that’s where LLMs come in.
A model
Specifically a (large) language model.
The training process for language models involves throwing documents into a statistical analysis process in order to build a model of how the language (of the source documents) works. Generally more data gives a ‘better’ model, which is why we find ourselves talking about large language models.
The current crop of LLMs have (approximately) been trained on everything that can be scraped off the open web (including huge volumes of copyright material). So they know how language works, or at least the ‘language of the web’, and all the search engine optimisation (SEO) nonsense that’s been perpetrated in the pre AI era[1].
LLMs give us a representation of how words relate to each other; a model of language.
To model
All of that ‘unstructured’ data hasn’t been modelled because (in the past at least) that would mean having people look at the data and coming up with a model; and that’s painstaking, time consuming and expensive work.
But what if I throw all the documents with my ‘unstructured’ data through an LLM? The documents will light up certain pathways through the model, and we can extract that information and infer a model for the source data. It’s probably not going to be 100% accurate (like the ‘books and records’ need to be), but it might very well be ‘good enough’ for much of the data to be used in ways like it was structured.
Beware terminology overloading
One of the pitfalls with the process described above is that it’s very vulnerable to confusing things that have the same (or similar names), but which are entirely different.
I’ll give a concrete example. The company I work for has developed atProtocol(TM) whilst the folk at Bluesky typically refer to their Authenticated Transfer Protocol as AT Protocol. LLMs aren’t good a spotting the difference between atProtocol and AT Protocol, so if you ask a public LLM chatbot about either, you’ll get answers that mix up both. This sort of thing happens with horrifying regularity in corporate data, because people use the same word to describe different things.
It’s particularly bad with marketing stuff, because people jump on bandwagons and hijack terminology. Try asking a chatbot about ‘observerability’ and see if there’s any mention of high cardinality data.
Conclusion
Using a model to model seems to be emerging as the #2 use case for generative AI in the enterprise[2]. That’s because so many problems boil down to: “there’s huge value locked up in unstructured data, but not so much that it’s worth manually picking it apart using the traditional data modelling techniques”. LLMs have dramatically lowered the cost (and improved the accuracy) of building models for unstructured data, and that’s allowing companies the ability to tap into that locked away value. On the other side of the coin it’s providing revenue for AI startups that had a solution looking for a problem.
If I reflect back on the ‘data science’ we did in my last job, far too much of the work was what I’d call ‘janatorial’ – cleaning the data up so it was in good enough shape for analysis. The emergent LLM based tools are dramatically shifting the balance between human time and software to get that stuff done, which (of course) opens the door to doing more with less[3].
Note
[1] Things are only getting worse now as the web fills up with AI ‘slop’, which is not particularly useful for training the next generation of models.
[2] Bullshit jobs is still #1.
[3] At least in the present state of play, where LLMs are (approximately) free, and the enormous costs for training are being absorbed by venture capitalists and hyperscale cloud providers.
Filed under: technology | Leave a Comment
Tags: AI, data, GenAI, language, LLM, model, modeled, structured, unmodeled, unstructured
No Responses Yet to “Using a Model to Model”