When last week in a large (and relevant) Dutch research event ChatGPT came up, and that this was going to change the world. Even the critiques came up, but were effectively disregarded with “these methods get better very quickly”. This is not untrue, but not really true either. I murmur “not even wrong”. I know how hard it is to get computers to find meaningful patters; I did a PhD in this in the early 21st century.

What strikes me, is that ChatGPT is now pitches as an informational retrieval (IR) system. This is a system where it tries to find information, that is, it “retrieves” information form a knowledge base. Like SQL or SPARQL. Or like Google Maps. IR about reproducing existing knowledge.

Now, deep learning starts with a different premise: we can find the patterns and in this way compress an unlimited number of facts into a mathematical equation, a physical law. That way, you do not have to record if the sun comes up every day. We predict it does. We do not have to record that rain drop will fall (that they do. when they do that actually is something to record). At best, we would record when rain drops start “falling” to the sky. That is, we have the laws of gravitation.

But here lies the problem with systems like ChatGPT: they are as good as their predictive patterns they learned. But they do not retrieve information. They predict information. This is why it doesn’t know about references. It lost the link between predictions and on which shelf the the book was stored.

So, when last week the research event mentioned that lawyers were starting to use it, citing existing work, I was skeptical: that would actually mean they moved ChatGPT into IR. And I already had learned (*) that ChatGPT would predict references, rather than look them up. It’s a prediction method, not an IR method. So, how come it would accurately give citations to court cases.

It didn’t. It’s all over the news now. It “hallucinated” legal citations.

Does this matter? I think it does. This is why I moved my research focus after my PhD back to IR, away from the machine learning. Deep learning can only generalize the facts, so we better start accurately recording facts. This is why I study interoperable and reusable knowledge bases, like WikiPathways, Wikidata, technologies like RDF in science, etc. Actually, this realization predates my machine learning. I guess I already had this notion when I started the Woordenboek Organische Chemie back in the nineties.

Someone has to. I just hope the funding for this fundamental aspect of research doesn’t run out. Information Retrieval will remain essential to science for a few decades more.