Photos and Stuff: Of Dogs, AIs, and Photographs

I regret, slightly, that there will be no followup to the previous teaser. Dr. Low has found a less disreputable publisher for his research into the scandalous behaviors of PhotoIreland. Alas. But, onwards!

My dog, upon hearing the word "walk," goes insane. Whatever her inner life actually looks like, it certainly appears that to her this word, in some meaningful way, means a specific activity. Rather, it means one of a sort of cloud of activities. It has synonyms: "hike," "park," "leash," and "poo-bag" at least. It appears that hearing the word triggers a set of emotions (pleasurable) and memories of previous walks, hikes, etc, and most definitely an expectation of more of the same. The word connects in some meaningful way to a set of emotions and a set of ideas about reality, about things that actually happen from time to time.

To me, the word is a shortened form of a sentence: "would you like to go for a walk?" which means in a bunch of ways. The dog and I agree on the important ones, which are the bundle of emotions and the real-world activities which occur from time to time. We agree on the way the sentence connects to the emotional and the real. What the dog misses is the linguistic content. She has no notion of the interrogative voice, she has no notion of preferences, not really. Her vocabulary in general, while very real, is limited to a handful of nouns which connect to things she likes very much, and other noises, "commands," which prompt her to do things (e.g. sit) in exchange for things she likes very much (food.)

You and I, on the other hand, have a rich linguistic structure to play with. We know about prepositions, like "of" as in "the ear of the dog" which is a meaningless idea to my dog. My dog knows about things being attached to other things, and she seems to have a notion of possession or perhaps ownership, but I cannot imagine it would even occur to her to express these things. They simply are. For you and me, words like "dog" refer to a real thing, refer to (probably) a bundle of emotional material, and also refer to a bunch of other words. Dogs are mammals, they have four feet, they like to go on walks, they bite you. Words are "defined" in terms of other words, and live in grammatical relationships to other words. "Dog" is a word that appears in some sentences in some places, and not in others.

If you pay attention in the right places, we're seeing a lot of "AI" systems appearing. Most recently a chat-bot based on GPT3, with which you can have a sort of conversation. You can ask it to write a song in the style of Aerosmith about prosciutto and, by all accounts, it will do a weirdly good job of it.

These things are, essentially, pure language. They are built by dropping a half a trillion words of written English into a piece of software that builds another piece of software. This second piece of software "knows" a great deal about where and how the word "dog" appears in English text. It "knows" in some sense that "ear" is a word that can exist in an "of" relationship with a "dog," and that the reverse is rare. To GPT3 the word "dog" is connected to a great mass of material, none of which is emotional (GPT3 lacks an endocrine system) and none of which is reality (GPT3 has no model of the world, only of language.) In a real sense, GPT3 is the opposite of a dog, being composed of precisely those facets of language which a dog lacks. Or, to put it another way, if you could somehow combine GPT3 with a dog, you'd have a pretty fair representation of a human.

It happens that a great deal of what humans do these days is a lot like GPT3. Many of us live online enough that much of our "input" comes in the form of language, and much of the language we "output" is really just pattern matching and remixing, not that different from what GPT3 does. We don't really think up a response to whatever we just read, we dredge up the fragments of an appropriate response and assemble them roughly into something or other and mash the reply button. Usually angrily.

I promised you photographs.

Consider visual art, specifically representational art.

This is a drawing of a dog. It is not a dog, nor is it the word "dog." Most likely, though, it connects to the same emotional and reality-based set of material that the word "dog" connects to, or close enough. What it lacks is the linguistic connection. Just as my dog understands the word "walk" we might understand a drawing of a "dog." A drawing of a dog generally will refer to, will be connected to, an abstraction of dog-ness. You might recognize the specific dog, or not. If you don't, you'll get "a dog" from it. Even if you do recognize the dog, the distance a drawing creates might push you a little toward the abstraction of dog-ness.

A photograph of a dog, like this one,

inevitably refers to a specific dog. Whether or not you recognize the dog, the dog in the photo is a specific dog. This one is named Julia, and she knows several words, among them "walk."

A picture, like a dog, functions as kind of the inverse of a contemporary GANS-based AI system. It is emotional and real, it is not linguistic.

It may be worth noting that AI systems in the old days went the other way around. They tried to hand-build a model of the (or a) world, and hand-code some rules for interacting with that world, or answering questions, and so on. Just like modern AI systems, these systems also produced interesting toys and almost no actual use cases, but the results were a lot less eerily "good" than the current systems.

In the modern era, the systems don't know anything, really. GPT3 does not know that the Tigers won the World Series in 1968. You can probably persuade it to produce the right answer to a properly formed question about the 1968 World Series, but GPT3 actually knows only that "Tigers" is the glyph that naturally appears in the "answer" position relative to your textual question. It's also likely to guess that the name of a baseball team appears there, and randomly shovel one in there until you rephrase your question. You can get a remarkable amount of what looks like knowledge into this kind of enormous but purely linguistic system. What follows "What is 2 times 3?" well, it might be "6" or it might be <some numeral> or perhaps it's just x or some sentence about mathematics. It depends on which pseudo-neurons you tickle with the way you phrase your question.

The current systems for making pictures are, weirdly enough, based on language models as well. As far as I know they work by moving back and forth between picture and language. When you ask for a picture of a dog, it makes a picture of something, and uses an image describing AI system to describe it, and then it measures how much the textual description of the current picture matches the textual prompt you gave it. Then it... very cleverly? modifies the picture, and repeats until the computed description text is close enough to your prompt text. Somewhere in there, fragments of pictures it's been trained on show up.

Notably, there is no model of reality in there. MidJourney can't do hands, because it has no idea that hands have 4 fingers and a thumb. It doesn't know that hands are a thing. It "knows" that certain glyphs appear at certain places in certain pictures. And, to be fair, hands are hard and you learn nothing at all about how to draw hands or even hand anatomy by looking at pictures. Neither, of course, is there a model of emotion in there anywhere. Not in the text systems, not in the picture systems. These are all made by delicately, surgically, removing the complex mesh of linguistic relationships from the world and from emotion. They operate by analyzing this isolated linguistic system as a system of glyphs and relationships.

I am certainly not the first to propose that genuine intelligence, intelligence that we recognize as such, might require a body, a collection of sense organs, and perhaps an emotional apparatus, but I think we are seeing convincing evidence of that today.

What makes this terrible and terribly interesting is that we respond to pictures and to words with emotion and attempts to nail them to reality. We imagine the world of the novel, and of the photograph. We respond with joy and anger and sadness. We're attempting to reach through whatever it is to the creator, to the author, to feel perhaps what they felt, to see what they saw, to imagine what they imagined. We do this with GPT3 output as well as Jane Austen output. We do it with DALL-E output as well as Dali output. At least, we try. With the AIs, there is no creator, author, painter, not as we imagine them. There is no emotional creature there, there is no observer of reality, there is no model of reality involved at all. All we get it remixes of previously made text and pictures. Very very convincing remixes, but remixes nevertheless.

A photograph, or something that looks like a photograph, feels to us more closely nailed to reality than a drawing or a painting, we react to it as if that stuff had really, for real, been in front of the lens of a camera. When an AI is in play, the distance between reality and the picture is infinite, at the moment of creation. At the moment of consumption, the apparent gap drops to zero, with consequences we cannot really guess at. People say things like "uncanny valley" and also speculate that the system will improve until the uncanny valley goes away. The last assertion is questionable, in my mind. Some detectable essence of uncanny valley may well be irreducibly present, the trace of a complete lack of a reality, the trace of the machine without emotion, the trace of the engine that remixes convincingly but knows and feels nothing. These systems always seem to tap out right around the uncanny valley, and then the wonks produce a new toy to distract us.

Does it make any difference if the author who writes "the ear of the dog" understands that phrase? Does it matter that they know what an ear is, and what a dog is? Does it matter whether they have stroked the ear of a dog, and felt its warmth? Is it enough that they know that of the glyphs "dog", "ear" and "of" the ordering 2-3-1 is common, and all the other ones are not? We react the same either way.

Does the emptiness of the "author" somehow come through, inevitably, or could we get along an author who has no heart? We're finding out now, and so far the answer seems to be yes, yes the emptiness does come through, albeit subtly. We shall see.

Photos and Stuff

Thursday, December 22, 2022

Of Dogs, AIs, and Photographs

1 comment: