On well-behaving AI, now with the GPT
February 16, 2023 § Leave a comment
This is again very speculative, and I’m by no means a pro on Large Language Models (LLM) but I wanted to take up a peculiar possibility that LLM/GPT/Dall-E -type, learning-based systems could be designed to generate specific forms and styles of behavior, just like GPT and Dall-E do today with texts and images. They could produce any kind of behaviors for video, animations and robots, especially good behaviors. Hence, I believe this is worth considering.
On the 5th of February, 2018 I published a blog Teaching manners to AI: Internet of good Behavior. GPT was still being developed and GPT-3 had not been released. Perhaps there was something in the air since I wrote “…we would let it use all the relevant data related to that behavior and to learn from it”. Here is the excerpt:
“What if there was a systematic way to offer models of good behavior for AI to follow, to teach it behaviors we know and define as good behavior? In many cases it would be easy to define the criteria and to use such behaviors as models for AI to follow and learn. With the Internet of good Behaviors (IogB) approach we could offer AI access to behaviors (and companionships) we think are good for its development just like we do to our children. By allowing this we would let it use all the relevant data related to that behavior and to learn from it. It is quite possible we could learn from that too, but that’s another matter.”
Of course, there was no explicit method to accomplish this then, but now we almost have it. Having seen what LLM:s can and will do I wanted to return to this topic, but now knowing the potential of the GPT, Dall-E and others. I have shortly touched this topic in my book Internet of Behaviors (IoB) – With a human touch.
Coding behaviors is nothing new
IoB is a system for coding mental and physical behaviors and their detailed or larger components. It can be physical, artistic or of any abstract form. Incidentally, when I have introduced the IoB, its “mental side” has often been ignored, although it has significant and novel value in IoB.
Codes for certain classes of behaviors can be defined in code books, just like it has been accomplished for musical notations. They have evolved for centuries and are now standard and in global use. When a musician plays according to the notes, we don’t usually consider it as a behavior which obeys a code that has been invented, stored and shared. Originally, it is the composer who has coded her or his “playing behavior” (or imagined a player with certain instrument) and written the behavior codes down for others to imitate it. Lento, Adagio, ppp, fp etc. are, ideed, behavior codes.
A violinist playing a composition by Sibelius, for example, repeats the behaviors of Sibelius or what he imagined for a violinist. In front of a symphony orchestra, the conductor makes sure the orchestra follows the codes, perhaps with slightly modified interpretations. In other words, by systematic coding of musical behaviors, extremely complex, even creative behaviors can be coded, stored and then generated, by humans or machines.
It is possible to build behavior code systems for any behavior. Historically, we know that at around 1677, Louis XIV commissioned Pierre Beauchamp to create a notation for baroque dance so that it can be “put on paper”. The emperor wanted to preserve and perhaps share these dances outside France and for the generations to come. (Karl-Johnson, G. (2017). Signs and Society, vol. 5, no. 2.)
Coding of visual behavior material like it is seen videos and movies is not easy. Manual video coding has required massive work, but now AI-based, deep-learning systems for this have been evolving fast. Clever multiple object tracking (MOT) methods have emerged for video and their overall performance is improving. Some have considered the potential of the general Transformer. It aims at modelling spatio-temporal interactions of objects and can be considered as a candidate for modelling behavior sequences and interactions among humans in video materials.
I cannot estimate the future potential of these rather complex tracking methods. Nevertheless, they seem like a promising possibility to extract both visual and textual data from movies and videos and to use this as training material in machine learning. Interestingly, in countries like Finland, where subtitles are used, movies and other filmed or televised materials carry synchronous visual and textual information about the presented scenes. Then there are the manuscripts, and spoken language which could also be used as sources for texts that are directly related to the visual material and to use these as training material. There are many complexities in this and it is a demanding challenge to use such materials, but it would be possible to demonstrate.
I hope to present the following idea clearly enough and why not, for improvement and perhaps even testing. My educated guess is that something like that is already happening, but I have no knowledge of that. Below is my speculative outline for such an approach.
The idea is to use a pretrained generative network which has been trained with extensive textual and visual materials (movies, videos and other) and so that its “attention” mechanism is focused on the visual and textual material in parallel, synchrony and with suitable temporal windows. The aim is to teach the system so that it connects the verbal expressions with related, visually shown behaviors. It learns to predicts the scenes and texts following a prompt scene and/or its texts. With a GPT-like arrangements we could then expect the system to react to textual, visual or textual+visual prompts by generating textual, visual or textual+visual expressions of behaviors.
I have presented a very general outline for the system in Figure 1 as I think about it.
Figure 1. Outline for a visual-textual transformer system
For curiosity, I tested how ChatGPT generates instructions for good behaviors. If we had a code book for good behaviors in a specific context, and a way to generate these behaviors, then the texts by ChatGPT that produce guidance for “good behavior” could be fed to a visual behavior generator and observe the outcome as a video, or perhaps as a static image. Different versions can be imagined how to accomplish this.
ChatGPT suggesting good behavior
Here I have demonstrated how the ChatGPT performs when asked to generate a manuscript for a good behavior in a simple human scene. It is easy to imagine that if the system had been trained with associated visual and textual data, its response could be in visual form as well. It would rely on a “score book” for certain behaviors.
EXAMPLE. I presented a simple prompt to ChatGPT and received a good manuscript for a simple behavior episode. I did not (yet) test it, but Dall-E could be used for generating the imagery for this.
Write a short manuscript of a scene where an American man, of age 50, comes to a Japanese home for the first time. He does not know how to behave. Please describe what would be good behavior in this situation?
John, a 50-year-old American man, was feeling nervous as he walked up to the front door of his Japanese friend’s home. He had never been to Japan before and was worried about making a mistake and offending his hosts.
As he entered the home, he noticed that everyone had taken their shoes off and placed them neatly by the door. John quickly followed suit and took off his shoes, placing them next to the others.
He was then greeted by his Japanese hosts with a bow. John wasn’t sure what to do, but he tried to imitate their actions as best he could. He lowered his head and gave a slight bow in return.
The hosts then offered John a seat on a cushion on the floor, which he accepted gratefully. John tried to sit with his legs crossed, as he had seen in movies, but found it uncomfortable. He shifted his position several times, worried that he was being rude or disrespectful.
As they began to serve tea and snacks, John watched carefully to see how his hosts behaved. He noticed that they always used both hands when passing objects, and that they never poured their own tea.
Taking these cues, John tried to be as respectful and polite as possible. He always used both hands when passing things, and made sure to pour tea for his hosts before pouring his own.
Despite his initial nervousness, John began to relax as …
Is there any sense in coding behaviors?
The idea behind IoB is to code behaviors with relevant resolution, depending on the situation and context, and not necessarily for every detail. No explicit IoB systems exists as far as I know although in computer game environments and in animation, coding of behaviors has been used in various forms for quite some time.
One possibility for testing the generation of behaviors with the IoB approach is to start from top down. That would mean the use of behavior codes that cover high-level, perhaps complex behavioral components (e.g. a video of a man approaching a Japanese house) in a specific situation/context. In other words, codes for the walking style of a man would not be used unless it has information value and it is something to be controlled. The output behavior codes could then be fed to a software that generates virtual characters with that behavior. Dozens of such tools are already on the market, but evidently and so far, having a compatibility issue.
What use could this be? If the ambition level is as high as it is now with recent AI developments one can imagine various meaningful uses where certain type of behaviors can be generated, for any purpose from education, entertainment, work and art. And then, returning to the question of well-behaving AI, this could offer one way towards generated, positive or ethically sustainable AI behaviors.
Leave a Reply