cross-posted from: https://lemmy.sdf.org/post/2617125

A written out transcript on Scott Aaronson’s blog: https://scottaaronson.blog/?p=7431


My takes:

ELIEZER: What strategy can a like 70 IQ honest person come up with and invent themselves by which they will outwit and defeat a 130 IQ sociopath?

Physically attack them. That might seem like a non-sequitur, but what I’m getting at is that Yudowski seems to underestimate how powerful and unpredictable meatspace can be over the short-to-medium term. I really don’t think you could conquer the world over wifi either, unless maybe you can break encryption.

SCOTT: Look, I can imagine a world where we only got one try, and if we failed, then it destroys all life on Earth. And so, let me agree to the conditional statement that if we are in that world, then I think that we’re screwed.

Also agreed, with the caveat that there’s wide differences between failure scenarios, although we’re probably getting a random one at this rate.

ELIEZER: I mean, it’s not presently ruled out that you have some like, relatively smart in some ways, dumb in some other ways, or at least not smarter than human in other ways, AI that makes an early shot at taking over the world, maybe because it expects future AIs to not share its goals and not cooperate with it, and it fails. And the appropriate lesson to learn there is to, like, shut the whole thing down. And, I’d be like, “Yeah, sure, like wouldn’t it be good to live in that world?”

And the way you live in that world is that when you get that warning sign, you shut it all down.

I suspect little but reversible incidents are going to happen more and more, if we keep being careful and talking about risks the way we have been. I honestly have no clue where things go from there, but I imagine the tenor and consistency of response will be pandemic-ish.

GARY: I’m not real thrilled with that. I mean, I don’t think we want to leave what their objective functions are, what their desires are to them, working them out with no consultation from us, with no human in the loop, right?

Gary has a far better impression of human leadership than me. Like, we’re not on track for a benevolent AI if such a thing makes sense (see his next paragraph), but if we had that it would blow human governments out of the water.

ELIEZER: Part of the reason why I’m worried about the focus on short-term problems is that I suspect that the short-term problems might very well be solvable, and we will be left with the long-term problems after that. Like, it wouldn’t surprise me very much if, in 2025, there are large language models that just don’t make stuff up anymore.

GARY: It would surprise me.

Hey, so there’s a prediction to watch!

SCOTT: We just need to figure out how to delay the apocalypse by at least one year per year of research invested.

That’s a good way of looking at it. Maybe that will be part of whatever the response to smaller incidents is.

GARY: Yeah, I mean, I think we should stop spending all this time on LLMs. I don’t think the answer to alignment is going to come from through LLMs. I really don’t. I think they’re too much of a black box. You can’t put explicit, symbolic constraints in the way that you need to. I think they’re actually, with respect to alignment, a blind alley. I think with respect to writing code, they’re a great tool. But with alignment, I don’t think the answer is there.

Yes, agreed. I don’t think we can un-invent them at this point, though.

ELIEZER: I was going to name the smaller problem. The problem was having an agent that could switch between two utility functions depending on a button, or a switch, or a bit of information, or something. Such that it wouldn’t try to make you press the button; it wouldn’t try to make you avoid pressing the button. And if it built a copy of itself, it would want to build a dependency on the switch into the copy.

So, that’s an example of a very basic problem in alignment theory that is still open.

Neat. I suspect it’s impossible with a reasonable cost function, if the thing actually sees all the way ahead.

So, before GPT-4 was released, [the Alignment Research Center] did a bunch of evaluations of, you know, could GPT-4 make copies of itself? Could it figure out how to deceive people? Could it figure out how to make money? Open up its own bank account?

ELIEZER: Could it hire a TaskRabbit?

SCOTT: Yes. So, the most notable success that they had was that it could figure out how to hire a TaskRabbit to help it pass a CAPTCHA. And when the person asked, ‘Well, why do you need me to help you with this?’–

ELIEZER: When the person asked, ‘Are you a robot, LOL?’

SCOTT: Well, yes, it said, ‘No, I am visually impaired.’

I wonder who got the next-gen AI cold call, haha!