AI model collapse

Post by **boredbird** » Thu Jul 25, 2024 11:15 am

Related to Wikipedia in the broadest sense, both as input for the AI models and its own likely future.

https://www.ft.com/content/ae507468-7f5 ... a81c6bf4a5

Archer · Post by **Archer** » Thu Jul 25, 2024 11:45 am

Looks like a paywal, though I suppose I could sign up for their trial. Tried archive but the paywall is still there.

The title seems dubious. Companies like google and facebook, NGOs like Wikipedia (despite what they might say), and government agencies like the NSA must have tremendous amounts of data, and for most of these companies it's probably their most valuable asset. There's no lack of data per se, but the public has access to very little of it and as such they are at a large disadvantage. Things like LMMs receive a lot of favorable press but I don't think they can or should replace a site like Wikipedia, even though Wikipedia is shite. They're a one-way, one-to-many form of communication. They simulate discourse; ideal for the propagandist but detrimental to the public interest. AI is dangerous, just not in the way one might expect.

A few notes, not having read the article, but in general:

- LMMs that cite sources will probably come about soon. I've seen this talking point before but it doesn't seem like there's any large technical obstacle.
- I get the sense that many search results point to websites generate by LLMS.
- It's becoming hard to talk with people on the internet and have a conversation.

Post by **boredbird** » Thu Jul 25, 2024 7:24 pm

Archer wrote: ↑
Thu Jul 25, 2024 11:45 am
The title seems dubious…not having read the article…

There ya go.

It's about what happen to AIs when AIs content is among their inputs.

But since you'd like to discuss something else.

Archer wrote: ↑
Thu Jul 25, 2024 11:45 am
- LMMs that cite sources will probably come about soon. I've seen this talking point before but it doesn't seem like there's any large technical obstacle.

The opposite, the developers of at least ChatGPTa dded a hack which instructs it to notname its sources. This is to obscure plagiarism and associated potential copyright claims. From their perspective this is one of the points, maybe even the main point, behind chatbots. It provides a layer of plausible deniability for the developers which would not exist if they just copy pasted the content they steal.

Some Wikipedia content is like this too.

Archer · Post by **Archer** » Thu Jul 25, 2024 9:37 pm

boredbird wrote: ↑
Thu Jul 25, 2024 7:24 pm
It's about what happen to AIs when AIs content is among their inputs.

That's not a hard problem, particularly for one of these organizations. I imagine they have a fair idea of what's original and what's LLM slop. LLMs do not write original material.

The opposite, the developers of at least ChatGPTa dded a hack which instructs it to notname its sources. This is to obscure plagiarism and associated potential copyright claims. From their perspective this is one of the points, maybe even the main point, behind chatbots. It provides a layer of plausible deniability for the developers which would not exist if they just copy pasted the content they steal.

Some Wikipedia content is like this too.

Whether or not it attempts to cite sources would depend largely (or perhaps entirely) on the training set (and perhaps a penalty term in the objective function). It probably wouldn't be hard to achieve either outcome. Your point seems like a good one though. The engineers and generally whoever is responsible for a given LLM can always use the "black box" characteristic as an excuse, presuming an artificial-neural-network-based model, which of course they are. The general public understand AI even less and have been primed with large quantities of scifi schlock, so it's even more believable from their perspective. The usefulness of LLMs is probably quite limited outside of advertising, surveillance, propaganda and other such deceit, and thus of little value to the general public. I suppose this is part of why I lost interest in applied AI. While it is interesting in a theoretical sense, layer-stacking and organizing datasets is essentially clerical work and gets very boring very fast, and the field is very over-saturated.

Post by **boredbird** » Fri Jul 26, 2024 1:37 am

Same story.

https://www.popsci.com/technology/ai-tr ... gibberish/

Here's the actual paper in Nature.

https://www.nature.com/articles/s41586-024-07566-y

Archer · Post by **Archer** » Fri Jul 26, 2024 1:57 am

boredbird wrote: ↑
Fri Jul 26, 2024 1:37 am
Same story.

https://www.popsci.com/technology/ai-tr ... gibberish/

Here's the actual paper in Nature.

https://www.nature.com/articles/s41586-024-07566-y

More or less what I'd have expected. Thanks for the links though.

Wikipedia Sucks!

Wikipedia Sucks!

AI model collapse

AI model collapse

Re: AI model collapse

Re: AI model collapse

Re: AI model collapse

Re: AI model collapse

Re: AI model collapse