Skip to main content
added 1 character in body
Source Link
gaborous
  • 16.8k
  • 13
  • 5

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way backthis cannot be undone, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that this cannot be undone, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

Added some context.
Source Link
Peter Mortensen
  • 31.4k
  • 4
  • 23
  • 14

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLMLLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

added 4 characters in body
Source Link
Wai Ha Lee
  • 8.9k
  • 7
  • 46
  • 71

TL;DR

Even from the point of view of AI researchers, StackOverflowStack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of StackOverflowStack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of StackOverflowStack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to StackOverflowStack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, StackOverflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of StackOverflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of StackOverflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to StackOverflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

TL;DR

Even from the point of view of AI researchers, Stack Overflow and other sites with mostly human generated content should ban or force labelling of AI generated content, as otherwise this will cause a circular reasoning catastrophic failure as the newly generated content past year 2022 cannot be fed to train newer AI models anymore since we can't know what was generated by humans or by older AI models.

Longer argument

I would like to provide an alternative perspective, not from the standpoint of Stack Overflow human users, but from Artificial Intelligence researchers.

It's highly likely that GPT-3 and hence ChatGPT was trained on all of Stack Overflow data. This worked because all the inputs at the time was human generated. (PS: Let's put aside the discussion whether it's ethical for AI researchers to use 3rd-party content to train AI models without asking the respective owners - I am here focusing on the fact that it already happened, that there is no way back, and the impact on our current and future situation).

Now, if answers from humans are mixed with answers generated by AI, we get a tampered dataset that will be unusable to train future LLM or other language models, because it will cause a hugely flawed circular reasoning loop, as we now feed an AI model data that an older AI model generated, without being able to determine what was generated by humans or by AI.

This means that if we can't ensure that most answers remain generated by humans, this will lead to a catastrophic failure of AI models, as it will simply become impossible to use newer data to make newer models: 2022 will become an "event horizon for AI" , with data generated prior to this year being still usable for training, but any data generated past being mostly unusable because of being tainted potentially in great proportions by AI generated content.

So this issue is not even just specific to Stack Overflow: all websites should either ban the use of AI generated content, or force such content to be labelled as AI generated. But even so, it will only work with compliant users. Since there is no 100% reliable way to detect textual AI generated content, and given we can always expect people to game the system especially when there are incentives to do so, this catastrophic failure seems all but inevitable.

added 43 characters in body
Source Link
gaborous
  • 16.8k
  • 13
  • 5
Loading
Source Link
gaborous
  • 16.8k
  • 13
  • 5
Loading