Codewashing
I have little understanding for people using large language models to generate slop; words and images that nobody asked for.
I have more understanding for people using large language models to generate code. Code isn’t the thing in the same way that words or images are; code is the thing that gets you to the thing.
And if a large language model hallucinates some code, you’ll find out soon enough:
With code you get a powerful form of fact checking for free. Run the code, see if it works.
But I want to push back on one justification I see repeatedly about using large language models to write code. Here’s Craig:
There are many moral and ethical issues with using LLMs, but building software feels like one of the few truly ethically “clean”(er) uses (trained on open source code, etc.)
That’s not how this works. Yes, the large language models are trained on lots of code (most of it open source), but they’re not only trained on that. That’s on top of everything else; all the stolen books, all the unpaid creative work of others.
Even Robin Sloan, who first says:
I think the case of code is especially clear, and, for me, basically settled.
…goes on to acknowledge:
But, again, it’s important to say: the code only works because of Everything. Take that data away, train a model using GitHub alone, and you’ll get a far less useful tool.
When large language models are trained on domain-specific data, it’s always in addition to the mahoosive amount of content they’ve already stolen. It’s that mohoosive amount of content—not the domain-specific data—that enables them to parse your instructions.
(Note that I’m being very delibarate in saying “parse”, not “understand.” Though make no mistake, I’m astonished at how good these tools are at parsing instructions. I say that as someone who tried to write natural language parsers for text-only adventure games back in the 1980s.)
So, sure, go ahead and use large language models to write code. But don’t fool yourself into thinking that it’s somehow ethical.
What I said here applies to code too:
If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.