Everything You Need to Know About the Reddit and OpenAI Collaboration

UPDATED: June 20, 2024
PUBLISHED: June 21, 2024
Reddit and OpenAI apps on an iphone

Reddit and OpenAI announced a collaboration in May 2024. So what does this mean for both platforms? Reddit has become synonymous with honest information from real people—so much so, in fact, that many Google users now type “Reddit” at the end of their search queries, and Google search algorithms prioritize content from Reddit and other public forum sites. 

If you’ve ever looked for an honest review of a product, you’ll know why. Let’s say you search for the best passport wallet. It’s likely you will be inundated with websites that have strategically gamed the results through search engine optimization and product guides that may be not-so-subtly influenced by sponsored reviews. 

We’ve learned that the workaround for this is to add “Reddit” to the end of searches because the results will come back with user posts with comments from a variety of sources that are more likely to give the review to you straight. 

So it’s not surprising that Reddit and OpenAI have partnered to use Reddit posts as a training dataset for generative AI tools and to allow Reddit to add new AI-powered features. In today’s world, we have to add generative AI to everything. 

In the announcement on the OpenAI website, Reddit co-founder Steve Huffman was quoted as saying, “Reddit has become one of the internet’s largest open archives of authentic, relevant and always up to date human conversations about anything and everything. Including it in ChatGPT upholds our belief in a connected internet, helps people find more of what they’re looking for, and helps new audiences find community on Reddit.”

But when I found out about the partnership, I had some concerns. First of all, people post some very personal stuff to Reddit.

Redditors’ feelings about content 

Although numerous Redditors expressed privacy concerns, many also indicated that the content was freely posted on a public forum.

In a thread about the merger on r/technology, Reddit user Chicano_Ducky wrote, “The moment an AI says ’thanks for the gold’ then I know humanity is cooked lmao.” 

In a thread in r/OpenAI, Reddit user danpinho wrote, “People write for free on a public space and expect it to remain private? Reality check: Some people scrape your comments for free. At least OpenAI is paying for it. And remember: If it’s free, you are the product.” 

Many Reddit users pointed out that the data is already being used, which is true. 

Parameters of Reddit OpenAI use

Let’s start with a small dose of reality. OpenAI’s models have always used Reddit, and any other publicly available internet information, for training data. In fact, you can download datasets from nearly a million subreddits from the Reddit corpus

Meredith Broussard is a data journalism associate professor at the Arthur L. Carter Journalism Institute of New York University and the author of the books Artificial UnIntelligence and More Than a Glitch. She says that, initially, there were no intellectual property concerns around the data. 

“Nobody really thought about it much because nobody was making money off of using it,” says Broussard. “So now that OpenAI has received so much investment, they’re going around and making these deals to make nice with organizations whose content they’ve already used.” 

Privacy risks with the Reddit and OpenAI partnership

Although Reddit is a great source for both private and public information, there are a lot of users who post very personal issues to its forums. There is a subreddit for everything—many users will post for advice about the most intimate parts of their lives: sex, love, struggles with mental health or addiction. It seems concerning that the content is out there to train AI. 

“One of the problems with generative AI is that it does not distinguish between sensitive data and other data,” says Broussard. “So you do need to have protections in place so that people’s PII [personal identifying information] is not widely distributed. Some of those guardrails are in place inside systems like ChatGPT and other places, and sometimes they’re not. In the same way that there’s no way to stop generative AI chatbots from hallucinating, there’s also no way to totally stop them from disclosing any PII that is in the training data.” 

And even if the data is scrubbed, it’s still a risky game. 

“Overall, there’s no good way to be absolutely certain that personal information is not being disclosed somehow,” says Broussard. 

Similar to the lawsuit brought by the Author’s Guild after many authors found that their work was used to train ChatGPT, Broussard believes that this could put OpenAI at risk for litigation from individuals whose personal information was leaked by the large language model (LLM). 

“It would make sense that there would be upcoming lawsuits around personal information being leaked by generative AI models,” says Broussard. 

AI’s algorithmic bias 

AI is also known to carry human biases into its algorithms. Data is human, and human patterns have not always been the most inclusive. This means that AI has been shown to have bias based on race, gender, ability and other factors that marginalize individuals. 

“One of the first things I think about for AI bias concerns with the deal is the community of Reddit users, which is only a very small subset of the people out there or the people online,” says Broussard. “So if the voices of Reddit users are overrepresented in the training data… this is going to disproportionately affect the results of whatever ChatGPT or other generative AI systems create.” 

AI has a long way to go

As a journalist who has covered AI from all angles, my main takeaway has always been that artificial intelligence is not as “smart” as we make it out to be. AI is great at recognizing patterns in data, and these patterns are always generated from the past. 

So while AI can create undeniable efficiencies, it can’t provide the level of expertise that a human who knows what they’re doing can. Reddit might be able to add to the capabilities of GPT models and other LLMs, but as previously mentioned, it has already been used. 

“An expert human is still better than a mediocre AI,” says Broussard. “If you’re a human who doesn’t know how to do a thing, it looks like the AI is doing an adequate job. But we don’t strive for mediocrity in the world. We strive for excellence, right?”

Photo by Koshiro K/Shutterstock.