AI and the open web

ai-and-the-open-web

It seems that there is a protectionist trend now where large platforms are restricting access to data more tightly. It is is seen mostly as a response to large language models such as GPT used by ChatGPT that are scooping up data from the web. If it leads to more closed behaviour on the web, it will become a negative trend.

Protectionist trend – Reddit, now Twitter

In June, Reddit raised prices on their API. Reddit’s owners are planning to take the company public, and they need to boost revenue from the social news site before they do. Reddit founder and CEO Steve Huffman told The New York Times “The Reddit corpus of data is really valuable, but we don’t need to give all of that value to some of the largest companies in the world for free.”

This has led to an ongoing strike with volunteer moderators that has caused mass disruption on the platform. Steve Huffman said that he is not backing off. He told The Associated Press, “Protest and dissent is important. The problem with this one is it’s not going to change anything because we made a business decision that we’re not negotiating on.” It has reached an impasse.

Yesterday, Elon Musk announced that Twitter is putting a limit on how many posts you can read per day. This is what he said in a tweet:

To address extreme levels of data scraping & system manipulation, we’ve applied the following temporary limits:

  • Verified accounts are limited to reading 6000 posts/day
  • Unverified accounts to 600 posts/day
  • New unverified accounts to 300/day

Later, Musk tweeted that the limit had been raised to 10,000, 1,000, and 500 respectively.

“Several hundred organizations (maybe more) were scraping Twitter data extremely aggressively, to the point where it was affecting the real user experience,” Musk said.

It doesn’t make sense that they are scraping data at scale. It is an inefficient way to gather that kind of data. Even if Twitter is worried that some companies are getting around paying for access to its API by scraping webpages, restricting usage for regular users seems like cutting off your nose to spite your face. Usually, businesses want to encourage people to use their service as much as possible, because that is how they make money!

How will it play out?

It is hard to tell how this will play out. It is a battle to monetize this new frontier. The data holders want a slice of the pie if they are a prime sources for language models to build knowledge and interact in a more human-like fashion.

It could be that this is being opportunistically used to increase prices for API access. Blame the bots! The truth is that it is hard to know what the reality unless you are behind the scenes.

Users suffer as they put in the middle. The market for third-party apps shrinks and can become untenable for some small businesses. That is bad for consumer choice.

Web standards need to adapt. At the moment, I guess AI bots are indexing pages like search engine bots based on the robots.txt file. Permission for using data for language models is not explicit as far as I know. You may have to explicitly block a bot to opt out. For example, OpenAI has published instructions for blocking its bot.

It is likely that regulation will be required in the long-term. The major players are large companies and they have a big advantage. It will depend on if they want to defend their high ground aggressively.

Final thoughts

Personally, I don’t see this as an alarming thing. This is a familiar fight. It is just something that we need to figure out.

Open information and commerce have always been incongruent. This is a battle over information — who produces it, how you access it, and who gets paid for it. In Reddit’s case, it is galling that their rich data is moderated by users for free — it will be an interesting test case to see how this side of the AI revolution evolves. It is an important how this is settled because it will shape what the web will be.

We should try to persevere openness, it is a great strength of the web. There needs to be a viable commercial solution to satisfy business needs. If one is not found, we need to mitigate harm being done through regulation.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
drf-yasg:-the-superhero-of-api-documentation-–-saving-developers-from-documentation-despair!

DRF-YASG: The Superhero of API Documentation – Saving Developers from Documentation Despair!

Next Post
how-to-find-current-trending-topics-for-content-creation:-the-10-best-tools

How To Find Current Trending Topics for Content Creation: The 10 Best Tools

Related Posts