Tech News

Conventional crawlers are accused of providing scripted content to AI companies

If you’ve ever wondered how AI companies like Google, anthropic, accai, and meta get their training data from organized publishers like New York Times, – Over thereor The Washington Postwe may finally have an answer.

In a detailed investigation of This page -Transport across the AtlanticJournalist Alex Reisner points out that several AI companies have partnered with the Crawl Foundation – a non-profit that takes out the web to create a large history of the Internet community for research purposes. According to the report, the standard Crawl, whose database contains many petabytes, has successfully opened a backdoor that allows AI companies to train their models on large news content. In a blog post published today, the common crawl strongly denies the allegations.

The foundation’s website claims its data is collected from freely available web pages. But the Executive Director, Richard Skrenta, was told The Atlantic He believes AI models should be able to access everything on the Internet. “There are also robots,” said Skrenta when he was told The Atlantic.

BREAKFUT:

California Greenlightights AI Up, Data Protection, Quiet Netflix

AI Chatbots such as chatgpt and google Gemini have caused a stir in the journalism industry. AI Chatbots glean information from publishers and share this information directly with readers, clicking and visitors away from those publishers. This phenomenon has been called the Traffic Apocalypse and the AI ​​Armageddon. .

As stated in -Transport across the Atlantic Report, some news publishers have become aware of common crawling activities, while others have prevented the base scarf by adding a command to their website code. However, that only protects future content, not anything that’s already been hacked.

Bright light speed

Many publishers have requested that regular crawling remove their content from your archives. The foundation revealed that it is compliant, albeit slightly, due to the volume of data, with one organization sharing more emails from a regular crawl with The Atlantic That the removal process was ’50 percent, 70 percent, then 80 percent. ” But then Reisner discovered that none of those requests appeared to have been filled out — and that the regular Crawl archives hadn’t been changed since 2016.

Skrenta was told The Atlantic That the file format used to store the archives is “intended to be immutable,” meaning it cannot be removed once added. However, the recovery reports of the public search tool, the only non-technical way to look at the general history of the Crawl, returns misleading results to find certain domains – to hide the limitation of what was extracted and stored.

Mashable was reached through a standard crawl, and a team member pointed us to a community blog from Skrenta. In it, Skrenta denied that the organization had misled advertisers, saying that its web crawler does not bypass PayWalls. He also emphasized that traditional Cradl is financially independent and “doesn’t do the dirty work of AI.”

The Atlantic He made several predatory and misleading claims about the Crawl Foundation in general, including the accusation that our organization ‘lied to advertisers’ about our activities,” said Web Crawler, known as CCBOT, collecting data. Web pages are publicly available. We do not go ‘behind the PayWall,’ Do not enter any websites, and do not use any method designed to avoid access restrictions. “

However, as Reisner reports, Crawl has previously received contributions from Opelai, anthropic, and other AI-focused companies. It also lists nvidia as a “combator” on its website. In addition to collecting raw text, Reisner Writes, the foundation also helps gather and disseminate information on AI training — even holding it for wider use.

Either way, the fight over how the AI ​​industry uses proprietary material is up front. Ouneai, for example, remains at the center of several lawsuits from major publishers, including New York Times And mashable’s parent company, Ziff Davis.

Articles
Artificial intelligence

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
google.com, pub-2981836223349383, DIRECT, f08c47fec0942fa0