How OpenAI’s bot crushed this seven-person company’s web site ‘like a DDoS attack’



On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s ecommerce site was down. It looked to be some kind of distributed denial-of-service attack. 

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. 

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.” 

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. 

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site. 

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models. 

It sells the 3D object files, as well as photos – everything from hands to hair, skin, and full bodies – to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics.

Tomchuk’s team, based in Ukraine but also licensed in the U.S. out of Tampa, Florida, has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags specifically telling OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, that have their own tags, according to its information page on its crawlers.)

Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its informational page that it honors such files when configured with its own set of do-not-crawl tags, though it also warns that it can take its bots up to 24 hours to recognize an updated robot.txt file.

As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content. It’s not an opt-in system.

To add insult to injury, not only was Triplegangers knocked offline by OpenAI’s bot during US business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.

Robot.txt also isn’t a failsafe. AI companies voluntarily comply with it. Another AI startup, Perplexity, pretty famously got called out last summer by a Wired investigation when some evidence implied Perplexity wasn’t honoring it.

Triplegangers product page
Each of these is a product, with a product page that includes multiple more photos. Used by permission.Image Credits:Triplegangers (opens in a new window)

Can’t know for certain what was taken

By Wednesday, after days of OpenAI’s bot returning, Triplegangers had a properly configured robot.txt file in place, and also a Cloudflare account set up to block its GPTBot and several other bots he discovered, like Barkrowler (an SEO crawler) and Bytespider (TokTok’s crawler). Tomchuk is also hopeful he’s blocked crawlers from other AI model companies. On Thursday morning, the site didn’t crash, he said.

But Tomchuk still has no reasonable way to find out exactly what OpenAI successfully took or to get that material removed. He’s found no way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And OpenAI has so far failed to deliver its long-promised opt-out tool, as TechCrunch recently reported.

This is an especially tricky issue for Triplegangers. “We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

Triplegangers’ website was also an especially delicious find for AI crawlers. Multibillion-dollar-valued startups, like Scale AI, have been created where humans painstakingly tag images to train AI. Triplegangers’ site contains photos tagged in detail: ethnicity, age, tattoos vs scars, all body types, and so on.

The irony is that the OpenAI bot’s greediness is what alerted Triplegangers to how exposed it was. Had it scraped more gently, Tomchuk never would have known, he said.

“It’s scary because there seems to be a loophole that these companies are using to crawl data by saying “you can opt out if you update your robot.txt with our tags,” says Tomchuk, but that puts the onus on the business owner to understand how to block them.

openai crawler log
Triplegangers’ server logs showed how ruthelessly an OpenAI bot was accessing the site, from hundreds of IP addresses. Used by permission.

He wants other small online businesses to know that the only way to discover if an AI bot is taking a website’s copyrighted belongings is to actively look. He’s certainly not alone in being terrorized by them. Owners of other websites recently told Business Insider how OpenAI bots crashed their sites and ran up their AWS bills.

The problem grew magnitudes in 2024. New research from digital advertising company DoubleVerify found that AI crawlers and scrapers caused an 86% increase in “general invalid traffic” in 2024 — that is, traffic that doesn’t come from a real user.

Still, “most sites remain clueless that they were scraped by these bots,” warns Tomchuk. “Now we have to daily monitor log activity to spot these bots.”

When you think about it, the whole model operates a bit like a mafia shakedown: the AI bots will take what they want unless you have protection.

“They should be asking permission, not just scraping data,” Tomchuk says.




Source

Denis Jenkinson's motorcycle reference library, a large quantity of books and literature on motorcycle manufacturers, mechanics and maintenance from the s to early s, some with hand-written notation, including Tuning for Speed by Phil Irving editionKnow Thy Beast - The Vincent by. For more information or to book an event visit us at: Hotel Manquehue Las Condes Better service, gastronomy and location. Some American visitors at first saw little promise in the remote town. As examples of practical methods of Sufism which can be of use in the modern world, the Sirdar discusses meditation, the giving of charity, and focusing more on durable truths and realities than on transient and illusory pursuits. After those updates are complete, check which version of. Boulton Papers, Project proposals to designate the Charles A. BIZOL, a German lubricant and additive manufacturer, works in close collaboration with top laboratories in Germany to develop products that meet the requirements of modern engines. AFSCME locals in San Jose, California, led the country's first strike over pay equity issues, followed by a successful lawsuit against the state of Washington that won retroactive upgrading of historically-low salaries in female-dominated titles the decision was later overturned on appeal, but it influenced state and local governments around the country to include pay equity measures in new contracts. Furthermore, they are more expensive, which is why you will need a bigger budget if you want such a compost bin. Grilled chicken topped with spicy chipotle sauce, served with rice and beans. Mango Milk Shake, Special Burfi Ginseng, Multiminerals And Multivitamins Capsules. This book provides a comparative study of fascisms and reactionary nationalisms. This unique book is an essential resource for nonprofit organizations seeking to meet their needs with a generation of volunteers eager to explore new options, work in new capacities, and continue lifelong learning. Soon, everyone joins them, and the two girls look at each other in a new light. Wondfo, first response early result gold standard among those trying to conceive, and blue Cross all negative the same day and next day with first morning urine. Line 6 approaches the challenge of delivering the "right" tone by developing powerful digital signal processing algorithms derived though painstaking measurement and analysis of the most highly regarded amplifiers and other gear from the recent past, the not-so recent past, and today. I foolishly married a man 21 years my senior after a rocky first marriage that lasted 11 years.