benchmarks Archives - GenixPlay Studios

Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late […]

Debates over AI benchmarking have reached Pokémon Read More »

OpenAI launches program to design new ‘domain-specific’ AI benchmarks

AI / Sasandara Dilmina

OpenAI thinks AI benchmarks are broken. Now the company is launching a program to fix how AI models are scored. The new OpenAI Pioneers Program will focus on creating evaluations for AI models that “set the bar for what good looks like,” as OpenAI phrased it in a blog post. “As the pace of AI

OpenAI launches program to design new ‘domain-specific’ AI benchmarks Read More »

People are using Super Mario to benchmark AI now

AI / Sasandara Dilmina

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini

People are using Super Mario to benchmark AI now Read More »

Did xAI lie about Grok 3’s benchmarks?

AI / Sasandara Dilmina

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was

Did xAI lie about Grok 3’s benchmarks? Read More »

AI isn’t very good at history, new paper finds

AI / Sasandara Dilmina

AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found. A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions.

AI isn’t very good at history, new paper finds Read More »

AI researcher François Chollet is co-founding a nonprofit to build benchmarks for AGI

AI / Sasandara Dilmina

Former Google engineer and influential AI researcher François Chollet is co-founding a nonprofit to help develop benchmarks that’ll probe AI for “human-level” intelligence. The nonprofit, the ARC Prize Foundation, will be led by Greg Kamradt, an ex-Salesforce engineering director and founder of the AI product studio Leverage. Kamradt will serve as president and a member

AI researcher François Chollet is co-founding a nonprofit to build benchmarks for AGI Read More »

Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024

AI / Sasandara Dilmina

When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti. It’s become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself

Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024 Read More »

Deep tech startups with very technical CEOs raise larger rounds, research finds

Venture / Sasandara Dilmina

SaaS founders trying to figure out what it takes to raise their next round can refer to Point Nine’s famous yearly SaaS Funding Napkin. (The term refers to “back of the napkin” plans or calculations.) Now, European hardware deep tech teams have a similar resource from First Momentum, a pre-seed fund investing in technical B2B

Deep tech startups with very technical CEOs raise larger rounds, research finds Read More »

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

AI / Sasandara Dilmina

Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluating the performance and impact of AI models, including generative models like its own Claude. Unveiled on Monday, Anthropic’s program will dole out grants to third-party organizations that can, as the company puts it in a blog post, “effectively

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks Read More »