Meta’s Llama 4 Scout and Maverick models are under scrutiny for their benchmarking practices, as critics raise concerns about transparency in AI model evaluation. The controversy stems from the company’s presentation of benchmark results, which some argue may not accurately represent the models available to users.
The two new open-weight large language models, Scout and Maverick, utilize a mixture-of-experts architecture. Scout, which features 16 experts specialized in specific tasks or data types, is designed to run on a single Nvidia H100 GPU and cater to developers with limited resources. On the other hand, Maverick scales to 128 experts and is intended for more complex workloads. Both models are derived from Llama 4 Behemoth, an unreleased model with 288 billion active parameters currently in training.
The mixture-of-experts design of Scout and Maverick allows the models to selectively activate a subset of experts per task, offering efficiency gains over dense models. This design not only improves performance but also lowers the cost of inference, making deployment across a range of enterprise use cases more practical.
Despite the models’ performance on the LM Arena leaderboard, where Llama 4 Maverick reached the second spot, researchers have raised concerns about the version of the model used for benchmarking. Some researchers pointed out that the version submitted to the leaderboard was an “experimental” chat version of Maverick, not the open-weight model available to the public. This practice of using non-public versions for benchmarking has sparked criticism within the AI community, with researchers arguing that it undermines the integrity of benchmarking.
In response to the backlash, Meta’s vice president of generative AI, Ahmad Al-Dahle, defended the company’s practices. He refuted claims that Meta trained models on test sets to artificially inflate results, stating that performance variations could occur depending on the platform models are run on. Al-Dahle explained that Meta is still fine-tuning public deployments to match the quality levels seen internally.
The debate surrounding Meta’s Llama 4 models highlights the importance of transparency and accountability in AI model evaluation. As AI models become increasingly complex, ensuring that benchmarking practices accurately reflect the models available to users is crucial for maintaining trust within the AI community. Moving forward, it will be essential for companies like Meta to prioritize transparency and integrity in their benchmarking practices to uphold the credibility of their AI models.