Establishing Best Practices for Building Rigorous Agentic Benchmarks arxiv.org 2 points by consumer451 10 hours ago