Beyond Benchmarks: Human-Aligned Evaluation Frameworks for Large Language Models