Introduction
Recently, I found the MathVista benchmark evaluates LLMs solving math problems from pictures. Remarkably, Claude 3.5 Sonnet, Gemini 1.5 Pro (May 2024), and GPT-4o have outperformed the average human.
And the newly released Claude 3.5 Sonnet reached a score of 67, much higher than Gemini 1.5 Pro (May 2024) (63.9) and GPT-4o (63.8).
What is MathVista?
Here is a quota introduction of MathVista from its homepage.
To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation
models find challenging.
This is the leaderboard published:
My Experiments
Data source
I found a website called Math-Exercises that categorizes math problems. It offers pictures of the problems and their answers.
I picked 5 problems from this page and tested them on Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o. Let me show you each problem and their answers. Let's start.
Round 1 - Set Operations
Problem:
Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :
Claude 3.5 Sonnet:
Gemini 1.5 Pro:
GPT-4o:
Right Answer:
Result:
🌟 Awesome! All three models are correct! Maybe it’s too simple for AI.
Round 2 - Set Operations Again!
This time, I am using a slightly more abstract set problem.
Problem:
Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :
Claude 3.5 Sonnet:
Gemini 1.5 Pro:
GPT-4o:
Right Answer:
Result:
🌟 Incredible, they are all right again! Is AI really good at set operations?
Round 3 - Algebraic Expressions
I chose a different type of problem. This time, I want to test AI's algebraic skills.
Problem:
By grouping the terms factor the polynomials and algebraic expressions :
Claude 3.5 Sonnet:
Gemini 1.5 Pro:
GPT-4o:
Right Answer:
Result:
😱 What? They are all wrong this time. I can’t understand how they know how to solve this problem but still make mistakes at the end.
Round 4 - Linear Equations
Problem:
Solve the linear equations and check the solution :
Claude 3.5 Sonnet:
Gemini 1.5 Pro:
GPT-4o:
It prints out too much. I only screenshot the final answer.
Right Answer:
Result:
🌟 They all found the right answer again! However, Claude failed in the answer check, even though the answer is correct. It's so odd.
Round 5 - Inequalities
Maybe the Equations is too simple for AIs, this time I use inequalities!
Problem:
Solve the linear inequalities with absolute value :
Claude 3.5 Sonnet:
Gemini 1.5 Pro:
It prints out too much. I only screenshot the final answer.
GPT-4o:
It prints out too much. I only screenshot the final answer.
Right Answer:
Result:
😢 Maybe it’s too difficult for AI. They are trying hard, but it's all wrong.
Conclusion
AI can solve many math problems. Although it may fail in some cases, it can still provide useful hints.
And finally, Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o received the same scores in this math match! In conclusion, state-of-the-art AIs perform similarly at math.
Links:
- Poe is really awesome. I used Poe to finish this test. It offers many advanced models and community models for users.
- I am building a web application to help solve math problems. Feel free to try it out: AI Math Solve.