Can AI really solve math from pictures?

ppaanngggg - Jul 12 - - Dev Community

image of AI solve math

Introduction

Recently, I found the MathVista benchmark evaluates LLMs solving math problems from pictures. Remarkably, Claude 3.5 Sonnet, Gemini 1.5 Pro (May 2024), and GPT-4o have outperformed the average human.

And the newly released Claude 3.5 Sonnet reached a score of 67, much higher than Gemini 1.5 Pro (May 2024) (63.9) and GPT-4o (63.8).

What is MathVista?

Here is a quota introduction of MathVista from its homepage.

To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation
models find challenging.

This is the leaderboard published:

latest mathvista leaderboard

My Experiments

Data source

I found a website called Math-Exercises that categorizes math problems. It offers pictures of the problems and their answers.

I picked 5 problems from this page and tested them on Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o. Let me show you each problem and their answers. Let's start.

Round 1 - Set Operations

Problem:

Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :

set operation problem 1

Claude 3.5 Sonnet:

Claude solution of set operation problem 1

Gemini 1.5 Pro:

Gemini solution of set operation problem 1

GPT-4o:

GPT solution of set operation problem 1

Right Answer:

Answer of set operation problem 1

Result:

🌟 Awesome! All three models are correct! Maybe it’s too simple for AI.

Round 2 - Set Operations Again!

This time, I am using a slightly more abstract set problem.

Problem:

Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :

set operation problem 2

Claude 3.5 Sonnet:

Claude solution of set operation problem 2

Gemini 1.5 Pro:

Gemini solution of set operation problem 2

GPT-4o:

GPT solution of set operation problem 2

Right Answer:

Answer of set operation problem 2

Result:

🌟 Incredible, they are all right again! Is AI really good at set operations?

Round 3 - Algebraic Expressions

I chose a different type of problem. This time, I want to test AI's algebraic skills.

Problem:

By grouping the terms factor the polynomials and algebraic expressions :

Algebraic Expressions problem

Claude 3.5 Sonnet:

Claude solution of Algebraic Expressions problem

Gemini 1.5 Pro:

Gemini solution of Algebraic Expressions problem

GPT-4o:

GPT solution of Algebraic Expressions problem

Right Answer:

Answer of Algebraic Expressions problem

Result:

😱 What? They are all wrong this time. I can’t understand how they know how to solve this problem but still make mistakes at the end.

Round 4 - Linear Equations

Problem:

Solve the linear equations and check the solution :

Linear Equations problem

Claude 3.5 Sonnet:

Claude solution of Linear Equations problem

Gemini 1.5 Pro:

Gemini solution of Linear Equations problem

GPT-4o:

It prints out too much. I only screenshot the final answer.

GPT solution of Linear Equations problem

Right Answer:

Answer of Linear Equations problem

Result:

🌟 They all found the right answer again! However, Claude failed in the answer check, even though the answer is correct. It's so odd.

Round 5 - Inequalities

Maybe the Equations is too simple for AIs, this time I use inequalities!

Problem:

Solve the linear inequalities with absolute value :

Inequalities problem

Claude 3.5 Sonnet:

Claude solution of Inequalities problem

Gemini 1.5 Pro:

It prints out too much. I only screenshot the final answer.

Gemini solution of Inequalities problem

GPT-4o:

It prints out too much. I only screenshot the final answer.

GPT solution of Inequalities problem

Right Answer:

Answer of Inequalities problem

Result:

😢 Maybe it’s too difficult for AI. They are trying hard, but it's all wrong.

Conclusion

AI can solve many math problems. Although it may fail in some cases, it can still provide useful hints.

And finally, Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o received the same scores in this math match! In conclusion, state-of-the-art AIs perform similarly at math.

Links:

  1. Poe is really awesome. I used Poe to finish this test. It offers many advanced models and community models for users.
  2. I am building a web application to help solve math problems. Feel free to try it out: AI Math Solve.
. . . . . . . . .
Terabox Video Player