Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Deepseek-R1 has certainly created a lot of excitement and anxiety, especially for the opponent OPNAI O1. Therefore, we put them in the test in comparison along with some minor data analysis tasks and market research tasks.
To put the models equally, we used the puzzling search for professionals, which now supports both O1 and R1. Our goal was to look beyond the criteria and know whether the models could actually perform the designated tasks that require collecting information from the web, choosing the right parts of data and performing simple tasks that require a great manual effort.
Both models are impressive but make mistakes when claims lack privacy. O1 is a little better in thinking tasks, but the R1 transparency gives it an advantage in cases (and there will be a few) as it makes errors.
Below is a collapse of a small number of our experiences and links to the pages of confusion where you can review the results yourself.
Our first test was measured whether the models could calculate the returns on investment (ROI). We looked at a scenario where the user invested $ 140 in the wonderful seven (Alphabet, Amazon, Apple, Meta, Microsoft, NVIDIA, Tesla) on the first day of every month of January to December 2024. We asked the form to calculate the value of the portfolio in the current date.
To accomplish this task, the model will have to withdraw the MAG 7 price information for the first day of each month, and divide the monthly investment equally through shares ($ 20 per share), and its beauty and calculate the value of the portfolio according to the value of the shares in the current date.
In this task, both models failed. O1 returned the stock price list In January 2024 and January 2025 along with a formula to calculate the value of the portfolio. However, I failed to calculate the correct values and said mainly that there would be no return on investment. On the other hand, R1 made an investment error only in January 2024 and calculated revenue for January 2025.
However, what was interesting was the process of thinking about models. While O1 has not provided many details about how it reached its results, R1 logic has been tracked He showed that he did not have the correct information because the retrieval engine in confusion failed to obtain monthly stock data (many generation applications that were equipped do not fail due to the lack of a model in capabilities but because of a bad recovery). This has proven to be an important part of the comments that led us to the next experience.
We decided to run the same experience as before, but instead of demanding the model to recover information from the web, we decided to provide it in a text file. For this, we copy the monthly data for each share of Yahoo! Funding in a text file and gave it to the model. The file contained the name of each share in addition to the HTML table, which contains the price for the first day of each month from January to December 2024 and the last registered price. The data has not been cleaned to reduce the manual voltage and test whether the model can choose the right parts of the data.
Once again, both models failed to provide the correct answer. It seems that O1 has extracted data From the file, but suggest manually perform the account in a tool like Excel. Thinking was very mysterious and did not have any useful information to explore the model. R1 also failed It did not provide an answer, but the logic tracking contains a lot of useful information.
For example, it was clear that the model had properly analyzed HTML data for each stock and was able to extract the correct information. It also managed to make investments for a month separately, and to beautify them and calculate the final value according to the latest share price in the table. However, this final value remained in the thinking chain and failed to reach the final answer. The model was also confused by a row in the NVIDIA chart, which was characterized by the division of 10: 1 shares of the company on June 10, 2024, and ended up estimating the final value of the wallet.
Again, the real distinctive was not the same result, but the ability to investigate how the model reached its response. In this case, R1 provided us with a better experience, allowing us to understand the restrictions of the model and how we can reformulate our claim and coordinate our data to obtain better results in the future.
Another experience we required by the model to compare the statistics of four leading positions in the American Professional League and to determine which one has the best improvement in the field goal (FG %) from 2022/2023 to the seasons of 2023/2024. This task of the model required multi -step thinking on different data points. Fishing in the claim was that it included Victor and Yumanama, who just entered the league as an elevator in 2023.
The recovery of this claim was much easier, as the player’s statistics are widely reported on the Internet and are usually included in Wikipedia and NBA profiles. Both models answered correctly (it’s Gannis if you were curious), although they were based on the sources they used, their numbers were somewhat different. However, they did not realize that wemby was not qualified to compare and collect other statistics of his time in the European League.
In her answer, R1 It made a better collapse From the results with the comparison schedule along with links to the sources he used to answer them. We enabled the added context to correct the claim. After we have modified the specific claim that we were looking for FG % from the American Professional League seasons, the form WEMBY correctly excluded the results.
Thinking models are strong tools, but they still have ways to go before they are completely reliable with tasks, especially as other components of large language model applications (LLM) continues to develop. From our experiences, both O1 and R1 can still make basic errors. Although it shows impressive results, they still need a little hand to give accurate results.
Ideally, the thinking model should be able to explain the user when it lacks information for the task. Instead, tracking thinking in the model should be able to direct users to better understand errors and correct their claims to increase the accuracy and stability of the model responses. In this regard, R1 was the upper hand. We hope to provide future thinking models, including O33 series of Openai, for users more vision and control.