Evaluating the Performance of LLM-Generated Code forChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

In the domain of software development, making informed decisions about the utilization of large language models (LLMs) requires a thorough examination of their advantages, disadvantages, and associated risks. This paper provides several contributions to such analyses. It first conducts a comparative analysis, pitting the best-performing code solutions selected from a pool of 100 generated by ChatGPT-4 against the highest-rated human-produced code on Stack Overflow. Our findings reveal that, across a spectrum of problems we examined, choosing from ChatGPT-4's top 100 solutions proves competitive with or superior to the best human solutions on Stack Overflow. We next delve into the AutoGen framework, which harnesses multiple LLM-based agents that collaborate to tackle tasks.

Previous
Previous

Prompt Engineering of ChatGPT to Improve Generated Code & Runtime Performance Compared with the Top-Voted Human Solutions

Next
Next

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT