Featured image
Fine-Tuning

Phind Outperforms GPT-4 with Their Fine-Tuned CodeLlama Model

avatar

Sven

August 27th, 2023

~ 3 min read

Exploring the power of technology and AI’s advancements, we delve into the recent accomplishments of Phind, the company behind a fine tuned version of CodeLLama.

Beating Benchmarks: How CodeLlama Triumphed Over GPT-4

When it comes to AI models, benchmarks are the ultimate tests of prowess. Benchmark scores provide an objective measure of how well a model performs in comparison to others. Until recently, GPT-4 held a steady lead with a pass@1 score of 67% on HumanEval according to their official technical report from March.

However, the tables have turned as Phind’s fine-tuned CodeLlama-34B and CodeLlama-34B-Python have achieved scores of 67.6% and 69.5% respectively, thereby surpassing GPT-4. This noteworthy achievement is not only impressive but also indicative of the potential that lies within Phind’s CodeLlama project.

To ensure the validity of these results, Phind applied OpenAI’s decontamination methodology to their dataset, revealing no contamination. This step reflects Phind’s commitment to maintaining high scientific standards and integrity in their research.

A Closer Look at CodeLlama’s Performance on HumanEval

Meta recently released two CodeLlama models that exhibited remarkable performance on HumanEval. CodeLlama-34B achieved a pass@1 score of 48.8%, while CodeLlama-34B-Python accomplished a slightly higher score of 53.7%.

Phind fine-tuned both models using a proprietary dataset comprising approximately 80,000 high-quality programming problems and solutions. Unlike other datasets that merely offer code completion examples, Phind’s dataset features instruction-answer pairs, giving it a unique structure compared to HumanEval.

The models were trained over two epochs, totaling around 160k examples. The training process excluded LoRA, opting instead for native fine-tuning. Advanced techniques such as DeepSpeed ZeRO 3 and Flash Attention 2 were employed to facilitate the training process, which was completed in just three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

Decontamination Methodology: Ensuring Valid Results

Validity of model performance relies heavily on the integrity of the dataset used. To ensure this, Phind appllied OpenAI’s decontamination methodology. This rigorous process involves randomly sampling three substrings of 50 characters from each evaluation example or using the entire example if it comprises fewer than 50 characters.

A match is identified when any sampled substring appears as a substring of the processed training example. Following this stringent approach, Phind found zero contaminated examples in their dataset, solidifying the validity of their impressive results. For an in-depth understanding of the decontamination methodology, readers can refer to Appendix C of OpenAI’s technical report.

Celebrating Success: CodeLlama’s Achievement on HumanEval

The Phind team’s hard work has culminated in the successful fine-tuning of their models, leading to remarkable scores on HumanEval. Phind-CodeLlama-34B-v1 achieved a pass@1 score of 67.6%, while Phind-CodeLlama-34B-Python-v1 scored even higher at 69.5%.

The Future of CodeLlama: Open Sourcing for Greater Collaboration

In an exciting move, Phind is releasing both models on Huggingface. This decision aims to promote verifiability and foster greater collaboration within the open-source community. Independent verification of results is welcomed and encouraged, signifying Phind’s transparency and commitment towards collective growth and advancements in AI technology.

Links:
Blog Post: https://www.phind.com/blog/code-llama-beats-gpt4
HuggingFace: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1