I worked on fine-tuning a ChatGPT model for Tactician TM, a turn-based game engine. The goal was pretty cool: make it easier for anyone to describe game rules, and then our NLP model would tidy those up into a clear, standardized format. This was the first time I had to face a challenge like this and let’s just say it was a rough start. To make this challenge more approachable I decided to start on something small and focused on the game of Tic Tac Toe(TTT).
Initially, I was overwhelmed. Nonetheless, I embraced the challenge before me, realizing I had nothing to lose. My first choice was Python's NLP tool, "spaCy". My original plan was to deconstruct each input sentence word by word. However, I soon realized this approach was too time-consuming, considering the multitude of variables involved. So, I returned to the drawing board and conducted further research.
During this phase, I discovered fine-tuning and various AI tools that facilitate this process. Notably, I found that ChatGPT offered fine-tuning capabilities. Initially, I used the "davinci-002" model for its accessibility. This model required data in a JSONL file, formatted in a prompt-completion structure. The next step was creating the dataset.
I aimed for a 'waterfall effect' in my model. The natural flow of a waterfall inspires the concept of the "waterfall effect" in my model. Just as water in a river inevitably flows to a common destination, my model is designed to generalize any input into one of the standard TTT rules. This approach ensures consistency in interpreting diverse inputs and aligning them with established rule sets. I drafted a set of rules for TTT and then rephrased them in four distinct styles. Subsequently, I asked various individuals to write the TTT rules, gathering diverse perspectives and descriptions I hadn't considered.
Despite this effort, I felt the need for more data. Therefore, I used ChatGPT to rewrite the collected rules into 20 unique variations. I adopted the standard 80/20 split for training and testing data, which yielded consistent results. However, a meeting with my mentor and their guest revealed a significant oversight: my model was overfitted for TTT. The guest advised me to use newer ChatGPT models and to leverage existing ChatGPT data, reducing the need for extensive datasets. He introduced me to "few-shot learning".
The invaluable advice from my mentor's guest led me to rethink my approach to data collection. Instead of creating an extensive dataset from scratch, I explored ways to utilize existing data. This shift in strategy not only streamlined the process but also enabled me to focus on refining the model's accuracy and efficiency with pre-existing data gathered by ChatGPT. Through carefully constructed sentences I was able to get the models training loss down from 1.81 to .26 as you can see above in the frist two photos. Following this advice, I selected ChatGPT's "gpt-3.5-turbo-1106" model for my final fine-tuning, utilizing the few-shot learning strategy. The data structure evolved from a simple "prompt-completion" format to a more nuanced "system-user-assistant" format. This change allowed for more precise control over the model's responses. The 'system' component details the desired behavior of the model, such as being a helpful assistant. The 'user' represents the input, in this case, our paraphrased TTT rules. Finally, the 'assistant' defines the model's response in line with the system's personality.
I developed code to generate both training and testing data, shifting the focus from data collection to crafting effective prompts for ChatGPT to yield the desired output. This phase of generating training and test data introduced unique challenges. I faced issues with data being incomplete or not generated at all. Moreover, achieving uniformity in data presentation posed a significant challenge. These obstacles necessitated a meticulous refinement of the data generation process to ensure both consistency and completeness in the dataset.
To address the issue of incomplete data, I increased the word limit allowed for generation. This simple modification significantly improved the completeness of the generated data. To tackle the formatting issue, I integrated a specific format style into the prompts given to the model. This strategy ensured that the data was not only uniformly formatted but also aligned with the desired output structure.
After extensive experimentation and adjustments, I finalized a set of prompts that consistently delivered optimal results and maintained uniform data formatting. My code is now adept at taking a standard set of rules and generating the training data for model creation, as well as producing corresponding test data. The first image above shows the prompt I created for the training data. While the second image shows the prompt used for the testing data.
The image above is a screen shot of my Github where the code is available for use. Feel free to download the code and give it a shot yourself. I took the time to make the documentation straight forward and easy to follow.