Completion Settings
What are Completions?
Completions are ultimately what the LLM APIs do, they “complete” the prompt that you give them. Ultimately, a large-language-model just tries to predict what the next “token” should be in the input. When you ask a chatbot a question, it actually just takes your input, plus the context that Montag provides from the Embeddings, and then “finishes” the prompt. In the case of chatbots to finish the prompt means to reply to the question or answer the task.
In an LLM, completions can be tweaked to change their behaviour. That’s what the settings in the “Completions” section of Montag provide - a way to store tweaked settings to send to the LLM. these settings can change how creatively they respond or how often they repeat themselves, etc. Completion Settings in Montag
- Model: The model to use to run the completion, for OpenAI, valid entries are gpt-4 and gpt-3.5-turbo for meaningful results. Others should work, but the API client may or may not properly interpret how to access those models. For open-source LLMs, set this to just gpt-3.5-turbo, since these are open-source and not hosted, they do not support model-selection at the completion level, and this is instead handled at the API client selection level.
- Temperature: Takes a value between 0.01 and 1.0 (it must be a decimal), the lower the value, the more predictable the output will be, the higher the number, the more varied the output will be for the same question. So for example, in a case where you want solid, consistent answers, a low value of 0.01 might be best. However, if you want to introduce some “creativity” into you responses, raise this value.
- TopP: is similar to temperature, however instead of introducing variability, it sets a probability “floor” for the model to use to predict the next token, so setting this low ensures the model can select from all probabilities of tokens, and setting this high means it will only ever select the most likely probabilities. I like to keep this the same as temperature.
- Frequency Penalty: The frequency penalty penalizes tokens that have already appeared in the preceding text (including the prompt), and scales based on how many times that token has appeared. So a token that has already appeared 10 times gets a higher penalty (which reduces its probability of appearing) than a token that has appeared only once. This setting is useful if you want to get rid of repetition in your outputs.
- Presence penalty: applies the penalty regardless of frequency. As long as the token has appeared once before, it will get penalized. This setting is useful if you want to get rid of repetition in your outputs.
- Max Tokens: As with Embeddings, the Max Tokens sets the maximum number of tokens that should be generated
- Token Limit: This specifies the token limit for the model in total including the prompt, Montag uses this to help build the final prompt and truncate text that is too long.