The Need for Explainability and Simulation in Preference Optimization

5 min readJul 10, 2024

Preference Optimization and Explainability

The world is multilingual and technology is recognizing the need to be multilingual. Working on multilingual data or models with language diversity poses many challenges and opportunities. In this context, research on Reinforcement Learning from Human Feedback (RLHF) has shown promising potential for multilingual applications. Cohere recently published a study that claimed transfer capabilities of RLHF across different languages, specifically exhibiting performance improvements and highlighting the strengths of online RLHF methods over offline approaches.

The team at Cohere explored a series of experiments in connection with the testing of RLHF transfer capabilities. They created a synthetic multilingual preference dataset by translating approximately 50,000 English prompts from ShareGPT into 22 different languages. Completions were generated using Cohere’s Command and Command R+ models, with the Cohere May 2024 model serving as the reward model. It is pertinent to note that this reward model has a score of 88.2 in RewardBench Leaderboard for reward models.

Four dataset mixtures were developed, with varying in the number of languages and prompts. The Aya 23 8B model was used as the base model for RLHF Direct Preference Optimization (DPO) and RLHF Reward Learning from Offline Optimization (RLOO).

Key findings related to preference optimization in a multilingual context from the paper include:
Outcome 1: Cross-Lingual Transfer: Preference optimization in multilingual language models, using English data as preference data, improves performance in other languages. In addition, including a few more languages in preference data significantly enhances cross-lingual transfer, demonstrating higher win-rates on unseen languages.

Outcome 2: Multilingual Alignment: Multilingual preference data is necessary for aligning multilingual Language Models (LLMs). Increasing the number of languages in preference optimization training consistently improves multilingual performance compared to training with English-only data.

Outcome 3: Online vs. Offline Optimization: Online preference optimization, in particular the Relative Leave-One-Out (RLOO) method, seems to outperforms offline optimization, such as Distance Preserving Optimization (DPO). RLOO seems to achieve better overall performance and seems to indicate improved cross-lingual transfer, even on languages not included in preference training.

Outcome 4: Preference Optimized Aya 23 8B: The preference optimized model (Aya 23 8B model) seems to win over other open source-open weights models in terms of win-rates. It outperforms the original Aya model as well as widely used open-weight models across all 23 languages evaluated.

Questions for further exploration:
Amongst other things, this research brings in certain questions about multilingual optimization while providing valuable insights, There are important questions that warrant further exploration to specifically understand how LLM optimization works:

1. DPO works by deriving the #reward function ‘r’ from the policy ‘pi’ using KL divergence as a constraint. Optimal policy maximises reward function. Unless the preference data is multilingual, the rewards will reduce resulting misalignment. This brings a question as to how is the ‘English data only’ optimization improving multilingual performance? Even if the preference data has similar context (as the content is from ShareGPT), the language and completions in respective languages could make it difficult for the optimization to improvide performance in other languages. This is without considering the implicit linguistic nuances of languages.

2. #KLPenalty is considered at 0.5. This represents the extent to which the policy is sensitive to the reward function. If the policy is not trained to be sensitive to a high (> 0.7) or low degree (<0.3), then the policy will not optimize for the reward in a relative context. Even if the preference data was tailored for multilingual, the KL penalty will lead to relative reward than one which is highly sensitive. This is also another reason why there is a need to understand the phenomenon behind ‘English data only’ optimization improving multilingual performance’.

3. #Transformers are non deterministic by design, essentially, 2 consecutive DPO runs may throw very different results. This means that the DPO needs have sufficient confidence level to be able to demonstrate “why and how the ‘English data only’ optimization improving multilingual performance ?”. However, determining the confidence interval or uncertainty quantification for DPO would be difficult and resource intensive.

4. Reward is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty, for the chosen or rejected responses. Similarly, loss function is formulated by comparing the log-odds of winning completions. Logprobs / Log-sigmoid/ Logit of best and worst cases (where probability is 0 or 1) may not be differentiable, resulting in unreliable optimization. Further, the assumption that the optimal policy and the reward policy are not correlated cannot be proven easily because of environment complexity, non-linear and non deterministic problem space, and also due to choice of optimization algorithm or associated hyperparameters.

5. #DPO does not perform well in out-of-distribution data. Multilinguality is essentially out of context. DPO also is known to have reduction in likelihood of preferred completions if the relative probability of preferred and dispreferred increases. When applying english in multilingual context, the above failure mode will affect the DPO. So, why and how ‘English data only’ optimization improves multilingual performance is pertinent?

6. The dataset is built by leveraging #ShareGPT using NLLB (Meta’s No language left behind model) translations. While post-translations and completions are generated for each language to bring diversity, this process has constraints attributable to original training data on these languages. Further, translations are not reliable given the transliteration, translation errors, mistranslation and misinformation.

7. #GPT4 Turbo is used as ‘llm as judge’ in evaluations. Given the lack of human validation of the evaluation, the question remains as to why and how ‘English data only’ optimization improves multilingual performance.

8. The success of DPO is dependent on the quality and fairness of the preference data used to train the model. For instance, if the preference data is biased, it will affect the logprobs reliability, thereby affecting the weights assigned based on such flawed logprobs. This may not necessarily affect certain domains, but may significantly affect in certain other domains. Unless DPO efforts are tied to a specific downstream use case environment, the effect of the bias or ability to address bias would be difficult to estimate.

Given the analysis above, the best way forward to understand the inherent reasons for the emphrical analysis resulting showing “‘English data only’ optimization improving multilingual performance “ are as follows:
1. Disclosure of the test results along with the test parameters to enable further research.
2. Establishing explainability (reasoning at an embedding level) to understand the underlying reasons for such phenomena.
3. Conducting simulation studies to understand the impact of such phenomena over varied scenarios.

These approaches will not just allow for critical examination of the optimization outcome, but also provide enough cautioning impetus towards bold claims, enabling better research and development efforts. Further, by delving deeper into these aspects, we can unlock the full potential of RLHF in multilingual contexts and promote responsible and effective multilingual performance.

The Need for Explainability and Simulation in Preference Optimization

Written by Sundar Narayanan