Human-in-the-loop or on-the-loop is not a silver bullet. Evaluate their effectiveness

Framework for evaluating effectiveness of Human-in-the-loop or Human-on-the-loop

Sundar Narayanan
8 min readJan 4, 2022
<a href=’https://www.freepik.com/vectors/process'>Process vector created by freepik — www.freepik.com</a>

HTL in Context

Human-in-the loop/ on-the-loop (herein referred as ‘HTL’) is a mechanism through which human intelligence is integrated in using or leveraging machine learning adoption, thereby supporting human discretion with machine activity/ outcomes. HTL is about two key factors (1) Design of Human-Computer Interaction and (2) Autonomy of Decision making. With increased adoption of machine learning, both the above factors relating to Human-in-the loop/ on the loop, helps to (a) Ensure quality in actions performed by the algorithm, (b) Limit false positives in the outputs from algorithm and (c) Have oversight on risks caused by algorithm. But how effective the HTL models are for high-risk artificial intelligence systems are difficult to assess.

Human-in-the loop/ on-the-loop is essentially a mechanism to include adequate model oversight and consideration for human factors in the modelling decision process. Typically in the model decision making process humans are involved in providing appropriate labels to the data, provide insights on edge cases of model prediction or classification and test or validate model for correctness.

A combined human-automation system performs a sequence of four consecutive information processing functions: the acquisition of information, the analysis of that information, the decision what action to take, based on the information, and the implementation of the action [here]. With the rise of autonomous systems, HTL is considered as an essential risk mitigator (from both moral and legal perspectives) or even a method for a fairer paradigm [here] and [here]. Most research papers published in 2020 and 2021 (on Arxiv) containing ‘Human-in-the loop/ on-the-loop’, are examining HTL as an enabler for better accuracy, better predictions and better and effective models overall [here].

Effectiveness of HTL

Effectiveness of HTL in algorithmic decision making, in specific, high risk AI could be subject to 2 key issues. (a) Previous studies have exhibited that humans involved in HTL may contribute to increasing disparities across racial and societal groups through their decisions [here] & [here]. (b) In addition, not in all algorithmic decision points, human-in-the loop or on-the loop actions can be effective. This is because of the inherent limitations that we as humans may have including human oversight may be (a) Rubber stamping automated decisions, (b) amplifying automation bias and (c) Blurring responsibility where humans are blamed for algorithmic errors and biases [here].

The above clarifies that effectiveness of HTL highly depends on two factors namely, (1) Appropriateness of HTL in model lifecycle and (2) Opportunity, willingness, subjective experience, and capability of humans involved for HTL. Appropriateness of HTL depends on the specific used case in which it is applied and its underlying business environment. For instance, the oversight capabilities of humans involved in HTL depends on availability of data, ability to conclude with available info, and mechanism to track after-effects of such decisions in the context of the original prediction by the AI system. Similarly, subjective experience and capability of the humans to interpret and make context aware decisions or actions is critical. This is applicable both in the context of fulfilling their role responsibility and casual responsibility [here].

Learnings from past research exhibited the following insights with reference to HTL:

  • Putting a human into the loop does not assure meaningful role by humans. This may create discrepancies between humans’ role responsibility and their causal responsibility, and may also expose them to unjustified legal measures and psychological burdens [here].
  • Behavioural economic experimental research (task: piloting unmanned aerial vehicle) exhibits that humans in HTL are inefficient due to their overconfidence or under-confidence on algorithmic results [here].
  • Presentation styles in which the outcomes are produced for HTL consideration can impact the way humans actions or decisions thereon [here]. Framing and messaging also influences human actions on decisions involving moral judgment [here].
  • Relying solely on the intuitions of ML experts and practitioners to capture the relevant nuances is likely inadequate and ineffective[here].
  • Lack of authority for humans, limits opportunities to examine facts (specifically in opaque or vague process scenario. For eg. Lack of authority or access to gain further information on a select application from sources beyond traditional sources) or to highlight irrelevance of the probabilistic thresholds are key reasons for less optimal effectiveness of HTL [here].
  • Responsibility measure is significantly affected by the assessed level of abilities (subjective) of the AI system and the human. Research on responsibility measure is based on static effects and not on temporal effects [here].

With the above, it is clear that there is a need to examine effectiveness of HTL. While [here], propose a mathematical approach of responsibility measure, it does not completely provide a framework for deciding on appropriateness of HTL or assess the capability or experience. In addition, the responsibility measure is intended for a static effect scenario than a temporal effect including time available to take decision, relative reliance on the system and behavioural effects (eg. Fatigue effects or learning effects) that impacts the human-computer interaction [here]. Also, causal responsibility in HTL is affected by Opportunity-Willingness-Experience-Capability-Capacity (OWECC) of the humans involved in the loop. Thereby exhibiting the necessity for a comprehensive framework to determine the appropriateness of HTL and OWEC of humans.

Framework for evaluating effectiveness of HTL

From the above factors, HTL effectiveness assessment need to be covered from both a design of HTL and how outcomes and insights are represented to HTL perspective. As we have seen representation can have influence on decision making by HTL.

The framework introduced here examines HTL effectiveness assessment from these three dimensions (Design, Computation and Representation). These dimensions are further classified into specific process areas of attention for examining effectiveness.

A. Design

Design of HTL shall consider the implications of (a) Data and model choices, (b) Selection of human, (c)Treatment of outcomes and (d) Human limitations in performing such roles.

Data and model choices include the (a) data adequacy, (b) data quality and representativeness, (c) documentation of causatives and inferences (including thresholds considered for metrics) and (d) algorithm choices. These choices and documentation can contribute to errors or bias in outcomes and possibly influence how the HTL treats such outcomes. Its necessary to consider that the extent of influence perceived may be different for a variety of used cases and domain applications. For instance, the extent of influence of a causal reasoning or construct validation may be different for fraud prediction at credit lending decision stage and for behavior biometric based account hacking frauds. This is due to the fact the extent to which the ground truth can be tested in either of the cases.

Selection of humans for HTL roles are critical part of the assessment for effectiveness. In traditional high-risk environments invariably HTL roles often have a detailed selection criteria and process. These selection criteria and process consider aspects including suitability, capability for the role, preparedness to handle stress, ability to make decisions that supports the larger human cause and health and mental wellbeing. These are aimed at providing holistic perspectives and minimum acceptable thresholds for handling such sensitive roles. Similar aspects are necessary to be considered for HTL effectiveness assessment, as HTL effectiveness is subjective to the effectiveness of the human selection process. The selection of humans shall consist of (a) capability assessment, (b) domain knowledge, (c) understanding of impact of algorithms (in the specific domain context), (d) stress preparedness and (e) awareness of causal responsibilities in the role.

Processes associated with treatment of outcomes are core to the activities under HTL. Hence, assessing them is essential in understanding the effectiveness of HTL. Treatment of outcomes typically involve (a) activities and/ or actions to be performed on the outcomes, (b) thresholds, classification, and filtration that are done on the outcomes prior to various actions, (c) access to resources for performing activities that could contribute to decision and (d) decision factors that determine HTL actions on outcomes. The oversight opportunities and actions thereof are mostly based on availability of data, insights from data to conclude, and mechanism to track after-effects of such actions. Further, it is also relevant to consider how the feedback from HTL is used to update or reinforce the model.

Limitations exist for humans in many areas. Examining such limitations in the context of role played in HTL is essential. These limitations include (a) extent to which the information can be reviewed for decision process, (b) awareness or limitation thereof with reference to domain knowledge, and (c) extent to which experiences and beliefs impact the decision process.

B. Representation

Representation of data, outcomes and impacts can also have influence on the HTL, as referred above. Its necessary to consider how the outcomes are represented to HTL and examine if that can have an influence on their decision. This is also to examine if these processes can induce inherent confirmation bias thereby impacting the effectiveness of HTL. Outcome representation and its influence on HTL is subjective and may differ from one domain to another.

Impact is the effect on the ultimate user or customer, caused by the decision HTL takes basis the outcome of the algorithm. Impact representation is about how the impacts are represented. Under representation or misrepresentations can influence the HTL and may also affect effectiveness of decisions.

HTL may degrade the algorithm, if the outcomes are incorrectly scored, mislabeled outcomes or outcomes that have errors in the decision or update of such decisions [here]. Understanding these varying factors in assessing effectiveness of the HTL can help in making meaningful approaches towards using HTL.

C. Computation

Computational assessment and monitoring of the HTL performance both at an individual and group levels are necessary in assessing if the design and representation factors are well adapted in HTL environment. It includes (a) Ground Truth Validation and (b) Consistency of Actions.

Ground truth validation is a measure to assess if the decision arrived at by HTL is consistent with the facts of the case in question. Consistency of actions is about whether the actions proposed or undertaken by HTL individually or as a group are statistically consistent. Considering these aspects using computational methodologies and KPI on performance of HTL helps in assessing effectiveness of HTL. These could be performed prior to deployment (testing phase) and post deployment monitoring phase of the HTL process.

Effectiveness assessment will be subjective to the domain, industry, and the use case in context.

Assessing individuals deployed in HTL roles

Understanding the capability of individuals in HTL is critical and important in a high-risk environment. In this context, one needs to assess the capability of individuals from the perspective of Willingness-Experience-Capability-Capacity (excluding the first aspect ‘Opportunity’ in the OWECC). Opportunity is examined based on whether there is a need for HTL in the process.

While assessing individuals their willingness, specifically in the context of assessing harms or impacts of algorithmic system or outcomes is important. Less motivated or unwilling human in the loop can have detrimental impact on the people.

Experience in handling high risk roles and specific experience in dealing with the domain, industry and/ or used case can be a useful parameter to assess individuals. In addition, capability associated with the industry, domain or used case and ability to deal with high-risk circumstances (stress testing). In addition, capacity to perform HTL and efficiency that can be achieved therein, considering the volume and frequency of the outcomes are also necessary to assess individuals. These shall be assessed independent of effectiveness assessment of HTL as mentioned above.

Conclusion

HTL is an important tool to ensure that the outcomes from the algorithmic system are monitored and managed effectively. However, as with any tool, there are limitations in their effectiveness. Being aware of such limitations are essential in the context of implementing human centered HTL.

While HTL effectiveness assessment framework referred above is a good guidance to start with, it has its limitation due to subjectivities in assessing various aspects.

--

--