Reliability of Observational Assessment Methods for Outcome-based Assessment of Surgical Skill: Systematic Review and Meta-analyses

Marleen Groenier, Leonie Brummer, Brendan Bunting, Anthony Gallagher

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

BACKGROUND: Reliable performance assessment is a necessary prerequisite for outcome-based assessment of surgical technical skill. Numerous observational instruments for technical skill assessment have been developed in recent years. However, methodological shortcomings of reported studies might negatively impinge on the interpretation of inter-rater reliability. OBJECTIVE: To synthesize the evidence about the inter-rater reliability of observational instruments for technical skill assessment for high-stakes decisions. DESIGN: A systematic review and meta-analysis were performed. We searched Scopus (including MEDLINE) and Pubmed, and key publications through December, 2016. This included original studies that evaluated reliability of instruments for the observational assessment of technical skills. Two reviewers independently extracted information on the primary outcome (the reliability statistic), secondary outcomes, and general information. We calculated pooled estimates using multilevel random effects meta-analyses where appropriate. RESULTS: A total of 247 documents met our inclusion criteria and provided 491 inter-rater reliability estimates. Inappropriate inter-rater reliability indices were reported for 40% of the checklists estimates, 50% of the rating scales estimates and 41% of the other types of assessment instruments estimates. Only 14 documents provided sufficient information to be included in the meta-analyses. The pooled Cohen's kappa was.78 (95% CI 0.69-0.89, p < 0.001) and pooled proportion agreement was 0.84 (95% CI 0.71-0.96, p < 0.001). A moderator analysis was performed to explore the influence of type of assessment instrument as a possible source of heterogeneity. CONCLUSIONS AND RELEVANCE: For high-stakes decisions, there was often insufficient information available on which to base conclusions. The use of suboptimal statistical methods and incomplete reporting of reliability estimates does not support the use of observational assessment instruments for technical skill for high-stakes decisions. Interpretations of inter-rater reliability should consider the reliability index and assessment instrument used. Reporting of inter-rater reliability needs to be improved by detailed descriptions of the assessment process.

Original languageEnglish
JournalJournal of surgical education
DOIs
Publication statusAccepted/In press - 20 Aug 2019

Fingerprint

Meta-Analysis
Outcome Assessment (Health Care)
Process Assessment (Health Care)
Checklist
PubMed
MEDLINE
Publications
interpretation
performance assessment
rating scale
moderator
statistical method
available information
statistics
inclusion
evidence

Keywords

  • outcome-based assessment
  • surgical skill
  • inter-rater reliability
  • reporting guidelines
  • Patient Care
  • Medical Knowledge

Cite this

@article{c40dbfd619cf4576a3eceff65ab9eb5b,
title = "Reliability of Observational Assessment Methods for Outcome-based Assessment of Surgical Skill: Systematic Review and Meta-analyses",
abstract = "BACKGROUND: Reliable performance assessment is a necessary prerequisite for outcome-based assessment of surgical technical skill. Numerous observational instruments for technical skill assessment have been developed in recent years. However, methodological shortcomings of reported studies might negatively impinge on the interpretation of inter-rater reliability. OBJECTIVE: To synthesize the evidence about the inter-rater reliability of observational instruments for technical skill assessment for high-stakes decisions. DESIGN: A systematic review and meta-analysis were performed. We searched Scopus (including MEDLINE) and Pubmed, and key publications through December, 2016. This included original studies that evaluated reliability of instruments for the observational assessment of technical skills. Two reviewers independently extracted information on the primary outcome (the reliability statistic), secondary outcomes, and general information. We calculated pooled estimates using multilevel random effects meta-analyses where appropriate. RESULTS: A total of 247 documents met our inclusion criteria and provided 491 inter-rater reliability estimates. Inappropriate inter-rater reliability indices were reported for 40{\%} of the checklists estimates, 50{\%} of the rating scales estimates and 41{\%} of the other types of assessment instruments estimates. Only 14 documents provided sufficient information to be included in the meta-analyses. The pooled Cohen's kappa was.78 (95{\%} CI 0.69-0.89, p < 0.001) and pooled proportion agreement was 0.84 (95{\%} CI 0.71-0.96, p < 0.001). A moderator analysis was performed to explore the influence of type of assessment instrument as a possible source of heterogeneity. CONCLUSIONS AND RELEVANCE: For high-stakes decisions, there was often insufficient information available on which to base conclusions. The use of suboptimal statistical methods and incomplete reporting of reliability estimates does not support the use of observational assessment instruments for technical skill for high-stakes decisions. Interpretations of inter-rater reliability should consider the reliability index and assessment instrument used. Reporting of inter-rater reliability needs to be improved by detailed descriptions of the assessment process.",
keywords = "outcome-based assessment, surgical skill, inter-rater reliability, reporting guidelines, Patient Care, Medical Knowledge",
author = "Marleen Groenier and Leonie Brummer and Brendan Bunting and Anthony Gallagher",
year = "2019",
month = "8",
day = "20",
doi = "10.1016/j.jsurg.2019.07.007",
language = "English",
journal = "Journal of surgical education",
issn = "1931-7204",
publisher = "Elsevier",

}

Reliability of Observational Assessment Methods for Outcome-based Assessment of Surgical Skill : Systematic Review and Meta-analyses. / Groenier, Marleen ; Brummer, Leonie; Bunting, Brendan; Gallagher, Anthony.

In: Journal of surgical education, 20.08.2019.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Reliability of Observational Assessment Methods for Outcome-based Assessment of Surgical Skill

T2 - Systematic Review and Meta-analyses

AU - Groenier, Marleen

AU - Brummer, Leonie

AU - Bunting, Brendan

AU - Gallagher, Anthony

PY - 2019/8/20

Y1 - 2019/8/20

N2 - BACKGROUND: Reliable performance assessment is a necessary prerequisite for outcome-based assessment of surgical technical skill. Numerous observational instruments for technical skill assessment have been developed in recent years. However, methodological shortcomings of reported studies might negatively impinge on the interpretation of inter-rater reliability. OBJECTIVE: To synthesize the evidence about the inter-rater reliability of observational instruments for technical skill assessment for high-stakes decisions. DESIGN: A systematic review and meta-analysis were performed. We searched Scopus (including MEDLINE) and Pubmed, and key publications through December, 2016. This included original studies that evaluated reliability of instruments for the observational assessment of technical skills. Two reviewers independently extracted information on the primary outcome (the reliability statistic), secondary outcomes, and general information. We calculated pooled estimates using multilevel random effects meta-analyses where appropriate. RESULTS: A total of 247 documents met our inclusion criteria and provided 491 inter-rater reliability estimates. Inappropriate inter-rater reliability indices were reported for 40% of the checklists estimates, 50% of the rating scales estimates and 41% of the other types of assessment instruments estimates. Only 14 documents provided sufficient information to be included in the meta-analyses. The pooled Cohen's kappa was.78 (95% CI 0.69-0.89, p < 0.001) and pooled proportion agreement was 0.84 (95% CI 0.71-0.96, p < 0.001). A moderator analysis was performed to explore the influence of type of assessment instrument as a possible source of heterogeneity. CONCLUSIONS AND RELEVANCE: For high-stakes decisions, there was often insufficient information available on which to base conclusions. The use of suboptimal statistical methods and incomplete reporting of reliability estimates does not support the use of observational assessment instruments for technical skill for high-stakes decisions. Interpretations of inter-rater reliability should consider the reliability index and assessment instrument used. Reporting of inter-rater reliability needs to be improved by detailed descriptions of the assessment process.

AB - BACKGROUND: Reliable performance assessment is a necessary prerequisite for outcome-based assessment of surgical technical skill. Numerous observational instruments for technical skill assessment have been developed in recent years. However, methodological shortcomings of reported studies might negatively impinge on the interpretation of inter-rater reliability. OBJECTIVE: To synthesize the evidence about the inter-rater reliability of observational instruments for technical skill assessment for high-stakes decisions. DESIGN: A systematic review and meta-analysis were performed. We searched Scopus (including MEDLINE) and Pubmed, and key publications through December, 2016. This included original studies that evaluated reliability of instruments for the observational assessment of technical skills. Two reviewers independently extracted information on the primary outcome (the reliability statistic), secondary outcomes, and general information. We calculated pooled estimates using multilevel random effects meta-analyses where appropriate. RESULTS: A total of 247 documents met our inclusion criteria and provided 491 inter-rater reliability estimates. Inappropriate inter-rater reliability indices were reported for 40% of the checklists estimates, 50% of the rating scales estimates and 41% of the other types of assessment instruments estimates. Only 14 documents provided sufficient information to be included in the meta-analyses. The pooled Cohen's kappa was.78 (95% CI 0.69-0.89, p < 0.001) and pooled proportion agreement was 0.84 (95% CI 0.71-0.96, p < 0.001). A moderator analysis was performed to explore the influence of type of assessment instrument as a possible source of heterogeneity. CONCLUSIONS AND RELEVANCE: For high-stakes decisions, there was often insufficient information available on which to base conclusions. The use of suboptimal statistical methods and incomplete reporting of reliability estimates does not support the use of observational assessment instruments for technical skill for high-stakes decisions. Interpretations of inter-rater reliability should consider the reliability index and assessment instrument used. Reporting of inter-rater reliability needs to be improved by detailed descriptions of the assessment process.

KW - outcome-based assessment

KW - surgical skill

KW - inter-rater reliability

KW - reporting guidelines

KW - Patient Care

KW - Medical Knowledge

UR - http://www.scopus.com/inward/record.url?scp=85070899845&partnerID=8YFLogxK

U2 - 10.1016/j.jsurg.2019.07.007

DO - 10.1016/j.jsurg.2019.07.007

M3 - Article

JO - Journal of surgical education

JF - Journal of surgical education

SN - 1931-7204

ER -