Psychometric Properties: Difference between revisions

Whether you identify as a student, clinician or researcher having confidence in what you use as a professional is important. Clinicians and researchers use various tools on a daily basis for clinical assessments and evaluations, measuring change over time and establishing prognosis for patients. [1] Our clinical reasoning, intervention and research suggestions can only be as strong as the tools we use.  [2]

Psychometrics is the field of mathematics that is concerned with the statistical description of instrumental data as variables and with the inferential statistical description of the relationships between variables.[3] In rehabilitation medicine, psychometrics usually measure individual parameters such as physical characteristics, ability, perception of change, pain, and functional ability. 

What are Psychometric Properties?[edit | edit source]

Psych prop2.jpg

Having confidence in clinical tools means that they measure what they are intended to measure (validity), they are stable over time (reliability) and can detect changes in conditions (responsiveness). Collectively, this is called looking at the psychometric properties (or methodological qualities) of a tool, scale or outcome measure[4]

Psychometric properties can be applied to questionnaires, outcome measures, clinical tools, scales or special tests. For the remainder of the page, the term “tool” will apply to describe all of these categories. 

Measurement instruments play an important role in research, clinical practice and health assessment.[5] Researchers and clinicians use measurement as a way of quantifying, understanding, evaluating and differentiating physical characteristics of the human body.[6] This is achieved through the use of clinical tools with patients.

The nature of measurement represents quantifying (measuring) bodily characteristics for example: level of pain, range of motion, strength, or functional outcomes. The usefulness of measurement in clinical research or practice helps with decision making and measuring progress during rehabilitation.  

Level of Measurement Types[edit | edit source]

There are four types of measurement for data classification in psychometrics: nominal, ordinal, interval and ratio.

Watch this video [7] for further clarification:

Possible combinations of validity and reliability. Retrieved from Souza et al. (2017). [5]

Validity refers to the tool’s ability to measure what it is supposed to measure. [4] Is the tool measuring the construct it is intended to? For example: does the goniometer truly measure range of motion? 

Validity implies that a tool has to be relatively free from error (i.e. reliable). A tool that is not consistent cannot produce confidence in a measurement. [8] In other words, a tool can be reliable without being valid (consistent over time, but not measuring the construct of interest).

To be classified as a tool with strong psychometric properties, it needs to be both valid and reliable.[9]

Types of validity[edit | edit source]

Validity types and measurements. Retrieved from Souza et al. (2017). [5]

Content Validity  The degree to which the content (i.e. sub sections or items) of the tool adequately reflect the construct of interest (mostly used with questionnaires). [4][6]

  • Face Validity  An aspect of content validity that refers to the degree to which (the items of) a tool indeed looks as though it is an adequate reflection of what it is supposed to measure (the weakest form of validity).[4]

Construct Validity The degree to which the scores of a tool are consistent with hypotheses based on the abstract concept (does it measure the theoretical component of the construct or variable?). [4][6]

  • Structural (or Factorial) Validity – An aspect of construct validity that refers to the degree to which the scores of a tool are an adequate reflection of the dimensionality of the measured construct. [4]

Discriminant Validity – Tests the hypothesis that the tool is not improperly related to different constructs. [5]

Criterion (or Criterion-based) Validity-  The measurement of one tool can be used as a substitute measurement, for an established reference standard (Gold Standard). [6]

Concurrent Validity- Establishes the validity of two measurements taken at the same time (perhaps one tool is considered more efficient than the Gold Standard). [6]

Predictive Validity The measurement of one tool can be used to predict a future  score of another tool. [6]

Cross-cultural Validity – The degree to which a culturally adapted tool is equivalent to the original instrument. [5]

Reliability refers to the extent to which a measurement is consistent and free from error.[6] As it relates to the reproducibility or dependability of a measurement, it is absolutely key to a strong clinical tool, because without it, we cannot have confidence in our tools or measurements, nor can we have strong clinical reasoning. However, it is important to understand that measurements are rarely perfectly reliable, as humans do respond with some degree of inconsistency. For example, if you measure someone’s knee flexion range of motion three times, will the measurements be identical all three times? Most likely not, as there will be inconsistencies with the precision of the evaluator and the state of the patient. 

Reliability measurements. Retrieved from Souza et al. (2017)[5].

Reliability refers mainly to stability, internal consistency and equivalence of a tool.[10] It is important to highlight that the reliability is not a fixed property. On the contrary, reliability relies on the function of the instrument, the population in which it is used, on the circumstances, on the context; that is, the same tool may not be considered reliable under different conditions.[11]

Reliability estimates are affected by several aspects of the assessment environment (raters, sample characteristics, type of tool, administration method) and by the statistical method used.[12] Therefore, the results of a research using measurement instruments can only be interpreted when the assessment conditions and the statistical approach are clearly presented.[13]

Types of reliability[edit | edit source]

1. Test-retest reliability: The test-retest reliability of a test describes the stability of scores obtained by a patient when they are evaluated on two separate occasions. This appears similar to intra-rater reliability but in this case, the patient self-evaluates themselves (for example a pain-rating scale).[14] 

Quantitative measure:

  • Intraclass correlation coefficients (ICC)
  • Bland and Altman method (fidelity between two raters)

Qualitative measure:

  • Two coefficients, Kappa or weighted Kappa

2. Intra-rater: The same evaluator over time. The intra-rater reliability of a test relates to the stability of the scores obtained by a rater when he/she carries out the test on two separate occasions. A single rater tests each patient twice (or more) with several days in between each test. The patient’s state must remain unchanged during this time.[14]

Quantitative measure:

  • Intraclass correlation coefficients (ICC)
  • Bland and Altman method (fidelity between two raters)

Qualitative measure:

  • Two coefficients, Kappa or weighted Kappa

3. Inter-rater: Different evaluators, usually within the same time period.  The inter-rater reliability of a test describes the stability of the scores obtained when two different raters carry out the same test. Each patient is tested independently at the same moment in time by two (or more) raters.[14]

Quantitative measure:

  • Intraclass correlation coefficients (ICC)
  • Bland and Altman method (fidelity between two raters)

Qualitative measure:

  • Two coefficients, Kappa or weighted Kappa

Other statistics associated with reliability:

  • Pearson product-moment coefficient of correlation;
  • Spearmann rho (ordinal data);
  • Intraclass correlation coefficient (ICC) (correlations and level of agreement);
  • Kappa statistics (percent agreement).

If there is a question about the stability of the measurement over time, the standard error of measurement (SEM) can also be calculated.[6]

Minimal detectable difference (MDD) / also called the Minimal Detectable Change (MDC):  The amount of change in a variable (a measurement) that must be achieved to reflect a true difference. 

It is important to understand that the MDC is not the same as the Minimal Clinically Important Difference (MCID). The MCID reflects the amount of change that needs to occur to be clinically meaningful. In general, the MDC will be smaller than the MCID values.[6]

Population specific reliability refers to the degree to which a measure reflects the population it is intended to be used to. [15] It allows for generalisations to be made, since the study sample (of patients and raters) must also be representative of the target population. [2]

Responsiveness, also known as sensitivity to changes, is the ability instruments have to measure small changes that are clinically important, where participants or patients respond to effective therapeutic interventions. This is considered an important part of the longitudinal constructs assessment process.[16]

A tool is said to be sensitive to change if it can precisely measure increases and decreases in the construct measured. This is important for tools which are used to evaluate changes following a therapeutic action. The aim is to measure the capacity of the scale to detect small but clinically significant changes. When an outcome measure is sensitive to change, the score increases as the patient improves, decreases as the patient worsens and does not change if the patient’s state remains stable.[14]

The responsiveness of a measuring instrument is its ability to detect change over time. A commonly used index of responsiveness is the effect size for paired differences. [17]

When you are using a tool, a questionnaire or a functional outcome measure with your patient, you want to have confidence in your tool. Is it measuring what it is supposed to? Is it reliable over time? Can it detect the difference between a healthy and a pathological state?

To be confident with your tool, you need to have strong validity, reliability and responsiveness. These constructs are not fixed properties but vary according to the population and setting they are applied to. However, you cannot be certain as a clinician that your interventions are truly helping your patient without these.

Be supportive and encouraging to researchers who are conducting methodological studies on psychometric properties, they are heightening the quality of rehabilitation medicine.


COSMIN: A global initiative for the selection and development of outcome measure instruments.

  1. Walton M., Powers J., Hobart J., Patrick D., Marquis P., Vamvakas S., Isaac M., Molsen E., Cano S., Burke L. Clinical Outcome Assessments: Conceptual Foundation–Report of the ISPOR Clinical Outcomes Assessment – Emerging Good Practices for Outcomes Research Task Force. Value Health. 2015 Sep; 18(6): 741–752.
  2. 2.02.1 Jerosch-Herold C. An Evidence-Based Approach to Choosing Outcome Measures: A Checklist for the Critical Appraisal of Validity, Reliability and Responsiveness Studies. Br J Occup Ther 2005; 68(8):347-53.
  3. Russell EW. Chapter 2: The Nature of Science. In: The Scientific Foundation of Neuropsychological Assessment: With Applications to Forensic Evaluation. Copyright © 2012 Elsevier Inc. 
  4. Mokkink L., Terwee C., Patrick D., Alonso J., Stratford P., Knol D., Bouter L., de Vet H. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010 Jul;63(7):737-45.
  5. Souza A., Alexandre N., Guirardello E.. Psychometric properties in instruments evaluation of reliability and validity. Epidemiologia e Serviços de Saúde. 2017 Jul;26:649-59.
  6. Portney L., Watkins M. Chapter 4: Principles of Measurement, within Foundations of Clinical Research : Applications to Practice, 3rd Edition. F.A. Davis Company, Pennsylvania, United States 2015. ISBN10 0803646577. 
  7. 365 Data Science. Data Science & Statistics: Levels of Measurement. Available from: [accessed 2 April 2023]
  8. Price P., Jhangiani R., Chiang I-Chant A. Chapter 5: Psychological Measurement. In: Research methods in Psychology – 2nd Canadian edition. Available online:[accessed 14-2-2023]
  9. Gellman M., Turner J. Psychometric properties. In Encyclopedia of Behavioral Medicine. 2013 Edition. Springer, New York,
  10. Martins GA. Sobre confiabilidade e validade. RBGN. 2006 jan-abr;8(20):1-12.
  11. Keszei A., Novak M., Streiner D. Introduction to health measurement scales. Journal of psychosomatic research. 2010 Apr 1;68(4):319-23.
  12. Roach KE. Measurement of health outcomes: reliability, validity and responsiveness. JPO: Journal of Prosthetics and Orthotics. 2006 Jan 1;18(6):P8-12.
  13. Kottner J., Audigé L., Brorson S., Donner A., Gajewski B., Hróbjartsson A., Roberts C., Shoukri M., Streiner DL. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. International journal of nursing studies. 2011 Jun 1;48(6):661-71.
  14. Fermanian J. [Validation of assessment scales in physical medicine and rehabilitation: how are psychometric properties determined?.] Ann Readapt Med Phys. 2005; 48(6):281-287.(Article in French)
  15. Portney L.G., Watkins M.P.. Foundations of clinical research: Applications to practice. 2nd Edition, Prentice Hall Health, Upper Saddle River, 2000.
  16. Lohr KN. Assessing health status and quality-of-life instruments: attributes and review criteria. Quality of life Research. 2002 May;11(3):193-205.
  17. King M, Dobson A. Estimating the responsiveness of an instrument using more than two repeated measures. Biometrics. 2000 Dec;56(4):1197-203. doi: 10.1111/j.0006-341x.2000.01197.x. PMID: 11129479.
  18. The Psychometric World. Crash Course in Psychometric Testing – Module 3: Reliability, Validity and Norms. Available from: [accessed 17-02-2023]

Leave a Reply

Your email address will not be published. Required fields are marked *