Under the Australian VET Framework, and the AQTF, assessments used by Registered Training Organisations (RTOs) must compliant with the “Principles of Assessments: fair, Flexible, reliable and valid”.
Trainers and RTO managers asked me with singular frequency: “How can we ensure our assessment tools are reliable and valid? Do you have a template for that?” I decide to write the following notes to put some light on the above.
Testing the reliability of an assessment is finding how much error is included in the evidence. In other words, finding the degree to which an assessment tool produces stable and consistent results.
According to the NVR Standards’ definitions, there are five types of assessments reliability:
- Internal consistency,
- Parallel forms,
- Inter-rater, and
Let’s look at the meaning of each type of reliability below.
Internal consistency is a measure of reliability used to evaluate the degree to which different items that probe the same construct (area of knowledge or competency) produce similar results. Can be obtained by grouping assessment items by area of knowledge or competency and determining the correlation between them. In other words, we want to ensure that there is consistency across all tasks within an assessment tool.
Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same area of knowledge or competencies) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate consistency of results across alternative versions. You may be interested in develop a large set of assessment items (tasks, questions) for a particular competency standard and them randomly split the items into two sets, which would represent the parallel forms.
Split-half is in fact another subtype of internal consistency reliability. The idea is to “split in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.
Inter-rater reliability is a measure of reliability used to assess the degree to which different assessors or raters agree in their assessment decisions. It’s about how much homogeneity, consistency or consensus, there is in the assessment decisions made by different assessors. Inter-rater reliability is useful because human observers will not necessarily interpret answers or behavior the same way; assessors may disagree as to how well certain responses demonstrate knowledge or skills being assessed.
Intra-rater reliability is a measure of reliability used to assess the degree of agreement among multiple repetitions of an assessment tool performed by a single assessor to different individuals/groups.
Validity refers to how well an assessment tool measures what it is purported to measure. For example, if your scale is off by 5kgs, it reads your weight every day with an excess of 5Kgs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5Kgs to your true weight. It is not a valid measure of your weight.
There are five types of validity defined in the NVR Standards:
- criterion (i.e. predictive and concurrent),
- construct, and
Face Validity ascertains that the measure appears to be assessing the intended competency. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of candidates. If the candidates do not believe the measure is an accurate assessment of the competency standard, they may become disengaged with the assessment task proposed.
Content Validity ensures that the measure covers the whole performance criteria within the competency standard. It is important that all required workplace conditions and standards have being included in the assessment.
Criterion Validity is used to predict future or current performance – it correlates assessment outcomes with competency standards. The criterion validity study, is completed by collecting both the assessment results and information about student’s performance. For example students taking a Certificate IV in Accounting are assessed in basic accounting calculations at the early stage of the course. The results of that assessment can be correlated with the student’s performance in applying those basic accounting calculations techniques while doing other units of competency within the course. The criterion validity test will show how accurate the first assessment predicts the student behaviour/performance in the second unit of competency.
Concurrent criterion validity must be used whenever there are two or more different assessments paths. A classic example is the RPL process. To study the criterion validity for concurrent assessments is necessary to correlate the performance of students who took the course learning and assessment path with those who took the RPL option.
Construct Validity is used to ensure that the assessment is actually assessing what it is intended to assess (i.e. the construct), and not other competencies. Using a panel of assessors familiar with the competency is a way in which this type of validity can be assessed. The assessors can examine the items and decide what that specific item is intended to assess. Example, marketing students, may be asked to read a case study about an IT organisation, and prepare a marketing plan based on the case scenario information. The case scenario is written with complicated IT specific wording and phrasing. This can cause the assessment inadvertently becoming an assessment of technical reading comprehension, rather than an assessment of marketing plans.
Consequential Validity refers to the social consequences of using a particular assessment for a particular purpose. Assessment tasks must be relevant to the workplace and the student target group. For example, suppose some subgroups obtain lower scores in a particular communication skills assessment, consequently they are required to take further training in communication skills. If the assessment used measured particular traits for the subgroup that are not important for an effective communication in their relevant workplace, the assessment is not valid due to consequential validity reasons.
Validity is not simply a property of the assessment tool. As such, an assessment tool designed for a particular purpose and target group may not necessarily lead to valid interpretations of performance and assessment decisions if the tool was used for a different purpose and/or target group.
We do have “Validation templates” that we use to record evidence of the validation activities, which have been created to guide assessors through the process of auditing compliance with the principles of assessment and the rules of evidence. It is important to state here, that those templates may become useless if the assessors participating in the validation process do not master the principles described above.
As a contribution to the VET sector, Insources delivers training in this area including the workshop: Lead Assessment Validations.