Lesson observations: Approach with caution!
For any measure of teaching effectiveness to be useful, it needs to be valid. To be valid, a measure also needs to be reliable. Reliability represents the consistency of a measure. A measure is said to have a high reliability if it produces similar results each time – for example if two observers independently rate the same lesson, those ratings should agree with one another. Validity represents the extent to which a measurement corresponds to what it aimed to measure. So a valid observation would measure genuine learning gains, rather than be subject to bias.
We’ve known for some time that classroom observations lack the reliability for high-stakes judgements of teacher effectiveness. For example, the MET project – which spent millions of dollars to produce robust observation protocols – found that even for these carefully constructed observation measures, the reported reliabilities of observation instruments used in the MET study range from 0.24 to 0.68. Rob Coe gives a great example of why such low reliability represents a problem for teacher appraisal.
“One way to understand these values is to estimate the percentage of judgements that would agree if two raters watch the same lesson. Using Ofsted’s categories, if a lesson is judged ‘Outstanding’ by one observer, the probability that a second observer would give a different judgement is between 51% and 78%.”
Evidence of poor validity has also been around for a while. For example, Strong et al (2011) asked participants to watch videos of teaching and asked them to rate whether the teacher was ‘effective’ or ‘ineffective’. These ratings were compared to the value-added scores for the students of those teachers. Where the observers had not received specific training on observations, even experienced teachers and head teachers matched ‘effective’ teaching to high value-added less than 50% of the time. Again, Rob Coe gives an example of what this would mean for classroom teachers being graded in the UK.
“Fewer than 1% of lessons judged inadequate and only 4% of lessons judged outstanding produce matching learning gains. Overall, 63% of judgements will not correspond with value-added.”
Why is observation such a problematic measure of effective teaching?
There are lots of possible reasons why observations may not be a valid or reliable measure of teaching: Learning is invisible – it takes place in a student’s head and anything that we can see in the classroom is merely a proxy for that learning, is one problem. Another problem is the fact that observers likely have strong ideas about what ‘good practice’ looks like – whether those practices lead to learning gains is another matter. Teaching is also based on a natural ability – something humans have evolved to do – therefore even experienced teachers will find it really difficult to explicitly describe what it is they do.
However, there also appears to be another, quite simple, reason why using observations to make high-stakes judgements about teaching tend to lack validity: It seems the prior attainment of students in a class biases the ratings of an observer.
Steinberg, M. P., & Garrett, R. (2016). Classroom Composition and Measured Teacher Performance What Do Teacher Observation Scores Really Measure?. Educational Evaluation and Policy Analysis, 0162373715616249.
Steinberg and Garrett used data from the MET study to explore the extent to which the class a teacher is timetabled to teach might influence observation measures of that teacher’s performance. They review a number of previous studies in this area, relating other factors which appear to influence the outcome of observation ratings. For example, observation scores tend to be lower for teachers whose students come from more disadvantaged backgrounds. They also note the problem that teachers are not randomly assigned to teaching groups in schools – and that often inexperienced teachers are allocated to more disadvantaged students, while more experienced teachers tend to work with higher achieving students.
To examine whether the prior attainment of students influenced observation ratings, they used the data from the MET study. The MET study was carried out over 2 years and across six districts in the US. One of the advantages of the MET data was the fact that the project randomised the allocation of teachers to classes prior to the second year of the study. They used this random allocation of 834 teachers to classes (Grades 4-9) to generate estimates of the effect of prior achievement on measured teacher performance for that second year. Their conclusion suggests that teachers of lower-ability groups may be unfairly rated as relatively ineffective, even when very the strict observation protocols involving considerable training are used:
“In this article, we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time.” Page 20
Interestingly, the study found that the influence of prior attainment was greater for teachers of ELA (English Language Arts) than for maths teachers, and for subject-specialist teachers (common in secondary in the UK) compared to generalist teachers (more like primary). Another interesting finding was that prior attainment appeared to particularly influence measures of teaching related to ‘classroom climate’ – suggesting that observers of teachers of higher-performing students may be judged better at behaviour management than they actually are.
This study has significant implications for schools which use high-stakes (let alone graded) observations as the basis for appraisal. If a teacher’s effectiveness is, in part, determined by which groups they are allocated to teach then withholding a pay rise or placing a teacher on capability based on observations of teaching becomes potentially unmerited and inequitable.
How can teachers know their impact?
Observations of teaching can (and I’d say should) provide teachers with useful feedback they can use to develop their professional practice – but if observations lack validity, then they won’t help provide useful formative feedback (let alone summative judgement). Once again, Rob Coe has some suggestions about how schools might approach observations:
There’s a great video of Rob Coe presenting some of the problems and possible ways forward at Teach First: What is the future of lesson observation in our schools? (Part 1) (Jan 2014)
- Stop assuming that untrained observers can either make valid judgements or provide feedback that improves anything
- Apply a critical research standard and the best existing knowledge to the process of developing, implementing and validating observation protocols
- Ensure that good evidence supports any uses or interpretations we make for observations. It follows that appropriate caveats around the limits of such uses should be clearly stated and the use should not go beyond what is justified
- Undertake robustly evaluated research to investigate how feedback from lesson observation might be used to improve teaching quality (EEF already has one such study underway).
Other than observations, value-added data and student survey feedback might be used to help provide teachers with more valid feedback on their teaching.
The MET study, for example, found that VA data reasonably correlates with a teacher’s long-term success. However, VA data tends to come too infrequently (and too late) in a school cycle to help identify where things might be going well or need to improve. It also doesn’t provide ‘fine-grain’ detail – i.e. it can tell you that students did well, but can’t really tell you what it was the teacher was doing well, or what they should be doing to improve. There are also some other issues with VA scores – for example, one study tested VA modelling techniques by see what effect teachers had on their students’ heights. In their analysis, they found that teachers appeared to influence the height of their students almost as much as English and maths scores.
Student surveys are another method used by the MET. I’ve used these within a coaching context to help teachers identify areas they might work on – and used follow-up surveys to see whether students felt the changes had any impact. You can read a bit about this here: Investigating teaching using a student survey
At the last though, the problem for teachers is that high-stakes judgements use any form of measure which lacks the reliability and validity to form a reasonable basis for such judgement. I suspect part of the issue has been the impression that school leaders needed to have such observation data to support their judgements of the quality of teaching in their schools to Ofsted. One of the proposals in the recent white paper (Educational Excellence Everywhere was to remove the separate Ofsted judgements for Teaching and Learning from future inspections. On the basis of the evidence this seems like a very good idea indeed!
Perhaps once out of Ofsted’s shadow, schools will be able to think about how to use observations much more constructively – perhaps as a coaching tool to help teachers improve their impact rather than a sword of Damocles to hang over their heads three times a year.