Lesson observations: Would picking a top set get you a better grading?

Lesson observations: Approach with caution!

For any measure of teaching effectiveness to be useful, it needs to be valid. To be valid, a measure also needs to be reliable. Reliability represents the consistency of a measure. A measure is said to have a high reliability if it produces similar results each time – for example if two observers independently rate the same lesson, those ratings should agree with one another. Validity represents the extent to which a measurement corresponds to what it aimed to measure. So a valid observation would measure genuine learning gains, rather than be subject to bias.

We’ve known for some time that classroom observations lack the reliability for high-stakes judgements of teacher effectiveness. For example, the MET project – which spent millions of dollars to produce robust observation protocols – found that even for these carefully constructed observation measures, the reported reliabilities of observation instruments used in the MET study range from 0.24 to 0.68. Rob Coe gives a great example of why such low reliability represents a problem for teacher appraisal.

“One way to understand these values is to estimate the percentage of judgements that would agree if two raters watch the same lesson. Using Ofsted’s categories, if a lesson is judged ‘Outstanding’ by one observer, the probability that a second observer would give a different judgement is between 51% and 78%.”

Classroom observation: it’s harder than you think

Evidence of poor validity has also been around for a while. For example, Strong et al (2011) asked participants to watch videos of teaching and asked them to rate whether the teacher was ‘effective’ or ‘ineffective’. These ratings were compared to the value-added scores for the students of those teachers. Where the observers had not received specific training on observations, even experienced teachers and head teachers matched ‘effective’ teaching to high value-added less than 50% of the time. Again, Rob Coe gives an example of what this would mean for classroom teachers being graded in the UK.

“Fewer than 1% of lessons judged inadequate and only 4% of lessons judged outstanding produce matching learning gains. Overall, 63% of judgements will not correspond with value-added.”

Classroom observation: it’s harder than you think

Why is observation such a problematic measure of effective teaching?

There are lots of possible reasons why observations may not be a valid or reliable measure of teaching: Learning is invisible – it takes place in a student’s head and anything that we can see in the classroom is merely a proxy for that learning, is one problem. Another problem is the fact that observers likely have strong ideas about what ‘good practice’ looks like – whether those practices lead to learning gains is another matter. Teaching is also based on a natural ability – something humans have evolved to do – therefore even experienced teachers will find it really difficult to explicitly describe what it is they do.

However, there also appears to be another, quite simple, reason why using observations to make high-stakes judgements about teaching tend to lack validity: It seems the prior attainment of students in a class biases the ratings of an observer.

Steinberg, M. P., & Garrett, R. (2016). Classroom Composition and Measured Teacher Performance What Do Teacher Observation Scores Really Measure?. Educational Evaluation and Policy Analysis, 0162373715616249.

Steinberg and Garrett used data from the MET study to explore the extent to which the class a teacher is timetabled to teach might influence observation measures of that teacher’s performance. They review a number of previous studies in this area, relating other factors which appear to influence the outcome of observation ratings. For example, observation scores tend to be lower for teachers whose students come from more disadvantaged backgrounds. They also note the problem that teachers are not randomly assigned to teaching groups in schools – and that often inexperienced teachers are allocated to more disadvantaged students, while more experienced teachers tend to work with higher achieving students.

To examine whether the prior attainment of students influenced observation ratings, they used the data from the MET study. The MET study was carried out over 2 years and across six districts in the US. One of the advantages of the MET data was the fact that the project randomised the allocation of teachers to classes prior to the second year of the study. They used this random allocation of 834 teachers to classes (Grades 4-9) to generate estimates of the effect of prior achievement on measured teacher performance for that second year. Their conclusion suggests that teachers of lower-ability groups may be unfairly rated as relatively ineffective, even when very the strict observation protocols involving considerable training are used:

“In this article, we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time.” Page 20

Interestingly, the study found that the influence of prior attainment was greater for teachers of ELA (English Language Arts) than for maths teachers, and for subject-specialist teachers (common in secondary in the UK) compared to generalist teachers (more like primary). Another interesting finding was that prior attainment appeared to particularly influence measures of teaching related to ‘classroom climate’ – suggesting that observers of teachers of higher-performing students may be judged better at behaviour management than they actually are.

This study has significant implications for schools which use high-stakes (let alone graded) observations as the basis for appraisal. If a teacher’s effectiveness is, in part, determined by which groups they are allocated to teach then withholding a pay rise or placing a teacher on capability based on observations of teaching becomes potentially unmerited and inequitable.

How can teachers know their impact?

Observations of teaching can (and I’d say should) provide teachers with useful feedback they can use to develop their professional practice – but if observations lack validity, then they won’t help provide useful formative feedback (let alone summative judgement). Once again, Rob Coe has some suggestions about how schools might approach observations:

There’s a great video of Rob Coe presenting some of the problems and possible ways forward at Teach First: What is the future of lesson observation in our schools? (Part 1) (Jan 2014)

  • Stop assuming that untrained observers can either make valid judgements or provide feedback that improves anything
  • Apply a critical research standard and the best existing knowledge to the process of developing, implementing and validating observation protocols
  • Ensure that good evidence supports any uses or interpretations we make for observations. It follows that appropriate caveats around the limits of such uses should be clearly stated and the use should not go beyond what is justified
  • Undertake robustly evaluated research to investigate how feedback from lesson observation might be used to improve teaching quality (EEF already has one such study underway).

Other than observations, value-added data and student survey feedback might be used to help provide teachers with more valid feedback on their teaching.

The MET study, for example, found that VA data reasonably correlates with a teacher’s long-term success. However, VA data tends to come too infrequently (and too late) in a school cycle to help identify where things might be going well or need to improve. It also doesn’t provide ‘fine-grain’ detail – i.e. it can tell you that students did well, but can’t really tell you what it was the teacher was doing well, or what they should be doing to improve. There are also some other issues with VA scores – for example, one study tested VA modelling techniques by see what effect teachers had on their students’ heights. In their analysis, they found that teachers appeared to influence the height of their students almost as much as English and maths scores.

MET predictors of success

Testing Teachers: What works best for teacher evaluation and appraisal

Student surveys are another method used by the MET. I’ve used these within a coaching context to help teachers identify areas they might work on – and used follow-up surveys to see whether students felt the changes had any impact. You can read a bit about this here: Investigating teaching using a student survey

At the last though, the problem for teachers is that high-stakes judgements use any form of measure which lacks the reliability and validity to form a reasonable basis for such judgement. I suspect part of the issue has been the impression that school leaders needed to have such observation data to support their judgements of the quality of teaching in their schools to Ofsted. One of the proposals in the recent white paper (Educational Excellence Everywhere was to remove the separate Ofsted judgements for Teaching and Learning from future inspections. On the basis of the evidence this seems like a very good idea indeed!

Perhaps once out of Ofsted’s shadow, schools will be able to think about how to use observations much more constructively – perhaps as a coaching tool to help teachers improve their impact rather than a sword of Damocles to hang over their heads three times a year.

This entry was posted in Education policy and tagged , , , , . Bookmark the permalink.

7 Responses to Lesson observations: Would picking a top set get you a better grading?

  1. One simple thing and one unsimple thing.

    1 As I understand it the concepts of reliability and validity are separate (and sometimes conflicting) variables used in assessing the value of a research technique or measurement. Reliability is a measure of whether the technique yields the same result for each observation of the same thing. Validity is a measure of whether the result is a genuine representation of what it has been designed to measure. Observation schedules can be made reliable by measuring trivia, such as the number of pupils who are addressed by name during a lesson. However, if those reliable and readily checked data are taken to indicate a teacher’s genuine interest in the individual characteristics of pupils, then there could be a problem with validity.

    2 This leads directly to the second problem, which is a tendency to under-theorise. Teachers in general don’t value theory and to that extent tend to lose the value that it could have. In the case of teacher appraisal, assessment and reward it matters very much what the observer, or the observer’s employers, think they are doing. The theory that justifies the practice of observing and deciding has to be explicit, detailed and justified. Are we sure that it usually is? With something as bewilderingly complex as the relationship between a teacher’s observed behaviour in a classroom and pupils’ improved test scores I would claim that the theory has to be pretty sophisticated and pretty well confirmed by empirical studies on a wide range of behaviours and types of pupil outcome.

    Denying all this by correlating all observable inputs with readily measurable exam scores just isn’t enough, even when huge efforts are made to allow for complicating external variables. We are still just guessing (hypothesising) – while acting as if we really did know what we were doing. We are pretending to use science while only honouring part of the model.

    I don’t think the theoretical work behind what actually happens has been done yet. Nonetheless, we have no choice but to assume that the personal judgements of experienced and trained teacher evaluators are reliable, while assuming (and not yet knowing) whether or in what ways their judgements are valid.

    Sorry to be so glum.

    Like

    • 1) Yes – reliability is necessary but not sufficient for a measure to have validity.

      2) I have some sympathy with this – given that when you make a measure a target you tend to distort that measure. Ideally any measure of teacher effectiveness would be independent from, but strongly correlate to, outcome measures for students (e.g. value-added). If we understood teaching well enough, then the proxies we observed for in the classroom might do this – and theory could have a role to play in that.

      I suppose the problem is, what do you take as your theory? Discredited Piagetian ideas about discovery learning? Unfalsifiable Vygotskian ideas about a zone of proximal development? Feminist theory? Bruner’s theory of instruction? Freire’s critical pedagogy? Dewey’s pedagogic creed? To me, teaching is what Thomas Kuhn might call a pre-science – in that there isn’t a well-evidenced paradigm accepted across the profession.

      However, there’s nothing as practical as a *good* theory – in my opinion – and if teaching possessed a good theory, then we could perhaps untangle the relationship between ‘great teaching’ from ‘great outcomes for students’. Perhaps there’s some hope in current attempts to apply theory from cognitive psychology to inform practice: like the Deans for Impact report (http://www.deansforimpact.org/pdfs/The_Science_of_Learning.pdf) for example.

      Like

  2. Many thanks for that link to the Deans For Impact paper – at first glance it reminds me of work that Peter Tomlinson was doing at Leeds when I worked in the School of Education at Leeds for a while, back in the 90s. A lot of his thinking at that time went into his book “Understanding Mentoring” that a lot of hurriedly enrolled teacher trainers (school-based mentors) found very helpful in finding their own ways to observe and then discuss teaching skills with novices, A quick Google scan for related work led me to:
    https://www.researchgate.net/publication/5308637_Psychological_theory_and_pedagogical_effectiveness_The_learning_promotion_potential_framework
    which might give some reassurance along the lines you are suggesting.
    Whether or not the insights we adopt come from “real science” or not will, I would argue, always be disputed. Knocking down the obvious straw men and women stil leaves us with the current year’s crop of green shoots. The nature of science itself floats on a shifting set of paradigms and practices.

    (as I write, I am conscious of saying things in short-hand that shout out for challenge – but which I would claim are no more than indications of how far away from being able to agree on conclusions we are when even the questions we ask and the ideas we bring with us can be so contentious.)

    Like

    • Thanks for the link – the full article is behind a pay wall, but looks quite intriguing.

      “Knocking down the obvious straw men and women stil leaves us with the current year’s crop of green shoots. The nature of science itself floats on a shifting set of paradigms and practices.”

      That seems overly pessimistic to me. There’s more than ‘straw men’ to knock down by applying what we’ve known for decades about memory and learning from psychology. Challenging misconceptions and pseudoscientific ideas about learning isn’t a waste of time – even if new nonsense does try to arise to take the place of the old. Whilst scientific ideas change over time – and certainly some areas of psychology have problems producing replicable findings – I would say that’s ‘business as usual’ in science.

      There are some pretty sound psychological insights which might inform classroom practice (e.g. the Dean’s for Impact summary). Perhaps the harder question is how to get teachers in touch with the evidence so far, and helping to shape the research to come.

      Like

  3. Rob Czar says:

    Trying to define the least bad way to evaluate teachers misses the point. What are we trying to accomplish with teacher evaluation? To decide which ones to fire or pay better? Where is the evidence for that solving any of our educational problems? Tying teacher evaluation to student performance (i.e., value-added assessment) is an illusion. Would it be a good idea to tie doctor evaluation to patient improvement? Wouldn’t the evaluation depend on the characteristics of the patients and their health problems rather than the quality of the care provided? What doctors would choose to teach the terminally ill? Doctors should be judged about whether they followed an scientifically established model for testament and otherwise followed best practices. So should teachers.
    It should also be noted, despite what is printed and said, that teachers have only a small overall effect on differences in student learning. Reasonably competent teachers get pretty much the same results. Different students tend to differ in what kinds of teachers they respond to, one style or personality or instructional acumen does not fit all.
    The goal of evaluation should be to provide feedback for teachers to improve. VA does not help with that goal at all. The way forward for teacher “accountability” is to determining what teachers should be doing and judging how well they are doing that. Teachers should be evaluated against a model of high quality teaching and instruction, which would include peer observation, and evaluation of instructional materials such as lesson plans and course components.

    Like

    • Yes, VA data tends to come too infrequently (and too late) in a school cycle to help identify where things might be going well or need to improve. Ideally we need more ‘formative’ feedback for teachers. I’ve some sympathy with your view that ideally teachers would be evaluated against a model of high quality teaching, but such a model doesn’t really exist. The problem is we know too little about what great teaching looks like. There have been attempts to summarise what little we do know (e.g. http://www.suttontrust.com/wp-content/uploads/2014/10/What-Makes-Great-Teaching-REPORT.pdf ). Peer observation could be quite effective, so long as there is an element of challenge, though I’m not sure that artefacts like instructional materials or lesson plans would necessarily be good proxies for high-quality teaching.

      Like

Leave a comment