Your kids are doing better than their end of year test allows them to be labeled. Both you and your kid need to know this.
The deck is stacked against students in demonstrating their success because nearly all standardized educational testing is built around a curve designed to knock kids down rather than lift them up.
It’s like a diabolical “sorting hat” from the Harry Potter series.
In North Carolina, the reading level labels assigned to students are predetermined:
- Only around 10% of students will receive a “Level 5”
- Only around 20% of students will receive a “Level 4”
- Only around 20% of students will receive a “Level 3”
- Around 50% of students will be labeled “not proficient”
Here’s proof from the Standard Setting Technical Report for NC General Reading Assessments. Notice the consistency across grade level “cut scores” is not based on student scale score performance, but based on the percent of students who would “fall” into each category.
NC DPI claims the “cut scores” are based only on demonstrated content knowledge. Baloney. Data trends and their own process descriptions conflict with this claim.
I’m a teacher and mama bear who has experienced or heard about tears, shaken confidence and frustration among kids, their families, and their teachers, all because education bureaucrats and contractors care more about curves and quotas than kids.
I’m baffled at the emphasis on student impact based on population distribution, and not the actual impacts that this data has on the academic and emotional well-being of children who are fighting an uphill battle to prove their worthiness in a system designed to only tell half of them they’re good enough.
Keep reading to learn more about how this is done, debunking the attempts at plausible deniability that seeks to defend this practice, and how you can help change this system into one that prioritizes kids over curve quotas.
This is how they do it:
Whereas your kid’s teacher uses criterion referencing without predetermined performance quotas to assess your child, meaning any student who answers more than 90% of an assessment correct gets an A, 80-89% = B, etc, the North Carolina Department of Public Instruction and many other public and private standardized testing agencies prioritize norm-referencing.
Norm-referencing ranks student performance in comparison to other students. For example, a student ranked in the 75th percentile is deemed to have outperformed 74 others in a random sampling of 100 students.
The percentile rank is not the same as a percentage correct, nor is it an indicator of it. It’s possible that the student ranked in the 75th percentile answered questions with 50% accuracy, or 90% accuracy. Percentile only communicates how a student performed compared to others, not whether or not that student performed well or poorly based on demonstrated knowledge.
Because it is designed only to compare students to each other, norm-referencing produces a bell-curve distribution of students.
The image above demonstrates the faults of using this method to communicate academic achievement via norm referencing:
- It assumes that students underperforming their peers must be below academic expectations.
- It assumes students in the middle of the pack are at a basic proficient academic level.
- It assumes students at the top of the curve are content masters (and nobody below them strongly understands the material).
These aren’t valid assumptions because though classmates may have performed better than your child, it doesn’t mean your child didn’t meet grade level expectations.
Remember, norm-referencing compares your child to other children, not to baseline content knowledge prioritized in criterion referencing used by your child’s teacher throughout the year. In criterion referencing, there are no quotas for the number of students who can earn a particular grade.
Prioritizing percentile over content performance dilutes data on actual achievement. It insists that only 10% of students deserve a “Level 5” rating and sets up 50% of students to be labeled “not proficient” in reading.
This dynamic first hit my radar when I noticed 75% of NC students are told each year they’re not “career and college ready” since that label is reserved only for students scoring Levels 4 or 5.
The “renorming process” regularly moves the goalposts to realign “cut scores” that preserve the curve quotas. As a result of raising cut scores (moving the goalpost higher), student growth and success isn’t incentivized – it’s hidden.
Hiding in plain sight
None of these claims are speculative or far-fetched. Throughout this Bursting the Bubble Sheet series, I’ve used primary sources from the NC Department of Public Instruction.
There’s an element of transparency in the availability of these publications, but seemingly an assumption that few people will read or understand them well enough to question the practices they describe. As a teacher and a mom, I’m doubly motivated.
The Standard Setting Technical Report for NC General Reading Assessments thoroughly describes the process by which “cut scores” are created. Here’s a crash course:
Educators engaged in the cut score process were told to focus on these criterion-based benchmarks when considering where they would “cut” each level…
…however as the process continues it pivots to an emphasis on “impact data” which is the predicted population of students who would fall into each level (norm referencing)…
…part of the training included emphasis on educators participating in the process not “over reacting” to the impact data…
…this is the impact data participants examined after round 2:
It must have been quite alarming for the educators on the committee to realize that the test questions they received to make “cut scores” would have produced a distribution where around 50% of 8-14 year old kids would be told they’re not proficient readers.
I didn’t find anything in the document specifying what was said to educators to encourage their complicity in this distribution. Since participants were required to sign non-disclosure agreements it’s unlikely we could find out:
DRC stands for Data Recognition Corporation, the agency that contracts with NC DPI to work on state standardized tests, and produced the technical report referenced throughout this piece.
After round 3, participants evaluated the population distribution shown as “student impact” across grade levels, and were encouraged to ensure the population distribution was closely aligned (“articulated”) across grade levels:
This continues the fact pattern of shifting the process away from criterion referencing to norm referencing by prioritizing population distributions instead of offer a fair test that accurately portrays students’ skills and content knowledge.
The influence of DRC is particularly prominent near the end of the process:
Participants were shown the range of their individual cut score recommendations…
…and this range of cut scores offered by educator participants was leveraged to justify DRC and NC DPI’s “final evaluation and articulation process” as long as their final cuts were within a range offered across individual educators’ preferred cut scores.
DRC & NC DPI now had the facade of educator voice and criterion referencing to justify the same norm-referenced underrepresentation of student abilities as previous versions of reading tests:
Now what?
As a parent, I reassure my children that no standardized test score defines them. I encourage them to do their best, but I take the results with a grain of salt and prioritize their teachers’ feedback throughout the year as the best measurements of their skills and content knowledge.
As a teacher, I no longer have a state-tested course but I advise my colleagues to look at these tests in a new light and help others understand the “game.” For the most part, anything I’ve said in this Bursting the Bubble Sheet series substantiates what teachers’ instincts and observations have been telling them about standardized tests for years. When folks propose “merit-based” pay using test scores, it’s nonsense like what has been described here that causes educators to push back on measuring their merit using a deck stacked against teachers and students.
The folks holding power over the testing process are the NC Department of Public Instruction and NC State Board of Education. Reach out to them and encourage a transition away from norm-referencing and toward true criterion referenced tests. The State Superintendent, Lieutenant Governor and Treasurer are all elected officials and other members of the State Board are nominated by the Governor and confirmed by the General Assembly.
The folks holding power over public education purse strings are those in the NC General Assembly. Too many legislative leaders chomp at the bit to use this misleading test data to justify a transition to private school vouchers and vilify public school employees.
The folks who have the most to gain from continued distortion of student performance data are those trying to sell their solutions to address the manufactured problem. We need elected leaders to look out for the academic well being of children, not the bottom line of contractors
Further Reading
Catch up on previous posts in this series:
Sniping at students to sell vouchers – Bursting the Bubble Sheet part 3
Bursting the Bubble Sheet Part 2: Seeing the forest before the trees
Bursting the Bubble Sheet: NC DPI’s Disingenuous Claims on K-12 Testing Data Part 1
Earlier related posts:
Why does NC insist on telling 75% of its students and teachers they’re not good enough?
Common App Essay: Students succeed in spite of state education policy, not because of it
This work is dedicated to my children, students and colleagues. In this together.

I’d like to know if they look at the tests holistically to take into account the length being a test of “stamina” just as much as reading “ability”. Data shows how so many do well on the check ins which are shorter and do worse on longer EOG. From what I’ve read in the posted technical manual they don’t look at it that way.
LikeLike
This is a provocative article with some good research behind it. I disagree with your conclusions, but let me explain why. The bookmark procedure used by DPI is intended to be criterion-referenced (Table 1 from the report that you showed). All standard-setting exercises involve providing the raters (educators) with feedback about impact, so, from that perspective, there is nothing too fishy going on here with the DPI process. The criticisms you are raising, about how standard setting procedures can become just a thin veneer for norm-referenced cuts, would bring into question the legitimacy of any standard-setting exercise. But, based on the DPI documentation, I wouldn’t say they did anything especially out of line with other standard-setting exercises.
FWIW I think your point about merit-based pay has a more legs to it. Value-added measured do not use proficiency cut-offs, but the raw test scores. Often the test scores are standardized (i.e., norm-referenced) within grade-level before computing VAMs. I am not sure if this is process that SAS uses, but it is very common and so I wouldn’t be surprised at all if your argument applied there.
I appreciate your blog and the you work you are doing to keep parents informed.
LikeLike
Thank you for your comment. I understand that DPI practices are in line with typical standard-setting exercises. It is precisely those exercises and their demonstrated emphasis on norm-referencing that while generally accepted as statistically sound, I find inappropriate in an educational setting. Education is specifically designed to help students defy the “norm” through growth, and recognize achievement when it’s demonstrated to foster that growth.
Whereas criterion-referencing has no agenda on how many students can be labeled as “meeting the bar,” norm-referencing moves the bar via its commitment to preserving a bell curve distribution of student performance. When we prioritize curves over kids, we send harmful messages to students by telling them they’re not good enough if they’re not overachieving, and misleading messages to the public by mischaracterizing actual student performance, and by default school performance. It’s no wonder those who want to cash in on school privatization and curriculum contracts love this data.
If I were to apply this norm-referenced mindset my classroom, I would begin every school year having already decided how many students will earn an A, B, C, D or F. Folks would justifiably be outraged if I used that approach, and I think we also need to push back on its application to standardized test results. Because I’m more committed to kids than curves, I set achievement levels based ONLY on the skills and content knowledge I expect a student to demonstrate to earn a letter grade. When I finish a school year where no student finished with an F I do not see that as a statistical flaw or lack of rigor, but as a sign of the partnerships I develop with students, families and coworkers to support student growth in meeting the established criteria. When I have more students finish with an A compared to any other letter grade, I do not see it as a statistical flaw that perhaps my classroom doesn’t offer a rigorous enough environment. Any student who has been in my classroom would laugh at that suggestion. It is again a testament to the work that happens in my classroom to set the bar high, scaffold supports, and hold myself and my students accountable to meeting those expectations. This is what all highly effective educators do, yet the norm-referenced system is designed to limit which students, schools, and teachers can be labeled “good enough.”
With regard to EVAAS and student growth calculations, I’ve written about that as well and the data prioritized is actually based on student percentile, not raw score performance. I use racing Mickeys to explain that process here: https://educatedpolicy.com/2024/02/03/sniping-at-students-to-sell-vouchers-bursting-the-bubble-sheet-part-3/
There’s comfort in the consistency of bell curve norm-referenced outcomes, but educational settings are called on to defy the norm and grow student achievement. We should be celebrating when more students achieve, not see it as statistically problematic and “re-norm” the test by moving the bar on cut scores that misidentify students as “not proficient” simply because they’re not reading above grade level, or outperforming half of their peers in a sample.
In the spirit of solution-oriented advocacy I would like to see a more robust discussion on which criterion-referenced metrics we should use (DIBELS, Lexile, etc.) to ensure student success is measured in a way that is fair to students instead of the current coin-flip insistence that 50% of students must be told they’re not proficient because…..bell curve.
LikeLike