ACM Logo  An ACM Publication  |  CONTRIBUTE  |  FOLLOW    

Harnessing the Power of Natural Language Processing to Mass Produce Test Items
Advances in eAssessment (Special Series)

By Martin C. Yu, Taylor Sullivan / October 2022

Print Email
Comments Instapaper

When developing items for one static test form (or even several static forms) to be used for research purposes or in practice, test developers often execute a thoughtful, iterative, comprehensive process that involves several rounds of careful review and revision by testing experts, subject matter experts, members of the sponsor organization, etc. In these circumstances, we often have the luxury of time and resources on our hands, and it is not very difficult to create a high-quality test that features an array of distinct topics and situations.

When developing content for high-stakes, high-volume testing programs, the circumstances are quite different. Developers must routinely amass and maintain a large bank of items to feed multiple forms that are administered for a finite amount of time before they are replaced with other forms. In addition, the issue of content overlap/redundancy in the item bank becomes more salient as hundreds if not thousands of items must be developed to measure the same set of competencies or knowledge domains. The sheer volume of unique content needed, coupled with development timelines that are typically quite aggressive, necessitates a more strategic development process that focuses on process/operations efficiency, standardization, and waste reduction. We will discuss several of these strategies and welcome audience members to contribute their strategies as well.

First, it is critical to build a rich repository of stimulus materials and references. For example, when writing situational judgment test items, having a robust collection of critical incidents submitted by professionals in the field helps to ensure test developers have fresh ideas and examples of challenging situations to draw from when writing items. At HumRRO, we try to take every opportunity to collect critical incidents and do so on a rolling basis—at the end of each workshop, during our clients’ conferences or professional gatherings, in conjunction with any activity in which subject matter experts will be present. We’ve streamlined the critical incident documentation training process and provided a straightforward, online submission portal with built-in instructions and prompts. The quantity-quality trade-off comes into play when gathering critical incidents, so we try our best to set subject matter experts up for success but recognize and plan for reality—not all submissions will be usable. Similarly, working with test sponsors and subject matter experts to build a high-quality reference library, replete with the types of resources and references that skilled professionals may turn to, can help test developers and subject matter experts to develop high-quality content in a timely manner.

To stockpile exam content, it is also necessary to carefully evaluate the production process and identify ways to enhance operational efficiency. We have found some success using a cascading development model where we rely on short-burst protocol “sprints” organized by user-friendly tracking tools and highly coordinated review processes. Items can be at different stages of development using this approach—the focus is on moving each one through as quickly as possible.

One thing to keep in mind when a test development team is going “fast and furious,” the potential for making errors becomes much higher. Thus, weaving multifaceted quality control into the very fabric of the development process is critical. For example, train staff and subject matter experts in quality control procedures, send quality control checklists to subject matter experts so that they can check their own work before submitting it, track subject matter experts’ adherence to instructions to inform future recruitment, etc.

Whether gathering and storing source materials, communicating with subject matter experts, or synchronizing and coordinating activities during a complex or fast-paced development cycle it is clear that technology can play a key role in supplementing human resources. As technologies develop, we are continuing to learn about the various ways that they can supplement the human element. Our work in automated item generation using natural language processing is one example of this process unfolding in practice.

Automated Item Generation

A commonly used method to speed up the process without compromising item quality is automated item generation (AIG). Past attempts at AIG have relied on the cognitive model or item template approach, where item writers develop templates and models for creating complete items. By essentially creating an item component bank as a precursor to creating the item bank, the model is then fed into an algorithm that generates new test items based on the item components and templates. 

The cognitive model approach has been a proven method for generating knowledge-based items where there is a clear and well-defined correct answer. However, it is a rigid method limited to filling in templates and only offers as much variability as there are templates. There is still significant upfront effort requiring item writers to develop item templates and models for how different item components can be used to fill in these templates. For item types where variability and natural-sounding language are desired, such as those for situational judgment tests (SJT) where respondents consider behavioral responses to a descriptive situation, a more flexible approach is needed.

The other approach would be to use natural language generation techniques, which is a broad class of text generation methods that construct new text by predicting the most plausible word that should follow the word or words that precede it. For a long time, these methods were unsuitable for AIG as they had difficulty generating coherent text once the text became longer than a few words. Early attempts to harness machine learning techniques were still inadequate as they could not consistently generate realistic-looking text due to the lack of capabilities such as discerning context and applying subject-verb agreement. 

While these issues are yet to be completely resolved, recently developed machine learning techniques for natural language understanding and generation (NLU/NLG) based on neural networks trained on large representations of the English language have made strides in capturing the nuances of genuine language and in generating realistic text. Their capabilities include accounting for the continually changing context of a text passage as it is generated and giving differential attention to preceding terms when deciding sequentially generating text. The result of these methods is that a significant proportion of generated texts can be considered as coherent as human-written text with little to no additional editing required. Based on evaluating the effectiveness of these neural network techniques for text generation in general, as well as testing it specifically for generating personality, vocational interests, and SJT items, we determined that there has been sufficient advancement in machine learning for AIG to have the potential for item development for operational testing programs.

In our experience, natural language generation techniques hold great promise for achieving increased flexibility during item development. The advancements in text generation provided by NLU/NLG’s ability to comprehend and contextualize words and then predict the most plausible word or words that should follow them are compelling and exciting.

Although the application of NLU/NLG to AIG remains in its infancy, its use has surged in other areas, developing a large representation of the English language. As a result, a significant proportion of generated texts can be considered comparatively coherent to human-written text, with little to no additional editing required.

This approach to AIG is agnostic to the specific test content and can theoretically be applied to generate new test content for any testing program. We have taken advantage of NLU/NLG advances across several assessment formats including SJTs, personality tests, and interest inventories. We have found that this approach generates diverse, high-quality content across all three item types. Our experience indicates that NLU/NLG approaches to AIG can offer huge advantages over more traditional approaches to AIG:

  • Decreasing time-consuming, upfront analysis of item components or schema development to produce new items.
  • Reducing reliance on human item writers, with downstream benefits of reducing costs and reallocating personnel toward more complex and less automatable tasks, such as item review.
  • Lessening the impacts of biases or predispositions in item writing due to the different writing styles of individual writers.
  • Integrating content that human item writers may not have considered.
  • Increasing the number of items developed while preserving item quality, with larger item banks providing additional benefits for other aspects of test development such as form assembly and test security.

Making AIG Accessible

At HumRRO, we have successfully used natural language processing (NLP) to generate test items for a variety of assessment types including situational judgment tests, personality measures, and interest inventories (see Figure 1). Based on this expertise, we have developed an innovative interface for on-demand automated generation of test items using finely tuned NLU/NLG models. The NLU aspect of NLP allows computers to understand the nuance inherent in human speech, which feeds into NLG models that can write natural-sounding language—in this case, variable test items—that reflects such nuance.

Figure 1. AIG interface for model tuning and generation.

[click to enlarge]

Our point-and-click interface is intended to be user-friendly such that it is designed to be understood and usable by item or test developers without requiring prior experience with machine learning or natural language processing. It fine-tunes a natural language model (e.g., Radford et al.) on human-written test items so that the model will “learn” to replicate the structure of the specific items of interest, automatically generates new items from this model, and programmatically evaluates the quality of the generated items.

In addition to the overall benefits of using NLU/NLG for automated item generation, we also tailor our AIG models to the needs of specific test or item development programs. The customized options we provide enable the user to tune parameters available in the model development and item generation procedures. These include choosing between:

  • Speed or quality in the model development procedure. Underlying these options are parameters that affect the model learning rate. Although the actual speed varies depending on the content input into the program, a quality model could take a few hours longer for the program to develop than a speedy model. A higher quality model is certainly worth the time investment for operational use, but a speedier model may be more reasonable when experimenting with different items, for example.
  • Deterministic or diversified items in the item-generation procedure. Underlying these options are parameters that affect the amount of randomness included in the item generation procedure. More randomness means that a greater variety of item content could be obtained, with the tradeoff being a higher chance of error such as nonsense items. Text cleaning steps after items are generated can help mitigate some of these errors while maintaining diversification.
  • Construct-specific or construct-agnostic item generation. The AIG model can be tuned to understand features that characterize different constructs, such as items measuring different personality traits or items that reflect different proficiency levels. In the model development procedure, users can specify whether to train the model to learn construct labels and if constructs are specified, users can further specify which constructs they would like to generate items for in the item generation procedure.

Our present interface serves as a baseline that can be further customized for client/user preferences, and these options can be presented in different formats for different complexity levels. For example, in adjusting speed or quality in model development, one user may prefer simplicity, in which case we provide a simple slider between speed and quality with the underlying parameters adjusted in the background. Another user may prefer greater control, so we also provide a version where the user is able to input values for specific model development parameters.

Example AIG for Personality Items

Figure 2 is an example output from an AIG model fine-tuned on personality test items and asked to generate five extraversion items. None of these items appeared in the original set of personality items used to fine-tune the AIG model, and each generated item is evaluated on fit with the construct using text similarity metrics. For example, “Am very sociable” is likely a usable extraversion item, whereas “Always look for a quick buck” will likely need to be excluded from consideration in an operational personality measure.  

Figure 2. Example output from AIG for personality test items.

[click to enlarge]


The continuing need to mass-produce test items will push test developers to find new tools and technologies to improve process efficiencies. There is no better time than now to start integrating NLU/NLG approaches to AIG into the test development process. The technology has matured enough that it can be confidently implemented for use in operational testing programs. While the technology behind these AIG models requires computer or data science expertise to implement, they can be packaged into item or test development applications that are accessible to test developers in general.

Ultimately, the intention of AIG is not to completely replace human item writers, but rather to supplement them. Since AIG models are created and trained on human-written items, a continuing pool of these items will be needed to update the models so that they do not stagnate. Personnel can then be redistributed based on their skills and preferences, especially given individual inclinations for item writing versus item reviewing. The combination of AIG and this reallocation of effort should make it even more practical to produce, review, and implement test items.

About the Authors

Dr. Martin C. Yu is a senior scientist at the Human Resources Research Organization in Alexandria, Virginia. His research and consulting work aim to address and improve personnel assessment, selection, and development practices for organizations across a range of industries and sectors through the application of his expertise in advanced psychometric and data science techniques, including machine learning, multi-objective optimization, and natural language processing. Martin has also conducted research, published, and presented on a variety of practical topics relevant to organizational research, including differential prediction, judgment and decision-making, and research methods. Yu received his Ph.D. in industrial-organizational psychology from the University of Minnesota.

Dr. Taylor Sullivan is a senior staff I/O psychologist at Codility where she applies state-of-the-art I/O psychology practices and principles to support Codility’s product, customer success, marketing, and sales teams. Her expertise spans a variety of areas including talent assessment and selection, learning and development, leadership, and credentialing and licensure.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Copyright is held by the owner/author(s). Publication rights licensed to ACM. 1535-394X/2022/10-3533773 $15.00


  • There are no comments at this time.


Advances in eAssessment (Special Series) This series of articles covers advancements in eAssessment. The series features educators, developers, and researchers from around the world who are innovating how learning is assessed while meeting the challenges of efficiency, scalability, usability, and accessibility.
  1. Going Beyond Multiple Choice
  2. Centering All Students in Their Assessment
  3. Harnessing the Power of Natural Language Processing to Mass Produce Test Items
  4. Getting Authoring Right—How to Innovate for Meaningful Improvement
  5. Closing the Assessment Excellence Gap—Why Digital Assessments Should go Beyond Recall and be More Inclusive