Professional Standards in Education

Siegfried Engelmann
2004

A graduate student who does a research study involving high-risk subjects who go through non-traditional untried methods for teaching beginning reading has to justify the proposal and follow established protocol for research on human subjects. The student is to provide a rationale that describes why he thinks the method will work. He also has to describe possible benefits for the subjects, a backup plan to be used if the subjects are experiencing stress or failure that may affect their later learning, and indicators that are to be used to determine if children are not progressing as anticipated. The student must make thorough disclosures to the parents of the subjects, explaining risks, and possible compensation, and indicating who will respond to questions or problems. Finally, the student must obtain parental permission.

Ironically, a state or a school district that adopts the same untested program the graduate student uses is not required to follow any of the protocol and rules of conduct that govern the graduate student's procedures. Yet, both are experimenting with children. <

No, you say. The state adoption is not research. Yes it is.

If both the graduate student and the state experiment with children and both derive the same knowledge from the outcomes of the experiment, both doing research. The state is simply doing it in a clandestine manner, calls it something other than research, and charges for it.

One Webster's definition of an experiment is, "any action or process undertaken to discover something not yet known or to demonstrate something known." According to this definition, the state adopts programs and teaching methods on the assumption that they will work well with children; however, at the time of the adoption, the state has no data that programs or practices will work well. If the state later receives data on the effectiveness of the approach, and if these data are generated by the students who went through the program or practice, the children were experimental subjects whose data generates knowledge of the approach's effectiveness. If the results are positive, confirming the decision-makers' expectations, the research was "a process undertaken to demonstrate something known." If the results are negative or null (which is nearly always the case) the research functioned as "a process undertaken to discover something not yet known." In either case, it was an experiment, even though it was labeled an advancement, a breakthrough, or a reform. Functionally, the name means little. If the "reform" was the basis for the field obtaining convincing documentation that the intervention was ineffective, the intervention served a research function. In fact, if it failed, the research function would be one of the few positive results of the intervention.

The problem of experimentation by states and districts is documented by an uninterrupted sequence of failed reforms, starting with the bussing of inner city blacks and the new math in the 1960's, continuing through the open-school concept, the down-with-science humanism, the back to basics resolution, the teaching of reading through literature and whole language, and back to phonics. Whole language is a good example of failed reforms. The central argument that supports the approach holds that language is a whole. Reading is part of language. So reading should be governed by other facts we know about language. We see that language is effectively learned through situations in which language is used, not explicitly taught. Therefore, reading should be learned by actually reading, not being taught how to read. To many educators, this argument, although guilty of part-whole confusion, apparently seemed sound.

To support this argument, promoters of whole language presented what they assume is evidence. The evidence was often not of an experimental nature but consisted of analytical "research," possibly showing something about the structure of language, the structure of words, and obliquely relevant information, such as the fact that New Zealand is the most literate country in the world. The argument:

New Zealand is the most literate country in the world.
New Zealand uses whole language.
Therefore, our country will become as literate as New Zealand if we
use whole language.

Of course, the conclusion doesn't follow from the evidence. We don't know whether whole language caused this remarkable performance, which means that there is no data about how students in New Zealand perform with a program known to produce superior results in the US.

Following the lead of Honig in California, states and districts installed whole language wholesale. In California, schools were monitored to make sure they complied with the whole language mandates and discarded whatever reading programs were in use, without regard to the performance data of children. At least three districts in California that had exceptional results using Direct Instruction were forced to drop the DI and install whole language.

Within months after the implementation of whole language, even teachers who believed the hype and were trying to use whole language as it is specified observed that a large percentage of children were not learning to read. At the end of the first grade year, achievement test scores were significantly down.

In response to the performance of children, the states and districts issued caveats that had not been disclosed as part of the initial projections. The main assertion was that although children may be far behind at the end of kindergarten and first grade, they will catch up by the fourth grade. Exactly where the proponents of the reform got this information is not obvious. What is obvious is that many teachers told many parents, "Oh don't worry. He'll catch up by the fourth grade."

In the end, enough performance data was accumulated to discredit whole language completely. The data came in various forms, but mainly from achievement test performance of children in the early grades, and in Grade 4 (which revealed that the whole-language promise was a fabrication). Data also came from the rising number of referrals to special classes and from the number of retentions.

Perhaps as curious as the irresponsibility of state and district decision makers in installing and maintaining failed practices is what happens to them after the failed reform.

Following the disclosure of the reform's performance, decision makers do not say anything to the effect, "We screwed up. We are ashamed of ourselves for launching into a reform without sufficient data. We will never do it again." Instead, they presented a new reform based on their new insights about how children learn or about the structure of reading—as if science has just uncovered relevant data about the brain, learning, or human development; however, the new reform may have no more basis in data than the one it superseded. (After whole language, Honig became a phonics advocate, but without great contrition over the harm whole language did.) Furthermore, the administrators who engineer egregious failure do not have diminished status, but may actually go to a new district at a higher salary.

Ethical Standards

Most states and districts abandoned whole language and placed serious restrictions on using "literature" as the primary vehicle for teaching reading in the early grades. However, the system has not been reformed so that it is consistent with our commitments both to science and children. Obviously, the research data could have been obtained far less painfully through smaller-scale studies conducted in accordance with the protocol the graduate student must follow.

This protocol is spelled out in detail in the American Psychological Association (APA) standards for "Ethical Principles of Psychologists and Code of Conduct." The Ethical Standards articulate proper precautions and requirements that are implied by the power that psychologists may use or misuse. Some standards are applicable to states and districts that conduct educational experiments that are billed as reforms. The Standards are not only easily adapted to the kind of experiments that states and districts perform; they seem to be more necessary here than they are with small-scale experiments if we consider the "greater good."

Possibly, the key standard in the APA code is 3.04, which expresses the goal of "avoiding harm."

3.04 Avoiding Harm: Psychologists take reasonable steps to avoid harming their clients/patients, students, supervisees, research participants, organizational clients, and others with whom they work, and to minimize harm where it is foreseeable and unavoidable. (2002)

In the case of reforms, the harm is foreseeable and possibly unavoidable. To conduct research that provides evidence that whole language is not effective, some human subjects are required, and their failure must be documented. But the harm would be minimized by limiting the number of subjects and by terminating the treatment as soon as it became apparent that children were progressing below projections (which would mean long before the fourth grade or even the end of the first grade). The experiment would produce limited harm. Following clear signals of failure, the failing children could be placed in compensatory programs known to be effective. The state or district does not need to subject the entire school population to an experimental treatment for seven years (which is the period of adopting instructional programs in many states). The state or district does not need documentation of students who begin in K and go through the sixth grade before terminating the experiment. A much smaller sample of students and shorter experimental treatment would be able to generate data that is adequate.

A related issue is that if causing harm is unavoidable, is the "compensatory instruction" adequate compensation even for the minimized harm? A strong argument could be made that injured subjects should receive additional compensation. In any case there should be some form of disclosure to the subjects (or their parents) before the experiment. Section 8 of the APA Ethic's Code provides guidelines that address this issue and others.

Standard 8.01 is institutional approval. According to the standard, psychologists are to "conduct the research in accordance with the approved research protocol." Once states and districts acknowledge that their reforms function as research for some populations, the need for protocol logically follows.

Standard 8.02 presents guidelines for situations in which consent is required and outlines the features of the disclosers as well as the provisions for subjects to decline or withdraw from the research. The participants or their parents are to be informed of the purpose of the research and possible factors that may affect willingness to participate—potential risks, possible adverse effects, and possible positive benefits. Participants or parents also receive information about who will answer questions about details of the research or outcomes. Participants are to receive information about possible treatment alternatives and about compensation or costs.

Standard 8.05 describes conditions that do not require informed consent for research. One condition is "the study of normal educational practices, curricula or classroom management methods conducted in educational settings." This condition is prefaced by qualification that the "research would not reasonably be assumed to create distress or harm." That condition is not met by adoptions of significant reform measures or the adoption of new instructional material or practices that have no evidence of effectiveness. These are high-risk enterprises for at least the lower half of the school population.

Standard 8.07, Deception in Research, indicates that "psychologists do not deceive prospective participants about research that is reasonably expected to cause…severe emotional distress." For a small-scale educational experiment involving a discovery math program, the researcher may not know the extent to which distress is anticipated. For a larger population, however, the fact that there is no hard data on emotional distress presents a serious problem. In absence of data, we can assume that adverse consequences are probable if the failure rate is high. Failure in learning to read or do math causes strong emotional reactions in most students. So if a district were to install a new math program that featured discovery, the district would have to disclose that (a) it doesn't know the extent to which students will fail but (b) some who fail will have strong emotional reactions to the failure.

Standard 8.09 refers to humane care and use of animals in research. One provision is that "psychologists trained in research methods and experienced in the care of laboratory animals supervise all procedures involving animals and are responsible for ensuring appropriate consideration of their comfort, health, and humane treatment." Also, "psychologists make reasonable efforts to minimize the discomfort…of animal subjects…Psychologists use a procedure subjecting animals to pain, stress…only when an alternative procedure is unavailable."

Obviously, children are different from laboratory animals. For research purposes, however, it would seem reasonable to assume that the subjects' pain and stress would be monitored by an experienced person who supervises all procedures involving the experimental children, and who is responsible for ensuring appropriate consideration of their treatment.

Standard 10.09 refers to therapy; however, it is relevant to the kind of experimentation that school districts and states conduct:

Psychologists terminate therapy when it becomes reasonably clear that the client…is not likely to benefit, or is being harmed by continued service.

Because districts and states do not have counterparts for any of these requirements, they have no form of advocacy for the children who serve as subjects of their experimentation. The state or district does not provide disclosure of possible risks. It does not carefully monitor the installations of the approaches. It does not have anybody assigned to observe in the field and play devil's advocate. Nor does it terminate obviously poor approaches when it becomes reasonably clear that the children are being harmed.

Textbook Adoptions

Textbook adoptions are prime exemplars of experimenting with children. Instructional products, particularly those for the primary grades, are extremely important because they account for a large part of the variance in student performance. A well-designed instructional program with demonstrated effectiveness may produce an effect size that is more than a standard deviation above that of a poorly designed instructional sequence. (Adams & Engelmann, 1996).

Textbooks for beginning reading, math and all other subjects in the elementary grades are virtually never evaluated on the basis of effectiveness with students before they are adopted. Furthermore, there are no standards of effectiveness, and worse, no requirements for publishers to first try out the material with children, secure data on effectiveness, and disclose the results, which means that publishers create programs for use in schools without any data on how they work. This would be like mass producing an automobile without ever testing the design before launching a sales campaign. The first time any children see the program for a new approach is after it has been adopted. And the first time any performance data is generated by the program is usually a year or more after it has been in use in classrooms.

The publishers' attitude about creating instructional material may seem cavalier, but they are not the villains. Their procedures are a consequence of the way adoptions are configured. The publishers' products are referenced to the adoption criteria formulated by the district or state. The agency sets up criteria for instructional programs; the publishers attempt to design the material so that it meets the criteria. The agency evaluates the program not by trying it out on a small scale, but by assembling committees to inspect the material and judge from inspection how well it seems to meet the criteria. Historically, nowhere in this procedure is the question of research data on effectiveness addressed.

At least one state—California—had statutes that called for publishers to field-test material, but during the whole-language era the California State Board openly rejected these statutes. The 1976 statutes (section 60226) specified that the publishers are to "develop plans to improve the quality and reliability of instructional materials through learner verification." The 1988 California adoption criteria even included a requirement that publishers were to provide a description of the field-testing process and an explanation of how the materials are to be improved "on the basis of the field-testing data collected."

Although this sounds as if the adoption process was aligned with the legislation, the following sentence in the 1988 Language Arts Framework declared, "This additional information is not to be considered as part of the criteria for recommending materials to the state board…"

A 1989 suit against the Board argued that the state had to comply with the legislation on learner verification. The state board argued that it had a self-executing authority to do as it chose in adopting textbooks and that the Board's actions were not subject to review by the legislature. The Board lost the lawsuit, and was ordered to require publishers to provide learner verification, but that ruling made little difference because the laws were repealed within a year, and the adoption process has gone on ever since without concern with learner verification. So California, like other states and districts, declared that it is not interested in assuring that programs that reach the classroom have a high probability of working.

Another practical reason for the publishers' inattention to data on effectiveness is that usually there is not sufficient time to conduct the kind of field-test research needed to shape effective instructional material. The timeline that the state presents allows the publisher possibly only two years to create a K-5 sequence that meets the state's new requirements; however, it would probably take 2-3 years to try out the material for one grade level, revise it to avoid the specific performance problems identified in the first tryout, and try out the material again. To test all the levels, at least some "continuing students" would start in K and go through at least the third-grade level. To do a responsible job on a K-5 sequence, therefore, would require four or five years with the most efficient design that had various groups on each grade level starting two to four months apart (so the group receiving the final revised version in the first grade would start possibly ten months after the group that received the first tryout version of the program).

Another problem is that states and districts have primitive rules for adopting programs. Every seven years many of the statewide adoptions are referenced to a new framework with new criteria; therefore, the accepted standard has become for publishers to revise or redesign their products every seven years. Many districts will not adopt any program for beginning reading that has a copyright older than 7 years. This practice assumes either that first graders change so much every seven years that they need new instructional approaches, or that the revised program will always produce better results than the earlier version. Given that the results of student performance are not used in any practical way by the state or district, the adoption practices for subjects in the primary grades are enigmatic.

Instructional materials, like the overall reforms, are experimental. If the only basis that the publisher, state, and district has about the effectiveness of the product comes from field information obtained after the product had been adopted, the adoption process is functionally research. The principle of avoiding harm applies here.

Just as there is a Food and Drug Administration, there should be an Educational Protection Administration that tests products to be used in schools with the same rigor that drugs, prosthetics, machines, and other health-care products are tested by the Food and Drug Administration.

Carnine points out that education is probably like the Food and Drug Administration from its formation in 1938 until the Thalidomide disaster in 1962 (2000). During this period, the administration relied partly on opinion from clinical experts. The Kefauver Bill of 1962 required research evidence that documented that products were effective before they could be marketed.

Education relies not partly, but almost exclusively on expert opinion. The committee that "reviews" a particular instructional program form opinions about how relatively effective the program will be. The committee's opinions are consistently wrong. Education needs a Kefauver Bill. The damage created by faulty instructional programs does not produce outcomes that parallel the physical deformities created by thalidomide, but a wealth of data shows that school failure is the most highly correlated factor with all of the teen problems—drugs, felonies, pregnancy, dropping out of school, emotional problems (NICHD, 1998).

If even some of this harm is corrected by using products and practices that lead to school success, there should be no reason for not testing and validating them in the same way drugs and related products are tested and validated. A bottle of aspirin has qualifications for its use with younger children. Some instructional programs that produce reasonable results with higher performers fail seriously with lower performers; however, there are no cautions for the use of these programs. The cost of an administration that provided such cautions should not be a barrier when the health of millions of children is at stake.

One of the most outrageous examples of states not avoiding harm occurred in California. In 1985 the Curriculum Commission of California had established criteria for evaluating programs submitted for teaching mathematics. A small publisher in California designed a program meeting all these criteria. It received a score of 96, 16 points higher than any other submission. The only field-testing that occurred before the program was published involved 18 students. Data on 7 of them were excluded from the final data analysis. Of the remaining 11 students, 61% made gains or had no change in score, while 39% experienced a loss. The average gain of the group was 19 percentile points. The average loss was 22 percentile points. This program captured 60 percent of sales in the state the first year. When questioned about these results, G. Thomas of the California Department of Education explained that the State Board of Education "has never asserted that any specific score correlates with the quality of potential success of a particular program."

Research in Education

Researchers are providing additions to our knowledge of effective teaching practices, but research does not name specific products and rarely identifies them as exemplars. The research shuns specifics and attempts to derive general principles and general schemes. The aversion to specifics seems to be based on the assumption that if teachers are provided with general information about the various types of phonemic-awareness activities, or successful phonics techniques, they will be able to transduce this general information into effective, specific applications. (See National Reading Panel, April 2000.) There is no data that teachers have the ability (or the necessary training) to do this.

The irony of the research not identifying specific programs that are effective is circular. The only basis that the researchers have for knowing that phonological awareness and phonics are effective is through an analysis of superior programs. The consumer of educational material wants information about which programs work, just as the purchaser of an automobile wants information about which cars in a class are more "effective." Instead of providing the consumer with specific information, the researchers present general principles and often discussions that go far beyond the data. The logic they use is flawed. It is parallel to this:

All Dalmatians have spots.
Therefore, all dogs with spots are Dalmatians.

Here's the educational parallel:

All highly successful programs present explicit phonics.
Therefore, all programs that present explicit phonics are highly successful.

The logic is as flawed for the explicit phonics as it is for Dalmatians. There is no data that teachers are able to create highly successfully instruction from the kind of recommendations about phonological awareness or phonics provided by the 2000 Reading Panel. Furthermore, this excursion into general principles isn't needed. Just as the patient with serious heart problems requires a specific surgical procedure that has been demonstrated to be effective, the teacher needs specific products that have been demonstrated to be effective for teaching reading, math, and language. Just as the surgeon must be trained in specific procedures, teachers need training in how to use specific products so they are effective. If the researchers know which specific products work, the first responsibility of at least some of them should be to identify these products. Then the researchers have some kind of known base for developing what they believe to be the underlying principles that account for the success of these programs.

Conclusions

For a real educational reform to occur, the system must first recognize that it has done harm and continues to do harm. It must be institutionalized so that it follows standards for professional conduct that avoids unnecessary harm. Research should be conducted before the fact—before reform agendas are installed, before textbooks are adopted, before teachers enter the classroom or use a new procedure.

Next, educational agencies must identify all their practices that use teachers and children as the experimental subjects, from in-service formats to their textbook adoption practices and copyright requirements. Finally, the agencies need to apply a code of ethics to provide protocol for these experimental areas. States and districts need to find out information about effectiveness of proposed programs or practices through well-designed research that is governed by a strict code of conduct and strict guidelines of accountability. Concurrent with a sensible search for information about what works, adoption criteria and practices need to be scrapped. They have not worked in identifying programs that produce superior results. At best, they have generated indifferent practices in the publishing business and many products that range from mediocre to ineffective.

States need to work with major publishers to set up a new way to evaluate programs, a new way to adopt them, and a timeline that is appropriate for proper development of material that uses field tryouts and obtains data that the material works well with children.

Finally, researchers need to recognize that the basic-research model of deriving general "scientific principles" does not apply to education because education is an applied science. The procedures for reporting is parallel to medicine or automobile design, which recognizes that teachers need specific products and practices, not anything general or something they are supposed to invent. A start would be for researchers to evaluate how well teachers are actually able to apply general principles generated by research and use them to create highly successful applications.

The sum of the above would be a system that would be both scientific and would have the ethical code implied by the potential power of effective instruction.

Bibliography

Adams, G. L., & Engelmann, S. (1996). Research on Direct Instruction: 25 years beyond DISTAR. Seattle, WA: Educational Achievement Systems.

American Psychological Association. (2002). Ethical principles of psychologists and code of conduct. Washington, DC: Author.

Carnine, D. (2000). Why education experts resist effective practices (and what it would take to make education more like medicine). Washington, DC: Thomas B. Fordham Foundation.

Carnine, D. & Gersten, R. (2000). The nature and roles of research in improving achievement in mathematics. Journal for Research in Mathematics Education, (31) 2.

National Institute of Child Health and Human Development. (1998). Overview of reading and literacy initiative. Washington, DC: National Institute of Child Health and Human Development, National Institutes of Health.

National Reading Panel. (2000). Report of the National Reading Panel: Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction. Washington, DC: National Institute of Child Health and Human Development, National Institutes of Health.