VOLUME 78
Number 4
| Our nal of the December, 1988
WASHINGTON
CIENCES
ISSN 0043-0439
oe |
Issued Quarterly
at Washington, D.C.
CONTENTS
Editor’s Introduction:
DR. R. CLIFTON BAILEY
Mea tteMe wth oll eielkwiLelteice).e) 0) cee" |e) \e))e1/e) (| \¢ie)\e/e) \e) #0) s (ee! |e ee) 0) e\\e) e166
Articles:
DR. RICHARD ALLEN: The Washington Statistical Society and Its
History
Scnwiegelele kwh eivilel wild alata) «isilesice (elles! (eileiis) «2s ss, .0) © (©) 0. .« «i, 0.00. « 0) «Je, 0 \e)\0/4 6 (0 |e} 0 0 @-¢ (¢ 2 0 \8
DR. MILES DAVIS and DR. DONALD WOLFE: Clustering of Senators
AICO NUMRY CULES en ne NE 2 cin, oon S GE ee dies QUES S Mee te Sowa 296
DR. EDWARD J. WEGMAN: Computational Statistics: A New Agenda for
SEALISHEA MMC ORV RANG UE LACIICE! «ancl es. 5 ole cele ae ee ele ct nee ene gees Here 310
DR. KEITH R. EBERHARDT: Statistical Analysis of Experiments to Mea-
SLPS VERT aU Gees Oo ae Se cee nr eee ee ea
DR. N. PHILLIP ROSS and DR. GILAH LANGNER: Environmental
Statistics
DR. R. CLIFTON BAILEY: Some Uses of a Modified Makeham Model to
Evaluate Medical Practice
Washington Academy of Sciences
Founded in 1898
EXECUTIVE COMMITTEE
President
James E. Spates
President-Elect
Robert H. McCracken
Secretary
Donald O. Buttermore
Treasurer
R. Clifton Bailey
Past President
Ronald W. Manderscheid
Vice President (Membership Affairs)
M. Sue Bogner
Vice President (Administrative Affairs)
Jo-Anne A. Jackson
Vice President (Junior Academy Affairs)
Marylin F. Krupsaw
Vice President (Affiliate Affairs)
John G. Honig
Academy Members of the Board of
Managers
William M. Benesch
Carl E. Pierchala
Lawson M. McKenzie
Marcia S. Smith
Jean K. Boek
Thomas N. Pyke
BOARD OF AFFILIATED
SOCIETY REPRESENTATIVES
All delegates of affiliated
Societies (see inside rear cover)
EDITORS
Irving Gray
Joseph Neale
Lisa J. Gray, Managing Editor
ACADEMY OFFICE
1101 N. Highland St.
Arlington, Va. 22201
Telephone: (703) 527-4800
The Journal
This journal, the official organ of the Washing-
ton Academy of Sciences, publishes historical
articles, critical reviews, and scholarly scientific
articles; proceedings of meetings of the Acad-
emy and its Executive Committee; and other
items of interest to Academy members. The
Journal appears four times a year (March, June,
September, and December)—the December is-
sue contains a directory of the Academy mem-
bership. }
Subscription Rates
Members, fellows, and life members in good
standing receive the Journal without charge.
- Subscriptions are available on a calendar year
basis only, payable in advance. Payment must
be made in U.S. currency at the following rates:
U.S. and Canada....... $19.00
POreigii. ict ee 22.00
Single Copy Price....... 7.50
Back Issues
Obtainable from the Academy office (address
at bottom of opposite column): Proceedings:
Vols. 1-13 (1898-1910) Index: To Vols. 1-13
of the Proceedings and Vols. 1—40 of the Journal
Journal: Back issues, volumes, and sets (Vols.
1-75 1911-1985) and all current issues.
Claims for Missing Numbers
Claims will not be allowed if received more than
60 days after date of mailing plus time normally
required for postal delivery and claim. No
claims will be allowed because of failure to no-
tify the Academy of a change in address.
Change of Address
Address changes should be sent promptly to the
Academy office. Such notification should show
both old and new addresses and zip number.
Published quarterly in March, June, September, and December of each year by the
Washington Academy of Sciences, 1101 N. Highland St., Arlington, Va. 22201. Second
class postage paid at Arlington, Va. and additional mailing offices.
Editor’s Note
The Washington Statistical Society (WSS) was founded in 1926 and joined the
American Statistical Association (ASA) in 1935. The American Statistical Associa-
tion, founded in Boston in 1839 at No. 15 Cornhill, will celebrate its sesquicentennial
in 1989. The ASA was founded because of concerns with the inadequate and inaccurate
national statistics. An excellent account of the early history of the American Statistical
Association can be found in its 1918 publication The History of Statistics, Their De-
velopment and Progress in Many Countries, collected and edited by John Koren and
published for the American Statistical Association by the Macmillan Company of New
York in 1918. John Koren recounts some of the early history of the ASA (1839-1914)
and its concern with issues of national statistics such as the Census and vital statistics,
interests clearly consistent with the provisions of their bylaws which state (Koren, pp.
4—5) that “‘the operation of this Association shall principally be directed to the statistics
of the United States; and they shall be as general and as extensive as possible and
not confined to any particular part of the country ...” I also highly recommend
Revolution in United States Government Statistics 1926-1976, for an overview of sta-
tistical issues as they evolved with the Federal Statistical System. It was prepared by
Joseph W. Duncan and William C. Shelton and issued October 1978 by the U.S.
Department of Commerce, Office of Federal Statistical Policy and Standards.
With the forthcoming sesquicentennial of the ASA, many special publications are
anticipated. My purpose in this brief note is to let the reader know of the roots of
the statistical profession in the U.S. The articles in this special issue demonstrate in
a very limited way the range and richness of the science of statistical application and
methodology.
This special issue of the WAS Journal is devoted to a collection of articles by
members of the Washington Statistical Society, an affiliate of the Washington Acad-
emy of Sciences. This special issue appears as the American Statistical Association
enters its 150th year.
The articles in this issue convey the broad scope of statistical activities. The lead
article by Richard Allen, President of the Washington Statistical Society provides an
overview of WSS activities and its history.
The article by Miles Davis and Donald Wolfe applies the classical statistical meth-
odology of principal components for multivariate data to the voting records of the
U.S. Senate. Their insights and graphs will capture your imagination and encourage
you to ask of the data your own thought provoking questions.
Edward Wegman’s article examines the implications of computational statistics on
statistical theory, education, and practice. This article provides an elegant description
of some recent advances in graphical techniques for exploring multivariate data of
many dimensions.
Keith Eberhart’s article on ignition propensities of cigarettes follows in a long
tradition of the National Bureau of Standards (NBS) recently renamed the National
Institute of Standards and Technology (NIST). Traditionally, Bureau statisticians work
closely with their colleagues to design and analyze statistical studies. The tradition at
NBS is long and includes such notaries as W. J. Youden, John Mandel, Joseph Cam-
eron, Churchill Eisenhart, Harry Ku, Mary Natrella, Joan Rosenblatt and many others
who have had the good fortune to work there. The article by Joan Rosenblatt in the
Encyclopedia of Statistics (Volume 6, pp. 148-155, John Wiley and Sons, New York)
is a valuable reference to this tradition.
The article by N. Philip Ross and Gilah Langner examines the tradition of statistical
agencies in government and argues the case for credible centralized environmental
statistics for the U.S.
My own article addresses some methodological issues in medical statistics and dis-
cusses some uses of administrative data and its augmentation for special studies.
It has been my pleasure to bring these articles together for this special issue. I hope
the Washington Academy regularly will feature articles devoted to statistical science
and its applications.
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 291-295, December 1988
The Washington Statistical Society
and its History
Richard Allen
President, Washington Statistical Society
ABSTRACT
The Washington Statistical Society (WSS) serves 1500 statisticians in this Washington,
D.C. based chapter of the American Statistical Association (ASA). Each year, WSS
organizes and sponsors 35—40 technical sessions, a tradition that began when WSS was
organized in 1926. Accomplishments over the years are described.
About the Washington
Statistical Society
The Washington Statistical Society
(WSS) is the Washington, D.C. based
chapter of the American Statistical As-
sociation (ASA), the leading professional
society for statisticians in the world. The
WSS normally has about 1,500 members
which is approximately 13 percent of the
members of all chapters of ASA.
While one of the strengths of the WSS
is the contributions of members from the
Federal statistical agencies, its member-
ship is broad based. Approximately 55
percent of members are from govern-
ment, 10 percent from academia, 30 per-
cent from private and nonprofit organi-
zations, and 5 percent self-employed or
retired. Members of the WSS form the
backbone of many ASA activities and
committees. WSS members can be found
in leadership roles for nearly every ASA
Correspondence should be sent to: Richard Allen,
President WSS, c/o USDA/NASS/DAP Room 4133
So. Bldg., Washington, D.C. 20250.
291
section, committee, and publication. For
example, two of the last three presidents
of the ASA have been WSS members.
One of the major goals of the WSS is
to organize and sponsor a wide variety
of technical sessions relating to statistics
each year. These sessions are advertised
through the monthly WSS Newsletter and
newsletters of organizations which might
be interested in a specific session. Usually
35—40 sessions are offered each year rang-
ing from very technical sessions on work
in progress to presentations of general in-
terest from renowned professionals. Many
of these sessions are cosponsored with
other professional organizations and uni-
versity departments. Sessions are planned
by anumber of program committees which
currently include agriculture and natural
resources, economics, methodology, pub-
lic health and biostatistics, physical sci-
ences and engineering, social and demo-
graphic statistics, computing technology,
and quality assurance. In addition to the
regular program sessions, the WSS has
organized and presented a few short
292 RICHARD ALLEN
courses in each of the recent years. These
short courses have brought in leading
speakers and video tape presentations on
developing statistical methodologies. At-
tendance has been upwards to 200 partic-
ipants for some of these short courses.
One special activity of the WSS News-
letter is a monthly employment column
for both employers and prospective em-
ployees. The society participates in Wash-
ington, D.C. area science fairs by judging
all projects for noteworthy statistical con-
tent. Prizes and other recognition are pro-
vided. The society also sponsors an an-
nual award at all local universities for
outstanding graduate students with inter-
ests in statistics. This award consists of a
year’s membership in both WSS and the
ASA. The Society also cooperates with
six other government agencies and profes-
sional associations in the annual Julius
Shiskin memorial award for economic sta-
tistics.
Most members of WSS are also mem-
bers of the ASA. However, WSS does
offer associate memberships for individ-
uals who may be professionals in related
fields and who wish to keep up on infor-
mation in statistical activities in the Wash-
ington area.
Below is listed a brief description of the
history and development of WSS.
Brief History of the Washington
Statistical Society”
The WSS was organized in 1926 with a
short constitution that proclaimed it was
to be achapter of the American Statistical
Association. Officers established were
President, Vice-President, Secretary, and
two Representatives-at-Large. However,
a charter for ASA membershp was not
applied for until November 1935.
The names of most early Society offi-
cers are lost in the mists of history. Wil-
‘Acknowledgement: Credit goes to Al Mindlin for
the original history version.
liam M. Stewart, Director of the Census
Bureau, was President in 1928 and in 1930.
The 1930 Secretary-Treasurer was E. A.
Goldenweiser.
Although the 1926 constitution set the
annual dues at $1, during the Great
Depression no serious effort was made to
collect dues. Instead, the Secretary, who
by 1935 was the ASA District Repre-
sentative, stood at the meeting house door
and collected 15 cents from each attendee
(25 cents at luncheon meetings). Dinner
meetings cost $1.25 for dinner.
Apparently, in the earliest years, WSS
program meetings were held about once
a month. Information on topics and
speakers for meetings before 1940 is scarce
but there is a reference that in a 1939
meeting titled “Irrelevant Remarks on
Trivial Matters in Modern Statistical
Theory,”’ addressed by Leon Henderson,
Arne Fisher and Bassett Jones, the dis-
cussion got so ribald that “the doors were
closed.”
The informal financial management of
the Society’s early days was simply in-
adequate, and in 1941 WSS was insolvent
(a $10 deficit). The regular meeting door
fee of 15 cents could not be raised because
most of the meetings were held at George
Washington University, where a charge
greater than 15 cents would make the
University liable for taxes as conducting
meetings for pay. WSS discontinued door
collections and ordered collection of $1
from each member at the beginning of the
season. Heretofore, chapter membership
had been defined as every ASA member
living in the Washington area. With the
increasing number of statisticians migrat-
ing to Washington because of the “Peo-
ple’s War of Survival,” WSS trumpeted
to the world that the membership in 1942
was a thousand persons—approximately
one third of the total membership of
ASA. However, when the Board of Di-
rectors sought to collect its dues only
about 200 persons responded to the call.
WSS membership grew steadily after
World War II. The 1953 report of the Sec-
retary-Treasurer stated that more than
WASHINGTON STATISTICAL SOCIETY 293
700 of the nearly 900 ASA members in
the Washington area were WSS members.
Membership increased particularly rap-
idly in the period 1958 to 1964 and con-
tinued inching up to a temporary peak of
1,500 in 1967. Membership declined in the
period 1969-1974 to about 1,350. How-
ever, the trend reversed in 1975 and a
probable membership peak of 1,773 was
reached in 1979.
The first Annual Dinner meeting was
held in 1947, with Isadore Lubin as
speaker. The event has been held each
year thereafter with the exception of five
years.
Through the 1950’s and early 1960’s at-
tention of the Society’s Board of Direc-
tors focused heavily on scheduling meet-
ing. It was mainly a program committee,
leaving little time to develop additional
activities. For the most part an evening
meeting was held each month thru a seven
or eight month season. In 1963 a special
committee was established to develop over
the summer the entire year’s set of seven
meetings. Under the leadership of T. D.
Woolsey the first program committees
were established in 1964—Economics un-
der Hyman Kaitz, Public Health and Bio-
statistics under Oswald Sagen and Mon-
roe Sirken, and Social and Demographic
under David Kaplan. The program com-
mittee system flourished from the begin-
ning, rapidly taking over the main burden
of meetings and freeing the Board of Di-
rectors for further improvements.
As it became apparent that daytime
meetings were more popular than evening
meetings, more and more meetings were
held during the day. In the 1969-70 sea-
son there were no evening meetings at all,
except the Annual Dinner.
The 1960’s saw other dramatic expan-
sions of Society activities. In 1962 an
award of one-year membership in ASA
and WSS to an outstanding graduate stu-
dent in each of seven universities in the
Washington-Baltimore area was initiated.
Also in 1962 a Methodology Committee
was established under Jerome Cornfield
and Seymour Geisser. Under the subse-
quent leadership of Samuel Greenhouse,
and with an enabling revision of the con-
stitution, this evolved by 1963 into a semi-
autonomous Section of the Society with
its own elected officials and separate pro-
gramming. In 1966 a Baltimore Commit-
tee was established under Harold Gross-
man. WSS financed the growth of a
Statistical program in the Baltimore area,
until by 1969 it felt sturdy enough to cut
the parental tie and strike out as a ‘““Mary-
land” chapter of ASA. In 1967 an Em-
ployment Service was established under
Marie Eldridge. Also in 1967 an Out-
standing Paper Award Committee. for
young statisticians was established under
Churchill Eisenhart. In 1968 a WSS Com-
mittee on ASA Fellows was established
under Margaret Martin.
The increase in Society activities during
the 60’s and rapidly rising costs spelled
the end of the $1.00 dues in 1967 even
though a number of economy measures
such as switching to bulk rate newsletter
mailings were taken. Dues were raised to
$2.00 a year in 1967 and remained at that
level for 10 years. Increases in mailing
costs necessitated raises to $3.00 in 1977,
$4.00 in 1979, and $6.00 in 1984 which has
been maintained since then.
A major membership survey was con-
ducted in 1969 which pointed out a gen-
eration gap. The majority of members of
WSS were age 40 or above and favored
nontechnical meetings. Nontechnical
meetings tended to have higher attend-
ance, but technical meetings drew more
of the younger members. The WSS pro-
gram attempted to provide a balance of
the two types of sessions.
The 70’s saw WSS attempt a number of
new programs. Some of them were en-
visioned as annual programs but did not
prove to have such permanence. In 1973
a W. J. Youden memorial scholarship was
established in conjunction with the Amer-
ican Society for Quality Control. The
program was to provide scholarship
assistance for a worthy student at the
Washington Technical Institute. Appar-
294 RICHARD ALLEN
ently the scholarship was awarded only in
1973 and in 1976.
In 1974 WSS participated with a dozen
other private and government sponsors in
a three-day symposium on Statistics and
the Environment. This symposium fol-
lowed other similar presentations in Cal-
ifornia and was very successful, with over
200 participants. There was considerable
interest in establishing such a symposium
as an annual event but it was not to be.
In 1976, to commemorate the 50th an-
niversary of the Society, five past presi-
dents of the American Statistical Asso-
ciation addressed the WSS annual dinner
on the topic of ‘““The Past as Prologue to
the Future.” This proved to be one of the
most interesting annual dinner meetings
of all time.
Another notable accomplishment of the
1975-76 Society activities was the estab-
lishment of a “local associate member-
ship” program. That feature allows a non-
ASA member such as a retired statistician
or an individual working in some other
field to receive the WSS newsletter and
keep abreast of statistical activities in the
Washington, D.C. area.
The 1977-78 program year was marked
by two very popular activities. A short
course On variance estimation was planned
which ran on six consecutive Fridays. Over
140 people applied for the 50 available
spots and a random process was used to
select a lucky 50 participants. A reception
was held for visiting statisticians from
Latin America. Since the room would hold
only 135 people over 60 individuals had
to be turned away.
A major new award program was started
during the 1978-79 program year, the
Julius Shiskin Memorial Award for Eco-
nomic Statistics. WSS is joined in this pro-
gram by the Bureau of Labor Statistics,
the Bureau of the Census, the National
Association of Business Economists, the
Bureau of Economic Analysis, the Na-
tional Bureau of Economic Research, and
the Office of Management and Budget,
all of which Julius Shiskin was associated
with during his career. This award has
been presented annually with the selec-
tion made by a committee of represen-
tatives from each participating organiza-
tion.
The fundraising activities for the Shis-
kin Award led WSS to separate the two
positions of secretary and treasurer, an
arrangement which had been provided for
in the WSS Constitutions. Other changes
over time in the structure of the elected
positions of the Society was the establish-
ment of the Vice President as the Presi-
dent Elect in 1966 and the establishment
of two-year terms for Representatives at
Large, Secretary, and Treasurer in 1984.
The emphasis on maintaining continuity
of WSS activities has been reflected in the
encouragement of two-person teams for
organizing various program sessions.
The thawing of relations with China led
to one of the most successful activities of
the Society. Five statisticians from the
National Statistical Society of China vis-
ited Washington, D.C. for nearly a full
week in conjunction with their visit to the
1981 ASA annual meetings. WSS served
as the host for the D.C. part of the trip.
Activities arranged included visits with
statistical organizations, a reception at the
National Academy of Sciences, an eve-
ning “pot luck”’ dinner, chaperoned sight-
seeing, and a most successful evening din-
ner in honor of the visitors.
The success of China delegation visit
led to the sponsoring of other special pre-
sentations by the Society. In the spring of
1982, an all day program was scheduled
to commemorate the 200th anniversary of
the birth of S. D. Poisson. That fall, W.
E. Deming’s birthday was celebrated in
an evening social session. Morris Han-
sen’s 73rd birthday was recognized in 1983
with a similar evening social.
The 1984-86 time period led to several
important developments in the operations
of the Society. The interest in the Deming
and Hansen birthday parties was institu-
tionalized into an annual WSS holiday
party. This early evening event held every
December provides a second social activ-
ity to accompany the June annual dinner.
WASHINGTON STATISTICAL SOCIETY 295
During the 1984-85 program year,
Terry Ireland urged WSS to try some WSS
sponsored short courses. These were not
to compete with universities or private
vendors but were intended to provide in-
struction or knowledge on statistical top-
ics of broad interest. A short course might
involve the use of video tapes from an
ASA tutorial, with a knowledgeable in-
dividual available as a resource person.
These short courses were to be provided
on a fee basis, with fees covering rentals,
printing of materials, and travel expenses
of speakers.
The short courses were an instant suc-
cess. All had been well researched and
participants felt they got a fair return for
their investment. From the simple begin-
nings of low key one-day sessions in the
Martin Luther King Library the complex-
ity of courses and arrangements in-
creased. The 1987-88 program year was
marked by a very ambitious two-day sym-
posium on quality assurance in the gov-
ernment. This proved to be a landmark
session bringing together a unique com-
bination of quality professionals and
agency staff members. Very professional
materials were available and registration
included meals and receptions for speak-
ers. This session, planned for 150 partic-
ipants, drew over 200 people.
Another key development during the
1984—86 time period was the beginning of
WSS’ involvement in judging local science
fairs in the metropolitan area. Susan EI-
lenberg, in her position as a Represent-
ative-at-Large to the WSS Board, initi-
ated contact of all local school systems
and found volunteers to do the judging.
Since no Statistics category was offered,
WSS members judge all projects for sta-
tistical content. Prizes have included books
on statistics appropriate to high school age
individuals.
The success of the short course program
led to another change in WSS program
philosophy. WSS normally does not pay
any honoria for presentations. Many of
the name speakers each year from outside
the Washington, D.C. area have spoken
for free when they are in the area for other
commitments. In the 1986-87 program
year, the concept of one invited lecture
was originated. Based on the fact that some
extra proceeds were available from short
course operations, one outside speaker a
year would be brought in. The concept
was successful in 1987 and was repeated
in 1988. In that case, the science fair win-
ners were invited to display their projects
at the invited lecturer.
In 1984, WSS started its own internal
award program. This Presidents’ Award
goes annually to a member or members
for outstanding contributions to the So-
ciety. The award carries no monetary
value but since it is presented at the An-
nual Dinner, it does provide good recog-
nition to the recipient.
Many WSS members are actively in-
volved in the celebration of the 150th an-
niversary of the American Statistical
Society. This celebration will last from
August 1988 through December of 1989.
WSS has planned a series of 10 monthly
special presentations on broad topics such
as Statistics and the law and statistics and
the media.
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 296-310, December 1988
Clustering of Senators and
Their Votes
Miles Davis and Donald Wolfe
Loyola College, Baltimore, Maryland
ABSTRACT
Votes by Senators of the United States of America on major bills are analyzed by
principal components and clustering techniques. Clusters of Senators and bills are shown
in graphs. The three classes of Senators elected in successive Senatorial elections are
studied to detect systematic differences in their voting patterns. A dimension of liberal-
conservative gradation and a pattern of change over time are found from the voting
records.
Introduction
Voting in the Senate of the United
States of America is an interesting and
important source of complex data. Al-
though it is highly structured by party af-
filiation, by ideology and by regional
interest, it is still sufficiently unpredicta-
ble to be interesting. We analyze voting
on key bills by all senators from 1969 to
1986, using principal component analysis.
We offer the data as a readily understood
and important body of illustrative but real
data for experimentation with statistical
methods.
Voting Data
We abstracted data from the complete
record of voting in the Senate appearing
regularly in the Congressional Quarterly?
Correspondence should be sent to: Dr. Miles
Davis, 1214 Bolton Street, Baltimore, Maryland
21217-4111.
296
for key bills as selected by Michael Bar-
one in the Almanac of American Politics.'
We identify the 144 key bills in Table 1.
They are selected to reflect the decisive
action on issues that often come to several
votes. They are thus more closely con-
tested than the entire record of votes and
represent the high drama of decision.
The actors in the drama are the 217
senators who served from 1969 to 1986.
Their classes, party affiliations, states,
names, and the bills on which they voted
appear in Table 2. The term “‘class”’ refers
to the three divisions of the Senate de-
termined by the year of election. Since
senators are elected for six year terms,
and one-third of the Senate is elected
every two years, each Senator holds a seat
in one of the three classes.
The Senators’ votes on the key bills ap-
pear in Table 3 as the data for our study.
To show the votes compactly, the rows of
Table 3 are identified by class and state,
and the columns are identified by bill
number. For example, the first row of Ta-
Table 1
1069/54
2069/142
30@69/245
4070/11
5070/19
6070/89
770/103
8070/112
9@70/157
10@70/180
11@70/195
12€70/206
13070/211
14070/240
15070/249
16070/251
17070/252
18070/256
19070/328
20070/380
21071/23
22071/354
23071/355
24071/361
25071/417
26072/54
2772/97
28072/110
29072/144
30072/215
31072/219
32072/226
33072/262
34072/292
35072/296
36072/334
37€@72/339
38072/383
39072/391
40073 /27
4173/34
42073/154
43073 /286
44073/372
45073/400
46073/551
47@73/571
48074/66
49074/69
50©74/138
51074/156
52074/187
53074/194
54074/212
55074/225
56074/286
CLUSTERING OF SENATORS AND THEIR VOTES 297
Senate Bills
Prevent funds for Safeguard.
Reduce depletion allowance on oil & gas.
Reject Philadelphia Plan amendment.
Drug Control, striking “‘no-knock”’ provision.
Drug Control. Hughes amendment reducing further marijuana penalties.
Voting Rights Act—Voting at 18.
Corundum disposal.
Carswell Nomination to Supreme Court.
School bus in desegregated districts.
Bar U.S. military in Cambodia.
Consumer Products Warranty and Guaranty Act.
Limit agricultural subsidy to $20,000 to any producer in a single year.
Bar price support for tobacco.
Require weapons systems be tested.
Increase funds for Bureau of Prisons.
Create volunteer army.
Prohibit military use of defoliants.
Reduce authorizations for defense.
Elect Senate committee chairmen.
Reduce defense public information.
Restore funds for supersonic transport
Presidential campaign fund from tax.
Extend to single persons tax rates applicable to married persons.
Limit US military in Europe to 250,000
Rehnquist Nomination to Supreme Court.
Bar school busing on basis of race.
Table National Voter Registration Act.
Equal Rights Amendment.
Unfair Billing Practices.
Delete criminal penalties in 1972 Food, Drug and Consumer Product Act.
One-year authorization for Corporation for Public Broadcasting.
Delete National Legal Services Corporation from Economic Opportunity Amendments.
Delete minimum wage for domestics.
Give states exclusive authority to manage fish and wildlife, unless endangered species.
Reduce annual crop subsidy maximum.
Refer no-fault auto insurance to Judiciary Committee.
Outlaw Saturday-night special guns.
Table reduction of exemption on preference income for minimum tax.
General Revenue Sharing.
Senate committee meetings closed only by public vote at start.
Highway Trust Fund for bus opr rail.
Prohibit U.S. combat in Cambodia or Laos.
Permit Alaska Pipeline.
Restrict limousines for officials.
Reduce military headquarters overseas.
New northeast rail labor agreements.
Halt import of Rhodesian chrome.
Table required registration of handguns.
New standards for death penalty.
Federal Election Campaign Financing.
No-Fault Auto Insurance.
Prohibit required student testing.
Table prohibition of school busing.
Freedom of Information Amendment.
Demobilize 76,000 U.S. oversea military.
Repeal no-knock provision of Drug Abuse Control Act of 1970.
298
MILES DAVIS AND DONALD WOLFE
Table 1.— Continued
570@74/395
58074/479
590@74/496
6075/55
61075/67
62@75/130
63075/190
64075/382
65075/431
66075/516
67€76/27
6876/65
69076/93
70€@76/103
71076/141
72076/180
73076/333
74076/471
75€@76/521
76076/554
77@77/11
78@77/41
79@77/42
8077/59
81077/164
82077/232
83077/263
84077/275
85077/280
86077/320
87€77/523
88078/66
89078/161
90078/166
91078/435
92078/447
93078/480
9479/70
95079/169
96079/206
97@79/438
98079/445
99@79/490
100080/60
101@80/101
102€80/197
103@80/272
104080/315
105@80/345
106080/441
107080/466
108080/496
109081/140
110081/182
111€81/239
112081/275
113081/335
Senate Bills
Consumer Protection Agency.
Prohibit food stamps for strikers.
Foreign Aid Authorization.
Amend Cloture Rule.
Rescind F-111 fighter-bombers.
Table barring Medicaid abortions.
Resume military aid to Turkey.
‘“‘Redlining”’ Disclosure.
Emergency Natural Gas.
Common-Site Picketing.
Prohibit arms sales to Chile.
Federal Employees’ Political Activities.
Prohibit federal supersonic air funds.
Recommit no-fault auto insurance.
Reduce national defense budget.
Bar production of B-1 bomber.
Utilities pay for Clinch River reactor.
Retain nitrogen oxide standards.
Delete House ban on abortion funding.
Water Pollution Control Amendment.
Presidential Pardon for Draft Resisters.
Warnke SALT Nomination.
Warnke as Director of Arms Control and Disarmament Agency.
Halt Rhodesian Chrome Imports.
Reduce target price for wheat.
User fees for water resources.
Prohibit federal funds for abortion.
Limit Clinch River reactor spending.
Prohibit production of neutron bomb.
Federal tax fund for Senate elections.
Natural Gas Pricing.
Ratify Panama Canal Treaty.
Disapprove mideast figher plane sales.
Amend National Labor Relations Act.
Extend ERA ratification deadline.
Revenue Act of 1978.
Medicare-Medicaid Cost Containment.
Establish Department of Education.
Defer new nuclear power plants.
Food Stamps.
Windfall Profits Tax.
Indexing individual income tax.
Chrysler Loan Guarantees.
Spending Limits.
Fiscal 1981 Budget Targets.
Draft Registration Funding.
Aid to Nicaragua.
Exempt small business from OSHA.
Invoke cloture on Alaska Lands.
Kill federal funds for abortion.
Cut MX missile funds.
Fair Housing Act Amendments.
Require paying for food stamps.
Budget reconciliation.
Cut individual income tax rates.
Helms (R-NC) Foreign Aid Amendment.
Disapprove AWACS sale.
CLUSTERING OF SENATORS AND THEIR VOTES 299
Table 1.— Continued
Senate Bills
114081/368 Legal Services Corporation.
115@82/2 Bar court-ordered school busing.
116@82/118 Chemical weapons.
117@82/288 Balanced Budget/Tax Limitation Amendment to the Constitution.
118@82/420 Bar MX missile procurement.
119@82/422 Eliminate Clinch River reactor funds.
120082/463 1982 Transportation Assistance Act.
121083/35 Social Security Disability.
122083/65 Postpone date for resident status.
123@83/101 Immigration Reform and Control Act.
124083/169 Human Life Federalism Amendment.
125083/170 Tax Rate Equity Act.
126083/293 Martin Luther King, Jr. Holiday.
127€83/355 Table tuition tax credits.
128084/34 School Prayer Amendment.
129084/51 Table combat troops in El Salvador.
130084/132 Table ban on new MX missiles.
131084/252 Prohibit activities against Nicaragua.
132084/266 Table freeze on nuclear weapons.
133@85/142 Firearm Owners’ Protection.
134@85/191 Immigration Reform and Control Act.
135@85/300 Textile Import Quotas.
136@85/310 Discount sales of tobacco.
137085/371 Sequester funds for national defense.
138086/51 Aid to Nicaraguan “‘contras’”’.
139@86/176 Table limit to “star wars’’.
140@86/209 Reduce limits on PAC’s.
141©86/266 Rehnquist Nomination as Supreme Court Chief Justice.
142086/296 Tax Overhaul.
143086/300 Omnibus Drug Bill.
144086/311 South Africa Sanctions.
ble 3 shows votes by the senators occu-
pying the seat reserved for Arizona in
Class I. Reference to Table 2 shows that
this seat was held by Senator Fannin (R,
AZ) while voting on bills 1 through 76
and by Senator DeConcini (D, AZ) for
bills 77 through 144.
Votes are recorded as letters or sym-
bols:
y = yea vote n = nay vote
# = pairedfor x = paired against
f= CQpollfor « = CQ poll against
+ = announcedfor -— = announced
against
voting present
= not voting to avoid conflict of in-
terest
? = unknown
space means not a senator for that vote.
PE
c
Bills are identified in Table 1. Our se-
quential bill numbers are followed by the
years of the Congressional Quarterly
(CQ) volumes and the numbers assigned
by CQ to the bills. Bills marked @ were
passed, but those marked O were rejected.
Short descriptions of the bills follow the
numbers. Much more thorough descrip-
tions are in CQ.
Analysis of the Data
We began to analyze the data by con-
centrating on the process of voting. We
formed a matrix of 217 rows and 144 col-
umns containing a +1 wherever a yea
vote (y) was cast and a —1 wherever a
nay vote (n) was cast. Elsewhere, the ma-
300 MILES DAVIS AND DONALD WOLFE
Table 2
Senators listed by class, party, state and name, with
the numbers of the earliest and latest bills in their
terms of office.
1 RAZ Fannin 76
1 DAZ DeConcini 77 144
1 RCA Murphy ies)
1 DCA Tunney 2k), 76
1-RCA Hayakawa TT 20
1 RCA Wilson 121. 144
1p CT Dodd, T. oy 20
LR Weicker 21 144
1 RDE Williams, J. fi 20
1 RDE Roth 21 144
1,DFL Holland Keg 20
LD BE Chiles 21 144
1 RHI Fong Ly 6
1 DHI Matsunaga 77 144
1 DIN Hartke TKS)
1 RIN Lugar 77 144
1 DME Muskie L 10%
1 DME Mitchell 102 144
1 DMD Tydings ey 20
1 RMD Beall ZA 6
1 DMD Sarbanes 77 144
1 DMA Kennedy 1 144
1 DMI Hart dG
1 DMI Riegle 77 144
1 DMN McCarthy 20
1 DMN Humphrey, H. Pasi te
1 DMN Humphrey, M. S8r 195
1 RMN Durenberger 94 144
1 DMS Stennis 1 144
1 DMO Symington P76
1 RMO Danforth 77 144
1 DMT Mansfield ire WS)
1 DMT Melcher 77 144
PR INE Hruska LOG 7 6
1 DNE Zorinsky 77 144
1 DNV Cannon 1.120
1 RNV Hecht 121 144
1 DNJ Williams, H. MWe ibis)
1 RNJ Brady 116 120
1 DENy Lautenberg 121 144
1 DNM Montoya Wie, 76
1 RNM Schmitt TE OY
1 DNM Bingaman 121 144
1 RNY Goodell 20
1 RNY Buckley PA BETS)
t DNY Moynihan 77 +144
1 DND Burdick 1 144
1 ROH Young, S. Lf 920
1 ROH Taft, Jr. PAINS | TAG)
1 DOH Metzenbaum(II) 77 144
1 RPA Scott, H. © LW IK
1 RPA Heinz 77 144
1D RI Pastore 1s 6
1 RRI Chafee 77 144
1 DTN Gore, A., Sr. tna 20
1 REN Brock PAV TKS)
NONNMNMNMNNNMNMNMNNNYNNMNMNMNNNNNNNNMNMNNNNMNNNNNNNNNNNNNNNNN NMR RRR RRR RRP RRR Rr re
Sasser
Yarborough
Bentsen
Moss
Hatch
Prouty
Stafford
Byrd, H., Jr.
Trible
Jackson
Evans
Byrd, R.
Proxmire
McGee
Wallop
Sparkman
Heflin
Stevens
McClellan
Hodges
Pryor
Allott
Haskell
Armstrong
Boggs
Biden
Russell
Gambrell
Nunn
Jordan
McClure
Percy
Simon
Miller
Clark
Jepsen
Harkin
Pearson
Kassebaum
Cooper
Huddleston
McConnell
Ellender
Edwards, E.
Johnston
Smith, M. C.
Hathaway
Cohen
Brooke
Tsongas
Kerry
Griffin
Levin
Mondale
Anderson, W.
Boschwitz
Eastland
Cochran
Metcalf
Hatfield, P.
Baucus
Curtis
WWWWW WWW WWW WWW WWW WW WWW WWWWWWWWWNNNYNNNNNNNNNNNNNNNNNN NH NNN NW HW NW PY
CLUSTERING OF SENATORS AND THEIR VOTES 301
Table 2.— Continued
EOomtagS
Exon
McIntyre
Humphrey, G.
Case
Bradley
Anderson, C.
Domenici
Jordan
Helms
Harris
Bartlett, D.
Boren
Hatfield, M.
Pell
Thurmond
Mundt
Abourezk
Pressler
Baker
Gore. A; 55.
Tower
Gramm
Spong
Scott, W.
Warner
Randolph
Rockefeller
Hansen
Simpson
Allen, J.
Allen, M.
Stewart
Denton
Gravel
Murkowski
Goldwater
Fulbright
Bumpers
Cranston
Dominick
Hart
Ribicoff
Dodd, C.
Gurney
Stone
Hawkins
Talmadge
Mattingly
Inouye
Church
Symms
Dirksen
Smith, R. T.
Stevenson
Dixon
Bayh
Quayle
Hughes
Culver
Grassley
3 RKS Dole 1 144
a DRY. Cook 1) 7-59
3¢) KY. Ford 60 144
3 sD LA Long 1 144
3 RMD Mathias 1 144
3 DMO Eagleton 1 144
300 D NN: Bible Lex59
3 RNV Laxalt 60 144
3 RNH Cotton(I) E50
3 RNH Cotton(II) 64 64
eet DM BI Durkin 65 108
3 RNH Rudman 109 144
3 RNY Javits 1 108
a7 RNY D’ Amato 109 144
3 DNC Ervin 1°) 59
a INC Morgan 60 108
3.4 Rk NC East 109 138
SS ING Broyhill 139 144
3 RND Young, M. 1 108
3 RND Andrews 109 144
3a OH Saxbe Len 47
3.9 DOH Metzenbaum(I) 48 59
3. DOr Glenn 60 144
3 ROK Bellmon 1 108
3 ROK Nickles 109 144
Ewa el GIF Packwood 1 144
3 RPA Schweiker 1 108
3 RePA Specter 109 144
3) SC Hollings 1 144
3) DSD McGovern 1 108
Samed its) B) Abdnor 109 144
SPROUT Bennett PUP 'S9
SR Ae WG Garn 60 144
3. RVE Aiken Le. 59
3-4 -DeViE Leahy 60 144
3 DWA Magnuson 1 108
3 RWA Gorton 109 144
Sh WI Nelson 1 108
3 RWI Kasten 109 144
trix contains zeroes. This matrix was con-
densed into Table 3 by placing the votes
of all Senators into rows corresponding to
the Senate seats that they hold, identified
by class and state. Rows in the data ma-
trix, unlike Table 3, correspond to sena-
tors. The columns of the data matrix
correspond to bills, as in Table 3. Gra-
dations of support short of voting are ig-
nored by this choice, and an alternative
analysis might well take them into ac-
count.
A spectral analysis of the matrix was
done, following the ideas of Good*. The
302
MILES DAVIS AND DONALD WOLFE
TABLE 3.1
Votes in the Senate. Senate seats arranged by class and state identify the
rows,
DONNONNNNNNNNNNNNNNFRFPKPRPBPBPBPBP BR BPP RPP RPP RPP PRP BPP BPP RP RPP BPP PPP PP
and senate bill numbers identify the columns.
111111111122222222222333333333444444444455555555556666666666777
123456789012345678901234567890123456789012345678901234567890123456789012
nnnnnyyy ?n?n?xnyn- ?nynnnyyynnyyyyynyny? ?nnyynynyynnyn?nnnyn?nnynnnnnnynn
nn?n-yyyxnnx+n?#--xnnynnn-nyy ?nnnnnnynytyyn. nnynnyynyyyyynyyyynyyyyty? ?x
nnynn-+n+—-++—-ntnn-? ?nnynynyyn?nnyn?nnyyyy#ynxnyynnyyyy ?yynnnnynynyynynyn
nynnnnyynnnyyynynynynnnnyyyynyyyynynyyyyynnnyyyyynyynynyyynnynnnn?nnnyn?
nnnynynynnnnnxnnnnnnnyynyy-#y?? ?anyyynnyyynn?nnyyynynynnyynnyyynynyynynn
nn?nnn#nynt+nn-ny ?-?nynynyyyynynyynnyyyynyty ?nyy?? ?cynynyynynyyynnn.nnec.n
VYYY? ?nyn?yy?? ?yy ?+?ynyytn.nyynnnnnynyyynnyynynyynynnn?yyyn?y?nn?yyyy?y?y
yyyyynnn#yyyn?ynyy ?yny? ?n-?y??? ?nn?ny ?yyy *nnnnynnyynyyyyyNyyVVVYVyvyy ?ynny
yyyntnnn?yytntyn-#? ?ynnnyyyynyyyynynyyyyynyynynyyyyynynnynyynynynnynnynn
YYYY ?n?nyyyyn? ?nny#ynynynnnyynnnnnyny . fyyynnfnynnyynyyyy+nyyyynfyyyyynyy
yyyyyn?ntyyy#+yyyynynyyynnny fnnnnn#nynyyyynnfnynnyynyyyyynyyfyyfyyfynny?
yy ?yy--nty++-yynty-ynyynnx-y ?nnnann#nyxyynynnynynnyynyyyyynytyy .tyyyyxnyy
nnnynynynnynnnnnnnnnynnnyyynnynyynnynynxx? ?yn?nyyxnynnnnnyn ?nnynnnnannynn
y# ?nnxnnyyy-—? ?yny-nyyyyynnyynnnnnnnynnnyynny?ynyty ?yy#yynnynyyynyyyxynn
yni#yynnn#yyyx?y#yyyynynyxnny??? ?nnnnnnnxnynnyny? ?y#yyyynynny ?yyyyyy*x=nny
nnnnnyt+tynnnnannnannnnnynnnyyyynyyyyynyyyynnnyy.ynyynnynnnnny.nnnynnnn-nynn
ynnnnnnnny ?y#x?x?x ?nyynnyynyynnny fnynnynnt+ynnnnyyyynnnnnynnyn?yyynyynynn
yyyyynnnyytyyy ?yyyyynyyynxnyyn.nn. ynyyy+#ynnynynnyynyyyyynyyyy-yyyyyynyy
yyynynyn#yyyy#yyyy##ynnnyyynnyyyy-nyyyynynny fynyynnyn ?nnn? ?n ?nynn ?nnyynx
ynnynnynyyynnnyy-ynynyyyynnyynnnyynnyynnnynnfnyynyynyyyyynnyyynynyyyynnn
yy ?yynnntyyt+t+y-yyynyynnnytyyny.xyn#nnynyyny? ?ynyynyyyynynny#f£+yynynnnyn-
nnynnyxy#nnynnt+ynnnnynnnynyy.yynyy##yyynynyynyyyyyyyynnnynyyytyynnnn-nnn
nyyyynynyy+tyyy-yny-nnynyynny+nnnnn#nyyynyynnnnynnyyyytyyynnyynnyyyt#yny#
yynynnnnny??x??n?? ?yynnnyyyyNyyyVyNn?yyyyyny.- -ynyynnynnnynynnnyynnnnn.nn.
ynx#nnnn?y ?nnn-ynnnynynnyynynnnnnnnynynynyyn?nyyyyn?nannyyyyynynnnnn?nyn?
ynyy-ntnyyyty-yyyttyyynynnnyynnnnnnnnnyyyyn. -nyyyyy-yyfyynyyyyxfyyyy.-yn.
nyy??n?nyn?yn-nynn-nnnnyynyyyynnnntyntyyyynynnyyyyynyynyynyynyyyyynxnyyn
nnnyyyyynnyyn?nnnnnnnnnnyyyynyyyynyyyynnnyynnynyyxnynnnynynnynnnnannnyynn
nyynynnnyyyyyntnnnnnyyynn--t++nnnannnnynynynnnnnyyyyynyynyynyynynyyyty-xyn
nynnxnyynyyynnnnnynyyynyyynyynnynnnanyynnnyynynnyynyynnynynnynyynnynynnny
yynynnynyyyynynyyynynyyyyynyynnnnnynynnyyynnynyynyyynyyyynnyynnyyyynynny
nn? ynnnnyny-? ?ynnn?nyyynynnyynn?nynnxytnn?yynnyyytynyyxyynynnn#ynyny?n??
nnnyyytynny-—-nnnnn?nyynny?yy -nnyynnyyftnnxyynnnyffin.nf.ynnynnnyynnyynyn.
nn?+nnnyyyy? ?--#-nynynynynyyfynnnynnnyynnyyynnyyyyynnynyynyyny+tnyn#nn. —
nnnnny+ynnnnnnnnnnnnynyyy#?#?? 7nynnynyynnyyynnnyynnynnynnyn?nnynnnnnn#?n
nnxnny#ynnxynnnnnnnnynynyyyyxyyynyxyyt#yytny fnyyyyynnyyyyn?yyy-yyyyyynyy
nyynnnyyynnynnnynnnnynnnynyynyyynyynyyyyyynnynfynyynyyyyy ?nyynnyyyyy#nyy
nnnnnnt+ynnnynnnynnnnnnnnyyyynyyynyF#yyyx#nynnnynyynnyn?nannynnnnynnnn?ny-n
y-yxnnxnyy###nyynnyynnnn#¥n-ynnnnn-ynyynyyyny#nynnyy ?yynyy ?yynynynyn-nnn.
nnynnyyynnnnannnnnnntnnnny.yynynyynnyyt+#yyynnynyynyynyyyyynyyyy-yyyyyynyy
yn?nnnyyyyynnnxynynyynynynyyynn?yynnyyynyyny .yyynyyyyynyynyynyyynnynynnn
yy ?ynnyyyynnnyynnynynn ?nynyyyyyynnnyyyynnyyn ?nnyyynynnnyynyyyyypnnyynyny
ynxnnyyynn+nnnnnnnnnyynyyyyyy#nyt ?xyyyyynyynn?yyynnynnn?nynynnyynnnynyn.
yynnnnnnnny+t+y-nnnnnnnnntynynyyynnnnyyyyyynnyyynnyynyyyyynynyy ?yyyyynyyy
yyynynynyyyyynyynnynnnynnnnyynnnnnynyyyyyyyynnynnyynyynyynyyyynyyyyvyy *yy
n#ynnyyyyn?y#nynnnnnnnynyyyyn?y#tynyxynynynyynyyyynyynnnnnyyynnynnnnx?nnn
yyyyynynyyyy? ?ynyyyynyyynnnyynnnnn#n#nyyyynnyny? ?yynyyyyynyy ?y ?yyyyyynyy
nn?ynynynn?nn?nnnxnnynnnyyyxnyn#ynny-yynnnyy ?nny ?nnynnnnn ?n? ?n?nnnn?nynn
CLUSTERING OF SENATORS AND THEIR VOTES 303
TABLE 3.2
Votes in the Senate. Senate seats arranged by class and state identify the
rows,
WW WWW WWW WW WWW WWW WWW WWWWWWWWWWWWWWWNNNNNNNNNNNNNNN ND
and senate bill numbers identify the columns.
111111111122222222222333333333444444444455555555556666666666777
123456789012345678901234567890123456789012345678901234567890123456789012
yy#yyxtntynnt+y-~-yynynyyynn.yy. .nnynnxnyxyynnynynnyyyyyyyynyyy- fyyyyyynyn
nnnnnyyynnnx-nnnnnnnynnnyyyynynyyynyyyxnnnyyxynyynntnnnnnynxnnynnnnnnynn
yyynnn?nyyyynnnnnnnnnyyyyn?ty? .nnnynynt+n?ynynnyynyynyyyyynyy ?ynyyn?y?nnn
yytyynynyyyyyyynyyyynynnnnnyynnnnnynyyyyyynnynynnyynyynyynyyyyyyyyyyynyy
n-?n?n? ?yy?? ?nnnn?nnnyny ?nnyn?n? Pnnyyy ?nnyyynynyyynynnnyyyynynynnnnnnynn
nnn#nyt+tynyy-xnnnnn?n-ynnyyyy ?yny? ?xyyyynnnynnynyynnynnnnnynnynynnnynnynn
yy#nnnxnyyyyy#yyyyy#nnyyynyy#-—-nyytyyynyyyn. #yyt+-yynyyyyynnyyny-ynyyyny#
yyyyynyxyytyyynnny ?ynyynynnyynn?nn#n? ?ynyynnynynnyyny ?yyynnyyynyyyyynnyy
nnnnny#ynnnnnxnnnnnnynynyyyynyyyynnyyy#xnnyynynyynnyn-nnnyynnnynnnnnnynn
n-nnn-t+ynnnnx?+ynnnnynynyyyynyy?? ?xyn?yny ?y ?nynyynnynnnynyynn? Pnnnannnyn?
nn?xn+nynnxnnnnnnnnxynnnyyyynyyyyynynyynnnyynynyynnynnnnnyynnyynnnnnnynn
nynynnynnyyyny ?nnnnynynnyyyyyynnyn#ny#ynnnynny ?yy ?nynnynnynnnynnnnnnnynn
ynnynnnynyyynynnnyyyyynyyynyynnnnnxyyyynnyynynnyy#n-ynyyynnynx. nnyyynyyn
nnnnnnyynnnnn-nnannnnnannnyyynnyyyyynynfynnnyynynyynnynnnannynnnnynnannnnynn
nnnynyyynnynnnnnnn?nnynnyyyynynyynnynyynn. ynnynyynnnnnnynynnynn.nnnnnynn
yn##?-nnyyy? ?#+++# ?yyyyynnnyt. -xnyxnnny#nyyny? ?ynyy. ft+tyfynynyynynyyfc-?n
nN???n??y?n?n?n?yn?ynyn?nyyyNn?? ?y?yxyn?xnnxynnynyy-Nn?nnnnnyn?nnynnnn-nynn
yxnn?nyn?y ?nyynx?ynynynyny-yynnn?nnyyyynnynnyny. ??x????y??? ?yyynynyyyynn
ynyy ?nnnyyynynyyy#yynyyynnnynnnnnnnny ?yyyy ?nynynnyynyyyyynyyyynyyyytynyn
nnnx-nyyynnynnnannnnxynnnyyyynyyyyyxynyynytyn. yn? ?nnnn?nnnyyyyynyyy -yynny
yy#ty#n#nyy++?#yyy #nynyyynn-yynnnnnynyyyyyynntnynyyynyyyyy ?yynyn?yyy-ynyy
nnnnnt+yynnnnnxnynxnnynnnyyyynyyyynnyyyyynynn.nnyyy? ?nannnn?nyynnnnnyynynn
nn?ynynynnynnnnnnnnnyynyyyyyyynyynnyyyynn#ynnnnyynnnnnynny? ?yy ?7nnnnynynn
yyyyy-nnyytnnn+++-yyyyyyn.nyyffnnnxnyyyxyyynynynnfy.fffyynyyn#tfyyyy??n.n
ynynyn+nyyyynnnyny-ynyntnnnyynnnnnfynnxnytnnynfyy#ny#tyyyy ?nninnyyyf#ynftt+
nxyn-#yyny-yyyyyyx#ynyynynnyynnnnnynynnyyynnynynnnynyyyyynyyyynyyyfynnny
yyynnnnnyyyynyyny?yy-yyyn-nynnnnnnynyny#xynnynyyyyynyyyyynnyfynfy+fy#nyy
yyyyyn+nyyyynyy#yyyynyyynnnyyx .nnnynynyynyn. fnynnyynyfyyynnyyynyyyyyynyy
nnnnnnyyyynnnntynnnnynynyyyy ?yVYVynyyyyynnnynynyynnynnnyyynyynnnnnnnnynn
yn?nnnt+nynn?xnnynny ?ynnnyyyynnnynynyynyynynnnynyynn?nynyynyyyyYNyyVVYVVY°?y
nnxnn#?yxxynnnnnn-?yyynyy#nynnnyynnnyyynnnynynyyytnnnny ?nynyyn? ?nynnnnny
yny# ?n?nyynyn? # ?nyyyyynnynnyynnnnn#yyyyyyynnfyyynyynyyyyy ?yYyyvy ?yynny ?y
yyynnnnnyyynn? ?nyyyynyyyynnyynnnn? ?yynn#?yn?ynyynynnyyy ?ynn? ?nnynyyy#y?n
nnnnnnnynyyyn-nnnnnnyynyyyny -nnnnnnynynnnt+ynynnyyynnnnynnnnannynnnn?nnynn
nn?nnnyyxnnynn?nn?nnynn?yyynnyyyyv#yyyynnty? ?y?? ?nnnnnnnnyy nyyyyynyy
yyyttnynyy##?nynny ?ynnnnnnnyynnnnnynyyynyynynnynnyy ?yynyy ?yynynftnyyyynyy
nnxynynynnynnnnnnnnnnnnnyyyn#ynyynnyyynxnty.ny-yynnynnnyny. ?y??xnnnyny-n
nnnnnnnynnnnnnnnnnnnynyyyynynyn.yynyyynnnyyynnnyyynyny .nnynynnynnn.nnyn.
yn?nn? ?yyyyynn?x?#yyynn?y-yyyynnyyyynynn?yyy ?ynnn#ynyyyyynyyyynfnnnynnnn
nnynnxnyynnnn?nnnn?nynnnynnynyyynynyy ?yynyyy ?ynyyn?y ?nn?ny ?nynynnnnnnynn
nny? ?nynyynyyyyy? ?ynnnyyy ?y+?nyyynxynyy#yynynyy? ?yy ?yy ?yyynyyyy ?yynnnyny
ynynnnnnyyyyyyyynnyynnynynnyynnnnnynnyyyyyynnnyyyyynnyyyynytynnyyyyyynny
n?xynnnynyynnn-nn-xyyynyyynynftnnynnyynyynyynyy???xn?x? ?Pnyynn.yn.ynny?ynn
yyy ?ynnnyyyy ?y ?yyyyynyy+tn-n+xnn?n?t+ny?#yyynnynyynynny ?yyynnyyynyyyy ? ?nyy
nnynnyn#nnnnannnnnnnnynnntyynnyyyyyxynyyny.yy-y-ffnyf. .nnn?yynnynnn.nnynn
yyynnnyyyynnny-nnn?nnnnyynyyyyyynnnyyyynyynnnnyynnynyyyyynyyyy-yyyyy ?nny
yyynynxnyyy ?#nnnyxnnyyyynnnyt+nnnnnnnynyn#y-nnnyyyyynyyyyyn-yntn#tyyyynnnn
yy#yy-yny# ?ynyyyyyyynyyynnnyynnnnnynynnyyynnyny? ?yyyyyyyynn?yynyyn?yynyy
304
MILES DAVIS AND DONALD WOLFE
TABLE 3.3
Votes in the Senate. Senate seats arranged by class and state identify the
rows,
NNNNNNNNNNNNNNNNNFRPEFPRPRPRPRPRPRPRPRPRP PRP PRP RPP PRP RPRPRPRPRPRPRPRPRPRERP
and senate bill numbers identify the columns.
pbalalabakatataeakabalatababalalalababalababatavababalabal at alah abot au lob SL 1 ab aL Tab
777777788888888889999999999000000000011111111112222222222333333333344444
345678901234567890123456789012345678901234567890123456789012345678901234
nnyyyyyyyyMnyyyyyynnyyyyyynnnyyyyyny ?#ynyyynyynyyyxyyynn?#ynynyynnnyynny
?++-nnnnynynnnyynnyyynnynynyyynyynnnyyyyn?yyyyyynynnnynyyynyyynyyyyyyyny
nyy -nnnyynynynyyyynn.ynnynnynyy ?yn?ynyynyynnnnyynnynnyynnnynyyyynnnnnnyy
ynnnnny ?yynnnnynynyyyynyyyyyynyyyynnyyyyynynyynynnynnynyyynyyyynnyynynny
nnyynnyyynnnnyyyynynnynnynynyynyynnynyynynynyynyynyyyyyyynnyyy? ?#ynyynny
nnyfyyyyn?yyyynyyynnnyynynynnnyny ?nynyy ?yynnnnnyyyynyyynnnynnynynnnynyyy
n?xynnnnnnnnnnynnnyyynnynyyyyyyyyynnyyyynnyyyynynnyynynyyynyyynyyyynyyny
nnynyyyyy-nyyynynynnn?ynynyn?nynyynynyynyynnnnnnynnnyyynnnynynynynnynyyy
nn?ynyyyynyyyynyyynnnyynynynnnynynnynnynyyn?nnnny ?ynyyynnnynnyynnnnynyyy
yyynyyyynnyyyynyyynnnyyny ?y? ?n? ?ynyynnnnyynxnnn-yynnyyynnnynnnyyynnyny ?y
f.ynyyyy-nytyynyyynnnyynyyynnnynynyynny-yynnnnnnynnnyyynnnynynynnnnynyny
nyynyyyynyyynyxynynnnyynyynyyyyyyynynyyyyynnynnynnyynynnynnyyynnynynyyyy
nnnynnnnnynnnnynnnynyyny ?nnnyyy ?7n?nnyyynnnyyyyyynnyyynyyyyny ?yyYYYYYYYYYY
? 2ynynnyyynnynyynnyyyynnyyyyynnyyynnnyyyyyynyyyynnyynynnyynyyynyyyynynyy
yyy#yyyynynnyyynyynnyyynnnynnn? ?7nynynyynn ?ynynnyyyyyyyyynnynyynynnnynnyy
nnnynnnynnnnny#nynyynynynynyyyyyyynnnyynnnyyyyynynnynnnyyyyyynAnynnynyyny
nny+nnyxynnnnynynnynnynnynyny#?ynnn-n?ynytyyyyy ?nnyynnnyyynyyynyyyynyynn
yyynyyyyyyyyyynyyynnnyynynynny ?ny ?yynnynyy ?yyyyyyyynyyynnnynnnynnnnyny?y
nnn?nnnnnnynnnynnnnyynnynyyyyy ?ynn?nnyyynyyyyyyyyynnyyynnnynynnnnnyynyny
?nn?fnyyyyyynynyyynnnnnnynynnyynynnynnynytnnnnn?nyynyynnnnynnyynnnnynyny
-nnyyyyynyynnyynyynnnynnynnnny ?nyn?ynyynyynnynyyy ?ynyyyn?nynynnnnnnynyyy
-NYYYVYYYYNRYYYynyyynnnyy ?ynynnnynyn?yn?yny#nnnnnnyyynyyynnnynnyynnnnynyyy
nnynynyyyyynynyyyynyyyynyyn?ynnyyynynyynyynnnny ?nnynyy ?nyynyyyyy#yynyyny
nyxnyyyyynyyynyynynynyynynnynyynynnynyynnynnnnnynhyynnyynyynnnyynynnyyyyy
. #YVYvyyyynnynyyyynntnny ?ynny ?yynnynyynyyynynynynynyyyynnynyyyyynnynnny
nnnynyyyn-ynnnyynnynyynynnynyyy ?ynnnnyynyyyyyynynnynyyyynxnyyyyyyynyyy ?y
nynnnnn-yynnnnynnnyyyynynynyyn?yny ?nyyyynnyyyyy-nannynnnyyynyynynyyynynnn
nnynnyyyyny .nyyynynxn?nnyyyynyynynnynyyynnnnynnyn?ynnyyn?nynyynnnnnyyy ?y
nnnynnnnynnnnnynnnynynnynnnyyynynnnnyyyynnyyyynynnyynynyyynyyyvyyyyyyyny
nyynnnyynyynnynyyyynnynnynynyyynynnynyynyynnnyytyyynyyynyynyyynyynnnyyyy
yynyyyynynyynynynynnnynnynynyyynynnynyynynyyyynnynynyyyynyynyyyynnyynyny
yynnyyyyynnyyynyyynynnyyyynyynnnyyyyyyynynynynnnynyyyynynnynyyynynnyyyyy
nny+nnnnnnynnnynnnyyyn-ynynyyynynn ?nnyyynnyyyyynenynnnnyyynyyynyyyynyn?n
nnnyyyyynyyynnnynnnynynyyyynyynnnynnnyyyyynyyyynyn ?ynyyyyynyynyyyyyyynny
nn#ynnyynyynnnynnyyyyynyn?yyyyyynn ?nnyyynyyyyyyynnynnynyyynyyyyyyyynyynn
?nnynynn?y?? ?-yynnnnny ?nnnnnyyyyynnynyyny ?ynynnnnnynyyyynny ?ynyynnnyy? ?y
yyy- -yyynyyyyynyyyn? ?nnynynyynnyny ?nyyyynnyyyynnnnnynynyyyny ?nnyyyynyynn
y ?nnnyyyynn?yynyyynynyyntnyny ?ynyy ?yynynyyynnnnny ?ynyyynnnynynyn?nntnyny
nnyynnyyynyynnnyynyyyynyynnnyyyyynnnyyyynnyyyynynnyyyyyyyynyyyyyyyyyynny
?nn?nnnnnnnnnnynnnyytynynynyynny ?y ?nyyyynny ?yyyyn?nynnnyyynyynnyyyynyynn
?nynyyyyyyyyy ?#ynynyyyynyyyyyyyyynnynyynnynnyynynnynnyyyy ?n? ?nynynnyn-?y
yyynyyyyynyyyynyyynn?nnyn ?nyynnyyynnyyyynnyyyyyynnyynnnyyynyyynynnnynyyy
nnyyyyyynyynnyyyny+nynnynynyyny ?ynynnyynnnynnnnynnynnyynyyyyyynynynyyyny
nyyyynyynynnnyny ?yynyynnynynyyyyyynnnyynnyynynyynnyyyynyynynyyyyyyynyyny
nnnyyyyyyynAnnnynnnyyyynnnnynyyy ?nynnnyynnyyyyyynn?yynynyynnyyyyynynyyyny
ynyn.yyyynyyynnyyynn.nynyynyynnyyn?nn?yynynynynnnnnnyyynyyyyynynyyyyyyyy
nyynyyyyynyyyynyyynnny ?nyyynnnynynyynnnnyynnnnnyyyynyyyn ?nynnnynynnyny ?y
nnnynnnyynn?nnynnnnyxyynynynnn?nynyynnnnytnnannnyyyynyyynnnynnnynynnynnny
y?? ?yyyynyyyyynyyyn ?nyyynynyyyyyyynynyynyynyyynynnyynyynyynyyynyyyynyyyy
nnnynyynnyn?nnynnnynyynnnynyyynyyynnnyyynnynyyynnnyynyyyyynyyyyyyyynyy?n
CLUSTERING OF SENATORS AND THEIR VOTES 305
TABLE 3.4
Votes in the Senate. Senate seats arranged by class and state identify the
WWW WW WW WW WWW WWW WWW WWWWWWWWWWWWWWWWNNDNNNNNNNNNNNN DN DY
and senate bill numbers identify the columns.
oc e213 We BW Dee ie ph a Lg Ue Bs Ya ta Up Dp Le a Ol Do OT a OB
777777788888888889999999999000000000011111111112222222222333333333344444
345678901234567890123456789012345678901234567890123456789012345678901234
yvyynyyy#nnyffynynynnny? ?ynynny ?nynnynyyny+nnnnnynnynyyynnnynyynnynnynyny
nnnynnnnnnnnnnynnnyyynnyynnnyynyyynnnyynn-yyyynnnnyyynyynnyyyynynnyynnny
ynyxtyyyynyy ?ynyyynn?nnynynyyynynynnyyyynnnyyynnnnnynnnyynnyynnyyyynyynn
yyynyyyyynyyynnyyynnnyynynynnn?n?nnynnnnyynnnnnnynynyynnnn?n?nynnynynyny
nnnynnnnnnnnnnynynyy ?ynyn?nyyyyyyynnnyyynyyyyyyy ?nnynyyyyynyynnyyyynyyny
ynnynnnnnnnnnnynnnyyynnynynyyynynynnyyyynnyyyyynnnnpnnnyyynyynyyyyynynnn
nnny ?nnnn?? ?7nnynnnyyyy ?nnnnanyynynynnyyynn?ynyyynnnynnyyyynn?yynyyynyyn?y
t+ynnyyyynynyynyyyynnyyynytnynnnyyyytnyynyynnynx ?n-yynyynnnyntynnnnnyyyyy
nyynyyyyynnyyynyyynnnyynynnnnty ?ynyynnynyynnnnnyyyynyyynnnynnyynnnnynyyy
nnxynnnnnynnnnynnnyyyynynynyyynyyynnyyyynnyyyyyynnyynytyyynyyyyyyyyyyynn
yyynfyyyn. fyyynynynnnyn?nyny?y?y?? ?nnyynnyyyynyy ?nyynnyynnyyyynynyyyyn?n
nnynnnnynnynnnyynnyyyy ?y?? ?yyynynxnnyyyyny ?yyyyynn?ynyyyyynyyyyyynnynyny
-hyynnnnnyynnnynnny##nnynyyyyyn?n?nnyyyynnyyyyyn?nnnnnnyyynyynnyyyynyynn
?nyynnnnyy ?nnnynnnyy ?nnynynyyynyyynnyyyynyyyyyyynnyynyyyyynyyyyyyyyyynny
nnnnyyynynnyyyynnynnnyynynynnynnyynynnynnyyynnnyynyyynyynnynyyyyynnynyny
nnyynnnnnnynnnynnnyyyynynynyyyn?nnnnnyyynnyyyyynnnynnyyyyynyyynyyyyyynnn
nnnynnnnnynnnnynynynyynyynynyyyn? ?nyyyyynnyyyyynnnyynynyyynyyyyynyynyynn
ynyyyyyy -nfityyyynynyn?y?n?n???ynn? ?ynyyynyyyyyyynnyynnnyyynyyyyyyyyyyyny
?nn+?nnnnnnnnnynnnyyynny??-yyy? ?y?nnyyyyn? ?yy?#?nnynnnnnyynyynyy?yyy?y?n
yyynnyyyyyyyyynynnnnnyynynnnyyyyynnynynnyynynnnnnnynyyynnnynyyyyynnyyyny
yyynyyyynnyyyynyyynnnyynynynnnynynnynxyny ?nnnnnx? ynnyytnnnynnnnynnnnnyyy
YYY-YyyyNnyyyynyyynnnyynyynnyn?y ?n?ynnnnytnnnnny? ? ?nyyyn?nynnnnnannnynyyy
YVY-VYVYYYVyynyxynynnnyynynnnyytnyn?ynnnnyynnnnnyy ?ynyyynnnynnyynynnynnny
nynynnynnynnnyyyynnnnynnyyyyyynyy ?nynyyyyn+t ?yyynnnyynynyyynyyyyynytyyyny
?nyyyny-nny .nnnyynynyynn? ?yyyy?? ?n? ?yyyynnyyyyynnnyynyyyyynyyyyyyyyyyyny
-hynyyyfnyy.nynynynny?n?ynynyyyn?nnynyynyynnnnnnyy#nyyynnnynnnyyxnnynnyy
yny ?yyyy -nnnyynyyyynytnn? ?ynnxy??? ?yyyyynnyyyyyynynynnnyyynyynnyyyynyynn
nnynyyyyyyynnynynyynnynnynnnnF#y ?y ?yynyynyyn#ynnynyynyynnnnnnynynyynyyyny
fy#ny#+ynnynyynyyynnn ?ynynynnyyny ?nyyyyynnyyyyny ?nyynynyyynyyynyyyynyy?y
yyy-yyyyynyyyynynynnnyynynynnn?n? ?nyyyyynnynyyyynnyynnnyynnyyynyyynyyyny
nnxt+nnn-nnnnnnynynyyynnnnyyyynnynynnnyyynnyyyyyynryynynyyynyyynyyyynyynn
nnnynyyynynynyynyyynyyyyynynyyynyynynyynyyynnnnnynyyyyyynnynyyyyynnyyyny
nnyyyyyynyyn?nyynnynyy ?ynnyny ?n?n?nnyyyyn?yyyyyynnyyyynyynny ?yyyyynnyyny
VVY ?YVYYYYNYYVYYynynn?ynnynyy ?nynyn?ynynnn?nnnnyy ?7nynnyynynnnnyynxnnynyyy
nnnnyyyynynnnynyny ?nny ?nynynnnnny ?yynnnnyynnnnnnynyyyyynnnynyyynnnnynnyy
nnn?nnnnnyn?nnyn?nyyynn?nynyyynyn? ?nyyyynnyyyyyynnyynynyyynyyyyy ?yynyynn
yynnnyyyyynynynyyynnnyynyyynyynn ?ynynyyynynyyyyynnynnnynyynyyyyyyyyyyynn
nyynyyyyyny ?ynnyyynnnynnynyyny ?n+? ?ynyyyyyyyyyyynnyynynyyynyyyyyyyyyyyny
nnnynyynynyynnnyn-ynynnyynnnyyyyn? ?nyyyyn-yyyyynnnnynnnyyynyy ?yy?yynyynn
nnnynyynnynnnnynnnyyyynynyyyyyn?nynnyyyynyynynyynnyynyynn?ynyynynnnyyyyy
nnynyyyyynyynynynynnnynnnnynyy ?nynnynyyny ?nyn?-yyyynyyynynynyyynnnyynyyy
nnnynnnyynynnnyynnyynynynnnyynyynnnnyyyynnyyyynnnnyynnyyyynyyynyyyyyynnn
yyynnnyynyynnnyyyynyyynnyynyyn?yynnnnyyyy ?nnynny ?nynnynnynynyynyynnyyyny
y ?nnnnnyyynnnnynyyyyynynyynyynny ?#nnnyynyynnynyynnynyyynnnnnyyynynnyyyny
ynynnyy .-yyynnnnynnynyynnyn?nyynyyn?ynynnyyy ?ynn?y? ?nyyyynnnnyyyyyyyyyyny
y ?ynyyyynnyyyynynynnnyyn? ?ynnn?yy? ?yyyyynyyyyyyynnyynnyyyynyyynyyyyyynny
?n.ynnnnyynnnnynnnyyyynynynyyn?ynynnyyyynnyyyyynnnnynnyyyynyynynyyynt-??
yyynyyyynnnyyynynynnnyynyyyn?n?nynyynnnnytnnnnnynyynyyynnnynyyynynnynyny
~y# ?nxyynyynnynynynnnynnyyynnxynynnynyyynyynnyyynnynnyynyynyyynyynyyyy ?y
yyynyyyynnyyyynyyynnnyynynynnnynynnyyyyyynyyyyynnnyynynyyynyyynnyyyyyyny
306
SENATORS - CLASS |
Second
Principal
@ NE-Hruska Component
@ AZ-Fannin
e@ NY-Buckley
@ TN-Grock
NM-Bin
@ PA-Scott
Hi-Fong @ @ OH-Taft
MILES DAVIS AND DONALD WOLFE
MD-Sarbanes @®D OH-Metzenbaum
© HI-Matsunaga
© MI-Riegle
© NY-Moynihan
© ME-Mitchell
© NJ-Lautenberg
© TN-Sasser
gaman © © MT-Melcher
WV-Byrd ©@ Ri-Chafee
FL-Holland © AZ-DeConcini © @ CT-Welcker
MS-Stennis © © VA-Byrd © MD-Beall © ND-Burdick
@ CA-Mur phy
CT-Dodd, T. ©¢@ DE-Williams ® V7-Stafford MA-Kennedy ©
NV-Cannon © © MN-Humphrey, M.
VT-Prouty @q@ TX-Yarborough
a © TX-Bentsed © TN-Gore, Sr.
FL-Chiles © | © PA-Heinz © WA-Jackson First Principal Component
NJ-Brady ee WA-Evans © MD-Tydings
MN-Ourenber ger
© WY-McGee © MN-McCarthy
OH-Y bungee NY-Goodell
© NM-Montoya
@ CA-Wilson
NM-Schmitt ¢e © VA-Trible
© MO-Symington © WI-Proxmire
@ CA-Hayakawa © CA-Tunney © ME-Muskie
@ NV-Hec. IN-Hartke © © MN-Humphrey, H. © NJ-Williams
@ DE-Roth MT-Mansfield @® RI-Pastore
© UT-Moss
NE-Zorinsky ©@ MO-Danforth
© MI-Hart
@ WY-Wallop
UT-Hatch ee /N-Lugar
Fig. 1. Principal components for Senators in Class I.
mathematics involves a singular value de-
composition by the Jacobi method”’. Per-
haps more efficient methods would be
found in LINPACK?* and EISPACK".
Faddeeva® wrote about older but inter-
esting methods. MacRae’ and Easterling?
used the related method of factor analysis
to analyze voting in the Senate. Davis and
McCoy? analyzed survey responses in a
similar way. The analysis by principal
components is far from new, but it seems
useful as a first look at the data.
Results
In the figures, the principal component
vectors corresponding to the largest and
the next largest principal values are plot-
ted, following the widely used methods
named “‘biplots” by Gabriel’. In Figures
1, 2 and 3, the senators are divided among
the classes to which they belong. The
three classes of the Senate differ in their
characteristics, which may be seen in the
figures. Figure 4 shows the Senate bills,
CLUSTERING OF SENATORS AND THEIR VOTES 307
which are also identified by name in Ta-
ble 1.
Inspection of the figures allows us to
interpret the first, or largest, principal
component as a liberal-conservative gra-
dation. The second is not so easily under-
stood, but the sequence of bills in Figure
4 helps us to see a pattern. Early bills,
from 1969 and the early ’70’s, seem to fall
along a diagonal line from upper left to
lower right. As time passes, the line pivots
about the origin until, in the mid-1980’s,
SENATORS - CLASS I
Second
Principal
Component
NE-Curtis ee WY-Hansen
© MS-Eastland
© AR-McClellan
the line passes from upper right to lower
left. A further rotation of the axes might
well simplify the picture.
In Figures 1-3, the party affiliation of
the Senator is shown by @ Republican or
© Democratic symbols. More liberal Sen-
ators are more Democratic or north-east-
ern; more conservative Senators are more
Republican or south-western. These ob-
servations are well known, but it is inter-
esting that the spectral analysis was
performed with no reference to state or
© MI-Levin
© NJ-Bradley
© MA-Tsongas
© MT-Baucus
© AL-Sparkma
1A-Miller ee CO-Allott
e@ (D-Jordan
© NC-Jordan
© LA-Ellender
MI-Griffin @ © DE-Biden
© WY-Randolph © RI-Pell
®@ Oxk-Bartlett
@ VA-Scott
@ TX-Tower
@ TN-Baker
@ OR-Hattield
@ SC-Thurmond 8 First Principal Component
CO-Haskel/ @ @ SD-Abourezk
© ME-Hathaway
© GA-Nunn OK-Harris ©NH-McIntyre © © IA-Clark
SD-Pressler @ © OK-Bofen
@ MN-Beschwitz
© AL-Heflin
© LA-Johnston
@ /L-Percy
@ NM-Domenici @ MA-Brooke
NC-Helms @ @ /D-McClure © MT-Metcalf
@ /A-Jepsen
AkK-Stevens @ WyY-Simpson ® ® MS-Cochran
NH-Humphrey @® VA-Warner
© CO-Armstrong
® NJ-Case
© MN-Mondale
Fig. 2. Principal components for Senators in Class II.
308 MILES DAVIS AND
SENATORS - CLASS III
Second
Principal
Component
© AL-Allen
® UT-Bennett
GA-Talmadge ©@e NH-Cotton
NC-Ervin © @ CO-Dominick
®@ OK-Bellmon @ FL-Gurney
@ NO-Young
DONALD WOLFE
© CT-Dodd, C. © CO-Hart
© VT-Leahy
© OH-Glenn
© AR-Bumpers
© MO-Eagleton
© KY-Ford
>|SC-Ilollings
© Hl-Inouye
NV-Bible © IL-Dixon©e PA-Specter
KY-Cook ©@ O;
FL-Stone ©
@ AZ-Goldwater
NC-Broyhill @
NY-O'Amato @® /N-Quayle
NC-East @® WI-Kasten
GA-Mattingly @ @ AK-Murkowski
Al-Denton ee OK-Nickles
!0-Symms @ @ SD-Abdnor
UT-Garn @
@ NV-Laxalt
© AL-Stewart
© NC-Margan
-Saxbe
© IA-Culver CA-Cranston ©
@ MO-Mathlas
© NH-Durkin
First Principal Component
© OH-Metzenbaum
d ® © AR-Fulbright
AK-Gravel © © 1ID-Church
© WA-Magnuson
© IL-Stevenson
® NY-Javits
© IN-Bayh
© WI-Nelson
© CT-Ribicoff
© SD-McGovern
© IA-Hughes
® PA-Schwelker
Fig. 3. Principal components for Senators in Class III.
party affiliation. Only the voting record
contributed to the results.
In Figure 4, bills that were adopted are
shown as solid circles @ and those that
were not are shown as open circles O.
There is a pattern reflecting the legislative
leadership at varying times. In the earlier
years, more liberal bills, appearing in the
lower right corner, are passed more often;
but in later years, more conservative bills,
appearing in the lower left corner are
more often carried. Bills toward the edges
of the patterns are not often passed, per-
haps reflecting a moderating tendency of
the Senate to avoid support of extreme
legislation.
As Colin Mallows remarked on seeing
the graphs in the poster session at the
Joint Statistical Meetings in New Orleans,
the data would be usefully explored fur-
ther by computer graphics and shown
on videotapes. We are working on that
possibility, and we invite others to do so.
The data are available on a floppy disc
in an ASCII file in IBM-PC compatible
form.
CLUSTERING OF SENATORS AND THEIR VOTES 309
SENATE BILLS
Second
Principal
Component
8 036
033
043
320 ©26 25
300 €27
o21
046 fo 3)
68 096
044
o716
70e @49 S2 o31 034 038
#48
063
e7
082
0125
o131
o121 0113
122
o1 08
@127
#140
1430. 114
#9 095
0144
e103
e135
@126
4 e105 78@ 086
o107 084
089083 850 077
81
120 | 039123
102
e101
104@ 0100
119 138
116 136
12400110
e115
1120 0128
1300 0139
133@ @117
118 0129 e141
132
Fig. 4. Principal components for Senate Bills.
Acknowledgment
We thank Charles Stembler for pre-
paring a draft of the voting data in Ta-
ble 3.
References Cited
1. Barone, Michael: Almanac of American Poli-
tics, Washington, DC, National Journal, 1972-
1987.
2. Congressional Quarterly: Washington, DC,
Congressional Quarterly, Inc., 1970-1986.
3. Davis, M. and McCoy, J.: “A multivariate
model for response reliability in surveys,”’ Pro-
037 First Principal Component
AS
28 @61 075 69
$9e of 073 Tlo @68 074
o40 = 62@ ~860
350 019 4024 «41 064
ol3 = @11 e112 072 550 @67
160 o14
ceedings of the Section on Survey Research
Methods of the American Statistical Associa-
tion, 603-608, 1978.
. Dongarra, J. J., Moler, C. B., Bunch, J. R.
and Stewart, G. W.: LINPACK Users’ Guide,
Philadelphia, Society for Industrial and Applied
Mathematics, 1979.
. Easterling, Doublas V.: Political Science: Using
the Generalized Euclidean Model to Study Ide-
ological Shifts in the U. S. Senate, Chapter
10 in Young, Forrest W.: Multidimensional
Scaling: History, Theory and Applications,
Hillsdale, NJ, Lawrence Erlbaum Associates,
1987.
. Faddeeva, V. N.: Computational Methods of
Linear Algebra, New York, Dover, 1959.
310
ae
EDWARD J. WEGMAN
Gabriel, K. R.: The biplot graphic display of
matrices with application to principal compo-
nent analysis, Biometrika, 58:453—467, 1971.
. Good, I. J.: The Estimation of Probabilities,
Cambridge, MA, The MIT Press, 1965.
. MacRae, D.: Issues and Parties in Legislative
Voting-Methods of Statistical Analysis, New
York, Harper and Row, 1970.
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 310-322, December 1988
10. Ralston, A. and Wilf, H. S.: Mathematical
Methods for Digital Computers, New York,
John Wiley, 1960.
11. Smith, B. T., Boyle, J. M., Dongarra, J. J.,
Garbow, B. S., Ikebe, Y., Klema, V. C., Moler,
C. B.: Matrix Eigensystem Routines-EISPACK
Guide, 2nd ed., New York, Springer-Verlag,
1976.
Computational Statistics:
A New Agenda for
Statistical Theory and Practice
Edward J. Wegman*
Center for Computational Statistics
4400 University Drive
George Mason University, Fairfax, VA 22030
ABSTRACT
The impact of workstation and personal computing has important implications for the
future of statistics. We argue that the capabilities of new computing environments will
change methodological focus because computationally intensive algorithms free of onerous
restrictions are feasible in place mathematically tractable but potentially nonrobust al-
gorithms. Moreover electronic instrumentation allows us to collect data substantially dif-
ferent from traditional data collection. In particular, we argue that in place of small, low
dimensional homogeneous data sets chosen according to a well designed experiment, we
are more likely to see very large, high dimensional nonhomogeneous data sets collected
opportunistically. We outline a comparison between traditional statistics and what we call
computational statistics. We give several examples a computational statistics and complete
our thesis with a discussion of the implications for graduate curricula.
1. Introduction
The spectacular growth in the field of
computing science is obvious to all. In-
*Correspondence
deed, the most obvious manifestations,
the ubiquitous microcomputer, is in some
ways perhaps the least significant aspect
of this revolution. The new pipeline and
parallel architectures including systolic
arrays and hypercubes, the emergence of
COMPUTATIONAL STATISTICS: A NEW AGENDA 311
artificial intelligence, the cheap availabil-
ity of RAM and color graphics, the impact
of high resolution graphics, the potential
for optical, biological and chemical com-
puting machines and the pressing need for
software/language models for parallel
computing are all aspects of the compu-
tation revolution which figure promi-
nently in the other sciences.
While not as obvious to the casual ob-
server, the fields of statistics and proba-
bility have experienced equally spectac-
ular technical achievements including
weak convergence theory, the almost sure
invariance principle, exploratory data
analysis, nonlinear time series methods,
bootstrapping, semiparametric methods,
percolation theory, simulated annealing
and the like principally within the last dec-
ade. The computing and statistical tech-
nologies both figure prominently in the
manipulation and analysis of data and in-
formation. It is natural then to consider
the linkage between these two discipline
areas. The interface between these two
discipline areas has been labeled by the
phrase, “‘statistical computing,” and more
recently, ““computational statistics.”’ We
should like to distinguish between the two
and, indeed, argue that the latter embod-
ies a rather significantly different ap-
proach to statistical inference.
In thinking carefully about the rela-
tionship between computing science and
statistical science it is possible to describe
a large number of linkages. Two that come
immediately to mind are the stochastic
description of data flow through a com-
puter and the characterization of uncer-
tainty in expert systems. In the former
case, we can view a computing architec-
ture, particularly a parallel or distributed
architecture as a network with messages
being passed from node to node. This is
essentially a queueing network and the
characterization of the distribution state
of the network becomes a problem in sto-
chastic model building and estimation. In
the latter example, a rule-based expert
system is invariably characterized by im-
precision in the specification of the basic
predicates derived from the expert or ex-
perts. This may be because either the rule
is a “rule-of-thumb” and hence inherently
imprecise or because the rule is a not yet
fully formulated and verified inference
and hence exogenously stochastic. The as-
signment of probabilities (or such alter-
natives to probabilities as belief functions
or fuzzy set functions) in a useful way is
another application of statistical meth-
odologies to computing science. In both
of these examples statistical methodology
is employed in the development of com-
puting science. Both of these examples
might legitimately be called statistical
computing since the focus is on computing
with statistics as an adjectival modifier.
Traditionally, of course, this is not at
all what statistical computing means. Sta-
tistics is fundamentally an applied sci-
ence, hence, a computationally oriented
science. A statistical theory is useless
without a suitable algorithm to go with it.
Statistical computing has traditionally
meant the conversion of statistical algo-
rithms into a reasonably friendly com-
puter code. This enterprise became fea-
sible with the development of the IBM
360 series in the early 1960s and the slightly
later development of the DEC Vax series
of computers. Of course, the trend has
accelerated with the ubiquitous PC. Pack-
ages such as SAS, BMDP, Minitab, SPSS
and the like represent a very high evo-
lution in statistical computing. When we
use the phrase, “computational statis-
tics,” we have in mind a rather stronger
focus on the exploitation of computing in
the creation of new statistical methodol-
ogy. We shall give some explicit examples
shortly.
2. Statistics as an Information
Technology
Statistics is fundamentally about the
transformation of raw data into useful in-
formation. As such statistics is a funda-
mental information technology. It is,
312
therefore, appropriate to see statistical
science as intimately related not only to
computing science, but also communica-
tion technology, electrical engineering
and systems engineering as part of the
spectrum of modern information proces-
sing and handling technologies. Statistics
is perhaps the oldest and most theoreti-
cally well developed of these information
technologies.
It is thus important to understand how
the changing face of these technologies
affect statistics and, more importantly,
how they offer new opportunity for the
development of statistical methodologies.
In particular, it is important to understand
how the computer revolution is affecting
the accumulation of data. Electronic in-
strumentation implies an ability to ac-
quire a large amount of high dimensional
data very rapidly. While such capabilities
have existed for some time, the emerg-
ence of cheap RAM in the 1980’s has given
us the ability to store and access that data
in an active computer memory. Satellite-
based remote sensing, weather and pol-
lution monitoring, data base transactions,
computer controlled industrial automa-
tion and computer controlled laboratory
instrumentation as well as computer sim-
ulations are all sources of such complex
data sets. We contend that this new class
of data represents a challenge for statis-
ticians which is substantially different in
kind. In many ways the characteristics of
automated data are different from tradi-
tional data. Automated data is generally
untouched by human brain. While there
are fewer transcription errors, there are
also fewer checks for reasonableness.
There are likewise different economic
considerations. In many traditional data
collection regimes, the cost per item of
data is expensive. In the automated mode,
set-up costs are expensive, but once ac-
complished there is low incremental cost
for taking additional data. Thus it is often
easy to replicate, hence, increased sample
size, and it is easy to take additional mea-
surements on each sample item (vari-
ables), hence an increase in dimensional-
EDWARD J. WEGMAN
ity. As an adjunct, it is worth pointing out
that when an individual datum is expen-
sive, we collect no more than absolutely
needed to complete the inference. How-
ever, when the marginal cost of additional
data is low, we tend to collect much more
both in sample size and dimension moti-
vated by the belief that it might be useful
for some not-yet-precisely-specified pur-
pose. Thus, the computing revolution
often implies that there are less sharply fo-
cused reasons for collecting data.
The majority of existing methodology
is focused on the univariate, IID random
variable model. Even in the circumstance
that a multivariate model is entertained,
it is usually assumed to be multivariate
normal. We contend, in addition, that
while arbitrary sample size is frequently
assumed, the truth of the matter is that
these techniques implicity assume small
to moderate sample sizes. For example,
a regression problem with 5 design vari-
ables and 1000 observations would rep-
resent no problem for traditional tech-
niques. By contrast, a regression problem
with 40,000 design variables and 8 million
observations would. The reason is clear.
In the former case the emphasis is on sta-
tistical efficiency which is the operational
goal for most current statistical technol-
ogy. However, with massive replications
the need for statistical efficiency is much
less pervasive. Indeed, we may find it de-
sireable to exchange highly efficient,
parametric procedures for less efficient,
but more robust, nonparametric proce-
dures to guard against violations of (pos-
sibly untestable) model assumptions. The
emphasis on parsimony in many contem-
porary books and papers is a further re-
flection of the mind-set that implicitly fo-
cuses on small to moderate sample sizes
since few parameters do not necessarily
make sense in the context of very large
sample sizes. Finally, we note that the
very fact of largeness in sample size im-
plies that it is unlikely we would see IID
homogeneity.
Thus, the computer revolution implies
a revolution in the type of data we are
COMPUTATIONAL STATISTICS: A NEW AGENDA 313
Table 1.—Comparison of Traditional Statistics
Traditional Statistics
Small to Moderate Sample Size
IID Data Sets
One or Low Dimensional
Manually Computational
Mathematically Tractable
Well Focused Questions
Strong Unverifiable Assumptions relationships
(linearity, additivity) error structures
(normality)
Statistical Inference
Predominantly Closed Form Algorithms
Statistical Optimality
able to accumulate, i.e. large, high-di-
mensional nonhomogeneous data sets.
More importantly, however, the richness
of these new data structures suggest that
we are more demanding of the data. Ina
simple univariate IID setting we primarily
are concerned with variability of a single
random variable and questions related to
this variability. In a more complex data
set, however, it is natural to ask more
involved questions while simultaneously
having less a priori insight into the struc-
ture of the data. When more variables are
available, we would certainly ask about
the functional relationship among them.
We have used the phrase, structural in-
ference, for methodologies aimed at de-
scribing such relationships, deliberately
contrasting this phrase with the phrase,
statistical inference, which has tradition-
ally been about determining variability
(i.e. probability distributions). With a
more complex structure, we are open to
ask more of the data than just simple de-
cisions. We may want to ask forecasting
questions, e.g. about weather, economic
projections, medical diagnosis, university
admissions and tax audits, or questions
about automated pattern recognition, e.g.
printed or hand-written characters, speech
recognition and robotic vision, or ques-
tions about system optimization, e.g.,
process and quality control, drug ef-
fectiveness, chemical yields and physical
design.
The contrast between traditional statis-
Computational Statistics
Large to Very Large Sample size
Nonhomogeneous Data Sets
High Dimensional
Computationally Intensive
Numerically Tractable
Imprecise Questions
Weak or No Assumptions relationships
(nonlinearity) error structures
(distribution free)
Structural Inference
Iterative Algorithms Possible
Statistical Robustness
tics and the new statistics is marked and
strong. We summarize in Table 1. We think
that the implications of the computer re-
volution on statistics are sufficient to war-
rant new terminology, specifically com-
putational statistics. This terminology not
only provides the linkage of statistical sci-
ence with computing science, but also puts
the focus on statistics rather than on com-
puting.
3. Some Examples
In this section we should like to illus-
trate these notions of computational sta-
tistics with three examples of approaches
to data that flow from thinking about sta-
tistics in light of contemporary compu-
tation. It goes without saying that such
techniques as bootstrapping, cross vali-
dation and high interaction graphics are
clear well-known examples of what we call
computational statistics. We wish to de-
scribe some less well known ideas: 1. high
dimensional graphical representation, 2.
functional inference, and 3. data set map-
ping.
3.1 Example: High Dimensional
Graphical Representation
Large, high dimensional data sets often
have a complex, non-linear structure. For
314 EDWARD J.
this reason, exploratory analysis is even
more important for such data sets than it
is in more traditional well-structured data
sets. Visualization of data structures in
higher dimensions is, however, even more
difficult than in low dimensional cases
since geometric representation of data
with cartesian coordinates is impossible in
dimensions higher than three.
One interesting approach, the parallel
coordinate representation, has been pur-
sued by Inselberg (1985)? and Wegman
(1986).° A point in n-dimensional space
is represented in Figure 3.1. A parallel
coordinate representation consists of n
parallel axes, each axis meant to represent
one dimension of the n-dimensional vec-
tor. A point in n-space is plotted by mark-
ing x, on the 7th axis and joining x, through
x, by a broken line segment. Thus a point
1
2
3
&
WEGMAN
in Euclidean n-space is mapped into a
broken line segment in the parallel rep-
resentation. Figure 3.1 represents two
points coinciding in the 4th coordinate.
Figure 3.2 represents a more complex but
artificially contrived data set. While we
will not try to develop intuition for the
parallel coordinate representation in this
paper, our experience has been that with
a very modest amount of training, people
rapidly become adept at interpreting these
diagrams. Figure 3.2 has several features
of interest including a normal marginal
density in dimension one, a negative chi-
square of dimension two, a three dimen-
sional cluster in dimensions 3 to 5, a five
dimensional mode as well as linear and
nonlinear functional relationships.
The mathematical structure of parallel
coordinate diagrams turns out to be ex-
Fig. 3.1. Two points in n-dimensional space plotted in parallel coordinates. These points agree in the
4th coordinate.
COMPUTATIONAL STATISTICS: A NEW AGENDA
\ gs
a&
- Vv >
315
Fig. 3.2. Illustration of five dimensional data set represented in parallel coordinates.
tremely interesting. Focusing on the first
two dimensions of Figure 3.1 for a mo-
ment, one observes that a point in ordi-
nary cartesian coordinates maps into a line
(segment) in parallel coordinates. This
point-line mapping is suggestive of the
duality found in projective geometry. In-
deed, projective geometry plays a key role
in the development of the parallel coor-
dinate representation. If both the carte-
sian plane and two dimensional parallel
coordinate plane are augmented with the
appropriate ideal points so that they are
both projective planes, the transforma-
tion from the cartesian coordinates to the
parallel coordinates becomes a projective
transformation or projectivity. A projec-
tivity can be represented as a matrix trans-
formation on the so-called natural ho-
mogeneous coordinates. The particular
cartesian-to-parallel-coordinates transfor-
mation induces a number of interesting
dualities.
Not only do points map into lines, but
lines map into points as well. Conics map
into conics, rotations map into transla-
tions, translations into rotations and in-
terestingly enough points of inflection map
into cusps and vice versa. There are sev-
eral interesting implications of these facts
beyond just the graphic display value of
parallel coordinates. For one thing, since
316 EDWARD J.
translations are relatively easy to compute
while rotations are relatively harder, there
is a computational advantage to a parallel
coodinate representation when heavy use
of rotations is expected. It is also clear
that points of inflection are relatively dif-
ficult to detect while cusps are relatively
easy. We believe that the parallel coor-
dinate representation will be an advan-
tageous one for computational geometry
as well as statistical data analysis.
The parallel coordinate diagram serves
as a hyperdimensional analogue to the
traditional scatter diagram and thus may
be used as a fundamental tool for high
dimensional exploratory data analysis.
Several notions have already been ex-
plored, but these are certainly prelimi-
nary. It is known, for example, that the
number of crossings of line segments be-
tween adjacent pairs of parallel axes is
invariant with scale transformation. Such
invariance suggests that features of the
parallel coordinate axes depend only on
ranks and thus may hve robustness fea-
tures common to rank-based statistical
methods. Yet another question of signif-
icant interest relates to dimensionality
reduction. Is it possible, for example, to
find a simple graphical algorithm for di-
mensionality reduction based on rota-
tions, translations and nonlinear scaling?
If such a procedure were available either
automatically or with heuristic guidance,
it would be a fundamental tool in model
building.
3.2 Example: Functional Inference
Our fundamental premise is that we are
interested in the structural relationship
among a set of random variates. We for-
mulate this as follows. Given the random
variables X,, X,,... ,X,, which are func-
tionally related byanequation, f(X;,...,
X,) = €, determine f. This is a general-
ization of the problem of finding the pre-
diction equation in the standard regres-
sion problem, Y — xf = e, but in a
nonparametric, nonlinear setting. We
suggest the following notion. Let M =
WEGMAN
{6s sin.itt ye 5 Mp > E IX, ». hee ee
in general is an algebraic variety and un-
der reasonable regularity conditions a
manifold. Thus, there is a fundamental
equivalency between estimating the geo-
metric manifold M and estimating the
function f.
By turning our attention to the mani-
fold M, we make the problem a geometric
One, one whose structure is easier to vis-
ualize using computer graphics. For this
reason we believe there is an intimate
connection between the structural esti-
mation problem and the visualization of
high dimensional manifolds. While graph-
ical methods for looking at point clouds
have proven stimulating to the imagina-
tion, it is extremely difficult to understand
true hyperdimensional structure, partic-
ularly when rotating about an invisible
axis. We believe that a solid structure as
opposed to a point cloud would provide
the visual continuity to alleviate much of
this problem. This solid structure is what
we identify with the manifold M. To get
a handle on the procedure for estimating
a manifold we note a d-ridge is the extre-
mal d-dimensional feature on a hyper-
space structure of dimension greater than
d. The 0-ridge corresponds to the usual
mode. For some d, we contend that a rea-
sonable estimate of M is the d-ridge of
the n-dimensional density function of (X;,
..., X,). This in effect is estimating the
d-dimensional summary manifold with a
mode-like estimator. In essence what we
are suggesting to skeletonize hyperdi-
mensional structures. This type of process
has been done in the image processing
context with good computational effi-
ciency.
The inference technique then is to es-
timate M nonparametrically. This reduces
the scatter diagram to a geometrically de-
scribed hyperdimensional solid. This can
then be explored geometrically using
computer graphics. Features can then be
parametrized and a composite parametric
model constructed. The inference can be
completed by a confirmatory analysis on
the (perhaps nonlinear) parametric model.
COMPUTATIONAL STATISTICS: A NEW AGENDA 317
Techniques presently in use often assume
linearity or special forms of nonlinearity,
e.g. polynomial or spline fits, and often
assume additivity a low dimensional sub-
components, e.g. projection pursuit. We
would not like to take this perspective a
priori. While this methodology may con-
sequently seem complex, the premise is
that geometric-based structural analysis
will offer tools superior to traditional
purely analytic methods for building high
dimensional functional models.
The key to this development is to ap-
preciate the role of ridges in describing
relationships between random variables.
A simple two-dimensional example is il-
lustrated in Figure 3.3. Note that the con-
tours represent the density and the 1-ridge
represents the functional relationship be-
tween x and y in a traditional linear
regression. Since densities are key, a fast,
efficient multidimensional density esti-
mation technique is important.
The slowness of the traditional kernel
estimators in a high dimensional setting
arises from the fact that they are essen-
tially point estimators. That is to say, to
compute f(x) one needs to do smoothing
in a neighborhood of x. For a satisfying
visual representation, the x’s must be cho-
sen reasonably dense. Moreover, tradi-
tional kernels are frequently nonlinear
functions and thus the computation in-
volves the repeated evaluation of a non-
linear function at each of the observations
of a potentially large data set on a dense
Fig. 3.3. Two dimensional scatter diagram with the contours of the two dimensional density function
and the 1-ridge summary line.
318
set of points in the domain. Furthermore,
as dimension increases, exponentially
more domain points are required to main-
tain a constant number of points per unit
hypervolume. Traditional kernel estima-
tors become essentially useless for even
relatively low dimensions.
The traditional histogram provides an
alternative strategy. The histogram is a
two-step procedure. The first step is a tes-
selation of the line. The second step is an
assignment of each observation to a tile
of that tesselation. The computation of
the actual density estimator amounts to a
simple rescaling of tile- (cell-) count. The
histogram is a global estimator since the
function is constant on the tiles which are
finite in number and, indeed, relatively
few in number compared with the dense-
ness of points required for the kernel es-
timator. The traditional histogram, of
course, operates with fixed equally spaced
uniform tiles. There is no reason why the
tiles must be fixed or uniform. Wegman
(1975)’ suggests a data-dependent tesse-
lation in the one dimensional setting and
shows that, if the number of tiles is al-
lowed to grow at an appropriate rate with
the increase of sample size, then asymp-
totic consistency can be achieved.
The ingenious papers by Scott (1985,
1986)°° introduces the notion of the av-
erage shifted histogram, ASH. Scott rec-
Ognizes the computational speed of a
global estimator such as the histogram.
His algorithm computes the histogram for
a variety of tesselations and then averages
these together to obtain smoothing proper-
ties. In this paper we are suggesting a
combination of these two ideas. We pro-
pose a data-driven tesselation of the fol-
lowing sort. Take an a% (10%) subsam-
ple of the sample. Use these points to
form a Dirichlet tesselation of n-space. A
two dimensional example is given in Fig-
ure 3.4. The tiles of the Dirichlet tesse-
lation form the data-dependent convex
regions upon which to base the density
estimator. One pass through the data will
be sufficient to classify each point ac-
cording to tile and thence a simple res-
EDWARD J.
WEGMAN
caling to compute the estimator. Re-
peated subsampling will yield additional
estimators which can then be averaged in
the manner of Scott’s ASH. The details
of this algorithm need to are being ex-
plored, but the following conjectures are
made. Asymptotic properties similar to
those found in Wegman (1975)’ hold.
Maximum likelihood and nearest neigh-
bor properties will hold. Computational
efficiency will be substantially better than
with kernal methods. Because of the re-
peated sampling, bootstrap-type behavior
will hold.
It should be clear that the notions we
are suggesting rely heavily on the gener-
alization of present geometric algorithms
to higher dimensional space. The Diri-
chlet tesselation, for example, has useful
algorithms in 2 and 3 dimensions (see
Bowyer, 1981, Green and Gibson, 1978).'?
but the analogues in higher dimensions
are poorly developed (see Preparata and
Shamos, 1986).* Thus a fundamental ex-
ploration of algorithms for these tessela-
tions in hyperspace must still be done.
Interestingly enough, the computation of
tasselations in hyperspace is closely re-
lated to the computation of convex hulls
(again see Preparata and Shamos, 1986,
p. 246).*
The construction of the n-dimensional
Dirichlet tesselation (Voronoi diagram) is
a key element of the density estimation
technique we suggest. A related issue is
the assignment problem, that is, given the
tiles, what is the best algorithm for de-
termining to which of the tiles a given
observation belongs. In part the 2-dimen-
sional Voronoi diagrams were con-
structed to answer nearest-neighbor-type
questions. That is, given n points, what is
the best algorithm (minimum compute
time) for finding the nearest neighbor. In
general, the answer is known to be O(nd)
where d is the dimension of the space.
There is some reasonable expectation
that, since the construction of the tesse-
lation and the assignment problem are
closely linked, there is an efficient one-
step algorithm for sorting the observa-
COMPUTATIONAL STATISTICS: A NEW AGENDA
319
Fig. 3.4. Dirichlet tesselation (dashed lines) and Delauney triangulation (solid lines) of the plane
based on 13 points.
tions in convex regions about certain
nearest neighbors. If this could be done
in linear, near linear or even polynomial
time, the density estimation technique we
are suggesting may be computationally
feasible for relatively high-dimensional
cases. In any case, the sorting, clustering
and classification results which form the
core of computational geometry also form
the core of this approach to higher di-
mensional data structures.
We hope that this example makes clear
the mathematical complexity inherent
in computational statistics. It is our view
that there is an extremely important
role for mathematical statistics under
the general rubric of computational
Statistics.
3.3 Example: Data Set Mapping
A traditional way of thinking about the
model building process is that we begin
with a fixed data set and apply a number
of exploratory procedures to it in search
of structure within the data set. The data
set is regarded as fixed and the analysis
procedures as variable. Of course, the
model is iteratively refined by checking
the residual structure until a suitable
model reduces the residuals to a unstruc-
tured set of “random numbers.” We sug-
gest an alternative way of thinking.
Normal Mode of Analysis:
Data Set Fixed < Try Different
Techniques on It
320
Alternative Mode of Analysis:
Techniques Set Fixed < Try Differ-
ent Data on It
We have in mind the following. With the
cost or availability of computational re-
sources essentially not a significant con-
sideration in the analysis procedure (i.e.,
they are essentially a free good), we can
afford to standardize on say a dozen or
‘more techniques which are always com-
puted no matter what data are presented.
Others, of course, would still be option-
ally available. Such standard techniques
might include for example standard de-
scriptive statistics, smoothers, spectral es-
timators, probability density estimators
and graphical displays including 3-D pro-
jections, scatter diagram matrices, par-
allel coordinate plots, grand tours, O-O
plots, variable aspect ratio plots and so
on. Each of these might be implemented
on a different node of a parallel comput-
ing device and displayed in a window of
a high resolution graphics workstation,
analogous to having a set of papers on our
desk through which we might shuffle at
will, the difference being that each sheet
of paper would, in effect, contain a dy-
namic, possibly multidimensional display
with which the analyst might interact. We
have in mind viewing each of these data
representations as an attribute of the data
set (object orientation) so that if we mod-
ify the data set representation in one win-
dow, the fundamental data set is modified
and consequently its representations in
all of the windows are modified simulta-
neously.
With the set of techniques fixed, a data
analysis proceeds through an iterative
mapping of the data set, i.e. the data set
is iteratively re-expressed. This is done
by a series of techniques. A discriminant
procedure or a graphical brushing pro-
cedure allows us to transform one data
set into a number of more homogeneous
data subsets. Data transformations, re-
scaling either linear or nonlinear, clus-
tering, removing outliers, transforming to
ranks, bootstrapping, spline fitting and
EDWARD J.
WEGMAN
model building are all techniques for
mapping an old data set into one or more
new ones. Notice that we treat model
building as a simple data map. It is our
perspective that a model fit is just a trans-
formation of one data set to another (spe-
cifically the residuals) similar to any of
the others mentioned.
Two points of interest can be made.
First, the analysis of a data set can be
viewed as the development of a data tree
structure—each node is a data set and
each edge is a transformation or re-
expression of that data set to a new data
set. The data tree structure preserves the
record of the data analysis, indeed, the
data tree is the data analysis. At the bot-
tom of the data tree presumably we will
have data sets with no remaining structure
for, if not, then another iteration of our
analysis and another edge in the data tree
are required. The second point to be made
is that thinking in the terms just described
helps clarify our thinking by conceptually
separating the representational methods
(e.g. graphics, desciptive statistics) from
the re-expression methods (transforma-
tions, brushing, outlier removal, model
building). These are really two separate
functions of our statistical methodology
which are not commonly distinguished,
but when distinguished, aid in clearer
thinking. We particularly think it is help-
ful to understand, for example, that a
square-root data transformation and a
ARMA-model fitting are really quite
similar operations each resulting in a new
data set save that in the latter case we usu-
ally call the new data set the set of resid-
uals. Indeed, when the data analysis is
completely laid out as a data tree, the
full model is really accumulated by start-
ing at the root node (original data set) and
following the edges all the way to the end-
ing node (unstructure residual data set).
4. Curriculum Implications
First of all, it is important to recognize
that operating from the perspective of
COMPUTATIONAL STATISTICS: A NEW AGENDA
computational statistics does not imply
that traditional statistical methodologies
are obsolete. While some automated data
sets will have the characteristics described
earlier, many will not. In addition there
will, no doubt, continue to be many care-
fully designed experiments in which each
data point is costly to acquire, and, hence,
traditional methods will continue to be
used. Nowhere is this probably more true
than in the case of bio-medical clinical
trials. We believe there are certain shifts
in emphasis, however, that may be useful
to recognize.
In terms of mathematical preparation,
real and complex analysis and measure
theory are frequently emphasized as ele-
ments of a graduate curriculum. These are
tied to probability theory and the more-
or-less standard IID parametric assump-
tions. To the extent that the questions we
ask of data are less structured, much of
the need for the standard frameworks and
probability models is lessened. In their
place we will tend to have a more geo-
metric analysis and a more function-ori-
ented, nonparametric framework. The
first and second examples were deliber-
ately chosen to illustrate elements of pro-
jective geometry, differential geometry
and computational geometry. In addition,
there ts a strong element of nonpara-
metric functional inference in example
3.2 suggesting a heavier reliance on func-
tional analysis. We believe therefore that
functional analysis and geometric anal-
321
ysis should become part of the mathe-
matical precursors to a statistics curricu-
lum.
That computation will play a role is ob-
vious, exactly what form it will take is not
so obvious. Certain elements of comput-
ing science would appear to be candi-
dates. Computational literacy means
more than FORTRAN or C programming
or familiarity with the statistical pack-
ages. Clearly such concepts as object-ori-
ented programming, parallel architec-
tures, computer graphics and numerical
methods will play a significant role in
future curricula.
In terms of the statistical core material
itself, I believe the obsession with the
parametric framework (be it classical or
Bayesian) must end. It is important to
recognize that we are often dealing with
opportunistically collected data and that
classical formulations are too rigid. More-
over, as indicated earlier, we are asking
more of our data than simply distribu-
tional questions. Indeed the more inter-
esting questions are the structural ques-
tions (i.e., in the face of uncertainty, how
are two or more random variables related
to each other?). Thus, it would seem that
there should be some de-emphasis of tra-
ditional mathematical statistics (testing
and estimation) and more emphasis on
exploratory and structural inference. A
contrast between a traditional first year
curriculum and a revised curriculum is
laid out in Figure 4.1.
Ist Mathematical Applied Complex Measure
sem Statistics I Variables Theory
2nd Mathematical Design of pplied Probability
sem Statistics I] Experiments Statistics Theory
Fig. 4.1.a. A straw-man traditional first year curriculum for a Ph.D. Program in Statistics. Linkages
between courses are indicated by directed edges.
322
EDWARD J. WEGMAN
Ist Statistical Measure and Geometric Analytic and
sem Inference I Probability Methods Numerical
in Statistics Methods
el cel es
ee oe
2nd Statistical Functional Data Analysis Structural
sem Inference II Analysis for and Graphics Inference
Statistics
Fig. 4.1.6. A straw-man first year graduate curriculum in Computational Statistics. Additional course
work should include data structures and programming, and computing architectures
Acknowledgements
This paper has benefitted from several
conversations with Jerry Friedman. I
would like to thank him for not only these
discussions, but also his enthusiasm for
the computational statistics. This research
was supported by the Air Force Office of
Scientific Research under Grant AFOSR-
87-0179 and by the Army Research Office
under Grant DAAL03-87-G-0070.
References Cited
1. Bowyer, A. (1981), ““Computing Dirichlet Tes-
selations,’ Computer J. 24, 164-166.
2. Green, P. J. and Gibson, R. (1978), “Computing
Dirichlet Tesselations in the Plane,” Computer
J. 21, 168-173.
3:
4.
Inselberg, A. (1985), ‘““The Plane with Parallel
Coordinates,” The Visual Computer 1, 69-91.
Preparata, F. P. and Shamos, M. L. (1986), Com-
putational Geometry: An Introduction, New
York: Springer-Verlag, Inc.
. Scott, D. W. (1985), ‘“‘Average Shifted Histo-
grams: Effective Nonparametric Density Esti-
mators in Several Dimensions,” Ann. Statist. 13,
1024-1040.
. Scott, D. W. (1986), ““Data Analysis in 3 and 4
dimensions with Nonparametric Density Esti-
mation,” in Statistical Image Processing and
Graphics, (Wegman, E. and DePriest, D. eds.)
New York: Marcel Dekker, Inc.
. Wegman, E. J. (1975), ““Maximum Likelihood
Estimation of a Probability Density, Sankhya ( *)
37, 211-224.
. Wegman, E. J. (1986), ‘ ‘Hyperdimensional Data
Analysis Using Parallel Coordinates,’ Technical
Report 1, Center for Computational Statistics and
Probability, George Mason University, Fair-
fax, VA, July 1986.
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 323-332, December 1988
Statistical Analysis of Experiments
to Measure Ignition of Cigarettes
Keith R. Eberhardt
Statistical Engineering Division
Center for Computing and Applied Mathematics
National Institute of Standards and Technology
Administration Building, Room A337
Gaithersburg, Maryland 20899
ABSTRACT
Under the Cigarette Safety Act of 1984, NIST was given the task of studying several
types of commercial and experimental cigarettes to determine their relative propensities
to ignite soft furnishings. The analysis of the data came under close scrutiny by the
Technical Study Group appointed to oversee the research. In one experiment where the
usual chi-squared test could not be readiliy justified, an extension of Fisher’s Exact Test
to 2 xX 12 contingency tables was adopted. In another experiment, a modification of the
angular transformation for count data was used along with normal probability plots of
the effects to analyze a 2° factorial experiment.
Key Words: angular transformation, chi-squared test, contingency table, factorial exper-
iment, Fisher’s exact test, normal probability plot, statistical analysis
1. Introduction
This report describes some statistical
data analysis aspects of two related re-
search projects'” concerned with the pro-
pensity of commercial and experimental
cigarettes to ignite upholstered furniture.
These projects were conducted in the
Center for Fire Research at the National
Institute of Standards and Technology
(NIST, formerly the National Bureau of
Standards) during 1986 and 1987.
Cigarette ignition of furniture is by far
the leading cause of fire deaths and in-
juries in the United States. While the ig-
nition resistance of manufactured furnish-
323
ings has been greatly improved over the
last decade, fire casualties could be fur-
ther reduced if cigarettes were manufac-
tured to cause fewer ignitions. In response
to this situation, Public Law 98-567, the
“Cigarette Safety Act of 1984,” estab-
lished a Technical Study Group on Cig-
arette and Little Cigar Fire Safety com-
posed of representatives of the tobacco
industry, the furniture industry, the fire
service, the public health advocacy, and
concerned Federal agencies. As part of
their charge to design and oversee a re-
search program on cigarette ignition pro-
pensity, the Technical Study Group en-
gaged NIST to conduct the laboratory
experiments described here.
324 KEITH R. EBERHARDT
The following two sections describe ex-
periments designed to compare the per-
formance of various types of commercial
and experimental cigarettes under a va-
riety of test conditions. To measure the
ignition propensity of cigarettes, test
mockups were constructed to simulate
conditions corresponding to what hap-
pens when a lighted cigarette is dropped
on an upholstered chair. The mockups
were constructed using a variety of up-
holstery fabrics and padding types which
were chosen to represent a range of sub-
strates (i.e. fabric and padding combi-
nations) that can ignite with commercial
cigarettes. For each test condition, four
or five cigarettes were lighted and placed
on mockups, and the number of cigarettes
that ignited the substrate was recorded.
Thus, the basic data in these experiments
consist of counts of the number of igni-
tions in a given number of trials. A com-
mon probability model, the binomial dis-
tribution, applies to the data for both cases
to be described, but different statistical
methods were required due to differences
in the questions asked and in the exper-
iment designs used.
2. Differences among
Commercial Cigarettes
An important part of the first NIST
project for the Technical Study Group was
an experiment to determine whether there
are measurable differences in the ignition
propensities of different types of com-
mercial cigarettes. To study this question,
a test protocol was developed under which
12 types of commercial cigarettes were
tested on 18 different mockup configu-
rations. The experiment results are dis-
played in Table 1..The 18 mockup con-
figurations, which are described fully in
[1], differ in type ofsubstrate used, whether
or not the cigarette was placed in a crev-
ice, and whether or not the cigarette un-
der test was covered with a piece of cotton
sheeting.
Inspection of Table 1 reveals that many
of the substrates were either too ignition
prone (columns 1-3), or too resistant
(columns 12-18), to show any differences
among the cigarettes tested. This was dis-
appointing to the experimenters because
the fabric and padding combinations used
for the mockups had been pretested and
were chosen carefully to represent cases
where differences could occur. However,
the pretest samples came from different
lots of material than the larger quantities
which were procured for construction of
the mockups, and although both were
nominally the same, their behavior in cig-
arette ignition tests was quite different.
When the test data of Table 1 were pre-
sented to the Technical Study Group, a
wide variety of statistical analyses—with
contradictory intepretations—were pro-
posed. The reader can readily imagine how
the vested interests of the organizations
represented on the Technical Study Group
would lead some members to favor anal-
yses with conclusions opposite to those
preferred by other members. When the
collection of competing statistical anal-
yses was presented to this author for his
opinion, it was clear that whatever anal-
ysis he proposed would be carefully, and
perhaps critically, reviewed by at least half
of the members of the Technical Study
Group. Thus careful attention was given
to choosing statistical methods based on
assumptions and mathematical approxi-
mations that would be readily accepted.
Some of the proposed analyses of Table
1 treated pairwise comparisons between
cigarettes, within a given test configura-
tion, as constituting a 2 x 2 contingency
table with four trials for each of two cig-
arettes. The most extreme difference in
such a table has 4 ignitions for one ciga-
rette and 0 for the other. Applying the
chi-squared test to a cigarette pair show-
ing a 4 vs. 0 difference in ignitions yields
a significance level of 0.005, which would
indicate strong evidence of a true differ-
ence. Use of the chi-squared test for this
situation was criticized because the sam-
ple sizes (4 cigarettes of each type tested)
STATISTICAL ANALYSIS OF CIGARETTE IGNITION PROPERTIES
325
Table 1.—Comparison of Ignition Propensities for Commercial Cigarettes: Numbers of Ignitions in
4 Trials
Cigarette
Number?
1-3 4-5 6 7
2 4 3 + 1
7 4 4 1 4
1 4 4 4 4
~ ~ + = 3
12 4 4 4 4
11 4 4 4 4
3 a 4 4 4
5 ~ 4 4 4
9 4 4 “4 4
10 - 4 + 4
8 * - 4 4
6 ~ ~ + 4
TOTAL 48 47 45 44
AHHH HWKWOr CO oo
35
Test Configuration*
9 10
—
—
12-18 TOTAL
BRNWNNN WR ONrR
anwNnoo°oq°c”c$”roqodrqoqodnrocqcodqd
pe CS Go) QB CorGe) Gama co
j=) ja (>) (o> (=>) (=>) (a>) >) (>) (a) (=)
25
Oo
1
*The 18 test configurations represented here are fully described in reference [1].
TThe 12 cigarette types represent 12 different commercial cigarette packings which are distinguished by
name, length, sometimes diameter, whether menthol or non-methol, whether filter or non-filter, and by
package type (e.g. soft pack.)
are too small to justify the approximation
implied by use of chi-squared tables. To
properly accommodate the small sample
sizes, use of Fisher’s Exact Test had also
been suggested. By this criterion, a 4 vs.
0 difference in ignitions corresponds to a
significance level of only p = 0.029 —
still less than the often-used figure of 0.05,
but substantially larger than 0.005. While
a result with p = 0.029 might be consid-
eredstatistically significant were there only
one pair of cigarettes under considera-
tion, the fact that there are 66 possible
pairs among the 12 cigarette types further
diminishes the strength of evidence im-
plied by obtaining one, or a few, p-values
less than 0.05. One commenter pointed
out that one should expect to obtain an
average of 3.3 “‘significant differences” by
chance when the 0.05 level is used for 66
comparisons.
Another proposed analysis was based
on the “TOTAL” column of Table 1. In
this analysis, the chi-squared test was used
to test the global hypothesis that all cig-
arettes have equal probabilities of igni-
tion. The result is x? = 9.7 on 11 degrees
of freedom, a value that does not ap-
proach significance. A fault with this ap-
proach is that the row totals from the table
do not satisfy the conditions required for
validity of the chi-squared test. A prac-
tical indication of the problem can be seen
by noting that the computed value of x?
would change by a large amount if col-
umns 1-3 and 12-18 were not included in
the row totals. Alternately, if a suffi-
ciently large number of “‘flat’’ columns
like 1-3 and 12-18 were appended to the
table, the computed value of x? for the
TOTAL column could be made to ap-
proach zero. [Probabilistically, the situ-
ation can be characterized by noting that
the row totals do not have binomial dis-
tributions, as is assumed by use of x*, even
though the individual entries within the
rows are binomial. The large differences
in ignition behavior across the 18 test con-
figurations imply that the row totals are
distributed like sums of binomial vari-
ables, but with different probabilities of
ignition for each term in the sum. Such a
sum is not binomial. ]
The approach finally adopted for Table
1 performs a separate analysis for each
column (test configuration) of the table.
326
Lewontinand Felsenstein’ have shown that
the chi-squared test for 2 x WN contin-
gency tables is valid when the expected
frequencies exceed 1.0 in all cells. Since
the data for configurations 8 and 9 (only)
satisfy this condition, the chi-squared test
was used to analyze each of these columns
as a2 X 12 contingency table. Configu-
ration 8 shows highly significant differ-
ences between cigarettes (x* = 36.6, p <
0.001) while no significant differences are
indicated for configuration 9 (x? = 12.9,
p = 0.30). These two tests can be com-
bined by adding the respective x* values
to obtain x* = 49.6 on 22 degrees of free-
dom, indicating highly significant differ-
ences for the combined tests (p < 0.001).
What about the remainder of the table?
Clearly, no differences between cigarettes
are indicated for configurations 1—3, where
all cigarettes ignited, or 12-18, where none
ignited. The remaining columns, 4—7, 10
and 11, can be evaluated for significant
differences in the numbers of ignitions by
an exact calculation, analogous to Fisher’s
Exact Test for a2 x 2 contingency table.
This procedure is based on the conditional
probability distribution of the data given
the total number of ignitions,’ and it re-
quires enumeration of essentially all pos-
sible patterns of ignitions and non-igni-
tions that are consistent with the marginal
totals of the observed data. Applying this
exact test to columns like 8 and 9, which
have relatively large numbers of both ig-
nitions and non-ignitions, would have been
a very tedious process. Fortunately the
chi-squared test could be used in those
cases, and it is known that the two pro-
cedures give practically identical results
in situations where use of y? is valid.
Of these remaining configurations, col-
umns 6 (p = 0.003) and 7 (p = 0.011)
show significant differences between the
cigarettes tested. None of the other con-
figurations show differences that ap-
proach statistical signifiance.
Overall, the table shows some signs of
interaction between test configurations
and cigarette types. But to conclude that
these data do not provide evidence that
KEITH R. EBERHARDT
commercial cigarettes differ in ignition
propensity would seem unreasonable.
3. A Factorial Experiment
To identify characteristics of cigarettes
that could lead to a reduction in ignition
propensity, a second experiment was de-
signed using cigarettes of well-charac-
terized composition and construction
supplied by the cigarette industry. The
experimental cigarettes were custom-
made to differ on five design character-
istics, each at two levels: tobacco packing
density (low and high), cigarette circum-
ference (21 mm and 25 mm), paper
permeability (low and high), paper citrate
concentration (0.0 and 0.8%), and to-
bacco type (burley and flue-cured). Small
lots of cigarettes were produced corre-
sponding to each of the 2? = 32 possible
combinations of the five design factors,
and five cigarettes of each type were tested
for ignition propensity on upholstered
furniture mockups, as described above.
Four test configurations were used for the
experiments; these were chosen, based on
the experience gained from testing com-
mercial cigarettes, to be conditions where
differences between cigarettes could be
shown. The data resulting from the tests,
shown in Table 2, constitute a complete
2° factorial experiment for each of the test
configurations.
The standard statistical technique for
analyzing data from a factorial design is
the analysis of variance (ANOVA). How-
ever, the hypothesis tests and estimation
procedures of ANOVA produce valid sta-
tistical inferences only if the data for each
cell of the table (defined by the cigarette
design and test configuration) come from
populations that are at least approxi-
mately normal with equal variances. In
fact, the datain Table 2 are binomial count
data, and thus not well-modelled by equal
variance normal distributions: the binom-
ial distribution is asymmetric and the vari-
ance depends strongly on the mean. To
STATISTICAL ANALYSIS OF CIGARETTE IGNITION PROPERTIES 327
Table 2.—Ignition Propensity Results for Experimental Cigarettes
Number of Ignitions
Cigarette Design* Test Configuration
Packing Permea- Circum- Citrate Tobacco
Density bility ference Conc. Type 1 2 3 =
E IL 21 N B 0 1 0 0
E ie 21 N F 1 3 0 0
E le on C B 0 3 3 0
E IE 21 C F 0 5 1 0
E 1G, 25 N B 3 2 2 0
EE L 25 N K 3 1 0 0
5 ie 25 c B 5 5 = 0
E ily 25 € F 5 3 2 0
E H 21 N B 3 os 0 0
E H 21 N F 4 5 3 0
12) H 21 Cc B 4 5 2 0
E H 21 C FE 4 5 5 0
E H 25 N B 5 3 5 0
E H 25 N IE 5 5 2 0
E H 25 C B 5 5 5 0
1 H 25 C FE 5 5 5 0
N Ie 21 N B 2 5 5 0
N L pA N F 5 5 5 1
N sy 21 C B 3 5 5 0
N L 21 C FE 5 5 5 0
N L 25 N B 5 5 5 3
N L 25 N |e 5 5 5 2
N L 25 € B 5 5 5 3
N 1B, 75) c F 5 3 3 3
N H 21 N B 5 5 5 -
N H 21 N IF 5 5 5 5
N H 21 c B 3 5 5 2
N H 21 6 F 5 5 5 =
N H 25 N B 5 5 5 5
N H 25 N 5 5 5 5 5
N H 25 C B 5 2 5 5
N H 25 (S 18 5 5 5 5
*Design Factors:
Packing Density: E = low (expanded tobacco),
N = high (non-expanded tobacco)
Permeability: L = low, H = high
Circumference: 21 = 21-mm, 25 = 25 mm
Citrate Concentration: N = 0.0%, C = 0.8%
Tobacco Type: B = Burley, F = Flue-cured
7Test configurations are described in reference 2.
address this situation, the basic data were nearly constant variance. The formula for
transformed using the Freeman-Tukey obtaining the transformed response vari-
modification of the commonly used an- able is
gular transformation.’ This transforma- "
tion produces a response variable having Y = (0.5RARCSIN[SQRT(4/6) ]
amore nearly symmetric distribution with + ARCSIN[SORT((X + 1)/6) | },
328 KEITH R. EBERHARDT
Table 3.—Significance Probabilities (in Percent) of Design Factors for Experimental Cigarettes
(Experimental error was estimated from 4- and 5-way interactions.)
Test Configuration*
Design Factors 1 2 3 4
D (Packing Density) 0.05% 0.10% 0.01% 0.01%
P (Permeability) 0.08 0.23 0.69 0.01
R (Circumference) 0.03 85 Del 0.01
C (Citrate Conc.) 25 2.0 0.69 7.6
T (Tobacco Type) 8.9 52 56 7.6
Factor Interactions
D x P W555) 0.23 0.69 0.01
DxR 0.45 85 Pl 0.01
Dix C 43 2.0 0.69 7.6
Dix oT 45 52 56 7.6
Pipa 0.93 52 39 Se)
Pix € 45 6.6 85 DI
PIXE i 85 7.4 2)
RxC 29 85 62 22
Rx T 8.9 328 2.8 22
CxXul 45 52 62 81
*Test configurations are described in reference 2.
0.4
ORS
On? 99%
Y)
O 95%
LJ 0.1
(ue
funds
LJ
Oo 0
LJ
a
<
= =f 4
a 95%
Y
LJ
-0.2 99%
-0.3
vay
-0.4
-3 -2 -1 0 | 2 5
NORMAL ORDER STATISTIC MEDIANS
Fig. 1. Normal probability plot of estimated effects from a 2° factorial analysis of variance for the
data in Table 2 for test configuration 1. The factors are coded as: 1 = Citrate concentration, 2 = Paper
permeability, 3 = Packing density, 4 = Tobacco type (burley or flue-cured), 5 = Circumference (21
mm vs. 25 mm). Multiple factor labels represent corresponding interaction effects. The indicated 95%
and 99% limits on the plot were computed using the 4- and 5-way interaction terms to estimate
experimental error.
STATISTICAL ANALYSIS OF CIGARETTE IGNITION PROPERTIES 329
99%
Y
bt 95%
O
uJ
LH
Le
ud
(a)
LJ
—
<
=
a 95%
ud
99%
=0'.2
vy
Oo
3 2 aa
0 1 2 3
NORMAL ORDER STATISTIC MEDIANS
Fig. 2. Normal probability plot of estimated effects from a 2° factorial analysis of variance for the
data in Table 2 for test configuration 2. The factors are coded as: 1 = Citrate concentration, 2 = Paper
permeability, 3 = Packing density, 4 = Tobacco type (burley or flue-cured), 5 = Circumference (21
mm vs. 25 mm). Multiple factor labels represent corresponding interaction effects. The indicated 95%
and 99% limits on the plot were computed using the 4- and 5-way interaction terms to estimate
experimental error.
where X denotes the number of ignitions
(out of 5 trials).
After transformation, the data were
analyzed by standard ANOVA methods.°
The results of the ANOVA are summa-
rized numerically in Table 3. While this
summary is adequate to convey the main
conclusions of the analysis, the graphical
summary of the same results given in Fig-
ures 1-4 was much more effective for
communicating with the members of the
Technical Study Group.
Figures 1—4 present normal probability
plots of the estimated factorial effects from
the 2° factorial analysis of variance for the
data of Table 2. Under the null hypothesis
that none of the design factors affects ig-
nition propensity, the estimated effects
would constitute (approximately*) a ran-
dom sample from a normal distribution,
and so the plotted points would be ex-
pected to cluster about a single straight
line on the normal probability plot. The
indicated 95% and 99% limits on Figures
1—4 were computed using the 4- and 5-
way interaction terms in the ANOVA to
estimate experimental error. These limits
correspond, respectively, to 5% and 1%
tests of the hypothesis that the factorial
effects are zero.
In addition to providing an effective
means of presentation, the normal prob-
ability plot analysis has the advantage that
it automatically encourages consideration
**“Approximately” here relates to the degree of
success achieved by the Freeman-Tukey transfor-
mation in achieving approximate normality for the
distribution of the response variable Y. For exactly
normal data, the estimated effects would follow an
exact normal distribution. The figures suggest that
normality was more nearly achieved in some cases
(e.g. Figure 1) than others (e.g. Figure 2.)
330
ESUIMATED EeRE Crs
=3 = =|
KEITH R. EBERHARDT
0 1 2 S
NORMAL ORDER STATISTIC MEDIANS
Fig. 3. Normal probability plot of estimated effects from a 2° factorial analysis of variance for the
data in Table 2 for test configuration 3. The factors are coded as: 1 = Citrate concentration, 2 = Paper
permeability, 3 = Packing density, 4 = Tobacco type (burley or flue-cured), 5 = Circumference (21
mm vs. 25 mm). Multiple factor labels represent corresponding interaction effects. The indicated 95%
and 99% limits on the plot were computed using the 4- and 5-way interaction terms to estimate
experimental error.
of the simultaneous inference aspects of
the analysis.’ This relates to the well-
known phenomenon that if, say, 20 hy-
potheses are simultaneously tested at the
5% significance level, one of the hy-
potheses will be rejected, on the average,
even if all 20 are true. In the present case,
the normal probability plots compare the
magnitudes of 31 estimated factorial ef-
fects. Setting aside the 4- and 5-way in-
teraction effects that were used to esti-
mate experimental error, there are 25 null
hypotheses that could be tested (5 main
effects, 10 two-factor interactions, and 10
three-factor interactions.) Even if none of
the factors or interactions were active (i.e.
all 25 null hypotheses were true), the ex-
pected number of rejections at the 5%
significance level would be 1.25 (5% of
25) simply by chance. However, the prob-
ability plot analysis accounts for this phe-
nomenon by drawing attention to only
those effects that appear as outliers on the
plot, whether or not they lie outside the
95% or 99% bands. For example, in Fig-
ure 4., the 1.x 3), 5,and 31a eS
interaction effects lie very close to the
99% bands, corresponding to p-values
near 1%. However, the fact that those
effects do not appear as outliers on the
normal probability plot indicates that
those p-values should not be interpreted
as strong evidence of active effects.
To summarize the results of this exper-
iment, Figures 1-4 and Table 3 show that
two factors, namely, packing density and
paper permeability, were consistently
highly significant across all four test con-
figurations. Two additional factors, cir-
cumference and citrate concentration,
showed clear significance in two of the
test configurations. The factor for tobacco
STATISTICAL ANALYSIS OF CIGARETTE IGNITION PROPERTIES 331
2°53 <2e
ORM EDs bit Fie s
-3 =/) =
0 1 2 5
NORMAL ORDER STATISTIC MEDIANS
Fig. 4. Normal probability plot of estimated effects from a 2° factorial analysis of variance for the
data in Table 2 for test configuration 4. The factors are coded as: 1 = Citrate concentration, 2 = Paper
permeability, 3 = Packing density, 4 = Tobacco type (burley or flue-cured), 5 = Circumference (21
mm vs. 25 mm). Multiple factor labels represent corresponding interaction effects. The indicated 95%
and 99% limits on the plot were computed using the 4- and S5-way interaction terms to estimate
experimental error.
type (i.e., burley vs. flue-cured) did not
show a significant effect on number of
ignitions in any of the cases.
The interactions among these factors
were frequently significant whenever the
main effects were. Significant interactions
indicate that the magnitude of the effect
on ignition propensity for a given factor
is not constant across the levels of the
interacting factor. For example, on test
configuration number 1, the significant in-
teraction between Packing Density and
Circumference indicates that the effect of
Packing Density on ignition propensity is
different in magnitude for the smaller cir-
cumference cigarettes than for the larger
circumference cigarettes in the experi-
ment. Detailed study of the data suggests
that many of the significant interactions
can be explained by the single fact that
the maximum number of ignitions per test
condition could not exceed five in this ex-
periment, thus limiting the possible mag-
nitudes of the estimated effects.
In practical terms, the relation between
the cigarette design parameters studied
and ignition propensity can be summa-
rized as follows: The most significant fac-
tor was low packing density, achieved in
this case by using expanded, large particle
size tobacco. Low packing density appar-
ently lowers ignitions because the avail-
able fuel per unit length is reduced. An-
other very influential factor is the use of
low permeability paper, which reduces
ventilation to the tobacco column. Ciga-
rettes with a 21 mm circumference showed
evidence of reducing ignitions in three of
the four test configurations. Again, this
may be due to the reduction in fuel (less
tobacco and paper) per unit length. The
effect of citrate, which is added to ciga-
332 KEITH R. EBERHARDT
rette paper to regulate the paper burn rate
and to obtain ash of the desired appear-
ance, did not show a consistent effect on
ignition propensity. And tobacco type
consistently showed no effect on igni-
tions.
4. Acknowledgements
The author is grateful to Ms. Susannah
Schiller and Dr. Richard Gann, of the
National Institute of Standards and Tech-
nology for their careful reading of, and
comments on, this manuscript.
References Cited
1. Krasny, J. F. and Gann, R. G. 1986. Relative
propensity of selected commercial cigarettes to
ignite soft furnishings mockups. NBSIR 86-3421,
[U.S.] National Bureau of Standards.
. Gann, R. G., Harris, R. H., Jr., Krasny, J. F.,
Levine, R. S., Mitler, H. E., and Ohlemiller, T.
J. 1988. The effect of cigarette characteristics
on the ignition of soft furnishings. NBS Techni-
cal Note 1241, [U.S.] National Bureau of Stan-
dards.
. Lewontin, R. C. and Felsenstein, J. 1965. The
robustness of homogeneity tests in 2 x N tables.
Biometrics, 21: 19-33.
. Plackett, R. L. 1981. The Analysis of Categorical
Data, 2nd edition. Macmillan, New York. Sec-
tion 6.3.
. Freeman, M. F., and Tukey, J. W. 1950. Trans-
formations related to the angular and the square
root. Annals of Mathematical Statistics, 21: 607-
611.
. Box, G. E. P., Hunter, W. G., and Hunter, J.
S. 1978. Statistics for Experimenters. John Wiley
and Sons, New York. Chapter 10.
. Daniel, C. 1959. Use of the half-normal plot in
interpreting factorial two-level experiments.
Technometrics 1: 311-341.
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 333-338, December 1988
Environmental Statistics
N. Phillip Ross and Gilah Langner
ABSTRACT
Centralized environmental statistics are needed to provide credible answers to complex
environmental issues. Environmental hazards arise from multiple sources, transcending
geopolitical boundaries, and challenging our limited understanding of how ecosystems
operate. Meanwhile, environmental data are scattered across a number of federal agencies.
To deal coherently with the complex environmental issues, there must be sustained, ~
planned, long-term data collection and analysis efforts.
Environmental issues are becoming an
increasingly vital concern of our society.
Polls taken early in last year’s presidential
campaign showed that high percentages
of the American population consider the
quality of the environment to be a na-
tional priority. More and more, Ameri-
cans are recognizing the interdependence
of their quality’of life with the quality of
the environment in which they live.
However, if one were to ask a repre-
sentative group of American citizens
whether (and in what ways) the state of
the environment had improved or dete-
riorated in recent years, most of the an-
swers received would likely fall in the
“Don’t Know” category. Common sense
and intuition can no longer be relied on
to provide an accurate indicator of the
quality of any individual environmental
Correspondence should be sent to: Dr. N. Phillip
Ross, Chief, Statistical Policy Branch, Office of Pol-
icy, Planning and Evaluation, PM232, U.S. EPA,
401 M. Street S.W., Washington, D.C. 20460
Dr. Ross is Chief of the Statistical Policy Branch
Office of Policy, Planning and Evaluation, U.S.
EPA, Washington, D.C.
Ms. Langner is President of Stretton Associates,
a Washington-based consulting firm specializing in
policy analysis and communications.
resource, let alone of the state of the en-
vironment as a whole. One cannot step
outside and sniff the air to get an accurate
feel for air quality in major cities; indoors,
one cannot rely on one’s senses to deter-
mine whether a house is contaminated by
radon, a radioactive gas that cannot be
seen, smelled, or tasted.
Nor would the experts do much better
than the layperson in answering the same
question. Many old environmental prob-
lems have been addressed, but new ones
have sprung up to take their place. On
balance, it is hard to determine how well
we are doing in keeping our environment
safe and healthy and whether there has
been a net improvement in recent years.
Need for Centralized
Environmental Statistics
One major reason that it is so difficult
to come up with credible answers is that
we are not collecting and integrating the
type of quality-controlled, scientifically-
rigorous data that are necessary in order
to develop a coherent picture of environ-
mental trends. The Environmental Pro-
333
334
tection Agency (EPA) alone has an in-
formation collection budget of over 120
million hours and spends half a billion
dollars annually on data collection. A va-
riety of other federal agencies are actively
involved in different aspects of environ-
mental data collection as well. However,
within the federal government, there is
no dedicated bureau or agency with re-
sponsibility for the collection, integra-
tion, and analysis of the environmental
data necessary for setting priorities and
determining directions on a national
scale.
Statistical agencies or bureaus are com-
monplaces in almost every major federal
department. Such agencies include: the
Bureau of Economic Analysis and the Bu-
reau of the Census at the Department of
Commerce; the Bureau of Justice Statis-
tics at the Justice Department; the Bureau
of Labor Statistics at the Labor Depart-
ment; the National Center for Education
Statistics in the Department of Educa-
tion; the National Agricultural Statistics
Service and the Economic Research Ser-
vice in the Department of Agriculture;
the Statistics of Income Division of the
Internal Revenue Service; the National
Center for Health Statistics in the De-
partment of Health and Human Services;
the Social Security Administration’s Of-
fice of Research and Statistics; the Energy
Information Administration in the De-
partment of Energy; and the Division of
Housing and Demographic Analysis in the
Department of Housing and Urban De-
velopment.
As Paul Portney pointed out in a recent
article,!
While no one would argue that our cur-
rent measures of population or eco-
nomic activity are exact, it is impossible
to imagine modern government oper-
ating in their absence. Indeed, these
measures drive important federal grant
and entitlement programs; they also
help trigger, and then measure the suc-
cess of, major tax and spending pro-
grams, monetary policies, and even for-
N. PHILLIP ROSS AND GILAH LANGNER
eign policy decisions. . . Simply put, we
have not a single data series for the en-
vironment that goes back as far as even
the most recently established of the ec-
onomic and demographic series. . . nor
one that is subject to the same quality
control, careful measurement proto-
cols, or subsequent thorough analyses.
The absence of an active, centralized
focus for environmental statistics in this
country is felt on an international scale as
well. The United States is one of the few
developed nations that lack a centralized
governmental agency responsible for col-
lecting and integrating environmental data
and for producing a base of quantitative
information to support an annual State of
the Environment Report. The time has
come for environmental statistics to
emerge as the next major focus of federal
statistical data collection.
Defining Environmental Statistics
What do we mean by environmental
statistics? Beyond the simple tabulation
of environmental data, the environmental
statistics process we are speaking of is
much broader. It concerns itself with eval-
uating the state of the biosphere (envi-
ronmental media and the fauna and flora
that inhabit media) and its changes over
time. Environmental statistics thus in-
volves identifying information needs, de-
signing appropriate data collection activ-
ities (such as ambient monitoring and
population surveys), ensuring the quality
of the data, and conducting statistical
analysis and interpretation of data in or-
der to produce meaningful indicators of
the state of the environment.
According to the Statistical Office of
the United Nations, environmental sta-
tistics:
(a) cover natural phenomena, human ac-
tivities that exert impacts on the en-
vironment, and the impacts them-
ENVIRONMENTAL STATISTICS 335
selves on the environment and on
human living conditions;
(b) refer to the media of the natural en-
vironment, 1.e., air, water, land/soil,
and to the man-made environment
which includes housing, working con-
ditions, and other aspects of human
settlements;
(c) synthesize data from different subject
areas and statistical sources to facil-
itate integrated socio-economic and
environmental planning and poli-
cies.
In examining the current roster of en-
vironmental issues, we can identify three
primary reasons why an enhanced capa-
bility to produce environmental statistics
will be an indispensable aid to environ-
mental decision-making in the coming
years.
(1) Increased Complexity of
Environmental Problems and
Solutions
Environmental problems of the 1980’s
have taken on a subtlety and complexity
that did not characterize the air and water
pollution problems of previous decades.
To be sure, there is still no lack of acute
emergencies—chemical fires, toxic waste
dumps lacking security, and trucks car-
rying hazardous materials overturning on
the highway. However, many of our most
pressing environmental problems involve
us in new and uncertain terrain.
For example, setting standards for
chemicals is an increasingly complex pro-
cess, involving the assessment of the ef-
fects of long-term exposure to pollutants
at levels in the parts per billion and tril-
lion. There are vastly greater numbers of
chemicals in production than were ever
contemplated when much of our original
environmental legislation was put in place.
It is becoming clear that human activities
are influencing and changing the environ-
ment—the balance of ecological systems;
the availability of resources, the climate,
the ozone levels. As each day goes by we
are confronted with more and more po-
tentially serious problems.
The questions facing environmental de-
cision-makers are: which of these prob-
lems are most important; which ones
should receive attention and resources;
which ones require immediate action and
which can wait; how do we educate and
motivate the population to change its
mode of interacting with the environ-
ment; howdoweconveyinformation about
real health risks, without scaring the pop-
ulation and without encouraging compla-
cency?
As environmental problems become
less visible to our senses, the need arises
for sophisticated approaches to the de-
tection and monitoring of pollutants on
human health and the environment. The
physical and engineering sciences have
rapidly progressed to meet the require-
ment of sophisticated detection and mon-
itoring. However, the data collection and
statistical methods needed to appropri-
ately assess the impacts of newly discov-
ered environmental problems have not
kept pace. The costs of collecting ade-
quate environmental data to determine
health effects and to intelligently manage
natural resources are becoming astro-
nomical. Statistical approaches to data
collection and interpretation are the only
solution.
A major difficulty in looking to the fu-
ture is that we have so little information
available from the past. If we started to-
day to coordinate our environmental data
collection activities across the federal gov-
ernment, it would still be ten or more
years before we could start to examine
trends and to use those trends in the de-
cision process.
Certain sources of environmental data
already exist that have the potential to
provide insights into a number of envi-
ronmental processes. Unfortunately,
these sources often represent “‘encoun-
tered data” that cannot be considered a
random sample from the original popu-
lation.’ A variety of situations can give
rise to encountered data—instances where
336
the only feasible sampling procedure gives
unequal chances to the population units
or where the only data available are his-
torical data sets from diverse sources.
While a great deal of work has been done
on how to use encountered data in the
context of animal sightings and monitor-
ing of marine resources, this work must
now be extended to a broader array of
environmental modeling and monitoring
situations.
Finally, many of the environmental is-
sues now facing us transcend geopolitical
boundaries—acid rain, the greenhouse
effect, sea-level rise, ozone depletion, and
the continued loss of species on the planet,
to name a few. From an international per-
spective, it is time for the United States
to join its Western neighbors in devel-
oping a comprehensive long-term base of
environmental information. Although the
United States has taken the lead in de-
veloping research programs in some of
these areas, we have not yet committed
ourselves to the type of statistically de-
signed monitoring programs that are
needed to support hard decisions on sub-
jects involving national and global envi-
ronmental effects.
(2) Need for Integration
A second compelling reason for giving
more attention to environmental statistics
arises out of a perceptible change in the
prevailing image of the environment.
From viewing the world as a warehouse
of resources available for our benefit, we
are starting to recognize the exhaustible
nature of certain resources, such as oil
and wood, and even certain living species.
What once appeared as “‘infinite”’ is now
clearly finite, and in some cases is quickly
being depleted. There is growing talk of
the interrelationships among organisms
that sustain life, and even of the life-sus-
taining nature of the planet itself.
The logical corollary of such an ap-
proach is the need to develop an inte-
grated view of environmental media. We
can no longer deal separately with each
N. PHILLIP ROSS AND GILAH LANGNER
environmental medium—air, water, and
soil—and ignore the ‘‘environmental
merry-go-round” effect of shifting pollu-
tants from one medium to another. En-
vironmental data bases need to reflect this
recognition and allow us to examine and
control for ‘“‘cross-over”’ effects.
Here, part of the problem is that in-
formation and responsibilities for various
segments of the environment are diffused
throughout the Federal Government.
While EPA deals with problems resulting
from pollution of the environmental me-
dia, the stewardship of our environmental
resources is in other hands. Thus, the De-
partment of the Interior maintains pri-
mary responsibility for public lands, min-
erals, national parks, and endangered
species; the Department of Agriculture
handles forestry, soil, and conservation;
and the Department of Commerce is in-
volved in oceanic and atmospheric mon-
itoring and research.
As a result, environmental media are
very often addressed separately from the
environmental resources they support.
There remains a serious need for high
quality data suitable for conducting long-
term evaluations of the state of the en-
vironment, considered as a whole. Our
ability to evaluate environmental prog-
ress in the past and set priorities for the
future is compromised by a lack of ap-
propriate, integrated trend indicators.
Within EPA itself, there are limitations
in what we can do because many of the
Agency’s data bases are primarily ori-
ented towards furthering EPA’s compli-
ance monitoring responsibilities. Com-
pliance data do not necessarily lend
themselves to analyses of long-term trends
in the environment. Take, for example,
the issue of waste reduction. Although
EPA maintains numerous data bases with
facility reports on environmental dis-
charges, at the present time there is no
single data base that can be used to mea-
sure waste reduction efforts. Even if it
were possible to “‘cross-walk” from the
facilities in one data base to those in an-
other (which it generally is not), the data
ENVIRONMENTAL STATISTICS 337
bases are designed to measure different
things in different units. Measurement of
concentrations of hazardous chemicals in
waste streams cannot provide useful data
on waste generation.
On a wider ecological level, we need to
collect new environmental data in con-
junction with our environmental and eco-
logical models. Using these models, we
are beginning to develop an understand-
ing of how ecosystems work. We need
better ways of recognizing what consti-
tutes a healthy ecosystem and better mon-
itoring skills that will provide early warn-
ing signs of damage or injury to an
ecosystem.
(3) The Public’s Right and Need
to Know
A third need for improved environ-
mental statistics is the public’s right to
know more about the environment. This
goes well beyond the legislated “right to
know” provisions in Title III of the Su-
perfund Amendments and Reauthoriza-
tion Act of 1986. EPA has a clear re-
sponsibility to make a wide range of
environmental statistics available to the
public both to fulfill the mandate of dem-
ocratic government and to establish the
credibility of its decision-making on en-
vironmental issues.
The public’s need for environmental in-
formation is significant in numerous sec-
tors of society. Accurate and up-to-date
environmental information is important
for public decision-making at the state,
county, and local levels. It is equally im-
portant for eliciting responsible natural
resource management decisions from in-
dustry. Increasingly, we are relying on in-
dustry to voluntarily cooperate in con-
serving environmental resources and to
make farsighted management decisions.
These decisions must be based on an ad-
equate information base. A 1984 report
by the World Wildlife Fund,’ for example,
identifies 11 corporate decisions that re-
quire the use of resource information, in-
cluding strategic direction, market re-
search, resource acquisition, production
capacity, plant siting, plant design, en-
vironmental compliance, production and
materials purchasing, research and de-
velopment, bank lending, and investment
recommendations.
Members of the public also need a re-
liable, quality-controlled, credible source
of environmental information to help them
evaluate the plethora of health risks that
appear almost daily in the news, and to
help them make judicious personal deci-
sions, whether that involves testing ahome
for radon or installing water treatment
devices. A related need is for clear and
usable explanations of environmental
statistics. Given the complexity of our en-
vironment, the advanced scientific tools
being used, and the probability concepts
in which much of our risk information is
couched, the development and publica-
tion of environmental statistics alone will
not necessarily mean that the public
has ‘‘access” to the information. Com-
munication of the significance and con-
text of the information is essential to
satisfying the public’s right and need to
know.
The increasing complexity of environ-
mental problems, the need for integrated
environmental approaches, and the need
to provide more environmental informa-
tion to the public, all point to the impor-
tance of giving serious attention to envi-
ronmental statistics. One way of solving
the problems and deficiencies outlined in
this article may be to create a federal En-
vironmental Statistics Agency or Bureau.
Recent months have seen a renewed in-
terest in this idea, with proposals and rec-
ommendations coming from a variety of
sources. At the Environmental Protection
Agency, a Science Advisory Board panel
chaired by Alvin Alm recently recom-
mended the creation of a new ecological
research institute as part of a renewed
EPA commitment to long-term environ-
mental research. The institute would have
responsibilities for ecological research,
environmental monitoring, and statistics.
Whatever the approach, it is apparent that
338 R. CLIFTON BAILEY
an enhanced environmental statistics ca-
pability will be vitally important to the
conduct of environmental protection in
this country andin the international sphere
in the coming decade.
References Cited
1. Paul R. Portney, “Needed: a Bureau of Envi-
ronmental Statistics,” in Resources, Resources
for the Future, Winter 1988.
2. Bartelmus, Peter, ‘“‘Environmental Statistics:
Journal of the Washington Academy of Sciences,
Volume 78, Number 4, Pages 339-353, December 1988
Systems, Frameworks and International Ap-
proaches” in Society of American Foresters/et
al, Intl Renewable Resource Inventories for
Monitoring Conf, Corvallis, OR, Aug. 15-19,
1983, pp. 524(5).
. Hennemuth, R. C., G. P. Patil, and N. P. Ross,
“Encountered Data Analysis and Interpretation
in Ecological and Environmental Work: Opening
Remarks.” Presented at the annual ASA meet-
ing, San Francisco, 1987.
. Corporate Use of Information Regarding Natural
Resources and Environmental Quality, prepared
by Train, Russell E., World Wildlife Fund for
the Council on Environmental Quality, 1984.
Some Uses of a Modified
Makeham Model to Evaluate
Medical Practice
R. Clifton Bailey*
Health Standards and Quality Bureau
Health Care Financing Administration
ABSTRACT
The modified Makeham survival model describes the time course of a medical inter-
vention or illness in many cases. A precise understanding of the time course may be used
to evaluate the follow-up time for special studies, to know when the available follow-up
is not adequate, and to balance follow-up time against the number of cases. Also a
knowledge of the time course may be used to evaluate long-term and short-term risk
factors. Long-term and short-term risk factors are studied separately using the widely
available Cox proportional hazards model. This is compared with a fully integrated model
in which the modified Makeham is used with concomitant variables in each of three
structural parameters. The examples rely on data collected by the Health Care Financing
Administration to evaluate the effectiveness of medical interventions on the course of
illness.
*The opinions expressed in this paper are those
of the author and do not necessarily reflect the opin-
ions or policies of the Health Care Financing
Administration.
Correspondence should be sent to: R. Clifton Bai-
ley, Health Standards and Quality Bureau, Health
Care Financing Administration, 2-D-2 Meadows
East, 6325 Security Blvd., Baltimore, MD 21207.
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 339
Introduction
The value of data in making decisions
and understanding medical practice is a
long standing tradition in medicine. This
article describes some statistical tools
which can be used to advance our under-
standing of medical practice. The main
focus is on a modified Makeham survival
model. This basic model describes the time
course of a medical intervention or illness
in many cases. A precise understanding
of the time course may be used to evaluate
the follow-up time for special studies and
to know when the available follow-up is
not adequate. Examples of the compu-
tations demonstrate the balancing of fol-
low-up time and number of cases. Addi-
tional examples demonstrate how a
knowledge of the time course may be used
to isolate long-term and short-term risk
factors. Two approaches are described.
First, the long-term and short-term risk
factors are studied separately using the
widely used Cox proportional hazards
model. This approach is compared with a
fully integrated modified Makeham
model with concomitant variables in each
of three structural parameters. The ex-
amples in this paper rely on data collected
as part of an effort by the Health Care
Financing Administration to evaluate the
effectiveness of medical interventions on
the course of illness.
Recognition of Statistical Methods in
Extracting Information of Value
from Data
Before beginning a technical descrip-
tion of the Makeham model promised
above, it is worth setting the stage with a
few reminders about the role of statistics
in extracting information from data.
In his presidential address before the
First Indian Statistical Conference, 1938,
R. A. Fisher’ reminded his colleagues that
“in the original sense of the word, ‘Sta-
tistics’ was the science of Statecraft.’’ He
points out that the task of providing pub-
lic information as a function of official
Statistics ““. . . enables the public, if it will,
to size up its own problems. The Socratic
dictum ‘know thyself’ is applicable even
more to peoples than to individuals.”’ Also
he notes that with the development of a
theory of estimation and an understand-
ing of the magnitude and the nature of
the sampling errors, ““The whole tone of
the subject has been altered. The Statis-
tician is no longer an alchemist expected
to produce gold from any worthless ma-
terial offered him. He is more like a chem-
ist capable of assaying exactly how much
of value it contains, and capable also of
extracting this amount, and no more. In
these circumstances it would be foolish to
commend a statistician because his results
are precise, or to reprove because they
are not. If he is competent in his craft,
the value of the result follows solely from
the value of the material given him. It
containsso much information andnomore.
His job is only to produce what it con-
tains.”’ ““To consult the statistician after
an experiment is finished is often merely
to ask him to conduct a post mortem ex-
amination. He can perhaps say what the
experiment died of.”’
There are many notable examples of
the need to extract useful information
from data. For example, Walter Shewhart?
recognized the value of information gen-
erated by industrial processes and de-
veloped methods to chart measurements
sequentially—the control chart. An ex-
ample closer to the subject of this paper
is found in the medical arena. Florence
Nightingale* noted inconsistency in re-
cording the number of deaths at military
hospitals during the Crimean War. At
home she also found that English hospital
records followed no common nomencla-
ture or standard. She worked with Dr.
Farr, of the Registrar-General’s Office to
prepare standard lists for classes and or-
ders of diseases and model Hospital Sta-
tistical Forms to “‘enable us to ascertain
the relative mortality in different hospi-
tals, as well as of different diseases and
340
injuries at the same and at different ages,
the relative frequency of different dis-
eases and injuries among the classes which
enter hospitals in different countries, and
in different districts of the same coun-
tries.”” Furthermore, use of the proposed
forms “‘could enable the mortality in hos-
pitals, and also the mortality from partic-
ular diseases and injuries, and operations
to be ascertained with accuracy; and these
facts, together with the duration of cases,
would enable the value of particular
methods of treatment and of special op-
erations to be brought to statistical proof.”
For her efforts as a “passionate statisti-
cian,’ Florence Nightingale was made
Honorary Member of the American Sta-
tistical Association.*
Outcomes in Medical Statistics
In evaluating medical practices and in-
terventions, an evaluation of chances of
death seems to be an ultimate concern.
The history of the life table and of medical
statistics to a large degree centers around
techniques for statistical summary of mor-
tality.” Actually the techniques developed
for mortality statistics have broad appli-
cation. For example, in engineering, fail-
ure time is often used as a measure of
outcome when we want to know long we
can expect a car, a light bulb, or a per-
sonal computer to last. In all of these
problems, classification of the failure point
and items under test are crucial to the
utility ofthe studies being conducted. Many
problems in applying statistical methods
to medical data revolve around classifying
and recording conditions present, proce-
dures and interventions used, and the out-
comes. The outcomes of interest in health
care can be classified as mortality, mor-
bidity, disability, and ¢xpenditure. In this
paper, the focus is on mortality. How-
ever, methods for mortality analysis have
a role whenever we measure an outcome
by the time or duration for an occurrence
R. CLIFTON BAILEY
such as a readmission, relapse or failure
of a medical intervention or procedure.
The Modified Makeham Model for
Survival Analysis
A modification of the Makeham model
was proposed by Bailey et al. to evaluate
kidney graft survival.®’ The modified
Makeham model has a decreasing hazard
or risk function of the form
r(t) = a exp(—yt) + 8
where a is the initial excess risk, 5 is a
long-term risk and y determines the rate
for the decay of the initial excess risk. The
risk function for the Makeham model as
it is commonly found in the actuarial lit-
erature (Jordan, 1967)° has the form
r(t) = ap’ + 5.
In actuarial applications to human mor-
tality data, p is greater than 1 and the risk
function is increasing with time, ¢. This
corresponds to an increased risk of death
with older age over a lifetime.
When a constant 6 is added to a Gom-
pertz force of mortality to obtain the form
shown above the model is referred to as
Makeham’s modification of the Gom-
pertz even though Kurtz (1930) attri-
butes the idea for the use of the additive
constant 6 to Gompertz himself. In our
modified form of the Makeham, p is less
than 1, since p = exp(—vy). The decreas-
ing risk form of the Makeham works well
when there is a high risk intervention fol-
lowed by a period of recovery during which
the excess risk of the intervention dimin-
ishes with time.
The risk function is also known as the
hazard function or the force of mortality.
The risk curve shows us at what rate fail-
ures occur relative to the number of sur-
vivors at any time, ¢. The corresponding
expression for the proportion surviving at
time ¢ is
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 341
ope { = | ele ar}
or
S(t)
= exp{—{8r + (a/y)[1 — exp(—ye)]}}.
The modified Makeham model also in-
cludes concomitant variables in each of
its positive parameters as follows:
Q = exp(a + a,x, + °°
fe, Key PC) CE NOYAX,)
Be seXD (Boy habit 3.73:
+ Bix; +o + Baxy)
VWisresP Oo A Wyse
rey tt YEE)
This survival model has great utility for
evaluation of survival data following some
high risk event such as surgery. The fol-
lowing description of this and related tools
will be applied to examples using Medi-
care data to evaluate the effectiveness of
medical interventions.
Time Course of the Makeham Model:
Its Role in Statistical Studies
The time course of the Makeham model
is important for two important statistical
activities: the design of follow-up studies,
the evaluation of factors affecting long-
term and short-term risk.
The following sections will attempt to
isolate the long-term from the short-term
components of the modified Makeham
model. In evaluating the design for fol-
low-up studies, we focus our design ef-
forts on getting the information to eval-
uate the long-term risk components. This
approach is in the spirit conveyed in pres-
idential address cited above! when Pro-
fessor Fisher recalls that his teacher Pro-
fessor Whitehead of Cambridge used to
say: ‘“The essence of applied mathematics
is to know what to ignore.”’ With suffi-
cient lapse of time we can ignore the short-
term components. Then an evaluation of
the follow-up time required for a study is
reduced to a simpler problem in which the
risk function is a constant once the short-
term risk has become negligible. As you
will see, we don’t really ignore the short
term components. Instead we use our best
estimates of these components to recast a
complicated problem in terms of a simpler
problem.
The second aspect of isolating the long-
term from the short-term is to evaluate
complex data on medical intervention. An
estimate of the time for the short-term
hazard to become negligible relative to its
initial value allows us to use the popular
Cox proportional hazards model to better
understand the role of important risk fac-
tors. As before the estimates are based
on estimates for the parameters of a mod-
ified Makeham model with no concomi-
tant variables. The full Makeham model
with concomitant variables provides the
more satisfying approach when we have
a high risk intervention followed by a pe-
riod of recovery or return to stable or
constant risk. The full Makeham model
provides in one frame work a means to
jointly evaluate the proper balance of var-
ious risk factors. The result is a more com-
plex, comprehensive summary of com-
plex data.
Design of Survival Studies When the
Makeham Model is the Appropriate
Survival Model
To determine an appropriate follow-up
time for a study, there must be sufficient
time to estimate the long-term risk com-
ponent 6 when the Makeham model is the
appropriate survival model. Basically in-
formation on 6 is not available until the
short-term risk, a exp(—yt), has become
negligible. A reasonable approach is to
342 R. CLIFTON BAILEY
find the short-term time, 7,, such that the
hazard
r(t) = a exp(—yt) + 6
is equal to a value close to the long-term
risk itself. For this example, we use the
value r(t) = 1.16. That is, the short-term
effect is diminished to within 10% of 5.
This means that
a exp(—yt) + 6 = 1.18
or equivalently,
a exp(—yTs) = 0.1 8.
The solution for the time at which this
risk is achieved gives
Ts = —(1/y)In(0.1 8/a).
Note that 7; is negative when a is less
than 0.1 5. In this case simply use T; =
0. Now, as an approximation, we have the
constant risk form
r(AT) = 6
and the information on 6 begins accu-
mulating at time, 7; according to the
expression for the information
(6, AT) = 1 — exp(—6 AT)
where AT is the time lapse after time Ts.
When AT is zero, the above equation im-
plies there is no information about 6. As
AT approaches infinity, or symbolically
AT —— ~,
the information on 6 approaches 1 or
100%. Not only is /(6, AT) the infor-
mation on 6, but also it is related to the
survival function for the constant hazard
function where
[Gg ADia tl 1S¥ (AP)
where S;(AT) is the survival curve for
those cases not failed at 7;. This means
that information on the long-term risk ac-
cumulates as failures occur after time Ts.
To get an idea of how this works, see
Figure 1. For purposes of illustration,
consider a value of 6 = 0.2 per year. Then
to get 50% of the potential information
on 6 would require a study to extend
AT = —(1/8)In(0.5)
0.693/8
years beyond 7; or 3.5 years when 6 =
0.2 per year. A period of one year beyond
T; would provide 18% of the information
on 6. The idea is to design a study based
on getting the more difficult, long-term
information for the parameter 6. In taking
this approach, we rely on the very simple,
pragmatic notion, that it is more difficult
to obtain reliable long-term information.
By focusing the problem on the long-term
risk parameter, 5, we simplify a complex
problem and insure an adequate design
for a full evaluation of long- and short-
term risk factors. On this basis we eval-
uate sample sizes for follow-up study and
illustrate the balance between the number
of cases and the follow-up time.
Sample Size for a Follow-up Study
To plan a follow-up study, we must
evaluate a sample size or number of cases
to be included in our study. To do this we
reduce the complex modified Makeham
model to the simpler model with a con-
stant risk function by first estimating the
time for our model to be reduced to this
simpler form. This adaptation permits us
to borrow techniques already established
for the simpler problem. Essentially we
must find the sample size required to es-
timate 5. For example, this can be com-
puted from an approximate formula in
Gross and Clark (1975)! for the confi-
dence limits on 8. Use the formula for the
100(1 — a)% confidence limits on 6
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 343
Information on Longterm Risk
0.8
0.6
0.4
Information on Longterm Risk
0.2
0 2 4 6
ra) (risk/year)
Time (years) Additional Follow-up
Fig. 1. Information accumulates on the long-term risk, 5, as the follow-up time increases. The time
shown in the figure must be added to the time required for the short-term risk to become negligible.
)1- 2028] <¢
m
A Jb
ZR eee a
be ie
to find the number of cases m needed to
obtain an estimate within 100k% of 3.
The result
is the number of cases, assuming infinite
follow-up time and full information on 8.
Then we adjust the sample size to account
for the expected follow-up time. If m’ is
the actual sample size at time Ts, the ef-
fective sample size for a total follow-up
time of AT + Tz, is
m.— jm (le exp( —0.A 7).
To adjust the sample size m for this
follow-up time, first compute m then
m' = m/(1 — exp(—6 AT)).
This formula allows us to evaluate op-
tions to use more cases or a longer follow-
up time in survival studies. An additional
adjustment to the sample size accounts for
the expected number of cases lost before
T;. The adjusted sample size is
m" = m'/S(Ts)
where
S(Ts) = exp{—{8Ts
+ (a/y)[1 — exp(—yTs)}}
is the Makeham survival curve.
For example, one year of follow up from
T; with 95% confidence would yield a
sample size of
m = (1.96/0.1)? = 384
would provide an estimate of 6 within 10%
of 8(k = 0.1).
tv
344 R. CLIFTON BAILEY
Witha = 10.6, y = 39.1 and6 = 0.183,
(parameter estimates from the time course
of medicare heart attack data),
Ts = —(1/y)In(0.18/a)
= —(1/39.1)In((0.1)(0.183/10.6))
= 0.16 years,
Using the m above, we find that one year
of follow-up from the 0.16 years or a total
follow-up of 1.16 years, gives
m' = m/0.167 = 2,299
and
m” = m’'/0.741 = 3,103 cases.
The calculations for two years of fol-
low-up beyond 0.16 years for a total fol-
low-up of 2.16 years gives
m' = m/0.306 = 1,255
and
m" = m'/0.741 = 1,694 cases.
In contrast, the time course of heart
failure with a = 2.37, y = 10.77 and 6
= 0.377, has T; = 0.38 years.
Then for 1 year of follow-up beyond
0.38 years or 1:38 years
m' = 384/0.314 = 1223
and
m'" = 1223/0.698 = 1752 cases,
and for two years of follow-up beyond
0.38 years for a total of 2.38 years
gives |
nv 1 384)0.580)— 725
and
m” = 725/0.698 = 1039 cases.
These calculations clearly illustrate the
trade offs between follow-up time and
sample size that can be used in designing
studies. These computations are impor-
tant even in retrospective studies based
on administrative data because they pro-
vide insight into the number of years of
back records that will be needed for a
study. A careful evaluation of the re-
quired sample size for a study must in-
clude other factors, such as the expected
recruitment rate and an allowance for
cases lost to follow up. A fuller discussion
of these issues can be found in Meinert."!
Application of The Makeham Model in
the Evaluation of Data for the
Effectiveness and Use of Medical
Interventions—Long-Term Versus
Short-Term Risk Factors
As part of a project developed in late
1985 and early 1986, the Health Care Fi-
nancing Administration (HCFA) and the
Peer Review Organizations (PROs) un-
dertook a project to develop data on the
patterns of use and the effectiveness of
medical interventions. This project is part
of a comprehensive effort undertaken by
HCFA to assess and stimulate improve-
ment in the quality of medical care ren-
dered to Medicare beneficiaries. The ef-
fort is in four segments:
1. Monitoring of trends over time in use
and outcome of medical interventions.
2. Analysis of variation in use and out-
come over geographic areas and pro-
viders of care.
3. Detailed investigation of the patterns
of medical practice that underlie the
time trends and variations in outcomes
among localities.
4. Feedback and education to exchange
and improve information.
The examples which follow are derived
from the third segment in which detailed
investigations are being conducted for
cardiovascular problems. For this inves-
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 345
Table 1.—Maximum likelihood Estimates for the
Structural Parameters of the Makeham Model
Heart Heart
Attack Failure
Number of cases B52 3274
a (per year) 10.6 237
y (per year) 39.1 10.77
8 (per year) 0.183 03377
T,, (days) 22 78
tigation, a sample of just over 3100 cases
for each condition is available for analy-
sis. As a first step in understanding the
short- and long-term course of patient
survival following a medical intervention,
it is useful to fit the Makeham model with
decreasing hazard to the data. The result
of this fit can be used to graph the hazard
Over time, estimate the time course of the
risk following intervention, and engage in
a more thorough examination of the role
of key risk factors over the time.
To demonstrate these ideas, maximum
likelihood estimates were obtained for the
Makeham model for two data sets. One
has 3152 heart attack cases and the other
3274 cases with heart failure. The results
of this fit are shown in Table 1.
The risk curves are shown in Figure 2
and the corresponding survival probabil-
ities are shown in Figure 3.
Where do the Survival Curves Cross?
One important property of the Make-
ham modelis that survival curves can cross
Risk for Heart Failure and Heart Attack
10
Risk/y
120
Heart Attack
o— 10.6; y—39:1 0 —0483
Heart Failure
= 2.57, ¥—10577, 0= 01377
180 240 300
t (days)
Fig. 2. The modified Makeham risk functions for the heart attack and heat failure data. Note the
crossing of the hazard (risk) functions at 0.04832 year or 17.6 days.
346 R. CLIFTON BAILEY
Survival Probability
0.64
Heart Attack
0.4
120
Survival Probability
Heart Failure
180 240 300
t (days)
Fig. 3. The Survival Probability for the Modified Makeham Model for the Heart Attack and Heart
Failure Data. Note the crossing of the survival functions at a time later than that shown for the risk
functions shown in Figure 2.
when different sets of parameters are used.
For example, when we compare two sur-
vival curves with parameters a, y, 6 and
a’, y’, 5’ graphical solution of this prob-
lem will be apparent in many cases (see
Figure 3). An algebraic solution. of the
nonlinear equation for the cross over time
is not generally available. However, if the
graphical display of several survival curves
does not show a cross over, we know that
if a cross over occurs, it will occur for a
large value of time. In this case we can
equate the expressions for survival curves
and assume yf, and y’t, are large. Then
approximate cross over time is
t.
Oe 0%
Once the approximate cross over time
is known, more precise estimates of the
cross Over time can be easily found from
the graph or by iterative search. Note that
a negative cross over time means the curves
don’t cross. For the survival curves shown
in Figure 3, the cross over time from this
approximation is 96 days. Also observe
that the crossing of the risk functions
shown in Figure 2 occurs before the cros-
sing of the survival functions.
The Heart Attack Data for Separating
the Short- and Long-Term Risk Factors:
A Methodological Example
In this section, we explore the heart
attack data described above. There are
many possible variables to consider and
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 347
we wish to evaluate the risk factors that
play initially and long-term.
In considering these variables, we used
the time 7; where
T, = —(1/y)In(0.1)
to make separate evaluations of the short-
and long-term risk factors. This formu-
lation uses the time 7; for the initial ex-
cess hazard to diminish to 10% of the
initial value of the excess risk, a, so
a exp(—yT7,) = 0.1a.
This formula for separating long and
short term risks uses a criterion which dif-
fers from that for T; used for sample size
calculations. In our examples, the values
Ts used in the sample size calculations are
longer than the corresponding values of
T,. The use of a longer time for the sam-
ple design problem is more conservative.
In the estimation problem, there is the
additional consideration that a longer pe-
riod leaves fewer cases for the evaluation
of the long-term risk factors. Obviously
either approach for separating long-term
and short-term components of risk could
be used. With either approach, the first
step is to obtain parameter estimates for
the simple form of the Makeham with no
concomitant variables in the model. The
values shown in Table 1 provide a time
course which can be use to partially iso-
late the long-term from the initial or
short-term risk factors.
Once the time 7; has been determined,
two applications of the Cox proportional
hazards model were made to evaluate the
long-term and the short-term risk factors.
The SAS” procedure, PROC PHGL’M,
provides a handy tool for evaluating a
number of concomitant variables in the
Cox proportional hazards model. One
handy feature of the SAS implementation
is the option for stepwise model building
in which variables are tried systematically
in succession to assess their contribution
to the model. The initial risk factors were
evaluated using the data with no modifi-
cation. Significant risk factors are found
using the stepwise feature of the SAS
PROC PHGLM. The analysis is then re-
peated on the same data with the survival
time shifted from zero to T,. In the case
of the heart attack data, 7; = 22 days.
Consequently, the value 22 days is sub-
tracted from each survival time in the in-
itial data set. Negative values are assigned
a missing value and deleted from the long-
term survival analysis. The results of these
analyses are shown in Tables 2 and 3.
Table 2 shows the parameter estimates
for 24 risk factors which were significant
(P < 0.05) predictors of patient survival
in the Cox proportional hazards model.
When the survival time was modified by
subtracting 7; = 22 days from each sur-
vival time, 702 of the 3152 survival times
became less than zero and had to be ex-
cluded from the long-term analysis shown
in Table 3. Consequently, the number of *
cases used in the long-term model was
reduced from the 3152 cases used in the
initial risk evaluation to 2450 cases for the
long-term risk evaluation. At the same
time the number of censored survival times
is reduced from 1311 to 609. Censored
cases are those with a death or actual fail-
ure not yet observed during the period of
observation. Note that of the 24 variables
found to be significant predictors in the
initial risk model, only 14 remained sig-
nificant in the long-term model.
We consider this approach a handy
kludge for dealing with a complex prob-
lem. However, with the results of the
modified Makeham shown in Table 4, we
estimated simultaneously the components
of risk for all 24 variables shown in Table
2. In Table 4 an asterisk (*) was inserted
by the parameter estimates for each of the
delta components which were excluded
from the long-term model shown in Table
3. There are the 10 variables in Table 2
which are excluded from Table 3. If the
10 delta components with asterisks in Ta-
ble 4 are judged against their standard
errors, these components are not statis-
tically different from zero.
Furthermore the parameter estimates
348 R. CLIFTON BAILEY
Table 2.—Parameter Estimates for the Cox Proportional Hazards Model for Unmodified Survival Time
for 3152 Cases of Heart Attack
The parameters estimates were obtained from 3152 cases of heart failure of whom 1311 died during the
period observed. The MODEL CHI-SQUARE = 801.90 WITH 24 D.F. (—2 LOG L.R.) P = 0.0 The
output is from SAS® PROC PHGLM.
PARAMETER ESTIMATES
STEPWISE PROPORTIONAL HAZARDS GENERAL LINEAR MODEL PROCEDURE
3152 OBSERVATIONS
1311 UNCENSORED OBSERVATIONS
0 OBSERVATIONS DELETED DUE TO MISSING VALUES
—2 LOG LIKELIHOOD FOR MODEL CONTAINING NO VARIABLES = 20381.20
MODEL CHI-SQUARE = 1141.74 WITH 24 D.F.
MAX ABSOLUTE DERIVATIVE = 0.1537D-09. —2 LOG L = 19579.30.
MODEL CHI-SQUARE = 801.90 WITH 24 D.F. (—2 LOG L.R.) P = 0.0
FINAL PARAMETER ESTIMATES
VARIABLE CODE* BETA STD. ERROR CHI-SQUARE P
Corr ERAGE 0.051 0.006 69.23 0.0000
Leukocytosis HVWBC 0.012 0.003 14.54 0.0001
Low Value LVK 0.172 0.072 5.64 0.0176
Potassium
High Value HVPH 2G 0.635 Teds) 0.0066**
PH
Readmitted F122 0.291 0.115 6.40 0.0114**
<30da
Disoriented F293000 0.403 0.160 6.36 0.0116**
Ischemia F411800 — 0.176 0.065 7.44 0.0064**
MI Age not F412000 —0.157 0.065 5.80 0.0160**
determined
Glucose V425 0.000989 0.000229 18.53 0.0000**
Dissociation F426890 1.004 0.171 34.32 0.000
Congestive F428000 0.418 0.064 42.17 0.0000
Heart Failure
BUN V430 0.00526 0.00153 11.87 0.0006
PO, V461 — 0.00723 0.00183 15:56 0.0001**
Coma/Stupor F780005 0.906 0.128 50.27 0.0000
Pseudonomias V785000 0.00185 0.00054 11.47 0.0007
Murmur F785201 0.381 0.102 14.05 0.0002**
Systolic V785501 — 0.00997 0.00148 45.16 0.0000
Blood Pressure
Respirations V786010 0.0120 0.0034 12.18 0.0005
Stroke/TIA F826 0.367 0.088 17.36 0.0000
History
Coronary Heart F832 0.261 0.075 11.99 0.0005
Failure History
Myocardial ~ F876 0.143 0.064 4.96 0.0259
Infarction History
Cancer CA 0.632 0.147 18.44 0.0000
Chronic Renal RN 0.323 0.145 4.99 0.0254**
Disease
Diabetic DB —0.219 0.103 4.52 0.0336**
*Codes which begin with the letter F are indicator variables and codes that begin with V are values.
**Indicates variables not significant long-term as shown in Table 3.
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 349
Table 3.— Parameter Estimates for the Cox Proportional Hazards Model for a Modified Survival Time
for 2450 Cases of Heart Attack
For this analysis the survival time was modified by subtracting 22 days. Negatives values were deleted from
the analysis. The parameters estimates were obtained from the remaining cases of heart failure using SAS®
PROC PHGLM. The estimate of a 22 day period for the long-term risk to dominate was based on the
Makeham fit of the time course of the survival data. The estimates shown are for the variables shown in
Table 1 which were identified as significant in the stepwise proportional hazards general linear model
procedure with the default settings (P < 0.05).
609 UNCENSORED OBSERVATIONS
702 OBSERVATIONS DELETED DUE TO MISSING VALUES
FINAL PARAMETER ESTIMATES
Variables CODES* BETA
‘AGE ERAGE 0.0691
Leukocytosis HVWBC 0.0122
Potassium LVK 0.277
Dissociation F426890 0.896
A-V
Congestive Heart F428000 0.555
Failure
BUN V430 0.0155
Coma/Stupor F780005 0.691
Pseudonomias V785000 0.00321
Murmur F785201 0.378
Respirations V786010 0.0228
Stroke/TIA History F826 0.552
Coronary Heart F832 0.374
Failure History
Myocardial Infarc- F876 0.254
tion History
Cancer CA 0.939
STD. ERROR CHI-SQUARE P
0.0093 55.42 0.0000
0.0058 4.46 0.0347
0.112 6.09 0.0136
0.338 7.02 0.0080
0.090 37.64 0.000
0.0022 48.33 0.0000
0.311 4.92 0.0266
0.00081 15.80 0.0001
0.148 6.50 0.0108
0.0053 18.63 0.0000
0.123 20.24 0.0000
0.109 11.74 0.0006
0.091 7.85 0.0051
0.199 22.31 0.0000
*Codes which begin with the letter F are indicator variables and codes that begin with V are values.
for the modified Makeham model provide
the raw material to evaluate specific in-
dividual situations. With a specific case at
hand, the component parameters can be
used to estimate the structural parameters
a, y, and 6. With these estimates, both
the predicted survival curve and the pre-
dicted risk function can be evaluated and
plotted to provide a comprehensive fore-
cast for the situation. When the model is
used with a treatment or procedure that
can be administered as appropriate, the
results of the model prediction summarize
the experience of thousands of observa-
tions to succinctly predict the course of
survival for each option for the specific
individual case. This means that a simple
answer that one approach is preferred over
another must be replaced by the more
elaborate evaluation specific to the case.
And the choices are more complex in that
the survival curves may cross just as they
did in the comparision of the heart at-
tack and the heart failure data. In such
cases, the choice is a trade-off between the
short-term outcome and the long-term
outcome. The result is an elaborate
quantitative assessment of the expected
outcome that is custom fit to cover a com-
plex mix of individual situations. Clearly
such analyses provide a resource for sum-
marizing experience that awaits further
exploration and exploitation.
The appeal of the proportional hazards
model rests largely on the fact that the
conclusions derived from comparisons
based on this model remain simple, even
though complex adjustments are made.
The comparisons are simple because the
proportional hazards model does not re-
350
R. CLIFTON BAILEY
Table 4.—Estimates of the Makeham Parameters for the Heart Attack Data Set
Maximum likelihood estimates are shown along with approximate standard error for the estimate (S.E.).
Variables are identified with the code shown in Table 2 and a suffix designates alpha (a), gamma (y) or
delta (6) components.
MAXIMUM OF THE LOG LIKELIHOOD = —8209.242
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
SAEP
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E:
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
S.E.
PARAMETER
ESTIMATE
Salas
PARAMETER
ESTIMATE
SE.
F293000_a
0.874
0.267
F412000_«
—0.442
0.138
F426890_a
1.03
0.286
V430_a
— 3.72E-04
2.96E-03
F780005_c
0.764
0.175
F785201_a
3.21E-02
0.223
V786010_«
—3.28E-03
6.21E-03
F832_a
0.147
1.174
CA_a
— 0.231
0.296
DB a
— 0.376
0.218
F29300_+y
0.201
0.274
F412000_y
— 0.305
0.154
F426890_+>
0.155
0.402
V430_y
—3.43E-03
3.98E-03
F780005_y
— 0.495
0.221
F785201_y
— 0.388
0.260
V786010_y
~5.64E-03
6.65E-03
F832_y
0.157
0.212
CA_y
— 0.837
0.352
DB_y
6.54E-02
0.216
So
DE
Sell
HVWBC 8
1.00E-02
F29300_8
0.127"
0.343
F412000_8
~0.141*
0.116
F426890_ 6
atl
0.397
V430_8
1.62E-02
3.09E-03
F780005_56
— 0.560
O22
F785201_8
0.387
0.199
V786010 8
3.11E-02
6.53E-03
F832_8
0.484
0.130
CA 8
0.803
0.280
DB 8
7.89E-02*
0.161
ERAGE a
3.41E-02
1.54E-02
LVK_a
~2.93E-03
0.135
F122_a
0.449
0.223
F411800_a
— 0.292
0.135
V425 a
1.69E-03
4.12E-04
F428000_a
0.180
0.122
V46l_a
— 6.26E-03
3.23E-03
V785000_c
3.15E-04
1.03E-03
V785501_a
—2.47E-02
2.37E-03
F826_«
0.205
0.195
F876_«
—3.71E-02
0.136
RN_a
= (0.297
0.268
ERAGE_y
— 1.64E-02
2.35E-02
LVK_y
— 0.240
0.140
F122_y
0.198
0.231
F411800_y
~0.113
0.140
V425_y
5.25E-04
4.48E-04
F428000_+y
— 0.264
OFZ
V461_y
4.07E-03
3.68E-03
V785000_y
~9.72E-04
1.16E-03
V785501_y
~1.11E-02
2.59E-03
F826_y
—5.54E-02
0.240
F876_¥
—2.63E-02
0.152
RN_y
tle
0.395
ERAGE 8
F411800_8
Satin”
0.110 -
V425 8
1.53E-04*
4.61E-04
F428000_8
0.461
0.117
V461_8
~1.37E-03*
3.77E-03
V785000_8
3.44E-03
9.96E-04
V785501_8
5.35E-03*
3.55E-03
F826 8
0.581
0.152
F876_8
0.338
0.107
RN 8
—0.468*
0.464
“Estimated delta components which correspond to values not found significant in the Cox proportional
hazards model as shown in Table 3.
USES OF MODIFIED MAKEHAM MODEL TO EVALUATE MEDICAL PRACTICE 351
sult in survival curves that cross. The
modeling approach illustrated by the
modified Makeham places the focus on
estimation of complex outcomes rather
than the artificial reduction of complex
outcomes to simplistic conclusions that
came from formulating issues in terms of
simplistic statistical hypotheses.
Conclusion
The examples of the use of the modified
Makeham model to evaluate medical in-
terventions are promising. The compu-
tational cost associated with fitting this
complex model to large data sets may be
a limitation. If computational resources
become a problem, then an appropriate
statistical answer is to use a sample. The
computational limitations are more likely
to come from the need to include many
complex adjustments in our models. To
deal with these problems in an economical
fashion will be a challenge. In this paper,
it has been shown that simpler models,
such as the Cox proportional hazards
model, can be used with the simple struc-
tural form of the Makeham model to gain
useful insights. A related example by
Olshen etal.'* describes the Cox model and
other powerful statistical methodology
such as classification trees. We anticipate
the need to build models in which it is
necessary to examine many variables in
many combinations. To do this efficiently,
we must being various statistical and com-
puting resources into play. For example,
in medicine it is often the case that many
variables may carry very similar infor-
mation to the problem. When this is the
case, there are numerous alternative
models that are essentially indistinguish-
able. In such cases, external information
can be very useful in the final selection of
a model for a particular purpose. The ex-
ternal information may be in the form of
costs, convenience, or reliability or risk
to the patient of a clinical measure. The
external information may include a real-
ization that a variable has been miscoded
or that the information is nci coded con-
sistently. The biggest challenge to using
complex models for evaluating medical
procedures and practices from the infor-
mation contained in large data bases comes
from the problems inherent in such large
collections of data. On the other hand,
the challenge is to have the courage to
make the best of what is available. The
very probing of the data to glean valuable
insights regarding current medical prac-
tice will stimulate new questions and the
very use of the data in this productive
fashion will encourage those responsible
for providing and maintaining these data
to do a better job. Without use, the in-
formation is lost and the process of learn-
ing from experience is far more parochial
than it need be. Many things are being
tried. We need to study the really good
practices and the really bad practices to
learn what works and when it works. We
need only look at the conventional gath-
ering of medical knowledge to realize that
clinical trials are conducted on very select
populations, each practitioner has at most
limited experience, information in jour-
nals is often based on studies at a single
institution, and the current thinking in-
stilled in medical school graduates changes
slowly with experience. With models which
focus on effective use of data to jointly
evaluate short- and long-term risk factors,
there is a challenge to fully investigate
major data resources in order that we may
better understand medical practice.
Acknowledgement
The author wishes to express his grat-
itude to Henry Krakauer and Miles Davis
of the Health Care Financing Adminis-
tration for their encouragement and com-
ments on this paper.
References Cited
1. Bennett, J. H. 1974. Collected Papers of R. A.
Fisher, Volume IV 1937-1947, The University
352
R. CLIFTON BAILEY
of Adelaide, Australia. Paper 159, pages 160-
163.
. Shewhart, Walter. 1931. Economic Control of
Quality of Manufactured Product, New York,
D. Van Nostrand Co., Inc.
. Cook, Edward. 1914. The Life of Florence
Nightingale, in two volumes, Macmillan and Co.,
London, Vol. 1, Chapter II, ‘““The Passionate
Statistician (1859-1861),”’ pages 428-438.
. Kendall, Maurice and R. L. Plackett. 1977.
Studies in the History of Statistics and Proba-
bility, Volume IT. Macmillan Publishing Com-
pany, New York. Chapter 19, Florence Night-
ingale as a Statistician, paper by E. W. Kopf,
Journal of the American Statistical Association,
15, 388—404 (1916).
. Pearson, E. S. and M. G. Kendall. 1970. Studies
in the History of Statistics and Probability, Haf-
ner, Darien, Conn. Chapter 7, Medical statistics
from Graunt to Farr, papers by Major Green-
wood reprinted from Biometrika, 45, 101-27
(1943), 32, 203-25 (1942); 33, 1-24 (1943).
. Bailey, R. C. and Homer, L. D. 1977. Com-
putations for a best match strategy for kidney
transplantation, Transplantation, 23: 329-336.
. Bailey, R. C., Homer, L. D. and Summe, J. P.
1977. A proposal for the analysis of kidney graft
survival, Transplantation, 24: 309-315.
. Jordan, Jr., Chester Wallace. 1967. Life Con-
tingencies, 2nd Edition, The Society of Actu-
aries.
10.
Ti:
2:
13:
14.
. Kurtz, E. B. 1930. Life Expectancy of Physical
Property Based on Mortality Laws. The Ronald
Press Company, New York.
Gross, Alan J. and Clark, Virginia A. 1975,
Survival Distributions: Reliability Applications
in the Biomedical Sciences, John Wiley & Sons,
New York, page 61.
Meinert, Curtis L. 1986. Clinical Trials, Design,
Conduct, and Analysis, Oxford University Press,
New York.
SUGI Supplemental Library User’s Guide, Ver-
sion 5 Edition. 1986. SAS Institute, Inc. Cary,
NC.
Roper, William L., Winkenwerder, William,
Hackbarth, Glenn M. and Krakauer, Henry.
Effectiveness in health care: An initiative to
evaluate and improve medical practice. The
New England Journal of Medicine, vol. 319,
No. 18. Nov. 1988.
Olshen, Richard A., Gilpin, Elizabeth A., Hen-
ning, Hartmut, LeWinter, Martin L., Collins,
Daniel and Ross, Jr., John. Twelve-month
prognosis following myocardial infaction: Clas-
sification trees, logistic regression, and stepwise
linear discrimination. 1985. In Proceedings of
the Berkeley Conference in Honor of Jerzy Ney-
man and Jack Kiefer, Volume I, Wadsworth,
Monterey, California and the Institute of Math-
ematical Statistics, Hayward, California. pp.
245-267.
DELEGATES TO THE WASHINGTON ACADEMY OF SCIENCES,
REPRESENTING THE LOCAL AFFILIATED SOCIETIES
7 CLEATS GTEC) IES) 011 0) 1 ga le Barbara F. Howell
Pere stical SOCICtY Of WaShiMetOn . <i 5 ics oc. oe te tele ee ee teense ee eet Edward J. Lehman
PI MECICICDY. Ol WASHINGTON: = 2/1. 05 cic nceletsi stein cose 0 \s elepe coe ois tage ed ae eg eisie ge ees Austin B. Williams
TEMES EV COL NV ASPALITXEQ)IN 552) nea ciate ches «a lane Qe a a. cldediv’e im g)Sccnie cs winNy. « Ce we Jo-Anne A. Jackson
IMIPIEIE ARN SOCICEY Ol VWASMIN GLO. 2.05505 000655 6 acs boo Sic nos oes ee nave ce SIS rele eyes Manya B. Stoetzel
SNEMEMAP MOT ANIC SOCICUW Os eae tie cic: onda lcisforen aig sia ele souk Fels Sane Sel atEe Ow see bas Gilbert Grosvenor
PERCHES OCICLVAOM VV ASHE O LON) 105 15118 oo a thei n aoe Siena neice a eecieele gi eo Lee ap wes James V. O’Connor
emeenncreicty OF the District of Columbia .: 2... 2.02.0... 0c0c cece eee eee ne eee Charles E. Townsend
IBLF OLICAIASOCICLY 26 2s ne See nln sb ake Gites ce eicusisiaie mn ecatse oterdi ave aie 3'e-oa Be ayauelae Paul H. Oehser
MMAEMCCICE SOI V\WASHINIOLOM:. s3r\ccc settee anes 2 aictae Dees ale ce cma ee eadesiauess Conrad B. Link
mmo ainctican Foresters, Washington Section ....... 20... 26 e eke cee ence ences Mark Rey
PEE AOCIC LYK Oly ENGINE CES. .0 elo loctse legs see isis eiaielased sein o Wie sel atuiid eee eiaeee na George Abraham
Institute of Electrical and Electronics Engineers, Washington Section................. George Abraham
American Society of Mechanical Engineers, Washington Section...................0-0000- Michael Chi
PET CAl SOCICLY OF WaSHIN PCO .260. 6 co ka ete e lois ope eRe ob Cee ees eee Robert S. Isenstein
Seaestean society for Microbiology, Washington Branch......... 2.0... 2... .jc0cjee eee es eee ea Vacant
Society of American Military Engineers, Washington Post....................--. Charles A. Burroughs
fmeemeanoocicty of Civil Engineers, National Capital Section... .... 00.0002 0.0... 0 ne eeses Carl Gaum
Society for Experimental Biology and Medicine, DC Section ...................... Cyrus R. Creveling
ema society tor Metals, Washington Chapter... 0.2.6... fo 03 ee ese Fae ee ee oes James R. Ward
American Association of Dental Research, Washington Section.....................-.-. Eloise Ullman
American Institute of Aeronautics and Astronautics, National Capital Section............... Paul Keller
maeeanenictcorolorical Society, DC Chapter .. 0. 2. eb Dannie ce is ee wees be os A. James Wagner
Insecticide Society of Washington PEP oy eeEM ete Na er asin lacclls Soden oe Atte eh Rae ae Oe Albert B. DeMilo
Seemswea society of America, Washington Chapter. ..............-0: 0.005 ee ecee cee Richard K. Cook
Pe Meameucicar Society, Washington Section: 2.5.0 2.0.06 ccc eee eet ete bee e seat se Paul Theiss
mastitute of Food Technologists, Washington Section ....................222+ eee Melvin R. Johnston
American Ceramic Society, Baltimore-Washington Section........................ Joseph H. Simmons
SuREnaIP RE Ne LIES OCICL VE Se ila caldera ales ome hoe als oD b ialee One eleped Meum teieveene Alayne A. Adams
Eee MELTISCORYAOl SCONCE! CUD... i. a5. be joteceat vies sys wie iv slave Ges Libis sa veleloih ce eve thale Albert Gluckman
American Association of Physics Teachers, Chesapeake Section ...................... Peggy A. Dixon
Smeaesociety of America, National Capital Section... ........2- 0.20. 5226- eee William R. Graver
American Society of Plant Physiologists, Washington Area Section............... Walter Shropshire, Jr.
Washington Operations Research/Management Science Council ...................... Doug Samuelson
Bammer society of America, Washington Section.: .........02 52. sete cee ese ce tens onl Carl Zeller
American Institute of Mining, Metallurgical
sasectrolcum Engineers, Washington Section... 2... 52... 6. ce cnie eines een te Ronald Munson
See MIO AMI EAIPAS(TOHOMICNS 2 ,>) 5 ac.2)4 2-02 creas sakla ds sais sue es nes nis de eye eee Robert H. McCracken
Mathematics Association of America, MD-DC-VA Section................0..0eeeeee Alfred B. Willcox
SBA STALE CHCA CO HEMIISES S50. ici os ocacne se Se dw asic ao) ed CO eee Miloslav Rechcigl, Jr.
SEE NONG PIC ale ASSOCIATION 5505.05 2 aha neni oe ie Ls Sine aha oh artic oie ela) Sodeeimin die aeesn ewe a Bert T. King
PepeePeaneb aint neennical Groups: ©.) os ole os obsie's as neces Wuewle wha ate oat salnems Robert F. Brady
American Phytopathological Society, Potomac Division..................-.00000e00e Roger H. Lawson
Society for General Systems Research, Metropolitan Washington Chapter ..... Ronald W. Manderscheid
Pemiatm Actors Society, Eotomac Chapter... .2i; 0562s <i eons yee ele fee Ue ee ee wots Stanley Deutsch
PemcbGanhishieries. society, Potomac Chapter: .. 0-25... .....cce0sseeseneecccnenenals Robert J. Sousa
mavaciaon for Science, Lechnology and Innovation.............<:-.-c.sesserceeere ree Ralph I. Cole
BR PRLEMESHCIOlaSICAl SOCIELY fe) 08. 06. eae ee ss as Kotte ee cade yond etlns Ronald W. Manderscheid
Institute of Electrical and Electronics Engineers, Northern Virginia Section.............. Ralph I. Cole
Association for Computing Machinery, Washington Chapter..............-..0-00005 James J. Pottmyer
BT aRIE OES EACSHCAIISOCIELY Is <5 ses eee oo a vie won ens geo sidim inte nin yale R. Clifton Bailey
Delegates continue in office until new selections are made by the representative societies.
Wi 10 ANE
3 9088 01303 2206
Washington Academy of Sciences 2nd Class Postage Paid
1101 N. Highland St. at Arlington, Va.
Arlington, Va. 22201 and additional mailing offices.
Return Requested with Form 3579
DR. HARALD A. REHDER
3900 WATSON PLACE, N.W.
APARTMENT 2G-B
WASHINGTON, DC 20016 F