Interpreting Outcome Data in Hematological Malignancies: A Paradigm for Clinical Studies

Results of clinical studies are often contradictory in real time, and in other instances therapies may be adopted due to information from clinical studies where the data may be premature or resulting from small studies. Much of the data may have inherent selection biases, and their interpretation may be confusing and difficult. The hematological literature is full of such examples, and this review will describe some such instances in the hope of introducing both a cautionary note and encouraging more precise description of study conditions as well as an appreciation of the importance of allowing data from clinical studies to mature. Several examples will be drawn from clinical studies in lymphomas, leukemia, and bone marrow transplantation.


LYMPHOMA Diffuse Large Cell Lymphoma
In the mid-1970s the standard of care for the treatment of diffuse large cell lymphoma (or diffuse histiocytic lymphoma, as it was then known) was a combination of cyclophosphamide, doxorubicin, vincristine, and prednisone. This, or a modified version of these drugs, known as CHOP, initially developed at the National Cancer Institute in the US in the mid-1970s, 1 was generally given every 3 weeks for six cycles, and this was the historic standard of care for lymphoma, with reported survivals of 35%-40%. In the late 1970s and in the early 1980s, following the work of Norton and Simon 2 in 1977 and Goldie and Coldman 3 in 1982, many of the advances in the design of cancer studies followed the Goldie-Coldman hypothesis which, in essence, described the necessity for considering the intensity, timing, and the use of alternating non-crossresistant drugs as critical for the success of cancer therapy. As a result of these studies multiple new regimens were reported in the early 1980s with second-generation treatments for lymphoma which included the acronyms COP-BLAM, m-BACOD, M-BACOD with reported survivals of 55%-60%. These were followed by the third-generation regimens for the treatment of diffuse large cell lymphomas, including combinations such as ProMACE-MOPP, COP-BLAM III, ProMACE-CytaBOM, and MACOP-B with reported overall survivals of 65%-75% ( Table  1). The reports from these second and third generations were so astonishing that many considered the "historic" standard of CHOP to be unethical. An editorial in the Annals of Internal Medicine in 1985 concluded that "the results of second-and thirdgeneration chemotherapy regimens are so consistently good from so many independent sources, that they continue to engender even more ferment in the treatment of large cell lymphoma." 4 Against this general background, in the late 1980s, the Southwest Oncology Group and the Eastern Oncology Group in the US initiated a prospective randomized phase III trial comparing the standard CHOP regimen with three intensive chemotherapy regimens for advanced lymphomas. The results published in the New England Journal of Medicine in 1993 astounded the hematology com-munity with similar overall survival for all regimens and with no subgroup of patients in which survival was improved by a third-generation regimen ( Figure  1). 5 Furthermore, the CHOP regimen was less toxic, thus concluding that CHOP remained the best available treatment for patients with advanced-stage intermediate-or high-grade lymphomas. These remarkable results highlighted the difficulty of interpreting limited phase II data due to inherent selection biases. To this day CHOP remains the standard of care for aggressive lymphomas and is the yardstick against which all new advances are compared. The only proven advance in the management of lymphoma has been the addition of rituximab which was established through a carefully controlled phase III study where CHOP alone was the comparator arm. 6

Relapsed Aggressive Lymphoma
Another example relates to the management of relapsed aggressive lymphomas. Early data in the 1980s suggested that the results from autologous transplantation were far superior to the use of traditional conventional chemotherapy, which in fact yielded almost no cures for the disease. Nevertheless, given the lessons learned from the phase III study of CHOP, some skepticism existed in the hematologic community, and the need for a prospective phase III study was clearly apparent. The PARMA study ( Figure 2) was designed specifically for this purpose in 1987. Recruitment was difficult due to a reluctance by many practitioners to offer standard chemotherapy to even those with the better prognosis among the relapsed groups. Preliminary data, presented at international meetings in 1992 and 1993 ( Figure 3), were widely interpreted as demonstrating that high-dose therapy with autologous transplantation did not provide a significant improvement. This created quite a stir in the transplant community until the definitive results from the trial were published in the New England Journal of Medicine in 1995, demonstrating that, compared with conventional chemotherapy, treatment with high-dose chemotherapy followed by autologous bone marrow transplantation increases the survival in patients with chemotherapy-sensitive relapsed lymphoma ( Figure 4).

Complete Remission
Although it has been known for a long time that achieving a complete remission is the sine qua non for long-term survival, induction of remission has been fairly standardized over the past four decades. Standard induction for AML consists of 3 days of an anthracycline, usually daunorubicin, together with 7 days of cytarabine. The problem here relates to data published in the late 1980s and the 1990s, which indicated that using virtually identical drug regimens the complete remission rate varied from 55% to 60% among the Southwest Oncology Group (SWOG) in the US, 65%-70% among the Eastern Cooperative Oncology Group (ECOG) in the US, 70%-75% in the Cancer and Leukemia Group B (CALGB) in the US, and 75%-85% in Medical Research Council (MRC) in Britain (Table 2). Despite these differences in the complete remission rate, the overall outcome for AML for younger adults is virtually identical in each of the major groups when evaluating for survival from diagnosis ( Figure  5). 7 The question still remained how these identical survival results could be achieved when there are such heterogeneous reports of the complete remission rates. Although not always clearly specified in the manuscripts, it was clear to practitioners that these discrepancies did not reflect an inherent difference in practice or responses within institutions. The explanation here reflects a difference in the requirement or definition of a complete response such that, for example, in SWOG, patients needed to undergo central review at diagnosis and upon recovery of blood counts in order to confirm a complete remission. In ECOG, although central review was not required at the achievement of complete remission, final blood results needed to be performed at an ECOG-certified laboratory. This meant that if a patient was discharged from the hospital, in apparent remission, but with a platelet count of 70,000/µL, and the confirmatory platelet   Table 2; CBC, complete blood count; CR, complete remission.   Published with permission from Rowe JM and Avivi I. 19 ANC, absolute neutrophil count; G-CSF, granulocyte colony-stimulating factor; GM-CSF, granulocytemacrophage colony-stimulating factor.  count of over 100,000/µL required for the definition of complete remission was not performed at an ECOG-certified laboratory, such a patient could not be categorized as achieving complete remission (Table 3).
Such subtle differences need to be clearly described in published reports to avoid either under-or over-interpretation of data.

Phase III Studies in AML
Growth factors, granulocyte colony-stimulating factor (G-CSF) or granulocyte-macrophage colonystimulating factor (GM-CSF), have now been demonstrated in 18 controlled studies to shorten the period of neutropenia by 4-7 days (Table 4). Despite the safety demonstrated in virtually every study, there are still physicians who hesitate using growth factors during induction therapy due to concerns for safety, related to the known increased blast cell proliferation.
The controversy has abounded for almost two decades, and one of the early negative papers for the use of cytokines was a report from the Cancer Leukemia Group B which suggested no benefit for the use of growth factors in AML. This was a wellconducted prospectively randomized study comparing GM-CSF versus placebo. 8 However, the GM-CSF used in this study was E. coli-derived, a nonglycosylated GM-CSF that was highly toxic (and for this reason is no longer in clinical use). Many patients developed a rash and a fever, and the drug was discontinued during the trial period, due to safety concerns.
However, the authors correctly noted that the study drug was discontinued in one-third of patients in each group, presumably because the treating physician perceived that the patient had severe GM-CSF-associated toxicities, mostly rash and fever; 60/187 of patients in the GM-CSF group and 56/189 of patients in the placebo group were removed from the study.
However, what was not considered is the fact that precisely among those patients who were affected by rash or fever the GM-CSF was discontinued. Thus, the lack of benefit in the study may have reflected the fact that particularly the patients who may have benefited most from the cytokines did not receive this. The point here is to emphasize the need to understand the precise study conditions and the caution needed in interpreting even prospectively designed placebo-controlled phase III studies.

Lessons from Very Large Studies
The International Acute Lymphoblastic Leukemia (ALL) Study, jointly conducted by the Eastern Cooperative Oncology Group in the US and the Medical Research Council in Britain, was a large prospective study of 2,000 newly diagnosed patients with acute lymphoblastic leukemia (ALL). In this study, patients were treated identically on both sides of the Atlantic, with the data centralized in one center. ALL is a relatively uncommon disease in adults, with only approximately 1,500 new adult patients in the US per year. Because the accrual to co-operative group studies in acute leukemia in the US is no greater than 5%-10%, this means that there would be less than 150 adult patients with ALL who would be available for major co-operative group trials. These numbers make it immediately clear that in order to obtain any definitive information on this disease a national and international collaboration is needed, and this was, in fact, established in this International ALL trial. Prior to the initiation of this study in 1993, patients with standard-risk ALL were never considered for an allogeneic transplant in first complete remission. In fact, the largest trial of bone marrow transplantation prior to the international ALL study was the French LALA-94 study which was published in 2004. 9 That study demonstrated a benefit for high-risk ALL patients who had a sibling donor over those who did not have a sibling donor. However, standard-risk patients (i.e. those patients younger than 35 years who did not have a high white cell count at presentation and who went into remission within the first 4 months) were not even studied.
In contrast, the results of the large international ALL study surprised the international community by demonstrating, first, that standard-risk patients had a better outcome if offered an allogeneic transplant from a matched sibling in first complete remission ( Figure 6) and, second, that high-risk patients, mostly those over the age of 35, had an unexpectedly high non-relapsed mortality that abrogated the superior benefit of allogeneic transplantation in this group (Figure 7). Prior to the results of this study, there had been a common perception that the wellknown graft-versus-leukemia effect had only a   26 minimal, if any, role in ALL. This study established, quite unequivocally, the very potent graft-versusleukemia effect in ALL as demonstrated both in standard-and high-risk patients (Figure 8).
Prior to 2005, there was little definitive information about cytogenetics in ALL. Although this had been accepted as being prognostically critical in AML, there was a paucity of information in ALL mostly due to the small number of patients in the studies. What had been mostly known was that the Philadelphia chromosome conferred a poor prognosis, but little else was confirmed. A complex karyotype in ALL was intuitively thought to portend a poor prognosis, as had been established in AML, but there had been no data to confirm this. This large international ALL study established, for the first time, the poor prognosis of patients with a complex karyotype (Figure 9), 10 when compared with all other Philadelphia-chromosome-negative patients. This transatlantic partnership confirmed the need and feasibility of large studies and emphasized the importance of collaboration among groups in uncommon disorders.

Graft-versus-Host Disease
A careful examination of the literature in BMT is used to emphasize the need for care in assessing implications of newly published data.
Graft-versus-host disease (GvHD) had been the "scourge" of BMT, with mortality rates approaching 30%-40%, depending on typed donor and disease. It was known that GvHD is primarily initiated by donor T-cells, and thus, in the 1980s, investigators considered whether T-cell depletion could prevent or ameliorate GvHD. It was clear in the early 1980s that, despite technologies that were in place for successful T-cell depletion, the procedure itself carried formidable problems, mostly those of graft failure. 11 It appeared that T-cells in the donor marrow were critical to maintain sustained engraftment, thus dampening the enthusiasm for this manipulation. In 1987, the first report of successful GvHD prevention, without graft failure, in human leukocyte antigen (HLA)-identical allogeneic bone marrow transplants was published using marrow that was depleted of T-cells by monoclonal antibodies and complement. 12 In the same year, multiple results of successful T-cell depletion resulted in a short-lived euphoria when the problem of GvHD was thought to be "history." The ink had virtually not dried on these papers when the excitement was dampened by reports in 1988 which pointed out an increased risk of relapse associated with T-cell depletion. 13 In the subsequent year or two, multiple reports confirmed the early relapse post-allogeneic transplantation when T-cell depletion had been used.
A seminal experiment carried out in 1991 by Marmont in Italy 14 demonstrated the markedly increased relapse among 440 T-cell-depleted patients compared with 1,328 non-T-cell-depleted patients with a parallel benefit in overall survival ( Figure 10).
The importance of the graft-versus-leukemia effect in humans has now been firmly established and was confirmed across a wide range of diseases in a classic paper summarizing data from the International Bone Marrow Transplant Registry ( Figure 11). This retrospective registry study confirmed, in very large numbers, the increased relapse rate among syngeneic twins or patients undergoing T-cell depletions, compared with those experiencing acute or chronic GvHD, or both. 15

Timing of Bone Marrow Transplantation in Leukemia
Allogeneic transplantation in first remission, in general, is recommended as the standard approach for patients at high risk for relapse with conventional therapy. Without doubt, allogeneic transplantation provides the most efficacious antileukemic therapy due to the potent graft-versusleukemia (GVL) effect, and data have confirmed that allogeneic transplantation confers the lowest relapse rate for every subtype of AML. The high transplantrelated morbidity and mortality is the only reason for not offering this to every patient with ALL or AML. In essence, this is a delicate balance between efficacy and toxicity. 16 One of the most important issues relates to the timing of transplant. The foremost question among practitioners and patients is, given the high procedural mortality, should such a procedure be preferably reserved for patients in second complete remission or at relapse? Such considerations are bolstered by data demonstrating reasonable survival if transplant is performed in second remission (Figure 12). Given the high nonrelapse mortality in allogeneic transplantation, such transplantation may sway patients away from transplant in first remission. While there is no doubt that allogeneic transplantation can be performed successfully in second complete remission, such    reports are highly selective and confined to a small group of patients who have survived the relapse, achieved a second complete remission and were fit enough to undergo a transplant, and for whom a donor was available. This represents a small minority of patients. If one considers the overall survival for all relapsed patients, this is no more than about 10%. 17,18 Thus, presenting the optimistic data of second complete remission (CR2) to patients at diagnosis is thoroughly misleading and clearly needs to be avoided.

INTENTION-TO-TREAT ANALYSES
Phase III studies, representing prospective randomized trials, are the gold standard, especially when analyzed by intention to treat. However, it is crucial to understand the limitations of such analyses. For example, phase III studies of transplantation usually underestimate the toxicity of the procedure because the donor arm is diluted by the number of patients who do not receive the transplant. They may also underestimate or overestimate efficacy depending on whether transplant is better than the comparator group. Furthermore, intention-to-treat analyses from diagnosis do not provide information for individual patients, as specified time points.
Importantly, a generic issue of transplant studies relates to the large number of patients who do not undergo the assigned or randomized procedure. Any intention-to-treat analysis can only be reliably assessed if patients actually receive the treatment specified in their assignment or randomization. This is notoriously so in autologous transplantation where as many as 50% of randomized patients do not receive their assigned randomized therapy ( Figure 13). Although outcome curves are routinely published for such studies, based on intention-totreat analyses, the true meaning is entirely unknown. Such data need to be interpreted with a great deal of circumspection.

CONCLUSION
It is clear that when comparing studies differences in patient population, study conditions, study eligibility, and subtle differences in the conduct of a study all go towards emphasizing the lack of direct comparability across studies. It is crucial to be particularly careful in interpreting small studies and to be aware of early communication of data.
Lastly, even in well-conducted studies, it is vital to understand very carefully what large studies tell us and what they do not. The limitations of intention-to-treat analyses must be understood when considering published data. While good phase II data provide the backbone for further investigations, adequately sized, prospective phase III studies, conducted by a collaborative group of investigators, are the only way to move forward with definitive information.