The ghosts of HeLa: How cell line misidentification contaminates the scientific literature


While problems with cell line misidentification have been known for decades, an unknown number of published papers remains in circulation reporting on the wrong cells without warning or correction. Here we attempt to make a conservative estimate of this ‘contaminated’ literature. We found 32,755 articles reporting on research with misidentified cells, in turn cited by an estimated half a million other papers. The contamination of the literature is not decreasing over time and is anything but restricted to countries in the periphery of global science. The decades-old and often contentious attempts to stop misidentification of cell lines have proven to be insufficient. The contamination of the literature calls for a fair and reasonable notification system, warning users and readers to interpret these papers with appropriate care.

The misidentification of cell lines is a stubborn problem in the biomedical sciences, contributing to the growing concerns about errors, false conclusions and irreproducible experiments [1, 2]. As a result of mislabelled samples, cross-contaminations, or inadequate protocols, some research papers report results for lung cancer cells that turn out to be liver carcinoma, or human cell lines that turn out to be rat [3, 4]. In some cases, these errors may only marginally affect results; in others they render results meaningless [4].

The problems with cell line misidentification [5] have been known for decades, commencing with the controversies around HeLa cells in the 1960s [6–10]. In spite of several alarm calls and initiatives to remedy the problem, misidentification continues to haunt biomedical research, with new announcements of large-scale cross-contaminations and widespread use of misidentified cell lines appearing even recently [11–13]. Although no exact numbers are known, the extent of cell line misidentification is estimated between one fifth and one third of all cell lines [4, 14]. (Although currently only 488 or 0.6% of over 80,000 known cell lines have been reported as misidentified, most cell lines are used infrequently [15].) In addition, misidentified cell lines keep being used under their false identities long after they have been unmasked [16], while other researchers continue to build on their results. Considering the biomedical nature of research conducted on these cell lines, consequences of false findings are potentially severe and costly [17], with grants, patents and even drug trials based on misidentified cells [18]. Several case studies performed by the International Cell Line Authentication Committee (ICLAC) highlight some of the potential consequences of using misidentified cell lines [19, 20]. Especially in the last decade, the gravity of the problem has been widely acknowledged, with several calls for immediate action in journal articles [3, 12, 21–23], requirements for grant applications (e.g. [24, 25]) and even an open letter to the US secretary of health [26].

The current calls for action and remediation activities are almost exclusively concerned with avoiding future contaminations, such as through systems for easier verification of cell line identities. Various solutions have been proposed [27–29], among others employing genotypic identification through short tandem repeats (STR) [30]. In addition, authors are expected to check overviews of misidentified cells (such as [12, 15, 27, 31]) before conducting their experiments. However, little attention is currently paid to the damage that has already been done through the past distribution of research articles based on misidentified cells. Although systems such as retractions and corrections are available to alert other researchers of potential problems in publications, these systems are rarely used to flag problems with cell lines [20, 32]. Even if future misidentifications could be avoided completely–which is not likely given the track record of earlier attempts–these ‘contaminated’ articles will therefore continue to affect research.

Before any action can be taken, it is essential that we get a sense of the size and nature of the problem of contaminated literature. This raises several questions. First, how many research articles have been based on misidentified or contaminated cell lines? How wide is their influence on the scientific literature? Second, what can we say about origins and trends in the contaminated literature? Is the problem getting better, or restricted to peripheral regions of the world’s research, where perhaps protocols are less strict? Third, what could be appropriate ways to deal with the contaminated literature? To answer these questions, we searched the literature for research papers using cell lines that are known to have been misidentified. In order to put the results of this search in perspective, we analysed the precise complications of misidentification for three particular cell lines.

