A diversified informatics portfolio covering health sciences and healthcareOhno-Machado,, Lucila
doi: 10.1093/jamia/ocy163pmid: N/A
We have been highlighting the expanded quantity and quality of articles published in JAMIA for the past few years. They reflect how informatics has grown from a relatively small and lesser-known healthcare and biomedical science specialty, when the journal started 25 years ago, into a well-recognized discipline with distinct foundations and applications that are relevant to various domain areas. The 2018 closing issue of JAMIA exemplifies the breadth and depth of informatics: it presents applications in global and public health (p. 1608, p. 1586), healthcare (p. 1600, p. 1634), and behavioral science (p. 1675), and it describes foundational work in vocabulary mapping (p. 1618), privacy protection of patient records (p. 1593), statistical methods for longitudinal data (p. 1669), and a study on the integrity of clinical information in diagnostic imaging orders (p. 1651). Articles in this issue also review how mobile health applications can be leveraged for citizen science (p. 1685), discuss factors that are important for patient portal engagement (p. 1626), and show how deep neural networks can be used to provide expert-level sleep scoring (p. 1643). Finally, AMIA’s list of core competencies (to be achieved as a result of health informatics education) is presented. JAMIA’s articles represent the best work in our field, and it is no surprise that they have been featured in the lay press, as well as in popular “Year in Review” panels at AMIA conferences. In 2018, we provided our readers with an assortment of established and emerging research topics, as well as authoritative reviews and perspectives from informaticians around the world. Reporting on the growth of our discipline while affording new authors the opportunity to feature their best work together with established senior professionals in our field is an important function of JAMIA. We wish the entire JAMIA family (readers, authors, reviewers, editorial and production teams) happy holidays, and congratulate everyone for the impressive achievements in 2018. We look forward to 2019, in which a new and highly qualified editorial team will guide AMIA’s flagship publication to new heights. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Reflections on the journey of editing a scientific journalOhno-Machado,, Lucila
doi: 10.1093/jamia/ocy167pmid: 30541126
In my first editorial, I stated my goals to implement changes that would disseminate JAMIA to a broader audience, expand its contents, and optimize its management.1 In my final editorial 8 years later, I share with you my journey toward accomplishing these goals. In 1994, Bill Stead organized a group of senior American Medical Informatics Association (AMIA) members to found the Journal of the American Medical Informatics Association (JAMIA).2 Bill, the founding editor-in-chief from 1994 to 2003, is one of the pioneers of the biomedical informatics field and a recognized leader in the academic medical center community. As its first editor, Bill set JAMIA’s original vision and mission; he stepped down after a decade of service and at a stage when the journal was a recognized asset to AMIA and the informatics community in general.3 Randy Miller (editor-in-chief, 2004–2010) succeeded him, further advancing JAMIA’s mission and solidifying its status as AMIA’s flagship publication. Eight years ago, when Randy passed me the JAMIA torch, he gallantly wrote “All’s well that ends well for JAMIA editors,”4 referring to the outcome of the search for his successor. I learned from Randy to be attentive to every single detail. (A hallmark of an editor-in-chief, in addition to setting up the vision and strategy for the journal, is the search for perfection. I found the best role models for this in Bill and Randy.) It was an honor to be selected for the role. I had been an associate editor for a few years, but I had not planned to be the editor when I first started serving as a reviewer. (When I was in graduate school, Ted Shortliffe taught me how to review a biomedical informatics article, and Mark Musen taught me how to write one. Bob Greenes ensured I did so while I was junior faculty at the Decision Systems Group, Brigham and Women’s Hospital, Harvard Medical School. I am lucky to have had their guidance and support for so many years.) I was ecstatic and somewhat surprised to have been selected, particularly because English is not my native language and I still had much to learn about editorial processes. I immediately accepted, having little time to reflect on what it truly meant to steer AMIA’s flagship publication, and how critical this role was for so many readers, authors, reviewers, and JAMIA’s editorial team. If I had thought too much about it, it might have been overwhelming, but being somewhat naïve turned out to be an asset: I did not think at any single moment that I would not be able to do the job; I just did not know how much of my time it would consume, which innovations I would bring forward, and which barriers I would need to overcome. As with similar professional or personal challenges, this was one to be attacked head-on, with confidence, a knowledge-seeking attitude, humility, and pride. In a “trial by fire,” I learned how to deal with extraordinary situations that took a lot of unexpected time, such as response to plagiarism, accusations of delivering biased or uninformed reviews, discovery of hidden conflicts of interest, authorship disputes, retractions, corrections, attempts to influence editorial decisions, and threats of lawsuits and retaliations. Fortunately, there are many sources of knowledge and support for many of these items, and the publisher’s and AMIA’s staff were always ready to help, so these temporary problems were overcome quickly. I learned to be efficient with time so my daytime job would not suffer from my dedication to JAMIA, and continued to improve my own writing for clarity, grammar, and style. I had the invaluable help from a technical editor: Dr Michele Day has provided insightful requests for clarification, suggested word replacements, and noticed lack of flow from paragraph to paragraph for most of the 60-plus “highlight” pieces and editorials. She taught me how to write better English (which may have resulted in better writing in Portuguese, too, but the hypothesis remains untested). I started the online-only special issues of JAMIA, and later helped the journal “go green” at the same time we transitioned from a bimonthly to a monthly publication. Another innovation I introduced was the JAMIA Journal Club. The rationale was simple and timely. When I became JAMIA’s editor-in-chief, I had recently started a new biomedical informatics program at the University of California, San Diego, after spending many years as a faculty member in Boston’s Harvard–Massachusetts Institute of Technology system. Given the small size of our new program, I missed meeting with various colleagues in journal clubs and seminars. Additionally, I thought JAMIA could benefit from live presentations by authors of outstanding papers, and the virtual journal clubs would provide an open forum to discuss the latest informatics innovations, especially for informaticians who hold positions in institutions without training programs or academic informatics groups. With a live (and recorded) journal club, JAMIA could also be known to a wider audience that could “listen to” instead of read an article. For these reasons, we started the monthly JAMIA Journal Club in 2012. The JAMIA Journal Club has been accomplishing its goals and is still ongoing because of the work of the student editorial board,5 which was another JAMIA innovation later replicated by other journals. However, this one I did not invent: I encouraged it to continue because it was a brilliant idea. Trainees could witness the review process as reviewers under the supervision of an associate editor, and understand the statistics and trends for the journal, thus cultivating a new generation of editors. I thank all readers and authors of JAMIA, the AMIA staff, and publishers. I am especially thankful to the associate editors who served as student editorial board organizers, our current associate editors for their input in the directions of the journal and selection of peers for the editorial board, and all associate editors that have rotated in the position the past 8 years, including associate guest editors of special issues (there were 44 in total). They brought new themes to JAMIA, as well as new authors and perspectives that enriched our field. Their contributions helped JAMIA continue to stand out at a time when a plethora of new informatics-related dissemination venues emerged and there was great concern about the sustainability of traditional scientific journals.6 I will not name everyone here, as I am afraid of making a critical omission, but I would like to ask that our community keeps recognizing their efforts. The editorial team is a secret sauce in running the journal: it is composed of AMIA members voted by peers as a result of a process that has improved over many years. The vote by the incumbent associate editorial team recognizes informaticians for being outstanding experts in their respective areas, as well as for their ability to review manuscripts fairly, insightfully, constructively, and in a timely manner; our authors and readers deserve no less. The editorial team helps ensure that our service to the scientific community is completed with utmost integrity and that it is inclusive, impactful, and impeccable. A key function of the editor-in-chief is to organize the team to achieve this goal. I trust that we were very effective at that, as can be shown by conventional and nonconventional measures of journal success; our team processed over 10 000 articles in the past 8 years, and we lowered the average and median times to first decision to under 30 days. Our articles have been read by millions of people worldwide, and we have received submissions from over 90 countries. We had millions of downloads and views, and the skyrocketing number of citations reflects the dissemination of informatics across many other disciplines. We achieved all this because we “stood on the shoulders of giants,” who made the journal an invaluable asset to AMIA and the informatics community at large, and because we kept improving on this legacy. At the end of my second term, one thing is certain: time flies, whether one is having fun or not. In this case, I had lots of fun, and with the same blend of sadness, happiness, accomplishment, and anxiety I felt when I left my oldest son for the first time in daycare or when my youngest son departed for college, and I am passing the JAMIA torch to the new editor-in-chief, Sue Bakken. It is reassuring to know that the journal will be in great hands, as she is exceedingly qualified and will take JAMIA to new heights. I thank you all for the unique opportunity to serve as your editor-in-chief for 8 productive and enjoyable years. Looking back, it was a lot of work, a lot of rewards, but, most importantly, a lot learned from people with so many different backgrounds, aspirations, beliefs, and goals. REFERENCES 1 Ohno-Machado L. A new JAMIA . J Am Med Inform Assoc 2011 ; 18 ( 1 ): 2. Google Scholar Crossref Search ADS PubMed 2 Stead WW. JAMIA – why? J Am Med Inform Assoc 1994 ; 1 ( 1 ): 75 – 6 . Google Scholar Crossref Search ADS PubMed 3 Brennan PF , Humphreys BL , Masys DR , Miller RA. Kudos to Dr. Stead . J Am Med Inform Assoc 2003 ; 10 ( 1 ): 108 – 9 . Google Scholar Crossref Search ADS 4 Miller RA. All’s well that ends well for JAMIA editors . J Am Med Inform Assoc 2010 ; 17 ( 6 ): 624 – 5 . Google Scholar Crossref Search ADS PubMed 5 Johnson K , Miller RA. The JAMIA student editorial board: peer review education in biomedical informatics . J Am Med Inform Assoc 2004 ; 11 ( 1 ): 87 – 8 . Google Scholar Crossref Search ADS PubMed 6 Shortliffe EH , Lorenzi NM , Greenwood K , Broussard AN , Miller RA. JAMIA looks to the future amidst profound changes in the world of publishing . J Am Med Inform Assoc 2010 ; 17 ( 1 ): 1 – 2 . Google Scholar Crossref Search ADS © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Discovering foodborne illness in online restaurant reviewsEffland,, Thomas;Lawson,, Anna;Balter,, Sharon;Devinney,, Katelynn;Reddy,, Vasudha;Waechter,, HaeNa;Gravano,, Luis;Hsu,, Daniel
doi: 10.1093/jamia/ocx093pmid: 29329402
Abstract Objective We developed a system for the discovery of foodborne illness mentioned in online Yelp restaurant reviews using text classification. The system is used by the New York City Department of Health and Mental Hygiene (DOHMH) to monitor Yelp for foodborne illness complaints. Materials and Methods We built classifiers for 2 tasks: (1) determining if a review indicated a person experiencing foodborne illness and (2) determining if a review indicated multiple people experiencing foodborne illness. We first developed a prototype classifier in 2012 for both tasks using a small labeled dataset. Over years of system deployment, DOHMH epidemiologists labeled 13 526 reviews selected by this classifier. We used these biased data and a sample of complementary reviews in a principled bias-adjusted training scheme to develop significantly improved classifiers. Finally, we performed an error analysis of the best resulting classifiers. Results We found that logistic regression trained with bias-adjusted augmented data performed best for both classification tasks, with F1-scores of 87% and 66% for tasks 1 and 2, respectively. Discussion Our error analysis revealed that the inability of our models to account for long phrases caused the most errors. Our bias-adjusted training scheme illustrates how to improve a classification system iteratively by exploiting available biased labeled data. Conclusions Our system has been instrumental in the identification of 10 outbreaks and 8523 complaints of foodborne illness associated with New York City restaurants since July 2012. Our evaluation has identified strong classifiers for both tasks, whose deployment will allow DOHMH epidemiologists to more effectively monitor Yelp for foodborne illness investigations. machine learning, social media, foodborne diseases, text mining, classification BACKGROUND AND SIGNIFICANCE Foodborne illness remains a major public health concern nationwide. The Centers for Disease Control and Prevention (CDC) estimates that there are 48 million illnesses and >3000 deaths caused by the consumption of contaminated food in the United States each year.1 Of the approximately 1200 foodborne outbreaks reported and investigated nationally, 68% are restaurant-related.2 Most restaurant-associated outbreaks are identified via health department complaint systems. However, there are potentially valuable data sources emerging that could be incorporated in outbreak detection. Specifically, the increasing use of social media has provided a public platform for users to disclose serious real-life incidents, such as food poisoning, that may not be reported through established complaint systems. As a result of the increasing interest and potential value of social media data, research institutions are partnering with public health agencies to develop methods and applications to use data from social media to monitor outbreaks of infectious diseases. Textual data from Internet search engines and social media have been used to monitor outbreaks of various infectious diseases, such as influenza.3 An evaluation comparing the use of informal and unconventional outbreak detection methods against traditional methods found that the informal source was the first to report in 70% of outbreaks, supporting the usefulness of such systems.4 The incorporation of social media data into public health surveillance systems is becoming more common. Multiple projects focus on identifying incidents of foodborne illness using data from Twitter. Harvard Medical School developed and maintains a machine learning platform, HealthMap Foodborne Dashboard, to identify complaints and occurrences of foodborne illness and send a survey link where Twitter users can provide more information; this platform is freely available for research.5 The Chicago Department of Public Health partnered with the Smart Chicago Collaborative to develop Foodborne Chicago, which also uses machine learning to identify tweets indicating foodborne illness and also sends a survey link where Twitter users can provide more information.6 The Southern Nevada Health District developed nEmesis, an application that associates a user’s previous locations with subsequent tweets indicating foodborne illness.7 In this study, we use data from consumer reviews obtained from the popular website Yelp. A comparison of food vehicles associated with outbreaks from the CDC Foodborne Outbreak Online Database and data extracted from Yelp reviews indicating foodborne illness and implicating a specific food item found that the distribution of food categories was very similar between the 2 sources, supporting the usefulness of these data in public health responses.8 Furthermore, Yelp reviews can be directly linked with individual restaurant locations, allowing for targeted and timely response. Since 2012, the Computer Science Department at Columbia University has been collaborating with the New York City (NYC) Department of Health and Mental Hygiene (DOHMH) to develop a system that applies data mining and uses text classification to identify restaurant reviews on Yelp indicating foodborne illness, which are later manually reviewed and classified by DOHMH epidemiologists. This system was used in a pilot study from July 1, 2012, to March 31, 2013, and found 468 Yelp reviews that described a foodborne illness occurrence.9 Of these 468 reviews, only 3% of the illness incidents had been reported to the DOHMH by calling NYC’s citywide complaint system, 311. Investigations as a result of these reviews led to the discovery of 3 previously unknown foodborne illness outbreaks, approximately 10% of the total number of restaurant-associated outbreaks identified during the pilot project’s time period. This highlighted the need to mine Yelp reviews to improve the identification and investigation of foodborne illness outbreaks in NYC. Due to the success of the pilot study, DOHMH integrated Yelp reviews into its foodborne illness complaint surveillance system and continues to mine Yelp reviews and investigate those pertaining to foodborne illness; this process has been instrumental in the identification of 10 outbreaks and 8523 reports of foodborne illness associated with NYC restaurants since July 2012. OBJECTIVE In this study, we aimed to evaluate the performance of several classifiers on both tasks: the prototype classifiers used by the deployed DOHMH system and multiple well-known state-of-the-art classification models. Additionally, we sought to investigate the impact of training the classifiers using data collected by the prototype system over years of deployment. This process, however, must be treated with care, as the data collected from the prototype system suffer, unavoidably, from a selection bias. To resolve this issue, we derived principled bias-adjusted training and evaluation objectives and designed training regimes that incorporate data sampled from the complement of this biased set to produce improved classifiers. We investigated the impact of these biased vs bias-adjusted training regimes and identified strong final models for both tasks. MATERIALS AND METHODS We first describe the overall DOHMH system design. We then describe the classification models used in our evaluation. Finally, we describe the data used in the evaluation and discuss bias-adjusted training and evaluation objectives. Yelp system design The system runs a daily process to pull Yelp reviews of NYC restaurants from a privately available application programming interface (API) and applies text classification techniques to classify reviews according to 2 criteria. The first criterion, referred to as the “Sick” task, corresponds to whether the review mentions the occurrence of a person experiencing foodborne illness from the restaurant. The second criterion, the “Multiple” task, corresponds to whether there was a foodborne illness event experienced by more than one person; although they are quite rare, these cases constitute significant evidence of a foodborne illness outbreak and are of special interest to DOHMH epidemiologists. After automatically classifying all new reviews according to these criteria, all reviews classified as “Sick” (ie, having a “Sick” probability >0.5) are then presented to DOHMH epidemiologists in a user interface for manual review. Upon reviewing a document, the epidemiologists record the gold standard label for both criteria. Yelp messages are sent to the authors of reviews that appear to report true incidents of foodborne illness, and an interview is attempted with each author to collect information regarding symptoms, other illnesses among the author’s dining group, and a 3-day food history. All sources of restaurant-associated foodborne illness complaints are aggregated in a daily report; outbreak investigations are initiated if multiple complaints indicating foodborne illness are received within a short period of time for one establishment, or if a complaint indicates a large group of individuals experiencing illness after a single event. Classification methods Prior to classification, the reviews, or documents, are converted into a representation that is usable by the classification algorithms, known as the featurization of documents. This is done using a bag-of-words (BOW) approach by converting each document into a vector with the counts for each word in the vocabulary. The classifiers built for the operational system at DOHMH, further referred to as “prototype” classifiers, were J4.810 decision tree models, chosen for the interpretability of their decision functions. These models were trained using 500 reviews, labeled by DOHMH epidemiologists for both criteria. The 500 reviews were selected using a mix of an unbiased sample of reviews and reviews from keyword searches for terms that are intuitively indicative of foodborne illness, such as “sick,” “vomit,” “diarrhea,” and “food poisoning.” To identify the most effective classifiers for our classification tasks, we experimentally evaluated several standard document classification techniques in addition to the prototype classifiers. First, we considered improvements to the document featurization over basic BOW by including n-grams (n consecutive words) for n = 1, 2, and 3, and term frequency-inverse document frequency (TF-IDF) weights for the terms.11 For both classification tasks, “Sick” and “Multiple,” we evaluated 3 well-known supervised machine-learning classifiers: logistic regression,12 random forest,13 and support vector machine (SVM).14 Logistic regression is a classical statistical regression model where the response variable is categorical. Random forest is an ensemble of weak decision tree classifiers that vote for the final classification of the input document. SVM is a nonprobabilistic classifier that classifies new documents according to their distance from previously seen training documents. By definition, the positive examples for the “Multiple” task are a subset of the positive “Sick” examples, since at least one person must have foodborne illness for multiple people to have foodborne illness. Using this notion, we additionally designed a pipelined set of classifiers, further referred to as “Sick-Pipelined” classifiers, for the “Multiple” task, which first condition their predictions on the best “Sick” classifier. If the “Sick” classifier predicts “Yes,” then the “Multiple” classifier is run. Intuitively, this allows the “Multiple” classifier to focus more on the number of people involved than on whether there was a singular foodborne illness event at all. We evaluated logistic regression for this model class. Enhanced dataset and selection bias–corrected training Since July 2012, DOHMH epidemiologists have labeled 13 526 reviews selected for manual inspection by the prototype “Sick” classifier. These reviews are balanced for the “Sick” task, with 51% “Yes” and 49% “No” documents, but are imbalanced for the “Multiple” task, with only 13% “Yes” and 87% “No” documents. For training and evaluation, we split the data chronologically at January 1, 2017, to mirror future performance when training on historical data. This resulted in 11 551 training reviews and 1975 evaluation reviews. The training and evaluation sets have equal class distributions: 51%/49% for “Sick” and 13%/87% for “Multiple.” While these reviews contain useful information, having been selected by the prototype “Sick” classifier before labeling heavily biases them, and so they are not representative of the full (original) Yelp feed. To understand and correct for the impact of such bias, we derived a bias-adjusted training objective and augmented the training and evaluation datasets with a sample of reviews from the complement of the biased datasets in the full Yelp feed. Selection-bias correction To account for the selection bias of the prototype “Sick” classifier in the labeled data, we augment the training data with reviews from the set of Yelp reviews that were labeled “No” by the prototype “Sick” classifier. Reviews from this set, further referred to as “complement-sampled” reviews, likely have nothing to do with foodborne illness, but instead serve as easy “No” examples that the classifiers should predict correctly. Exactly how these 2 datasets are merged, however, requires principled consideration. For classifiers that learn to reduce classification error in training, we can formally model the joint likelihood of the classifier misclassifying some review and that review being selected by the prototype “Sick” classifier. Then, by marginalizing this joint distribution over the indicator that a review is selected by the prototype “Sick” classifier, we arrive at an unbiased estimate of the classification error. The end result is that we weigh classification mistakes for the biased and complement-sampled reviews by the inverses of their respective probabilities of being selected at random from the full Yelp dataset. Training regimes Using the above sample weights, we incorporate both the biased label data and the complement-sampled data to train our classifiers under 3 different regimes. The first, “Biased,” used only the data from the 11 551 reviews selected by the prototype “Sick” classifier. The second, “Gold,” used the “Biased” data plus 1000 reviews sampled from the complement-sampled Yelp feed and labeled by DOHMH epidemiologists. In this sample of 1000 reviews, only 4 were labeled “Yes” for the “Sick” task and 1 was labeled “Yes” for the “Multiple” task. In the third regime, “Silver,” we randomly sampled 10 000 reviews from the complement-sampled Yelp feed before January 1, 2017, and assumed all were negative examples of both tasks. Intuitively, this regime can be helpful if it regularizes out statistical quirks of the “Biased” data more than the noise it may introduce through false negatives. Evaluation The performance of each classifier was evaluated on the 1975 biased reviews from after January 1, 2017, along with another sample of 1000 reviews from the complement-sampled Yelp feed after January 1, 2017. These 1000 reviews were again labeled by DOHMH epidemiologists for both tasks. However, there were no positive examples of either task among the 1000 reviews. We evaluated the models for both tasks using 4 performance metrics common to class-imbalanced binary classification problems: precision, recall, F1-score, and area under the precision-recall curve (AUPR). Precision (often called “positive predictive value”) is the proportion of true positives out of the total number of positive predictions. Recall (often called “sensitivity”) is the true positive rate. F1-score is the harmonic mean of precision and recall. Precision, recall, and F1-score were calculated at a classification threshold of 0.5, meaning that we classified reviews with “Yes” probabilities ≥0.5 as “Yes.” The AUPR was measured by first graphing precision versus recall by varying the classification threshold from 0 to 1, then calculating the area under the curve. For all 4 metrics, 0 is the worst possible score and 1 is a perfect score. Since our evaluation data are biased, the evaluation metrics as described would not reflect unbiased estimates of model performance on the full Yelp feed. We can again derive bias-corrected precision and recall quantities, as we did with the training objective, by weighing test examples from the biased and complement-sampled sets by the inverses of their respective probabilities of being selected from the full Yelp dataset. For each model class, task, and training regime (21 variations total), we performed hyperparameter tuning experiments using 500 trials of random search from reasonable sampling distributions using 5-fold cross-validation on the training data, stratified by class label and biased/complement-sampled label. The details of the various featurization techniques and hyperparameter optimization experiments can be found in the Supplementary Appendix. After selecting the best hyperparameter settings for each model variation using best average bias-adjusted F1-score across the development folds, we retrained the models on their full training datasets. We compared the resulting model variations to each other and the prototype classifiers on the 4 evaluation metrics. We calculated 95% confidence intervals for F1-score and AUPR using the percentile bootstrap method15 with 1000 sampled test datasets. We then selected the best variation for both tasks based on test bias-adjusted F1-score as our final classifiers. We report the confusion matrices, perform a detailed error analysis, and identify insightful top features for the final classifiers on both tasks. RESULTS We found that the best classifiers achieved bias-adjusted F1-scores of 87% and 66% on the “Sick” and “Multiple” classification tasks, respectively. Classification evaluation The performance of the classifier variations for the “Sick” and “Multiple” tasks is presented in Tables 1 and 2, respectively. All models were evaluated on the test data from after January 1, 2017. Table 1. Model performance on “Sick” task Model Training Regime Precision Recall F1-Score AUPR (95% CI) (95% CI) J4.8 Prototype 0.48 0.99 0.65 0.83 (0.63-0.67) (0.81-0.85) Logistic regression Biased 0.05 0.94 0.10 0.63 (0.09-0.11) (0.55-0.76) Logistic regression Gold 0.83 0.88 0.85 0.90 (0.83-0.87) (0.88-0.92) Logistic regression Silver 0.85 0.88 0.87 0.91 (0.85-0.88) (0.90-0.93) Random forest Biased 0.04 0.91 0.07 0.59 (0.06-0.09) 0.54-0.70 Random forest Gold 0.36 0.89 0.51 0.81 (0.38-0.68) (0.78-0.84) Random forest Silver 0.70 0.88 0.78 0.87 (0.66-0.85) (0.85-0.89) SVM Biased 0.09 0.95 0.16 0.82 (0.13-0.20) (0.79-0.87) SVM Gold 0.33 0.93 0.49 0.88 (0.37-0.67) (0.85-0.91) SVM Silver 0.96 0.74 0.83 0.93 (0.81-0.85) (0.92-0.95) Model Training Regime Precision Recall F1-Score AUPR (95% CI) (95% CI) J4.8 Prototype 0.48 0.99 0.65 0.83 (0.63-0.67) (0.81-0.85) Logistic regression Biased 0.05 0.94 0.10 0.63 (0.09-0.11) (0.55-0.76) Logistic regression Gold 0.83 0.88 0.85 0.90 (0.83-0.87) (0.88-0.92) Logistic regression Silver 0.85 0.88 0.87 0.91 (0.85-0.88) (0.90-0.93) Random forest Biased 0.04 0.91 0.07 0.59 (0.06-0.09) 0.54-0.70 Random forest Gold 0.36 0.89 0.51 0.81 (0.38-0.68) (0.78-0.84) Random forest Silver 0.70 0.88 0.78 0.87 (0.66-0.85) (0.85-0.89) SVM Biased 0.09 0.95 0.16 0.82 (0.13-0.20) (0.79-0.87) SVM Gold 0.33 0.93 0.49 0.88 (0.37-0.67) (0.85-0.91) SVM Silver 0.96 0.74 0.83 0.93 (0.81-0.85) (0.92-0.95) The underlined value represents the final selected model from among the variants. This is the model we further analyze in the error analysis. Because the bootstrap distribution of some test statistics exhibited non-normal behavior, their corresponding confidence intervals are wider. Table 1. Model performance on “Sick” task Model Training Regime Precision Recall F1-Score AUPR (95% CI) (95% CI) J4.8 Prototype 0.48 0.99 0.65 0.83 (0.63-0.67) (0.81-0.85) Logistic regression Biased 0.05 0.94 0.10 0.63 (0.09-0.11) (0.55-0.76) Logistic regression Gold 0.83 0.88 0.85 0.90 (0.83-0.87) (0.88-0.92) Logistic regression Silver 0.85 0.88 0.87 0.91 (0.85-0.88) (0.90-0.93) Random forest Biased 0.04 0.91 0.07 0.59 (0.06-0.09) 0.54-0.70 Random forest Gold 0.36 0.89 0.51 0.81 (0.38-0.68) (0.78-0.84) Random forest Silver 0.70 0.88 0.78 0.87 (0.66-0.85) (0.85-0.89) SVM Biased 0.09 0.95 0.16 0.82 (0.13-0.20) (0.79-0.87) SVM Gold 0.33 0.93 0.49 0.88 (0.37-0.67) (0.85-0.91) SVM Silver 0.96 0.74 0.83 0.93 (0.81-0.85) (0.92-0.95) Model Training Regime Precision Recall F1-Score AUPR (95% CI) (95% CI) J4.8 Prototype 0.48 0.99 0.65 0.83 (0.63-0.67) (0.81-0.85) Logistic regression Biased 0.05 0.94 0.10 0.63 (0.09-0.11) (0.55-0.76) Logistic regression Gold 0.83 0.88 0.85 0.90 (0.83-0.87) (0.88-0.92) Logistic regression Silver 0.85 0.88 0.87 0.91 (0.85-0.88) (0.90-0.93) Random forest Biased 0.04 0.91 0.07 0.59 (0.06-0.09) 0.54-0.70 Random forest Gold 0.36 0.89 0.51 0.81 (0.38-0.68) (0.78-0.84) Random forest Silver 0.70 0.88 0.78 0.87 (0.66-0.85) (0.85-0.89) SVM Biased 0.09 0.95 0.16 0.82 (0.13-0.20) (0.79-0.87) SVM Gold 0.33 0.93 0.49 0.88 (0.37-0.67) (0.85-0.91) SVM Silver 0.96 0.74 0.83 0.93 (0.81-0.85) (0.92-0.95) The underlined value represents the final selected model from among the variants. This is the model we further analyze in the error analysis. Because the bootstrap distribution of some test statistics exhibited non-normal behavior, their corresponding confidence intervals are wider. Table 2. Model performance on “Multiple” task Model Training Regime Precision Recall F1-Score AUPR 95% CI 95% CI J4.8 Prototype < 0.01 0.69 0.01 < 0.01 (0.01, 0.01) (< 0.01, < 0.01) Logistic regression Biased 0.08 0.56 0.15 0.25 (0.09-0.26) (0.19-0.40) Logistic regression Gold 0.42 0.58 0.48 0.56 (0.30-0.67) (0.49-0.67) Logistic regression Silver 0.64 0.58 0.61 0.58 (0.56-0.66) (0.52-0.65) Sick-Pipelined logistic regression Biased 0.07 0.61 0.13 0.18 (0.09-0.23) (0.13, 0.43) Sick-Pipelined logistic regression Gold 0.77 0.56 0.65 0.65 (0.60-0.70) (0.59-0.70) Sick-Pipelined logistic regression Silver 0.75 0.59 0.66 0.71 (0.61-0.70) (0.65-0.76) Random forest Biased 0.04 0.37 0.07 0.03 (0.05-0.12) (0.02-0.18) Random forest Gold 0.75 0.24 0.36 0.31 (0.29-0.42) (0.23-0.45) Random forest Silver 0.74 0.25 0.37 0.40 (0.31-0.43) (0.34-0.49) SVM Biased 0.07 0.65 0.12 0.18 (0.08-0.20) (0.12-0.48) SVM Gold 0.35 0.34 0.35 0.29 (0.21-0.54) (0.21-0.57) SVM Silver 0.20 0.30 0.24 0.39 (0.13-0.47) (0.30-0.64) Model Training Regime Precision Recall F1-Score AUPR 95% CI 95% CI J4.8 Prototype < 0.01 0.69 0.01 < 0.01 (0.01, 0.01) (< 0.01, < 0.01) Logistic regression Biased 0.08 0.56 0.15 0.25 (0.09-0.26) (0.19-0.40) Logistic regression Gold 0.42 0.58 0.48 0.56 (0.30-0.67) (0.49-0.67) Logistic regression Silver 0.64 0.58 0.61 0.58 (0.56-0.66) (0.52-0.65) Sick-Pipelined logistic regression Biased 0.07 0.61 0.13 0.18 (0.09-0.23) (0.13, 0.43) Sick-Pipelined logistic regression Gold 0.77 0.56 0.65 0.65 (0.60-0.70) (0.59-0.70) Sick-Pipelined logistic regression Silver 0.75 0.59 0.66 0.71 (0.61-0.70) (0.65-0.76) Random forest Biased 0.04 0.37 0.07 0.03 (0.05-0.12) (0.02-0.18) Random forest Gold 0.75 0.24 0.36 0.31 (0.29-0.42) (0.23-0.45) Random forest Silver 0.74 0.25 0.37 0.40 (0.31-0.43) (0.34-0.49) SVM Biased 0.07 0.65 0.12 0.18 (0.08-0.20) (0.12-0.48) SVM Gold 0.35 0.34 0.35 0.29 (0.21-0.54) (0.21-0.57) SVM Silver 0.20 0.30 0.24 0.39 (0.13-0.47) (0.30-0.64) The underlined value represents the final selected model from among the variants. This is the model we further analyze in the error analysis. Table 2. Model performance on “Multiple” task Model Training Regime Precision Recall F1-Score AUPR 95% CI 95% CI J4.8 Prototype < 0.01 0.69 0.01 < 0.01 (0.01, 0.01) (< 0.01, < 0.01) Logistic regression Biased 0.08 0.56 0.15 0.25 (0.09-0.26) (0.19-0.40) Logistic regression Gold 0.42 0.58 0.48 0.56 (0.30-0.67) (0.49-0.67) Logistic regression Silver 0.64 0.58 0.61 0.58 (0.56-0.66) (0.52-0.65) Sick-Pipelined logistic regression Biased 0.07 0.61 0.13 0.18 (0.09-0.23) (0.13, 0.43) Sick-Pipelined logistic regression Gold 0.77 0.56 0.65 0.65 (0.60-0.70) (0.59-0.70) Sick-Pipelined logistic regression Silver 0.75 0.59 0.66 0.71 (0.61-0.70) (0.65-0.76) Random forest Biased 0.04 0.37 0.07 0.03 (0.05-0.12) (0.02-0.18) Random forest Gold 0.75 0.24 0.36 0.31 (0.29-0.42) (0.23-0.45) Random forest Silver 0.74 0.25 0.37 0.40 (0.31-0.43) (0.34-0.49) SVM Biased 0.07 0.65 0.12 0.18 (0.08-0.20) (0.12-0.48) SVM Gold 0.35 0.34 0.35 0.29 (0.21-0.54) (0.21-0.57) SVM Silver 0.20 0.30 0.24 0.39 (0.13-0.47) (0.30-0.64) Model Training Regime Precision Recall F1-Score AUPR 95% CI 95% CI J4.8 Prototype < 0.01 0.69 0.01 < 0.01 (0.01, 0.01) (< 0.01, < 0.01) Logistic regression Biased 0.08 0.56 0.15 0.25 (0.09-0.26) (0.19-0.40) Logistic regression Gold 0.42 0.58 0.48 0.56 (0.30-0.67) (0.49-0.67) Logistic regression Silver 0.64 0.58 0.61 0.58 (0.56-0.66) (0.52-0.65) Sick-Pipelined logistic regression Biased 0.07 0.61 0.13 0.18 (0.09-0.23) (0.13, 0.43) Sick-Pipelined logistic regression Gold 0.77 0.56 0.65 0.65 (0.60-0.70) (0.59-0.70) Sick-Pipelined logistic regression Silver 0.75 0.59 0.66 0.71 (0.61-0.70) (0.65-0.76) Random forest Biased 0.04 0.37 0.07 0.03 (0.05-0.12) (0.02-0.18) Random forest Gold 0.75 0.24 0.36 0.31 (0.29-0.42) (0.23-0.45) Random forest Silver 0.74 0.25 0.37 0.40 (0.31-0.43) (0.34-0.49) SVM Biased 0.07 0.65 0.12 0.18 (0.08-0.20) (0.12-0.48) SVM Gold 0.35 0.34 0.35 0.29 (0.21-0.54) (0.21-0.57) SVM Silver 0.20 0.30 0.24 0.39 (0.13-0.47) (0.30-0.64) The underlined value represents the final selected model from among the variants. This is the model we further analyze in the error analysis. Table 3. Confusion matrices of best classifiers Actual Class Predicted Class No Yes Count Rate (%) Count Rate (%) Sick No 1882 (true negatives) 93 144 (false positives) 7 Yes 112 (false negatives) 12 837 (true positives) 88 Multiple No 2643 (true negatives) 98 55 (false positives) 2 Yes 114 (false negatives) 42 163 (true positives) 58 Actual Class Predicted Class No Yes Count Rate (%) Count Rate (%) Sick No 1882 (true negatives) 93 144 (false positives) 7 Yes 112 (false negatives) 12 837 (true positives) 88 Multiple No 2643 (true negatives) 98 55 (false positives) 2 Yes 114 (false negatives) 42 163 (true positives) 58 Table 3. Confusion matrices of best classifiers Actual Class Predicted Class No Yes Count Rate (%) Count Rate (%) Sick No 1882 (true negatives) 93 144 (false positives) 7 Yes 112 (false negatives) 12 837 (true positives) 88 Multiple No 2643 (true negatives) 98 55 (false positives) 2 Yes 114 (false negatives) 42 163 (true positives) 58 Actual Class Predicted Class No Yes Count Rate (%) Count Rate (%) Sick No 1882 (true negatives) 93 144 (false positives) 7 Yes 112 (false negatives) 12 837 (true positives) 88 Multiple No 2643 (true negatives) 98 55 (false positives) 2 Yes 114 (false negatives) 42 163 (true positives) 58 For the “Sick” task, we found that the logistic regression model trained using the “Silver” regime achieved the highest F1-score, 87%. With the addition of 10 000 silver-labeled complement-sampled reviews, this model gained 77% in bias-adjusted F1-score over its “Biased” counterpart, a significant increase. The low bias-adjusted F1-score of 10% for the “Biased” “Sick” logistic regression is due to the misrepresentation of the full Yelp dataset by the “Biased” training, which causes the model to highly over-predict “Yes” on the complement-sampled test data. This behavior is heavily penalized by the bias-adjustment because each false positive in the small complement-sampled test data is representative of many more false positives in the full Yelp dataset. For the “Multiple” task, we found that the “Sick-Pipelined” logistic regression model trained using the “Silver” regime achieved the highest F1-score, 66%. The use of pipelined training and prediction caused a gain of 5% for the “Silver” “Sick-Pipelined” logistic regression over its single-step counterpart. Precision-recall trade-off Given the rarity of reviews discussing foodborne illness, it is desirable to explore settings of the “Sick” classifiers that favor recall over precision, since DOHMH epidemiologists are willing to accept some extra false positives to reduce the risk of missing an important positive “Sick” review. We analyzed this trade-off by examining the precision-recall curves of the “Sick” logistic regression classifiers, presented in Figure 1. From the plot, we can see that “Gold” and “Silver” models begin to experience an approximately equal trade-off of precision for recall in the region of 80%–90% recall, illustrated by the slope of the curves being close to 1 point of precision lost per point of recall gained. In the 90%–100% recall region, the “Gold” model begins to experience a steep drop in precision at a recall of 92% while the “Silver” model does not experience a steep drop in precision until a recall of 98%. At this point, the precision of the “Silver” logistic regression is still 69%, 21% higher than the prototype classifier which has 48% precision at 99% recall. This indicates that even in a high-recall setting the “Silver” “Sick” classifier should provide better performance over the “Sick” prototype. Figure 1. View largeDownload slide Precision-recall curves of “Sick” logistic regression models in the high-recall region. While the “Biased” logistic regression performance lags below, the “Gold” and “Silver” models show relatively mild losses in precision per point of recall gained until the 90-100% recall region. After 92% recall the “Gold” model begins to experience a steep drop in precision while the “Silver” model does not experience a steep drop in precision until a recall of 98%. Figure 1. View largeDownload slide Precision-recall curves of “Sick” logistic regression models in the high-recall region. While the “Biased” logistic regression performance lags below, the “Gold” and “Silver” models show relatively mild losses in precision per point of recall gained until the 90-100% recall region. After 92% recall the “Gold” model begins to experience a steep drop in precision while the “Silver” model does not experience a steep drop in precision until a recall of 98%. Error analysis of best “Sick” classifier Of the 2975 reviews in the test dataset, there are 949 positive examples and 2026 negative examples for the “Sick” task. The best “Sick” classifier, “Silver” trained logistic regression, achieved an F1-score of 87%, a statistically significant 22% absolute increase over the prototype classifier, with an F1-score score of 65%. On this test dataset, the best “Sick” classifier correctly classified many reviews containing major sources of false positives for the prototype classifier. These gains are not surprising, given that this model uses 40 times more data and better document representations (TF-IDF and trigrams rather than vanilla BOW). This large performance increase will qualitatively change the efficacy of the system for DOHMH epidemiologists. Examination of the 144 false positives identified various causes. Many of these false positives cannot be identified by a classifier only using n-grams up to n = 3. For example, one reviewer wrote, “I didn’t get food poisoning,” which would require 4-grams for the classifier to capture the negation. This example illustrates a major shortcoming of n-gram models: important dependencies or relationships between words often span large distances across a sentence. Another major source of false positives are reviews that do talk about food poisoning but are not current enough to meet the DOHMH criteria for follow-up, and thus are labeled “No.” A third type of false positive occurs when a review talks about food poisoning in a hypothetical or future sense. For example, one reviewer reported that the food “had a weird chunky consistency…hopefully we won’t get sick tonight.” Multiple causes of the 112 false negatives were also identified. One notable cause is misspellings of key words related to food poisoning in the review, such as “diherrea.” Another major cause is grave references to food poisoning but the classifier predicts “No” because of a prevalence of negatively weighted n-grams, such as “almost threw up.” A final source of false negatives is human error in the labeling of reviews for the test data. For example, one review’s only reference to illness was “she began to feel sick” while at the restaurant, yet the review was labeled positive. Many of the reviews contained negation, which the best “Sick” classifier can detect due to the use of n-grams. N-grams also allow the classifier to identify that the pattern “sick of,” as in “sick of the pizza,” does not typically refer to actual food poisoning, compared to “got sick,” which typically does. Finally, we examined the highest-weighted n-grams of the best “Sick” classifier. The most highly positive-weighted features were phrases indicative of foodborne illness, such as “diarrhea,” “food poisoning,” and “got sick,” while the most highly negative features were either very positive phrases or indicative of false positives, such as “amazing” and “sick of.” These top features are encouraging, as they show the model has identified features that epidemiologists would also deem important. Error analysis of best “Multiple” classifier Of the 2975 reviews in the test dataset, there are 277 positive examples and 2698 negative examples for the “Multiple” task. The best “Multiple” classifier, “Silver” trained “Sick-Pipelined” logistic regression, achieved an F1-score of 66%. We examined the reason behind the 114 false negative reviews. Many false negatives were due to incorrect predictions made by the pipelined “Sick” classifier. Most other false negatives were caused by the inability of trigram models to capture longer phrases. Phrases indicating multiple illnesses, such as “we both got really sick,” typically span more than 3 contiguous words, leaving no way for a classifier using trigrams to detect them directly. Of the 277 true positives, 163 were correctly classified. Reviews containing phrases clearly indicating multiple illnesses in a bigram or trigram, such as “both got sick,” scored highest; however, such concise n-grams are rare. The classifier’s highly weighted features are n-grams that simply refer to multiple people without referring to food poisoning. The classifier can capture references to multiple people in a trigram, but these references are often devoid of context, making it hard to determine if multiple people simply did something together or multiple people became ill. Analysis of the true positive test reviews with respect to these feature weights suggests that the classifier tends to select reviews that contain an abundance of n-grams about multiple people. Examination of these features shows that the n-gram model class is not sufficient for the “Multiple” task, indicated by its low performance relative to the “Sick” task and the need for detection of long phrases, which it cannot do. While it is tempting to simply extend the n-gram range to longer sequences, this approach fails due to a well-known statistical issue called “sparsity”: specific longer phrases become extremely rare in the data and are not seen in enough quantity for models to learn from them. DISCUSSION In this study, we have presented an automated text-classification system for the surveillance and detection of foodborne illness in online NYC restaurant reviews from Yelp. Using this system, NYC DOHMH epidemiologists are able to monitor millions of reviews, a previously impossible task, to aid in the identification and investigation of foodborne illness outbreaks in NYC. As of May 21, 2017, this system has been instrumental in the identification of 10 outbreaks and 8523 reports of foodborne illness associated with NYC restaurants since July 2012. Aided by simple prototype classifiers, DOHMH epidemiologists have evaluated and labeled 13 526 Yelp reviews for 2 key indicators of foodborne illness since July 2012. Although these data are biased by the prototype classifier’s selection criterion, we showed how these biased data and additional complement-sampled data could be combined in a bias-adjusted training regime to build significantly higher-performing classifiers, an issue that commonly plagues deployed needle-in-a-haystack systems. We evaluated the performance of our prototype classifiers and several other well-known classification models on 2 tasks, namely “Sick” and “Multiple.” We found that logistic regression trained with the “Silver” regime performed best for the “Sick” task and that the “Silver” “Sick-Pipelined” logistic regression performed best on the “Multiple” task, with bias-adjusted F1-scores of 87% and 66%, respectively. As future work, we are currently exploring the use of modern deep learning techniques to further improve upon the classifiers by using soft measures of word similarity and models that are not limited to short contiguous spans of text, the key limitation found in the error analysis. We also intend to examine the performance of our system in locations outside of NYC. This study is granted institutional review board exempt status under National Science Foundation grant IIS-15-63785, titled “III: Medium: Adaptive Information Extraction from Social Media for Actionable Inferences in Public Health.” Although the raw Yelp data are not publicly available, all code used to reproduce the final experiments in this manuscript can be found at https://github.com/teffland/FoodborneNYC/tree/master/jamia_2017/. CONCLUSION The importance of effective information extraction regarding foodborne illness from social media sites is increasing with the rising popularity of online restaurant review sites and the decreasing likelihood that younger people will report food poisoning via official government channels. In this investigation, we described details of the DOHMH system for foodborne illness surveillance in online restaurant reviews from Yelp. Our system has been instrumental in the identification of 10 outbreaks and 8523 reports of foodborne illness associated with NYC restaurants since July 2012. Our evaluation has identified strong classifiers for both tasks, whose deployment will allow DOHMH epidemiologists to more effectively monitor Yelp for improved foodborne illness investigations. FUNDING This work was supported by National Science Foundation grant IIS-15-63785, Google Research Award “Information Extraction from Social Media: Detecting Disease Outbreaks,” Alfred P. Sloan Foundation grant G-2015-14017, the Centers for Disease Control and Prevention PHEP grant NU90TP000546, and CDC/ELC grant NU50CK000407-03. COMPETING INTERESTS This material is based on work supported in part by a Google Research Award. In accordance with Columbia University reporting requirements, LG acknowledges ownership of Google stock as of the writing of this paper. CONTRIBUTORS Columbia author contributions TE: Designed and evaluated the alternative machine learning techniques for the classification of foodborne illness occurrence in Yelp restaurant reviews. Co-authored manuscript. AL: Designed and evaluated the alternative machine learning techniques for the classification of foodborne illness occurrence in Yelp restaurant reviews. Co-authored manuscript. LG: Coordinated the design and evaluation of the alternative machine learning techniques for the classification of foodborne illness occurrence in Yelp restaurant reviews. Co-authored, critically reviewed, and provided extensive feedback on manuscript. DH: Coordinated the design and evaluation of the alternative machine learning techniques for the classification of foodborne illness occurrence in Yelp restaurant reviews. Co-authored, critically reviewed, and provided extensive feedback on manuscript. DOHMH author contributions SB: Conceptualized and coordinated the incorporation of Yelp reviews into the DOHMH foodborne illness complaint system. Co-authored, critically reviewed, and provided extensive feedback on manuscript. VR: Conceptualized and coordinated the incorporation of Yelp reviews into the DOHMH foodborne illness complaint system. Oversaw the collection of feedback data and data cleaning. Co-authored, critically reviewed, and provided extensive feedback on manuscript. KD: Conducted literature review regarding other uses of social media to detect foodborne illness complaints and outbreaks. Co-authored, critically reviewed, and provided extensive feedback on manuscript. HW: Conceptualized and coordinated the incorporation of Yelp reviews into the DOHMH foodborne illness complaint system. Co-authored, critically reviewed, and provided extensive feedback on manuscript. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. ACKNOWLEDGEMENTS We thank Fotis Psallidas for his contributions to the incorporation of Yelp reviews and the original machine learning classifiers into the DOHMH foodborne illness complaint system and helpful discussions. We thank Giannis Karamanolakis and Lampros Flokas for helping with data cleaning and validation of the results. We thank Yelp for providing the DOHMH with access to its raw feed of business reviews for New York City. We also thank Lan Li, Faina Stavinsky, Daniel O’Halloran, and Jazmin Fontenot for helpful discussions. REFERENCES 1 Scallan E , Griffin PM , Angulo RV et al. . E. Foodborne illness acquired in the United States: unspecified agents . Emerg Infect Dis. 2011 ; 17 ( 1 ): 16 – 22 . Google Scholar Crossref Search ADS PubMed 2 Gould LH , Walsh KA , Vieria AR et al. . Surveillance for foodborne disease outbreaks: United States, 1998–2008 . MMWR Surveill Summ. 2013 ; 62 ( 2 ): 1 – 34 . Google Scholar PubMed 3 Santillana M , Nguyen AT , Dredze M et al. . Combining search, social media, and traditional data sources to improve influenza surveillance . PLoS Comput Biol. 2015 ; 11 ( 10 ): 1 – 15 . Google Scholar Crossref Search ADS 4 Bahk CY , Scales DA , Mekaru SR et al. . Comparing timeliness, content, and disease severity of formal and informal source outbreak reporting . BMC Infect Dis. 2015 ; 15 ( 135 ): 1 – 6 . Google Scholar PubMed 5 Freifeld CC , Mandl KD , Resi BY et al. . HealthMap: global infectious disease monitoring through automated classification and visualization of internet media reports . J Am Med Inform Assoc. 2008 ; 15 ( 2 ): 150 – 57 . Google Scholar Crossref Search ADS PubMed 6 Harris JK , Mansour R , Choucair B et al. . Health department use of social media to identify foodborne illness: Chicago, Illinois, 2013–2014 . MMWR Morb Mortal Wkly Rep. 2014 ; 63 ( 32 ): 681 – 85 . Google Scholar PubMed 7 Sadilek A , Kautz H , DiPrete L et al. . Deploying nEmesis: preventing foodborne illness by data mining social media . Proc Conf AAAI Artif Intell ; February 12–17, 2016; Phoenix, Arizona; 3982 – 90 . 8 Nsoesie EO , Kluberg SA , Brownstein JS . Online reports of foodborne illness capture foods implicated in official foodborne outbreak reports . Prev Med. 2014 ; 67 : 264 – 69 . Google Scholar Crossref Search ADS PubMed 9 Harrison C , Jorder H , Stern F et al. . Using online reviews by restaurant patrons to identify unreported cases of food-borne illness: New York City, 2012–2013 . MMWR Morb Mortal Wkly Rep. 2014 ; 63 ( 20 ): 441 – 45 . Google Scholar PubMed 10 Quinlan R . C4.5: Programs for Machine Learning . San Mateo, CA : Morgan Kaufmann Publishers ; 1993 . 11 Leskovec J , Rajaraman A , Ullman JD . Mining of Massive Datasets . Cambridge : Cambridge University Press ; 2014 . 12 Cox DR . The regression analysis of binary sequences with discussion . J R Stat Soc Series B Stat Methodol. 1958 ; 20 : 215 – 42 . 13 Breiman L . Random forests . Mach Learn. 1997 ; 45 ( 1 ); 5 – 32 . Google Scholar Crossref Search ADS 14 Cortes C , Vapnik V . Support-vector networks . Mach Learn. 1995 ; 20 ( 3 ): 273 – 97 . 15 Efron B , Tibshirani RJ . An Introduction to the Bootstrap . Boca Raton, FL : CRC Press ; 1994 . © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Should parents see their teen’s medical record? Asking about the effect on adolescent–doctor communication changes attitudesAncker, Jessica, S;Sharko,, Marianne;Hong,, Matthew;Mitchell,, Hannah;Wilcox,, Lauren
doi: 10.1093/jamia/ocy120pmid: 30247699
Abstract Objective Parents routinely access young children’s medical records, but medical societies strongly recommend confidential care during adolescence, and most medical centers restrict parental records access during the teen years. We sought to assess public opinion about adolescent medical privacy. Materials and Methods The Cornell National Social Survey (CNSS) is an annual nationwide public opinion survey. We added questions about a) whether parents should be able to see their 16-year-old child’s medical record, and b) whether teens would avoid discussing sensitive issues (sex, alcohol) with doctors if parents could see the record. Hypothesizing that highlighting the rationale for adolescent privacy would change opinions, we conducted an experiment by randomizing question order. Results Most respondents (83.0%) believed that an adolescent would be less likely to discuss sensitive issues with doctors with parental medical record access; responses did not differ by question order (P = .29). Most also believed that parents should have access to teens’ records, but support for parental access fell from 77% to 69% among those asked the teen withholding question first (P = .01). Conclusions Although medical societies recommend confidential care for adolescents, public opinion is largely in favor of parental access. A brief “nudge,” asking whether parental access might harm adolescent–doctor communication, increased acceptance of adolescent confidentiality, and could be part of a strategy to prepare parents for electronic patient portal policies that medical centers impose at the beginning of adolescence. ethics, adolescents, electronic patient portal, confidentiality, children INTRODUCTION With the advent of electronic medical records and associated patient portals, increasing numbers of patients are accessing their own medical records to better understand and manage their healthcare.1,2 Parents, who have primary ethical and legal responsibility for their children’s healthcare, generally have full access to young children’s medical records. Medical records access could be helpful to help parents manage well-child care such as vaccinations3 and is likely to be especially valuable for parents of children with chronic illnesses and those attempting to coordinate care across healthcare providers.4–8 However, during the adolescent years, medical confidentiality—including protection from parental notification—may encourage teens to seek care for sensitive medical issues that become newly salient at this time.9 When confidentiality is not ensured or parental notification is mandated, adolescents may delay or avoid sexual healthcare, or withhold information from healthcare providers.10–13 One survey found that, if their parents were notified, almost 59% of adolescents seeking prescription contraceptives would stop seeking sexual health services but would not stop sexual activity.11 Other topics adolescents might prefer to keep between themselves and their doctors could include sexual identity or questioning,9,14 alcohol and drug use,15,16 or other sensitive issues.17,18 For these reasons, medical societies focusing on adolescent healthcare strongly recommend confidential care in this age group.19–21 Yet policies and actual practices about parental access to adolescent medical records and patient portal accounts are heterogeneous, varying by medical situation, care type, jurisdiction, healthcare organization policy, and even payer type.22–24 Minor consent laws vary by state, granting adolescents different degrees of autonomy for different types of care, while some states mandate parental notification or authorization for specific medical decisions at different ages or leave these issues ambiguous.25 To date, adolescent reproductive healthcare funded under Title X is confidential.26,27 Physicians using electronic health records may find it challenging to keep information confidential and may have to use awkward methods such as putting some information in a separate confidential electronic encounter.28–31 And even when doctors do offer confidential care, parents may find out about it later when they receive an explanation of benefits from the insurer.32 In our recent studies of electronic patient portals across the United States, almost all medical centers we studied restricted parental access to an adolescent child’s medical record.24,33 However, because the restrictions were developed locally in response to legal, cultural, and technical factors, they varied widely in terms of how much a parent could see of an adolescent’s record (from nothing, to a partial record with sensitive information redacted, to the entire record), the extent of adolescents’ access to their own records (from none to partial to complete), and age thresholds (with some centers providing confidentiality to patients as young as 10).24,33 Policies also varied about whether teens could or should agree to parental access, and a few centers simply turned off portal accounts altogether (for both child and parent) during the adolescent years.24 Regardless of the policy type, medical center leaders frequently encountered angry or bewildered parents when their child reached the age that triggered the restrictions.24 As a comparison to the medical leadership perspectives previously studied, the current study assessed public attitudes toward parental access to adolescent medical records. Given the complexity of the issues and the diversity of policies around the country, we conjectured that many people had not been exposed to a rationale for medical confidentiality for teens. Therefore, we also tested the hypothesis that support for parental access would decrease when respondents were presented with one of the primary reasons to offer confidentiality, which is to encourage adolescents to share information freely with their physicians. MATERIALS AND METHODS Data source The Cornell National Social Survey is a random-digit-dial telephone survey conducted annually by Cornell Survey Research Institute. Every year, the sample size of 1000 provides a margin of error of plus or minus 3.1 percentage points. The Cornell University Institutional Review Board approved the study, and respondents provided oral consent. Each year, sampling is conducted on a dual frame of landline and cell phone numbers in the continental United States for a simple random sample not stratified by region or other variables. The proportion of cell phone numbers is calculated from county-level data on prevalence of cell phone-only households. Listed and unlisted numbers are both included; known business and non-household numbers are excluded, as are disconnected numbers. When the telephone is answered, the interviewer asks to speak with the adult with the most recent birthday, a technique that ensures each adult in the household has equal chance of being selected.34 Researchers submit potential questions, which are competitively reviewed by the Cornell Survey Research Institute. All questions are pilot tested with a small sample before being finalized. Three questions about portals, medical records, and privacy were included by our research team, with the order of questions 2 and 3 randomized. In order A: Should a 16-year-old be able to have their own electronic patient portal account? (Options: Always, Only with parental permission, Never) Should a parent or guardian be able to see their 16-year-old child’s entire medical record? (Options: Always, Only with the 16-year-old’s permission, Never) Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? (Options: Yes, No) In order B: Should a 16-year-old be able to have their own electronic patient portal account? Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? Should a parent or guardian be able to see their 16-year-old child’s entire medical record? These questions were introduced with a brief definition: “An online patient portal is a website offered by your doctor’s office. You can use a patient portal to see your lab test results, prescriptions, and medical record, or to privately message your doctor.” “Don’t know” was not offered as a response option but was recorded when given as an answer. The entire survey, which required approximately 20 minutes to administer, contained multiple demographic questions as well as research questions submitted by other social science researchers. Statistical analysis Descriptive analysis was conducted with frequencies and percents. Bivariate associations were assessed with chi-square tests. Multivariable relationships between sociodemographics and the portal questions were assessed with logistic regression models; all variables significant at .05 were tested for interaction with question order in the multivariable models. In both bivariate analyses and logistic models for the question about whether parents be able to see the teen’s medical record, we modeled “always” responses vs all other responses. All hypothesis tests were 2-sided with an alpha of .05. Analyses were conducted in SAS v.9.3 (Cary, NC). RESULTS Of 1703 eligible individuals reached by phone, 1000 completed the survey, for a cooperation rate of 58.7%.35 The 1000 respondents were the result of calls to 8064 working numbers (including non-answered calls as well as calls to those who were ineligible, refused, or unable to participate), for an overall response rate of 12.4%. The final sample of 1000 respondents was diverse and compared well with the US population in age, sex distribution, ethnicity, family composition, and geographic diversity, but had an overrepresentation of white and well-educated respondents (Table 1). Table 1. Characteristics of the sample Characteristic Sample n Sample % National % Sex Male 498 49.8 48.6 Female 502 50.2 51.3 Age 18-24 117 11.7 12.8 25-44 310 31.0 34.3 45-64 370 37.0 34.1 65+ 203 20.3 18.9 Race White 815 81.5 75.1 Black 115 11.5 12.2 All other 70 7.0 12.7 Ethnicity Non-Hispanic 872 87.2 85.8 Hispanic 127 12.7 14.2 Declined 1 0.1 − Census division 1 (New England) 59 5.9 4.8 2 (Middle Atlantic) 129 12.9 13.2 3 (East North Central) 155 15.5 14.7 4 (West North Central) 75 7.5 6.5 5 (South Atlantic) 190 19.0 19.9 6 (East South Central) 58 5.8 5.9 7 (West South Central) 138 13.8 11.6 8 (Mountain) 62 6.2 7.1 9 (Pacific) 134 13.4 16.2 Education level HS or less 257 25.7 41.0 Some college or tech 284 28.4 31.3 College degree 266 26.6 17.6 Graduate degree 192 19.2 10.1 Declined 1 0.1 − Social beliefs Liberal 310 31.0 26 Moderate 363 36.3 35 Conservative 327 32.7 35 Household income <$50K 359 35.9 45.5 $50K < $75K 290 29.0 17.8 $75K < $100K 91 9.1 12.2 $100K < $150K 115 11.5 13.5 $150K+ 145 14.5 11.1 Has children in the home No 665 66.5 68.0 Yes 335 33.5 32.0 Characteristic Sample n Sample % National % Sex Male 498 49.8 48.6 Female 502 50.2 51.3 Age 18-24 117 11.7 12.8 25-44 310 31.0 34.3 45-64 370 37.0 34.1 65+ 203 20.3 18.9 Race White 815 81.5 75.1 Black 115 11.5 12.2 All other 70 7.0 12.7 Ethnicity Non-Hispanic 872 87.2 85.8 Hispanic 127 12.7 14.2 Declined 1 0.1 − Census division 1 (New England) 59 5.9 4.8 2 (Middle Atlantic) 129 12.9 13.2 3 (East North Central) 155 15.5 14.7 4 (West North Central) 75 7.5 6.5 5 (South Atlantic) 190 19.0 19.9 6 (East South Central) 58 5.8 5.9 7 (West South Central) 138 13.8 11.6 8 (Mountain) 62 6.2 7.1 9 (Pacific) 134 13.4 16.2 Education level HS or less 257 25.7 41.0 Some college or tech 284 28.4 31.3 College degree 266 26.6 17.6 Graduate degree 192 19.2 10.1 Declined 1 0.1 − Social beliefs Liberal 310 31.0 26 Moderate 363 36.3 35 Conservative 327 32.7 35 Household income <$50K 359 35.9 45.5 $50K < $75K 290 29.0 17.8 $75K < $100K 91 9.1 12.2 $100K < $150K 115 11.5 13.5 $150K+ 145 14.5 11.1 Has children in the home No 665 66.5 68.0 Yes 335 33.5 32.0 Dash (–) indicates not available. National percentages represent estimates from the adult population (18 and older) from American Community Survey 2016 5-year estimates except for ethnicity distribution, which is from the 2010 Census, and the social beliefs estimates, which are from Gallup 2017.36 Table 1. Characteristics of the sample Characteristic Sample n Sample % National % Sex Male 498 49.8 48.6 Female 502 50.2 51.3 Age 18-24 117 11.7 12.8 25-44 310 31.0 34.3 45-64 370 37.0 34.1 65+ 203 20.3 18.9 Race White 815 81.5 75.1 Black 115 11.5 12.2 All other 70 7.0 12.7 Ethnicity Non-Hispanic 872 87.2 85.8 Hispanic 127 12.7 14.2 Declined 1 0.1 − Census division 1 (New England) 59 5.9 4.8 2 (Middle Atlantic) 129 12.9 13.2 3 (East North Central) 155 15.5 14.7 4 (West North Central) 75 7.5 6.5 5 (South Atlantic) 190 19.0 19.9 6 (East South Central) 58 5.8 5.9 7 (West South Central) 138 13.8 11.6 8 (Mountain) 62 6.2 7.1 9 (Pacific) 134 13.4 16.2 Education level HS or less 257 25.7 41.0 Some college or tech 284 28.4 31.3 College degree 266 26.6 17.6 Graduate degree 192 19.2 10.1 Declined 1 0.1 − Social beliefs Liberal 310 31.0 26 Moderate 363 36.3 35 Conservative 327 32.7 35 Household income <$50K 359 35.9 45.5 $50K < $75K 290 29.0 17.8 $75K < $100K 91 9.1 12.2 $100K < $150K 115 11.5 13.5 $150K+ 145 14.5 11.1 Has children in the home No 665 66.5 68.0 Yes 335 33.5 32.0 Characteristic Sample n Sample % National % Sex Male 498 49.8 48.6 Female 502 50.2 51.3 Age 18-24 117 11.7 12.8 25-44 310 31.0 34.3 45-64 370 37.0 34.1 65+ 203 20.3 18.9 Race White 815 81.5 75.1 Black 115 11.5 12.2 All other 70 7.0 12.7 Ethnicity Non-Hispanic 872 87.2 85.8 Hispanic 127 12.7 14.2 Declined 1 0.1 − Census division 1 (New England) 59 5.9 4.8 2 (Middle Atlantic) 129 12.9 13.2 3 (East North Central) 155 15.5 14.7 4 (West North Central) 75 7.5 6.5 5 (South Atlantic) 190 19.0 19.9 6 (East South Central) 58 5.8 5.9 7 (West South Central) 138 13.8 11.6 8 (Mountain) 62 6.2 7.1 9 (Pacific) 134 13.4 16.2 Education level HS or less 257 25.7 41.0 Some college or tech 284 28.4 31.3 College degree 266 26.6 17.6 Graduate degree 192 19.2 10.1 Declined 1 0.1 − Social beliefs Liberal 310 31.0 26 Moderate 363 36.3 35 Conservative 327 32.7 35 Household income <$50K 359 35.9 45.5 $50K < $75K 290 29.0 17.8 $75K < $100K 91 9.1 12.2 $100K < $150K 115 11.5 13.5 $150K+ 145 14.5 11.1 Has children in the home No 665 66.5 68.0 Yes 335 33.5 32.0 Dash (–) indicates not available. National percentages represent estimates from the adult population (18 and older) from American Community Survey 2016 5-year estimates except for ethnicity distribution, which is from the 2010 Census, and the social beliefs estimates, which are from Gallup 2017.36 Most respondents thought that a 16-year-old should be able to obtain a patient portal account with parental permission, with another 20% endorsing adolescent accounts even without parental permission, and a similar proportion saying that adolescents should not have accounts at all (Table 2). About 83% of respondents thought that parental access to teen medical records would reduce the likelihood of teens consulting with their doctors, and question order made no difference (P = .29). However, the proportion who thought that parents should always have access to adolescent medical records varied by question order, falling from almost 77% to 69% among those asked the teen withholding question first (P = .01). Table 2. Perceptions about parental access to adolescent medical records Response options N % NONRANDOMIZED QUESTION 1. Should a 16-year-old be able to have their own patient portal account? Always 207 20.7% Only with parent permission 602 60.2% Never 189 18.9% Do not know/refused 2 0.2% RANDOMIZED QUESTIONS ORDER A ORDER B N % n % p 2. Should a parent or guardian be able to see their 16-year-old child’s entire medical record? Always 409 76.9% 324 69.2% Only with 16-y-o permission 110 20.7% 128 27.4% Never 9 1.7% 15 3.2% .01 3. Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? Yes 436 82.0% 394 84.2% No 93 17.5% 70 15.0% .29 Total 532 53.2% 468 46.8% Response options N % NONRANDOMIZED QUESTION 1. Should a 16-year-old be able to have their own patient portal account? Always 207 20.7% Only with parent permission 602 60.2% Never 189 18.9% Do not know/refused 2 0.2% RANDOMIZED QUESTIONS ORDER A ORDER B N % n % p 2. Should a parent or guardian be able to see their 16-year-old child’s entire medical record? Always 409 76.9% 324 69.2% Only with 16-y-o permission 110 20.7% 128 27.4% Never 9 1.7% 15 3.2% .01 3. Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? Yes 436 82.0% 394 84.2% No 93 17.5% 70 15.0% .29 Total 532 53.2% 468 46.8% Table 2. Perceptions about parental access to adolescent medical records Response options N % NONRANDOMIZED QUESTION 1. Should a 16-year-old be able to have their own patient portal account? Always 207 20.7% Only with parent permission 602 60.2% Never 189 18.9% Do not know/refused 2 0.2% RANDOMIZED QUESTIONS ORDER A ORDER B N % n % p 2. Should a parent or guardian be able to see their 16-year-old child’s entire medical record? Always 409 76.9% 324 69.2% Only with 16-y-o permission 110 20.7% 128 27.4% Never 9 1.7% 15 3.2% .01 3. Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? Yes 436 82.0% 394 84.2% No 93 17.5% 70 15.0% .29 Total 532 53.2% 468 46.8% Response options N % NONRANDOMIZED QUESTION 1. Should a 16-year-old be able to have their own patient portal account? Always 207 20.7% Only with parent permission 602 60.2% Never 189 18.9% Do not know/refused 2 0.2% RANDOMIZED QUESTIONS ORDER A ORDER B N % n % p 2. Should a parent or guardian be able to see their 16-year-old child’s entire medical record? Always 409 76.9% 324 69.2% Only with 16-y-o permission 110 20.7% 128 27.4% Never 9 1.7% 15 3.2% .01 3. Do you think teens would be less likely to talk to their doctors about sensitive issues (for example, sexual activity and alcohol or drug problems) if they knew their parents could see their medical record afterwards? Yes 436 82.0% 394 84.2% No 93 17.5% 70 15.0% .29 Total 532 53.2% 468 46.8% In bivariate analyses (details not shown), support for teens to have their own portal accounts (Question 1) was significantly more common among respondents with younger age and liberal beliefs. Support for parental access to the teen’s medical record was more common among men, older respondents, those with children in the household, and those with conservative beliefs. Belief that teens would be less open with their physician with parental medical record access was more common among women, younger respondents, and those with liberal beliefs. Demographics significant in these bivariate analyses were used to construct the multivariable model (Table 3). This model demonstrates that individuals were significantly more likely to support full parental records access if they had question order A (did not answer the teen withholding question first), were men, were 65 or older, had conservative social beliefs, had children in the home, or thought teens should not have their own accounts. Table 3. Adjusted odds of supporting full parental access to teen records Effect AOR 95% CI P*** Question order A vs B* NA** NA NA .004 Female vs male NA NA NA .01 Question order x gender interaction NA NA NA .04 Female vs male with question order A 0.92 0.58 1.47 Female vs male with question order B 0.46 0.29 0.74 Age <.001 18-24 vs 65+ 0.23 0.12 0.41 25-44 vs 65+ 0.46 0.27 0.76 45-64 vs 65+ 0.87 0.53 1.41 Social beliefs <.001 Conservative vs moderate 1.24 0.81 1.90 Liberal vs moderate 0.44 0.30 0.64 Children in home .01 No children vs at least 1 child 0.63 0.44 0.91 Would teen withhold from doctor? (Q2) .24 Does not believe vs does believe teen would withhold 1.33 0.82 2.15 Should teen have portal account? (Q1) <.001 Always vs only with parent permission 0.23 0.15 0.33 Never vs only with parent permission 1.63 1.00 2.67 Effect AOR 95% CI P*** Question order A vs B* NA** NA NA .004 Female vs male NA NA NA .01 Question order x gender interaction NA NA NA .04 Female vs male with question order A 0.92 0.58 1.47 Female vs male with question order B 0.46 0.29 0.74 Age <.001 18-24 vs 65+ 0.23 0.12 0.41 25-44 vs 65+ 0.46 0.27 0.76 45-64 vs 65+ 0.87 0.53 1.41 Social beliefs <.001 Conservative vs moderate 1.24 0.81 1.90 Liberal vs moderate 0.44 0.30 0.64 Children in home .01 No children vs at least 1 child 0.63 0.44 0.91 Would teen withhold from doctor? (Q2) .24 Does not believe vs does believe teen would withhold 1.33 0.82 2.15 Should teen have portal account? (Q1) <.001 Always vs only with parent permission 0.23 0.15 0.33 Never vs only with parent permission 1.63 1.00 2.67 * Question order A: Parent access question before teen withholding question. Question order B: Teen withholding question before parent access question. ** Because of the interaction between question order and gender, odds ratios cannot be computed for the question order and gender variables; odds ratios are provided for the interactions only. *** Type 3 analysis of effects P value indicates significance of entire variable. Table 3. Adjusted odds of supporting full parental access to teen records Effect AOR 95% CI P*** Question order A vs B* NA** NA NA .004 Female vs male NA NA NA .01 Question order x gender interaction NA NA NA .04 Female vs male with question order A 0.92 0.58 1.47 Female vs male with question order B 0.46 0.29 0.74 Age <.001 18-24 vs 65+ 0.23 0.12 0.41 25-44 vs 65+ 0.46 0.27 0.76 45-64 vs 65+ 0.87 0.53 1.41 Social beliefs <.001 Conservative vs moderate 1.24 0.81 1.90 Liberal vs moderate 0.44 0.30 0.64 Children in home .01 No children vs at least 1 child 0.63 0.44 0.91 Would teen withhold from doctor? (Q2) .24 Does not believe vs does believe teen would withhold 1.33 0.82 2.15 Should teen have portal account? (Q1) <.001 Always vs only with parent permission 0.23 0.15 0.33 Never vs only with parent permission 1.63 1.00 2.67 Effect AOR 95% CI P*** Question order A vs B* NA** NA NA .004 Female vs male NA NA NA .01 Question order x gender interaction NA NA NA .04 Female vs male with question order A 0.92 0.58 1.47 Female vs male with question order B 0.46 0.29 0.74 Age <.001 18-24 vs 65+ 0.23 0.12 0.41 25-44 vs 65+ 0.46 0.27 0.76 45-64 vs 65+ 0.87 0.53 1.41 Social beliefs <.001 Conservative vs moderate 1.24 0.81 1.90 Liberal vs moderate 0.44 0.30 0.64 Children in home .01 No children vs at least 1 child 0.63 0.44 0.91 Would teen withhold from doctor? (Q2) .24 Does not believe vs does believe teen would withhold 1.33 0.82 2.15 Should teen have portal account? (Q1) <.001 Always vs only with parent permission 0.23 0.15 0.33 Never vs only with parent permission 1.63 1.00 2.67 * Question order A: Parent access question before teen withholding question. Question order B: Teen withholding question before parent access question. ** Because of the interaction between question order and gender, odds ratios cannot be computed for the question order and gender variables; odds ratios are provided for the interactions only. *** Type 3 analysis of effects P value indicates significance of entire variable. In addition, there was a significant interaction between question order and gender, such that the effect of question order occurred largely among female respondents. In question order A (in which respondents did not answer the teen withholding question first), women and men were roughly equally likely to support parental access to adolescent medical records (AOR 0.92; 95% CI 0.58-1.47). However, in question order B (prompted to consider teen withholding first), women were only half as likely as men to support parental medical record access (AOR 0.46; 95% CI 0.29-0.74). The participant’s answer to the question about teen withholding (Question 2) itself was not statistically significant (P = .24), nor was the interaction between the question 2 answer and question order (data not shown). Adding race, ethnicity, and household income as additional demographics to the model made no appreciable difference to the odds ratios or P values (data not shown). Census division was not a significant predictor at the univariate level and also could not be included in the multivariate models because of small sample sizes within cells. DISCUSSION This survey suggests that majorities of the public endorse 2 somewhat conflicting views: that parents should have access to their teen children’s medical records, and that this parental access would prompt teens to withhold important information from their physicians. Support for parental access was much lower among respondents who answered the withholding question first, as well as among women, younger respondents, those with liberal social beliefs, those without children in the home, and those who thought teens should not have their own portal accounts. Answering the withholding question was particularly influential among women. Very interestingly, the respondent’s answer to the question about teen withholding was not a significant predictor of support for parental access. In other words, the mere fact of prompting respondents to consider this withholding question was associated with reduced support for parental access, regardless of whether they answered the withholding question yes or no. Strong arguments have been made both for and against full parental access to adolescent medical records. On the one hand, parents have both moral and financial responsibility for their children’s healthcare, as well as their education about health and other topics. Healthcare providers seek to support communication and positive relationships between teenage patients and their parents, because such strong relationships are associated with better health-related behaviors among adolescents, including reduced rates of sexual risk factors.37 In our recent key informant study, many medical center leaders explicitly hoped to develop portal access policies that would encourage teenage patients to discuss problems with their parents.24 At a pragmatic level, some medical centers may decide trying to ensure adolescent confidentiality is futile because parents will ultimately receive an insurance company statement of benefits for the child’s care.24 Recent news coverage of college suicides suggests that many people consider it unacceptable to withhold a troubled student’s mental health information from parents, even if colleges believe they are acting in compliance with federal education privacy law. A front-page feature article in The New York Times, for example, included multiple stories in which keeping information from parents was followed by a tragedy, with no counterexamples in which disclosing information to parents had adverse consequences.38 However, lack of confidentiality is known to discourage young people from approaching their physicians with concerns about sexual health, mental health, drug and alcohol use, and other sensitive issues.10–12 It is noteworthy that in a large longitudinal survey, teens who reported having poor communication with their parents were more likely to cite confidentiality concerns as the reason for skipping healthcare that they needed.12 We have previously found that for these reasons, many medical centers do impose confidentiality restrictions on access to adolescent medical records through electronic patient portals.24,33 The restrictions are idiosyncratic to each medical center, and include blocking parental access to the adolescent record entirely, or blocking parental access only to certain types of medical information considered sensitive, or requiring the teen’s permission for continued parental access, or even turning off portal accounts altogether during the adolescent years.24 Previous studies on adolescent medical privacy have found varying attitudes in different populations. In a qualitative study, parents of adolescents in juvenile detention generally wanted the adolescents to have and to control online access to their medical information.39 Another qualitative study among commercially insured adults found enthusiasm about potential teen use of a patient portal, accompanied by concerns that granting complete access to medical records, messaging, and scheduling would give adolescents too much autonomy and privacy.40 It is challenging to consider what measures might be appropriate to address the conflict between professional society ethical statements (endorsing confidential care for adolescents) and current public opinion. This conflict (together with technical limitations, lack of standards, and other constraints24) places medical centers in the unenviable position of having to develop policies and procedures that are likely to be unwelcome to at least some of their stakeholders. Within medical organizations, a shared decision-making session at the onset of adolescence might be helpful to fully educate all parties (parents, the adolescent, and the medical team) about information available in the portal and through medical bills, and to help parents and providers understand each others’ perspectives on confidentiality. Some healthcare organizations in our previous study had implemented such sessions.24 However, such sessions place resource and time burdens on healthcare organizations. More granular information control might help to strike an acceptable balance between the expectations of different stakeholders. Several medical organizations had implemented different levels of protection for different types of medical information, and one had tiered information access levels by age.24 (Similarly, more granular control has been endorsed by members of another vulnerable group—patients receiving care for mental and behavioral health conditions.41) Professionals and professional societies endorsing confidential care for adolescents might also consider ways to address the unpopularity of this viewpoint. For example, collaborative policy development with patient advocates holding different opinions could potentially lead to novel policies or new ways to frame policy. Healthcare organizations already promote the benefits of accessing medical records through patient portals: in light of our study, perhaps educational or public communication interventions should also raise awareness about known adverse consequences of inappropriate information disclosure. Because parents are among the most important stakeholders in the development of policies about adolescent confidentiality, it may seem irrelevant to assess the beliefs of non-parents. However, as we and others have demonstrated, many other stakeholders have input into policy. These could include advocacy groups (especially those advocating for minors), medical center staff and employees themselves, patients who might vote with their feet, donors, and voters considering legal issues. In addition, many people who are not currently parents of children in the home may be parents of adult children who previously lived at home, or may become parents in the future. This study therefore includes both parents and non-parents. Limitations The sample size of 1000 produced a margin of error of plus or minus 3.1 percentage points; subgroup analyses have lower power, and conclusions about subgroups should be drawn only with caution. The survey used up-to-date methods for sampling landline and cell phones and produced a diverse sample, but, nonetheless, the sample was somewhat likely to include more white and well-educated people than in the US population. The demographic questions allowed us to determine whether the respondent had children in the household, but not whether the children were adolescents; among respondents with no children in the household, we do not know how many were parents. Due to space limitations, we could add only 3 questions to the survey and therefore could not assess other potential confounders such as personal or family experience with electronic patient portals or with sensitive medical conditions. The description of the patient portal that was provided in the survey was general and did not list all types of potentially sensitive information that might be available. The policies studied here pertained to viewing the electronic medical record; notifications and other forms of communication might be covered by different policies. Policy implications Medical society guidelines suggest that ethical practice requires providing confidential care to adolescents, and many medical centers operationalize this guidance by placing various restrictions on parental medical record access. These restrictions are likely to lead to conflict, given our findings that public opinion is strongly in favor of full parental access. It is likely that there will always be a diversity of parental viewpoints about the extent to which adolescents should have medical privacy, with opinions influenced by characteristics and beliefs as described in the current study. In addition, opinions are likely to vary in light of the situation and the adolescent in question; some situations are more challenging than others, and some young patients are more mature and capable of managing their own healthcare than others. However, we also found strong endorsement for the statement that parental access impairs open communication between adolescents and their doctors about important topics. A very gentle “nudge” of prompting people to consider this potential negative effect reduced subsequent support for full parental access. It seems likely that broader educational interventions around the benefits of confidential medical care for adolescents would increase support for confidentiality, as well as help prepare parents for restrictions on medical records access triggered by the age of their child. AUTHORSHIP AND CONTRIBUTORS JSA conceptualized the study, formulated survey questions, conducted statistical analyses, and drafted the paper. MS, MH, and LW contributed to the study concept and survey question development, and provided critical feedback and final approval on the manuscript. The Cornell National Social Survey is administered by the Cornell University Survey Research Institute. FUNDING The Cornell National Social Survey is supported by the Office of the Senior Vice Provost of Cornell University. Dr Ancker is supported by AHRQ K01 HS 021531. Dr Wilcox and Mr Hong are supported by NSF CAREER 1652302. None of the sponsors had any role in study design; collection, analysis, and interpretation of data; writing the report; or decision to submit the report for publication. Conflict of interest statement. The authors have no conflicts to report. ACKNOWLEDGMENTS We thank the staff at the Cornell National Social Survey. REFERENCES 1 Henry J , Pylypchuk Y , Patel V. Electronic Capabilities for Patient Engagement among US Nonfederal Acute Care Hospitals: 2012-2015 . Washington, DC : Office of the National Coordinator for Health Information Technology ; 2016 . 2 Ancker JS , Silver M , Kaushal R. Rapid growth in use of personal health records . J Gen Intern Med 2014 ; 29 ( 6 ): 850 – 4 . Google Scholar Crossref Search ADS PubMed 3 Clark SJ , Costello LE , Gebremariam A , Dombkowski KJ. A national survey of parent perspectives on use of patient portals for their children’s health care . Appl Clin Inform 2015 ; 06 ( 01 ): 110 – 9 . Google Scholar Crossref Search ADS 4 Britto MT , Hesse EA , Kamdar OJ , Munafo JK. Parents’ perceptions of a patient portal for managing their child’s chronic illness . J Pediatr 2013 ; 163 ( 1 ): 280 – 1 .e281–2. Google Scholar Crossref Search ADS PubMed 5 Fiks AG , DuRivage N , Mayne SL , et al. . Adoption of a portal for the primary care management of pediatric asthma: a mixed-methods implementation study . J Med Internet Res 2016 ; 18 ( 6 ): e172. Google Scholar Crossref Search ADS PubMed 6 Bush RA , Stahmer AC , Connelly CD. Exploring perceptions and use of the electronic health record by parents of children with autism spectrum disorder: a qualitative study . Health Informatics J 2016 ; 22 ( 3 ): 702 – 11 . Google Scholar Crossref Search ADS PubMed 7 Hong MK , Wilcox L , Feustel C , Wasileski-Masker K , Olson TA , Simoneaux SF. Adolescent and caregiver use of a tethered personal health record system . AMIA Annu Symp Proc 2016 ; 2016 : 628 – 37 . Google Scholar PubMed 8 Jackson SL , DesRoches CM , Frosch DL , Peacock S , Oster NV , Elmore JG. Will use of patient portals help to educate and communicate with patients with diabetes? Patient Educ Couns 2018 ; 101 ( 5 ): 956 – 9 . Google Scholar Crossref Search ADS PubMed 9 Fuzzell L , Fedesco HN , Alexander SC , Fortenberry JD , Shields CG. “ I just think that doctors need to ask more questions”: sexual minority and majority adolescents’ experiences talking about sexuality with healthcare providers . Patient Educ Couns 2016 ; 99 ( 9 ): 1467 – 72 . Google Scholar Crossref Search ADS PubMed 10 Brittain AW , Williams JR , Zapata LB , Moskosky SB , Weik TS. Confidentiality in family planning services for young people: a systematic review . Am J Prev Med 2015 ; 49 (2 Suppl 1) : S85 – 92 . Google Scholar Crossref Search ADS 11 Reddy DM , Fleming R , Swain C. Effect of mandatory parental notification on adolescent girls’ use of sexual health care services . JAMA 2002 ; 288 ( 6 ): 710 – 4 . Google Scholar Crossref Search ADS PubMed 12 Lehrer JA , Pantell R , Tebb K , Shafer MA. Forgone health care among U.S. adolescents: associations between risk characteristics and confidentiality concern . J Adolesc Health 2007 ; 40 ( 3 ): 218 – 26 . Google Scholar Crossref Search ADS PubMed 13 Jones RK , Purcell A , Singh S , Finer LB. Adolescents’ reports of parental knowledge of adolescents’ use of sexual health services and their reactions to mandated parental notification for prescription contraception . JAMA 2005 ; 293 ( 3 ): 340 – 8 . Google Scholar Crossref Search ADS PubMed 14 Caputi TL , Smith D , Ayers JW. Suicide risk behaviors among sexual minority adolescents in the united states, 2015 . JAMA 2017 ; 318 ( 23 ): 2349 – 51 . Google Scholar Crossref Search ADS PubMed 15 Jacob JA. Single question can identify youth at risk for alcohol use disorder . JAMA 2016 ; 315 ( 20 ): 2158. 16 Chisolm DJ , Manganello JA , Kelleher KJ , Marshal MP. Health literacy, alcohol expectancies, and alcohol use behaviors in teens . Patient Educ Couns 2014 ; 97 ( 2 ): 291 – 6 . Google Scholar Crossref Search ADS PubMed 17 Mercado MC , Holland K , Leemis RW , Stone DM , Wang J. Trends in emergency department visits for nonfatal self-inflicted injuries among youth aged 10 to 24 years in the United States, 2001-2015 . JAMA 2017 ; 318 ( 19 ): 1931 – 3 . Google Scholar Crossref Search ADS PubMed 18 Gini G , Espelage DL. Peer victimization, cyberbullying, and suicide risk in children and adolescents . JAMA 2014 ; 312 ( 5 ): 545 – 6 . Google Scholar Crossref Search ADS PubMed 19 Blythe MJ , Del Beccaro MA. Standards for health information technology to ensure adolescent privacy . Pediatrics 2012 ; 130 ( 5 ): 987 – 90 . Google Scholar Crossref Search ADS PubMed 20 Ford C , English A , Sigman G. Confidential health care for adolescents: position paper of the Society for Adolescent Medicine . J Adolesc Health 2004 ; 35 ( 2 ): 160 – 7 . Google Scholar Crossref Search ADS PubMed 21 American College of Obstetricians and Gynecologists (ACOG) Committee on Adolescent Health Care . Committee Opinion 599: Adolescent Confidentiality and Electronic Health Records. 2014 . Washington, DC : American College of Obstetrics and Gynecology . 22 Bayer R , Santelli J , Klitzman R. New challenges for electronic health records: confidentiality and access to sensitive health information about parents and adolescents . JAMA 2015 ; 313 ( 1 ): 29 – 30 . Google Scholar Crossref Search ADS PubMed 23 Bourgeois FC , DesRoches CM , Bell SK. Ethical challenges raised by OpenNotes for pediatric and adolescent patients . Pediatrics 2018 . e20172745; DOI: 10.1542/peds.2017-2745. 24 Sharko M , Wilcox L , Hong MK , Ancker JS. Variability in adolescent portal privacy features: How the unique privacy needs of the adolescent patient create a complex decision-making process . J Am Med Inform Assoc 2018 ; 25 ( 8 ): 1008 – 1017 . Google Scholar Crossref Search ADS PubMed 25 Anonymous . An Overview of Minors’ Consent Law . New York, NY : The Guttmacher Institute ; 2018 . 26 Beeson T , Mead KH , Wood S , Goldberg DG , Shin P , Rosenbaum S. Privacy and confidentiality practices in adolescent family planning care at federally qualified health centers . Perspect Sex Reprod Health 2016 ; 48 ( 1 ): 17 – 24 . Google Scholar Crossref Search ADS PubMed 27 Anonymous . OPA Program Policy Notice 2014-01—Confidential Services to Adolescents . Washington, DC : US Department of Health and Human Services, Office of Population Affairs ; 2014 . 28 Stablein T , Loud KJ , DiCapua C , Anthony DL. The catch to confidentiality: the use of electronic health records in adolescent health care . J Adolesc Health . 2018 ; 62 ( 5 ): 577 – 582 . Google Scholar Crossref Search ADS PubMed 29 Anoshiravani A , Gaskin GL , Groshek MR , Kuelbs C , Longhurst CA. Special requirements for electronic medical records in adolescent medicine . J Adolesc Health 2012 ; 51 ( 5 ): 409 – 14 . Google Scholar Crossref Search ADS PubMed 30 Bourgeois FC , DesRoches CM , Bell SK. Ethical challenges raised by OpenNotes for pediatric and adolescent patients . Pediatrics 2018 ; 141 ( 6 ): e20172745. Google Scholar Crossref Search ADS PubMed 31 Bourgeois FC , Nigrin DJ , Harper MB. Preserving patient privacy and confidentiality in the era of personal health records . Pediatrics 2015 ; 135 ( 5 ): e1125 - 7 . Google Scholar Crossref Search ADS PubMed 32 Wisk LE , Gray SH , Gooding HC. I thought you said this was confidential?—Challenges to protecting privacy for teens and young adults . JAMA Pediatr 2018 ; 172 ( 3 ): 209 – 10 . Google Scholar Crossref Search ADS PubMed 33 Wilcox L , Sharko M , Hong MK , Hollberg J , Ancker JS. The need for guidance and consistency in adolescent privacy policies: a survey of CMIOs. In: Proceedings/AMIA Annual Symposium AMIA Symposium. In press; 2018 . 34 O'Rourke D , Blair J. Improving random respondent selection in telephone surveys . J Mark Res 1983 ; 20 ( 4 ): 428 – 32 . Google Scholar Crossref Search ADS 35 American Association for Public Opinion Research (AAPOR) . Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys . Oakbrook Terrace, IL : AAPOR ; 2016 . 36 Gallup . Conservative lead in U.S. ideology is down to single digits . Politics 2018 ; http://news.gallup.com/poll/225074/conservative-lead-ideology-down-single-digits.aspx Accessed February 7, 2018. 37 Sieving RE , McRee AL , McMorris BJ , et al. . Youth-adult connectedness: a key protective factor for adolescent health . Am J Prev Med 2017 ; 52 ( 3s3 ): S275 – 8 . Google Scholar Crossref Search ADS PubMed 38 Hartocollis A. His college knew of his despair. His parents didn’t, until it was too late . The New York Times . May 12, 2018 : 2018 . 39 Gaskin GL , Bruce J , Anoshiravani A. Understanding parent perspectives concerning adolescents’ online access to personal health information . J Particip Med 2016 ; 8 : e3 . Google Scholar PubMed 40 Bergman DA , Brown NL , Wilson S. Teen use of a patient portal: a qualitative study of parent and teen attitudes . Perspect Health Inf Manag 2008 ; 5 : 13 . Google Scholar PubMed 41 Grando MA , Murcko A , Mahankali S , et al. . A study to elicit behavioral health patients’ and providers’ opinions on health records consent . J Law Med Ethics 2017 ; 45 ( 2 ): 238 – 59 . Google Scholar Crossref Search ADS © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Predicting individual physiologically acceptable states at discharge from a pediatric intensive care unitCarlin, Cameron, S;Ho, Long, V;Ledbetter, David, R;Aczon, Melissa, D;Wetzel, Randall, C
doi: 10.1093/jamia/ocy122pmid: 30295770
Abstract Objective Quantify physiologically acceptable PICU-discharge vital signs and develop machine learning models to predict these values for individual patients throughout their PICU episode. Methods EMR data from 7256 survivor PICU episodes (5632 patients) collected between 2009 and 2017 at Children’s Hospital Los Angeles was analyzed. Each episode contained 375 variables representing physiology, labs, interventions, and drugs. Between medical and physical discharge, when clinicians determined the patient was ready for ICU discharge, they were assumed to be in a physiologically acceptable state space (PASS) for discharge. Each patient’s heart rate, systolic blood pressure, diastolic blood pressure in the PASS window were measured and compared to age-normal values, regression-quantified PASS predictions, and recurrent neural network (RNN) PASS predictions made 12 hours after PICU admission. Results Mean absolute errors (MAEs) between individual PASS values and age-normal values (HR: 21.0 bpm; SBP: 10.8 mm Hg; DBP: 10.6 mm Hg) were greater (p < .05) than regression prediction MAEs (HR: 15.4 bpm; SBP: 9.9 mm Hg; DBP: 8.6 mm Hg). The RNN models best approximated individual PASS values (HR: 12.3 bpm; SBP: 7.6 mm Hg; DBP: 7.0 mm Hg). Conclusions The RNN model predictions better approximate patient-specific PASS values than regression and age-normal values. neural networks, patient discharge, electronic health records, pediatric intensive care units, supervised machine learning Introduction Patients in intensive care undergo high-frequency monitoring and are treated to achieve and maintain homeostasis, a state approximating health.1,Homeostasis can be thought of as a composition of homeostatic volatility, the stability of a patient’s state over time, and homeostatic deviance, the distance of a patient’s current state from a medically acceptable state. An implicit goal of achieving homeostasis in an intensive care setting is restoring a patient’s physiological state towards a medically acceptable state. In practice, the values that determine this medically acceptable state are often estimated by the clinical team’s experience and expertise, and in general are not explicitly defined. Existing measures of health and mortality risk, like PRISM-III, PIM2, and PEWS, rely on the deviance of routine physiologic signs from explicitly defined normal values.2–4 Similarly, many ICU admission criteria5–8 include assessments of a patient’s mental and physical state, which base their standards of health on acceptable vitals. Data describing age-normal vital signs and their implications in a clinical setting9–11 primarily focus on samples of healthy individuals from the general population. Recently, Fleming et al9 described issues with reference ranges of vital signs currently used in acute settings, and aggregated existing studies’ heart rate and respiratory rate baselines to generate centile distributions to update clinical guidelines. Studies have reported the age distribution of vital signs in hospitalized children12 and age-normalized centiles of vital signs in children in intensive care.13,14 There are also differences found in EMR data sourced from primary care vs intensive care patients.15 Nevertheless, quantifying how far a patient’s physiologic state deviates from acceptable during the ICU episode requires an explicit measure of a defined acceptable state. The physiologic state of a critically ill child changes during an ICU episode. This can be represented as a dynamic system that transitions through multiple states over time. At any point in time, this dynamic system can be described as the patient’s state space, a multidimensional, mathematical characterization of the features that contribute to their condition. During an ICU episode, a successfully discharged child transitions through states of critical illness, stability, and a physiologically acceptable state space (PASS) for discharge from the ICU. The concept of PASS encompasses the physiologic state of health in which clinicians, based on all available information, have determined that a patient is acceptable for discharge from intensive care. This study took advantage of the period between medical discharge and physical discharge from the ICU, during which ICU monitoring continued and was available in the EMR, to quantify 3 vital signs (HR, SBP, DBP) associated with the PASS. Determining whether features of an individual child’s PASS could be predicted by 2 machine learning methodologies (regression analysis and RNNs) was explored. The ability to predict features of PASS throughout their PICU stay may provide an explicit estimate of the patient’s homeostatic deviance from their PASS over time. This study aims to quantify 3 vital signs as examples of features associated with PASS, and develop machine learning models to accurately predict these values for individual patients throughout the duration of their PICU stay. Methods Electronic Medical Records The data were extracted from anonymized observational clinical data collected in Electronic Medical Records (EMR, Cerner Millennium, Cerner Corporation, Kansas City, Mo.) in the PICU of Children’s Hospital Los Angeles (CHLA) between January 2009 and October 2017. The CHLA Institutional Review Board (IRB) reviewed the study protocol and waived the need for IRB approval. An episode of ICU care was defined as one continuous stay. If a patient had multiple PICU admissions, these were considered separate episodes. Each episode had charted time series measurements for over 375 variables representing vital signs, laboratory results, interventions, and drugs. The full list of variables can be found in Clinical Data Used in the Supplementary Appendix.16 CHLA’s PICU also records the times of medical discharge, that is, the time the clinical team deems the patient healthy enough for discharge from the PICU, and physical discharge, the time the patient departs the PICU. The time from PICU admission to medical discharge is defined as the Pre-Medical Discharge (PMD) period, and the time from medical discharge to physical discharge defined as the Medical-to-Physical Discharge period. The Length of Stay is defined as the time from PICU admission to physical discharge from the PICU. Previous work has discussed many of the complications regarding working with EMR data. Huff et al17 described electronic medical data encoded as linked events over time, and Lasko et al18 discussed many of the challenges in working with sparse, time-dependent pediatric EMR data. As such, a combination of pre-processing techniques is required to generate a matrix representation of EMR data amenable to machine learning algorithm development as previously described.16,19 Due to the sparse nature of charted medical data, pre-processing included imputation and normalization of variables. The resultant matrix is a sequence of feature vectors, where each row is associated with a variable over time, and each column is every variable at a time step. The matrix representation of a patient’s physiologic state space, as illustrated in Figure 1(a), has a 375-dimensional feature vector at each time point describing the patient’s state space. The pre-medical discharge and medical-to-physical discharge periods are shown in Figure 1(b). Figure 1. View largeDownload slide a) Data for a single episode is shown in a processed matrix format. A single row of data contains actual and imputed measurements from a single variable. A column of data comprises all measurements at one time point. Adapted with permission from.16 b) The means of a patient’s vitals between medical and physical discharge from the PICU define this patient’s PASS. Data from the pre-medical discharge period are used to predict individual PASS values. Figure 1. View largeDownload slide a) Data for a single episode is shown in a processed matrix format. A single row of data contains actual and imputed measurements from a single variable. A column of data comprises all measurements at one time point. Adapted with permission from.16 b) The means of a patient’s vitals between medical and physical discharge from the PICU define this patient’s PASS. Data from the pre-medical discharge period are used to predict individual PASS values. Models were trained only on episodes from the population of 9879 episodes that met the definition of Successful Discharge; patients survived their PICU episode (9470 episodes) and were not readmitted within 48 hours (9310). A medical-to-physical discharge period of at least 2 hours (8476), and a length of stay of at least 12 hours (8327) were required. It was also required that at least 2 observations of heart rate, systolic blood pressure, and diastolic blood pressure exist within the medical-to-physical discharge period. Within the final cohort of 7256 episodes (5632 patients), the 25th/50th/75th percentiles of length of stay were 33/60/120 hours, and the 25th/50th/75th percentiles of the medical-to-physical discharge period length were 5/9/27 hours, respectively. Prior to analysis, the episodes were randomly split into training, validation, and test sets such that all episodes from a single patient belonged to only one of the 3 sets to prevent biasing test set metrics. Sixty percent of patients were in the training set (4399 episodes, 3398 patients), 20% in the validation set (1447 episodes, 1119 patients), and the remaining 20% in the test set (1410 episodes, 1115 patients). The Physiologically Acceptable State Space The concept of a patient’s Physiologically Acceptable State Space (PASS) encompasses the entire spectrum of values associated with successful ICU discharge. Delays between medical and physical discharge allow quantification of this state in patients successfully discharged from the ICU. As physiologic representatives of this state, a triad of vital signs was selected due to their importance in determining physiologic stability in the PICU.1,2,5 The means of heart rate ( μhr), systolic blood pressure ( μsbp), and diastolic blood pressure ( μdbp) within the medical-to-physical discharge period were calculated for each of the 7256 episodes, representing a portion of the physiologic state space considered acceptable for PICU discharge. This aggregation from the full EMR can be seen in Figure 1(b). This triad was used to demonstrate predictive modeling of a representative portion of features from the overall PASS distribution. Note that 3 variables do not determine the PASS, but are associated with the state. The heart rate, systolic blood pressure, and diastolic blood pressure PASS values for each patient in the training set were plotted as a function of age in Figure 2 for comparison to published age-normal vital signs and age-dependent regression models. Figure 2. View largeDownload slide Published age-normal values, PASS values, and the PASS regression for heart rate, systolic blood pressure, and diastolic blood pressure. Figure 2. View largeDownload slide Published age-normal values, PASS values, and the PASS regression for heart rate, systolic blood pressure, and diastolic blood pressure. Age-normal Values Published age-normal vital signs for heart rate9,20,21 and blood pressure11 were used for comparisons to machine learning model predictions. Age-normal values are traditionally given as ranges, as shown by Age-Normal Min and Age-Normal Max in Figure 2. To assess the machine learning models, the midpoint of the minimum and maximum normal values (shown as Age-Normal Mean in Figure 2) was used as a baseline. Machine Learning Models for Predicting PICU-PASS Predictions made by 2 machine learning methodologies, regression and RNNs, were compared to individual patients’ actual PASS values. A polynomial regression model was generated to estimate each PASS vital sign from age. The polynomial order for each model was selected by optimizing the model on the training set and selecting the order with the lowest error on the validation set. A fifth order polynomial was found to have the lowest error for each vital sign. The polynomial equations are found in Regression Polynomials in the Supplementary Appendix. These polynomials were used to generate PASS predictions on the test set episodes. Predicting PASS using Recurrent Neural Networks Because other factors than age likely influence the PASS, recurrent neural networks (RNNs) were trained to capture the relationships among these factors to predict individual patient PASS values. Designed with a feedback loop, RNNs can sequentially ingest and integrate time series data to learn temporal relationships.16,19 Specifically, Hochreiter’s Long Short-Term Memory (LSTM) architecture22 was used, as LSTMs can generate patient-specific predictions that update as more information is processed over time. Medical applications with successful RNN use include a time-varying severity of illness score,16 early detection of critical decompensation in children,23 onset of heart failure,24 de-identification of patient notes,25 and disease diagnosis from EMR.26 Because the PASS triad, [ μhr, μsbp, μdbp], is calculated from the medical-to-physical discharge period for each patient, data from this period was not included in our training process. Two RNN models with identical architecture (see RNN Architecture in the Supplementary Appendix) were developed, differing only in how the 2 models were trained. RNNPMD was trained on all data available prior to medical discharge, as shown in Figure 3(a). The second model, RNN12h, was trained on only the first 12 hours of data, as shown in Figure 3(b). Both RNN models were trained on the same patients, and both models can predict PASS at all time steps. These 2 models allowed comparisons of model performance between training on data near PICU admission (RNN12h) and training on all pre-medical discharge data (RNNPMD). Both RNN models were trained by minimizing prediction error in the validation set,27 and both models were then used to predict patient PASS values on the test set. Figure 3. View largeDownload slide a) During training, RNNPMD minimizes errors over all data prior to medical discharge. b) RNN12h minimizes errors over the first 12 hours after PICU admission. Both models predict mean heart rate ( μhr), systolic blood pressure ( μsbp), and diastolic blood pressure ( μdbp) derived from the medical-to-physical discharge period at every time step. During assessment, test set predictions are generated at the 12th hour following PICU admission for comparing model errors. Figure 3. View largeDownload slide a) During training, RNNPMD minimizes errors over all data prior to medical discharge. b) RNN12h minimizes errors over the first 12 hours after PICU admission. Both models predict mean heart rate ( μhr), systolic blood pressure ( μsbp), and diastolic blood pressure ( μdbp) derived from the medical-to-physical discharge period at every time step. During assessment, test set predictions are generated at the 12th hour following PICU admission for comparing model errors. Assessment Metrics The 1410 episodes in the test set were used to assess model predictions. The published age-norms, the values predicted from the regression, and both RNN model predictions made at the 12th hour following PICU admission were compared to each child’s measured PASS triad values. Mean absolute error (MAE) was used to compare model errors, defined by: MAEvital= 1N ∑i=1N μvitalPi- μ^vital(Pi), where N denotes the number of episodes in the test set. The notation μvital(Pi) represents the PASS value for a vital ( μhr, μsbp, or μdbp) of the ith patient episode in the test set, while μ^vital(Pi) denotes the model prediction for the same PASS value. For each patient’s episode, there is 1 true PASS value, 1 age-normal value, and 1 value determined by regression. In contrast, the 2 RNN models make predictions at every time step, as illustrated in Figure 4. To assess performance, RNN predictions made at the 12th hour following PICU admission were used to calculate MAEs. For pairwise model comparisons, two-sample t-tests28 between the PASS errors of the age-normal, regression, and RNN model predictions were calculated, with significant differences determined by p < .05. Figure 4. View largeDownload slide This figure contains predictions for one test set patient, age 12, throughout the course of their PICU episode. While the Age-Normal value (80) and PASS Regression prediction (97) are constant over time, the predictions of both RNNs are made at every time step of the patient’s episode. Note that only RNNPMD is shown for clarity. Our metrics assess the predictions made at the 12th hour, but predictions can be updated as new information enters the EMR over time. Figure 4. View largeDownload slide This figure contains predictions for one test set patient, age 12, throughout the course of their PICU episode. While the Age-Normal value (80) and PASS Regression prediction (97) are constant over time, the predictions of both RNNs are made at every time step of the patient’s episode. Note that only RNNPMD is shown for clarity. Our metrics assess the predictions made at the 12th hour, but predictions can be updated as new information enters the EMR over time. To gain further insight into distinct PICU subpopulations, MAEs were calculated for several population subsets. Results were partitioned by pre-existing age-normal ranges, ICD-9 primary diagnoses for a better understanding of our model’s validity across distinct illnesses,29 and PIM2 score quartiles to better understand validity across severity of illness.3 Results Table 1 displays the MAEs between the true PASS values and the age-normal values, regression predictions, and 2 RNN predictions, respectively. The regression predictions outperformed published age-normal values for all vital signs, with MAE reductions of 27% for heart rate, 8% for systolic blood pressure, and 19% for diastolic blood pressure (p < .05). RNNPMD compared to published age-normal values had MAE reduced 41% for heart rate, 30% for systolic blood pressure, and 34% for diastolic blood pressure. Compared to the regression predictions, RNNPMD had MAE reductions of 20%, 23%, and 19%, respectively (p < .05). There was no statistically significant difference between the 2 RNN models Table 1. Comparing mean absolute error (MAE) between patient-specific PASS vital signs, age-normal, age-dependent regression model, and 12th hour predictions from two RNN models Vital Heart rate Systolic Blood Pressure Diastolic Blood Pressure Age-Normal 21.0 10.8 10.6 Regression 15.3*** 9.7** 8.5*** RNN12h 12.5***/*** 7.8***/*** 7.5***/*** RNNPMD 12.3***/***/. 7.6***/***/. 7.0***/***/. Vital Heart rate Systolic Blood Pressure Diastolic Blood Pressure Age-Normal 21.0 10.8 10.6 Regression 15.3*** 9.7** 8.5*** RNN12h 12.5***/*** 7.8***/*** 7.5***/*** RNNPMD 12.3***/***/. 7.6***/***/. 7.0***/***/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital. Table 1. Comparing mean absolute error (MAE) between patient-specific PASS vital signs, age-normal, age-dependent regression model, and 12th hour predictions from two RNN models Vital Heart rate Systolic Blood Pressure Diastolic Blood Pressure Age-Normal 21.0 10.8 10.6 Regression 15.3*** 9.7** 8.5*** RNN12h 12.5***/*** 7.8***/*** 7.5***/*** RNNPMD 12.3***/***/. 7.6***/***/. 7.0***/***/. Vital Heart rate Systolic Blood Pressure Diastolic Blood Pressure Age-Normal 21.0 10.8 10.6 Regression 15.3*** 9.7** 8.5*** RNN12h 12.5***/*** 7.8***/*** 7.5***/*** RNNPMD 12.3***/***/. 7.6***/***/. 7.0***/***/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital. Results for Select Diagnoses Results were partitioned and compared by ICD-9 encoded primary diagnoses.29Table 2 displays the results for the 5 most common primary diagnoses in the dataset. Because it is possible for episodes to have more than 1 primary diagnosis, the diagnosis subsets in Table 2 are not necessarily isolated populations. Across all diagnoses and PASS vital signs, RNN12h and RNNPMD had the lowest errors of all 4 methods. Published age-normal heart rate values were not statistically different from the regression predictions for Brain Neoplasm patients. Table 2. Comparison of model performance, in MAE, parsed by the most common ICD-9 primary diagnoses in our dataset Vital Diagnosis Idiopathic Scoliosis Brain Neoplasm Acute Respiratory Failure Asthma with Status Asthmaticus Anomalies of Skull/Face Bones Count 81 73 56 56 55 HR Age-Normal 31.3 15.5 22.6 30.8 18.1 Regression 19.4*** 17.5 . 14.8** 17.8*** 13.5 . RNN12h 14.9***/* 12.6./** 13.0**/. 15.8***/. 11.9*/. RNNPMD 14.4***/**/. 12.7./**/. 12.7***/./. 15.3***/./. 11.9*/./. SBP Age-Normal 11.5 10.2 10.3 7.9 11.4 Regression 10.2. 8.6. 9.2. 7.0. 9.5 . RNN12h 8.4**/. 7.8*/. 7.6./. 6.7./. 7.7*/. RNNPMD 7.8** 7.8*/./. 7.7././. 6.6././. 7.3**/./. DBP Age-Normal 12.8 10.7 8.8 9.3 11.2 Regression 7.8*** 7.9* 7.1. 7.4 . 7.3** RNN12h 6.6***/. 6.5***/. 6.6*/. 7.0./. 7.0***/. RNNPMD 6.3***/./. 6.4***/./. 6.3*/./. 6.6*/./. 6.4***/./. Vital Diagnosis Idiopathic Scoliosis Brain Neoplasm Acute Respiratory Failure Asthma with Status Asthmaticus Anomalies of Skull/Face Bones Count 81 73 56 56 55 HR Age-Normal 31.3 15.5 22.6 30.8 18.1 Regression 19.4*** 17.5 . 14.8** 17.8*** 13.5 . RNN12h 14.9***/* 12.6./** 13.0**/. 15.8***/. 11.9*/. RNNPMD 14.4***/**/. 12.7./**/. 12.7***/./. 15.3***/./. 11.9*/./. SBP Age-Normal 11.5 10.2 10.3 7.9 11.4 Regression 10.2. 8.6. 9.2. 7.0. 9.5 . RNN12h 8.4**/. 7.8*/. 7.6./. 6.7./. 7.7*/. RNNPMD 7.8** 7.8*/./. 7.7././. 6.6././. 7.3**/./. DBP Age-Normal 12.8 10.7 8.8 9.3 11.2 Regression 7.8*** 7.9* 7.1. 7.4 . 7.3** RNN12h 6.6***/. 6.5***/. 6.6*/. 7.0./. 7.0***/. RNNPMD 6.3***/./. 6.4***/./. 6.3*/./. 6.6*/./. 6.4***/./. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/diagnosis combination. Table 2. Comparison of model performance, in MAE, parsed by the most common ICD-9 primary diagnoses in our dataset Vital Diagnosis Idiopathic Scoliosis Brain Neoplasm Acute Respiratory Failure Asthma with Status Asthmaticus Anomalies of Skull/Face Bones Count 81 73 56 56 55 HR Age-Normal 31.3 15.5 22.6 30.8 18.1 Regression 19.4*** 17.5 . 14.8** 17.8*** 13.5 . RNN12h 14.9***/* 12.6./** 13.0**/. 15.8***/. 11.9*/. RNNPMD 14.4***/**/. 12.7./**/. 12.7***/./. 15.3***/./. 11.9*/./. SBP Age-Normal 11.5 10.2 10.3 7.9 11.4 Regression 10.2. 8.6. 9.2. 7.0. 9.5 . RNN12h 8.4**/. 7.8*/. 7.6./. 6.7./. 7.7*/. RNNPMD 7.8** 7.8*/./. 7.7././. 6.6././. 7.3**/./. DBP Age-Normal 12.8 10.7 8.8 9.3 11.2 Regression 7.8*** 7.9* 7.1. 7.4 . 7.3** RNN12h 6.6***/. 6.5***/. 6.6*/. 7.0./. 7.0***/. RNNPMD 6.3***/./. 6.4***/./. 6.3*/./. 6.6*/./. 6.4***/./. Vital Diagnosis Idiopathic Scoliosis Brain Neoplasm Acute Respiratory Failure Asthma with Status Asthmaticus Anomalies of Skull/Face Bones Count 81 73 56 56 55 HR Age-Normal 31.3 15.5 22.6 30.8 18.1 Regression 19.4*** 17.5 . 14.8** 17.8*** 13.5 . RNN12h 14.9***/* 12.6./** 13.0**/. 15.8***/. 11.9*/. RNNPMD 14.4***/**/. 12.7./**/. 12.7***/./. 15.3***/./. 11.9*/./. SBP Age-Normal 11.5 10.2 10.3 7.9 11.4 Regression 10.2. 8.6. 9.2. 7.0. 9.5 . RNN12h 8.4**/. 7.8*/. 7.6./. 6.7./. 7.7*/. RNNPMD 7.8** 7.8*/./. 7.7././. 6.6././. 7.3**/./. DBP Age-Normal 12.8 10.7 8.8 9.3 11.2 Regression 7.8*** 7.9* 7.1. 7.4 . 7.3** RNN12h 6.6***/. 6.5***/. 6.6*/. 7.0./. 7.0***/. RNNPMD 6.3***/./. 6.4***/./. 6.3*/./. 6.6*/./. 6.4***/./. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/diagnosis combination. Results by Age-normal Bins Results partitioned by age are shown in Table 3. Note that the heart rate and blood pressure bins are not identical due to different sources. Heart rate regression predictions showed improvement over published age-normal values across all bins except 0-1 month. Systolic blood pressure regression predictions had significantly lower MAEs than published age-normal values in the 1-12 months, 1-2 years, and 12+ years bins. For diastolic blood pressure, regression showed significant improvement over published age-normal values in the 1-12 months, 1-2 years, 10-11 years, and 12+ years bins. Table 3. Comparison of model MAEs parsed by pre-existing age bins from age-normal baselines. Note that the heart rate and blood pressure bins are not identical Vital Age Bin 0-1 Mos. 1-12 Mos. 1-2 Yrs. 3-4 Yrs. 5-6 Yrs. 7-9 Yrs. 10+ Yrs. Count 14 211 228 124 74 159 600 HR Age-Normal 16.7 19.3 21.5 19.5 20.6 20.4 22 Regression 11.0. 13.4*** 12.8*** 14.7*** 16.0* 14.3*** 17.3*** RNN12h 8.9./. 12.8***/. 11.4***/. 13.2***/. 12.7***/. 12.6***/. 12.8***/*** RNNPMD 9.3././. 12.8***/./. 11.4***/./. 12.9***/./. 12.6***/./. 12.5***/./. 12.3***/***/. Age Bin <96 Hrs. 1-12 Mos. 1-2 Yrs. 3-5 Yrs. 6-9 Yrs. 10-11 Yrs. 12+ Yrs. Count 2 223 228 157 200 91 509 SBP Age-Normal 2.8 11.2 9.7 10.4 9.3 9.5 12.1 Regression 12.8. 9.4* 8.1* 10.2 . 9.2. 9.4. 10.5** RNN12h 5.4./. 7.7***/** 7.1***/* 8.7*/. 7.6*/** 7.0*/** 8.2***/*** RNNPMD 6.6././. 7.4***/***/. 7.1***/*/. 8.5*/*/. 7.5**/**/. 6.8**/**/. 7.9***/***/. DBP Age-Normal 3.4 10.9 9.8 9.3 8.8 10.1 11.9 Regression 5.1. 8.2** 7.5*** 9.6. 8.2. 7.8* 8.9*** RNN12h 3.3./. 7.4***/. 7.2***/. 8.8./. 7.4*/. 6.5***/. 7.4***/*** RNNPMD 4.2././. 6.9***/./. 6.9***/./. 8.0./*/. 6.9**/*/. 6.0***/*/. 7.1***/***/. Vital Age Bin 0-1 Mos. 1-12 Mos. 1-2 Yrs. 3-4 Yrs. 5-6 Yrs. 7-9 Yrs. 10+ Yrs. Count 14 211 228 124 74 159 600 HR Age-Normal 16.7 19.3 21.5 19.5 20.6 20.4 22 Regression 11.0. 13.4*** 12.8*** 14.7*** 16.0* 14.3*** 17.3*** RNN12h 8.9./. 12.8***/. 11.4***/. 13.2***/. 12.7***/. 12.6***/. 12.8***/*** RNNPMD 9.3././. 12.8***/./. 11.4***/./. 12.9***/./. 12.6***/./. 12.5***/./. 12.3***/***/. Age Bin <96 Hrs. 1-12 Mos. 1-2 Yrs. 3-5 Yrs. 6-9 Yrs. 10-11 Yrs. 12+ Yrs. Count 2 223 228 157 200 91 509 SBP Age-Normal 2.8 11.2 9.7 10.4 9.3 9.5 12.1 Regression 12.8. 9.4* 8.1* 10.2 . 9.2. 9.4. 10.5** RNN12h 5.4./. 7.7***/** 7.1***/* 8.7*/. 7.6*/** 7.0*/** 8.2***/*** RNNPMD 6.6././. 7.4***/***/. 7.1***/*/. 8.5*/*/. 7.5**/**/. 6.8**/**/. 7.9***/***/. DBP Age-Normal 3.4 10.9 9.8 9.3 8.8 10.1 11.9 Regression 5.1. 8.2** 7.5*** 9.6. 8.2. 7.8* 8.9*** RNN12h 3.3./. 7.4***/. 7.2***/. 8.8./. 7.4*/. 6.5***/. 7.4***/*** RNNPMD 4.2././. 6.9***/./. 6.9***/./. 8.0./*/. 6.9**/*/. 6.0***/*/. 7.1***/***/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/age bin combination. Table 3. Comparison of model MAEs parsed by pre-existing age bins from age-normal baselines. Note that the heart rate and blood pressure bins are not identical Vital Age Bin 0-1 Mos. 1-12 Mos. 1-2 Yrs. 3-4 Yrs. 5-6 Yrs. 7-9 Yrs. 10+ Yrs. Count 14 211 228 124 74 159 600 HR Age-Normal 16.7 19.3 21.5 19.5 20.6 20.4 22 Regression 11.0. 13.4*** 12.8*** 14.7*** 16.0* 14.3*** 17.3*** RNN12h 8.9./. 12.8***/. 11.4***/. 13.2***/. 12.7***/. 12.6***/. 12.8***/*** RNNPMD 9.3././. 12.8***/./. 11.4***/./. 12.9***/./. 12.6***/./. 12.5***/./. 12.3***/***/. Age Bin <96 Hrs. 1-12 Mos. 1-2 Yrs. 3-5 Yrs. 6-9 Yrs. 10-11 Yrs. 12+ Yrs. Count 2 223 228 157 200 91 509 SBP Age-Normal 2.8 11.2 9.7 10.4 9.3 9.5 12.1 Regression 12.8. 9.4* 8.1* 10.2 . 9.2. 9.4. 10.5** RNN12h 5.4./. 7.7***/** 7.1***/* 8.7*/. 7.6*/** 7.0*/** 8.2***/*** RNNPMD 6.6././. 7.4***/***/. 7.1***/*/. 8.5*/*/. 7.5**/**/. 6.8**/**/. 7.9***/***/. DBP Age-Normal 3.4 10.9 9.8 9.3 8.8 10.1 11.9 Regression 5.1. 8.2** 7.5*** 9.6. 8.2. 7.8* 8.9*** RNN12h 3.3./. 7.4***/. 7.2***/. 8.8./. 7.4*/. 6.5***/. 7.4***/*** RNNPMD 4.2././. 6.9***/./. 6.9***/./. 8.0./*/. 6.9**/*/. 6.0***/*/. 7.1***/***/. Vital Age Bin 0-1 Mos. 1-12 Mos. 1-2 Yrs. 3-4 Yrs. 5-6 Yrs. 7-9 Yrs. 10+ Yrs. Count 14 211 228 124 74 159 600 HR Age-Normal 16.7 19.3 21.5 19.5 20.6 20.4 22 Regression 11.0. 13.4*** 12.8*** 14.7*** 16.0* 14.3*** 17.3*** RNN12h 8.9./. 12.8***/. 11.4***/. 13.2***/. 12.7***/. 12.6***/. 12.8***/*** RNNPMD 9.3././. 12.8***/./. 11.4***/./. 12.9***/./. 12.6***/./. 12.5***/./. 12.3***/***/. Age Bin <96 Hrs. 1-12 Mos. 1-2 Yrs. 3-5 Yrs. 6-9 Yrs. 10-11 Yrs. 12+ Yrs. Count 2 223 228 157 200 91 509 SBP Age-Normal 2.8 11.2 9.7 10.4 9.3 9.5 12.1 Regression 12.8. 9.4* 8.1* 10.2 . 9.2. 9.4. 10.5** RNN12h 5.4./. 7.7***/** 7.1***/* 8.7*/. 7.6*/** 7.0*/** 8.2***/*** RNNPMD 6.6././. 7.4***/***/. 7.1***/*/. 8.5*/*/. 7.5**/**/. 6.8**/**/. 7.9***/***/. DBP Age-Normal 3.4 10.9 9.8 9.3 8.8 10.1 11.9 Regression 5.1. 8.2** 7.5*** 9.6. 8.2. 7.8* 8.9*** RNN12h 3.3./. 7.4***/. 7.2***/. 8.8./. 7.4*/. 6.5***/. 7.4***/*** RNNPMD 4.2././. 6.9***/./. 6.9***/./. 8.0./*/. 6.9**/*/. 6.0***/*/. 7.1***/***/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/age bin combination. Except in the youngest blood pressure bin, which only had 2 data points, the RNNs better approximated PASS values than age-normal values. For heart rate, the RNN models only had statistical improvements over regression in the 10+ years bin. For systolic blood pressure, at least 1 RNN model showed improvements over regression in all bins except the youngest. For diastolic blood pressure, at least 1 RNN model showed improvements over regression in all bins over 6 years old. There was no difference in the 2 RNN models’ performances in any bin. Results by PIM2 Score Table 4 displays model errors partitioned by PIM2 quartiles, where lower PIM2 scores indicate a lower severity of illness. The regression model showed improvements over age-normal approximations for heart rate and diastolic blood pressure across all PIM2 quartiles, and improvements in the 1st and 4th PIM2 quartiles for systolic blood pressure. For all vitals and all quartiles, the RNNs more closely approximated individual PASS values than age-normal values. Similarly, the RNN models outperformed regression across all conditions except the 4th quartile for heart rate predictions. There were no differences in model performance between the 2 RNNs. Table 4. Comparison of model performance, in MAE, parsed by PIM2 score quartiles Vital PIM2 Quartile 1st Quartile 2nd Quartile 3rd Quartile 4th Quartile Quartile Range (−8.411, −6.31) (−6.30, −4.83) (−4.82, −4.15) (−4.14, 2.53) HR Age-Normal 20.9 23 19.2 21 Regression 15.7*** 15.9*** 14.8*** 14.7*** RNN12h 12.7***/*** 12.6***/*** 11.5***/*** 13.4***/. RNNPMD 12.3***/***/. 12.4***/***/. 11.4***/***/. 13.1***/./. SBP Age-Normal 10.5 10.8 10.7 11.2 Regression 9.1* 10.1. 9.7. 9.9* RNN12h 7.6***/** 8.0***/*** 7.4***/*** 8.2***/*** RNNPMD 7.4***/**/. 7.7***/***/. 7.3***/***/. 8.1***/***/. DBP Age-Normal 11.4 10.5 10.1 10.3 Regression 8.2*** 8.6** 8.5** 8.6** RNN12h 7.3***/. 7.7***/. 7.1***/*** 7.7***/. RNNPMD 6.9***/**/. 7.0***/**/. 6.8***/***/. 7.4***/**/. Vital PIM2 Quartile 1st Quartile 2nd Quartile 3rd Quartile 4th Quartile Quartile Range (−8.411, −6.31) (−6.30, −4.83) (−4.82, −4.15) (−4.14, 2.53) HR Age-Normal 20.9 23 19.2 21 Regression 15.7*** 15.9*** 14.8*** 14.7*** RNN12h 12.7***/*** 12.6***/*** 11.5***/*** 13.4***/. RNNPMD 12.3***/***/. 12.4***/***/. 11.4***/***/. 13.1***/./. SBP Age-Normal 10.5 10.8 10.7 11.2 Regression 9.1* 10.1. 9.7. 9.9* RNN12h 7.6***/** 8.0***/*** 7.4***/*** 8.2***/*** RNNPMD 7.4***/**/. 7.7***/***/. 7.3***/***/. 8.1***/***/. DBP Age-Normal 11.4 10.5 10.1 10.3 Regression 8.2*** 8.6** 8.5** 8.6** RNN12h 7.3***/. 7.7***/. 7.1***/*** 7.7***/. RNNPMD 6.9***/**/. 7.0***/**/. 6.8***/***/. 7.4***/**/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/PIM2 Quartile combination. Table 4. Comparison of model performance, in MAE, parsed by PIM2 score quartiles Vital PIM2 Quartile 1st Quartile 2nd Quartile 3rd Quartile 4th Quartile Quartile Range (−8.411, −6.31) (−6.30, −4.83) (−4.82, −4.15) (−4.14, 2.53) HR Age-Normal 20.9 23 19.2 21 Regression 15.7*** 15.9*** 14.8*** 14.7*** RNN12h 12.7***/*** 12.6***/*** 11.5***/*** 13.4***/. RNNPMD 12.3***/***/. 12.4***/***/. 11.4***/***/. 13.1***/./. SBP Age-Normal 10.5 10.8 10.7 11.2 Regression 9.1* 10.1. 9.7. 9.9* RNN12h 7.6***/** 8.0***/*** 7.4***/*** 8.2***/*** RNNPMD 7.4***/**/. 7.7***/***/. 7.3***/***/. 8.1***/***/. DBP Age-Normal 11.4 10.5 10.1 10.3 Regression 8.2*** 8.6** 8.5** 8.6** RNN12h 7.3***/. 7.7***/. 7.1***/*** 7.7***/. RNNPMD 6.9***/**/. 7.0***/**/. 6.8***/***/. 7.4***/**/. Vital PIM2 Quartile 1st Quartile 2nd Quartile 3rd Quartile 4th Quartile Quartile Range (−8.411, −6.31) (−6.30, −4.83) (−4.82, −4.15) (−4.14, 2.53) HR Age-Normal 20.9 23 19.2 21 Regression 15.7*** 15.9*** 14.8*** 14.7*** RNN12h 12.7***/*** 12.6***/*** 11.5***/*** 13.4***/. RNNPMD 12.3***/***/. 12.4***/***/. 11.4***/***/. 13.1***/./. SBP Age-Normal 10.5 10.8 10.7 11.2 Regression 9.1* 10.1. 9.7. 9.9* RNN12h 7.6***/** 8.0***/*** 7.4***/*** 8.2***/*** RNNPMD 7.4***/**/. 7.7***/***/. 7.3***/***/. 8.1***/***/. DBP Age-Normal 11.4 10.5 10.1 10.3 Regression 8.2*** 8.6** 8.5** 8.6** RNN12h 7.3***/. 7.7***/. 7.1***/*** 7.7***/. RNNPMD 6.9***/**/. 7.0***/**/. 6.8***/***/. 7.4***/**/. P-values of error comparisons to age-normal, regression, and RNN12h within the same column and vital are denoted by sequential superscripts (.: p > .05, *: p ≤ .05, **: p ≤ .01, ***: p ≤ .001) in each cell. Bold numbers indicate the lowest error metric for a given vital/PIM2 Quartile combination. Discussion Patients clinically determined suitable for medical discharge, yet remaining in the PICU, afforded the opportunity to observe their clinical features in their physiologically acceptable state space for discharge. In this analysis, the PASS was represented by 3 vital signs for each child, measured during their medical-to-physical discharge period. In the ICU, published age-normal values are often implied as a guide for individual patient therapy. These results demonstrate that clinicians do not wait for age normal vital signs to determine whether a child is well enough to be discharged from the ICU. Defining PASS associated with medically acceptable discharge was the first step towards being able to quantify a patient’s deviation from it during their ICU course. The next step was determining whether the individual child’s PASS could be predicted. An age-dependent regression analysis was used to model and predict PASS values for 3 vital signs in the PICU population. The regression PASS predictions were significantly different from published age-normal values, consistent with previous studies.12–15 In addition, 2 RNN models demonstrated the ability to predict individual patient PASS values, and these predictions better approximated the child’s true PASS values compared to age-normal and regression approximations across different diagnostic categories and severities of illness. The similarities and distinctions between the 2 RNN models, RNNPMD and RNN12h, are important for clinical application. While RNNPMD was trained to learn patient trajectories from admission to medical discharge, RNN12h learned trajectories only from the first 12 hours following PICU admission. Both models generate PASS predictions over time, but both were compared only on their 12th hour predictions. It is important that the RNN models were accurate not only in the overall PICU population, but also in distinct subpopulations, because this validates the patient-specific nature of the RNN predictions. Patients with systemic disease, such as those with Acute Respiratory Failure,30 had more deviant vital signs than those with localized disease processes such as Brain Neoplasm. Table 2 shows that heart rate differences between age-normal and true PASS values were greater for patients with Acute Respiratory Failure, Scoliosis, and Asthma than those with Brain Neoplasm and Skull Anomalies. In contrast, the errors corresponding to the regression and RNN models had smaller variations across diagnostic categories. Furthermore, the RNN’s predictions across PIM2 quartiles were consistent regardless of diagnosis. This implies that the RNNs incorporate other EMR data in distinguishing how healthy or sick a patient may be. The RNNs also outperformed age-normal and regression approximations on the subsets, which further verifies the RNN’s ability to make patient-specific assessments across patient stratifications. The goal of this project was to define and predict patient-specific vital signs associated with medically acceptable discharge status. This extends the work of those who have reported on the physiologic status of patients during the entirety of their PICU stay2–4,9–14 by specifically characterizing individual children’s PASS. The ability to estimate a patient’s PASS within 12 hours of admission may have uses in developing machine learning approaches to clinical management of critically ill children. Quantifying the deviance of a PICU patient’s state space from a defined acceptable state, as the patient transitions from critically ill to discharge ready, can be facilitated by having more accurate determinations of the patient’s PASS vital signs or other features. The ability to posit a child’s physiologically acceptable state space shortly after ICU admission suggests potential for developing personalized monitoring of patient status, stability, and trajectory. There are limitations to this study. Only 3 variables associated with the complete PASS were assessed, and the predictions of these variables do not necessarily represent actual targets for care. Further, PASS as defined in this study does not identify all possible acceptable state spaces, merely those observed in the study population. Complex analysis of PASS will be necessary to further define other aspects of this state space across variables and conditions. Future work will expand the predicted physiologic features of the patient’s PASS. Another limitation is that the population came from only 1 tertiary PICU. Future work will endeavor to validate these results using data from multiple collaborating sites. Lastly, the study is retrospective in nature. Longitudinal clinical trials are necessary to evaluate the impact of patient-specific PASS predictions as a clinical support and patient monitoring tool throughout the treatment process. Conclusion The concept of PASS encompasses a defined acceptable PICU discharge state. The quantified PASS vital signs acceptable for PICU discharge were compared to published age-normal values and predictions from age-dependent regression and RNN models. The RNN model predictions better approximate patient-specific PASS values than regression and age-normal values. Funding This work was supported by a grant from the Laura P. and Leland K. Whittier Foundation. Contributors All listed authors contributed to the design, analysis, drafting, and approval of this work. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. Competing interests None. References 1 Yeh TS , Pollack MM , Ruttimann UE , et al. . Validation of a physiologic stability index for use in critically ill infants and children . Pediatr Res 1984 ; 18 ( 5 ): 445 – 51 . Google Scholar Crossref Search ADS PubMed 2 Pollack MM , Patel KM , Ruttimann UE. The pediatric risk of mortality III—acute physiology score (PRISM III-APS): a method of assessing physiologic instability for pediatric intensive care unit patients . J Pediatr 1997 ; 131 ( 4 ): 575 – 81 . Google Scholar Crossref Search ADS PubMed 3 Slater A , Shann F , Pearson G. PIM2: a revised version of the paediatric index of mortality . Intensive Care Med 2003 ; 29 ( 2 ): 278 – 85 . Google Scholar Crossref Search ADS PubMed 4 Duncan H , Hutchison JS , Parshuram CS. The pediatric early warning system score: a severity of illness score to predict urgent medical need in hospitalized children . J Crit Care 2006 ; 21 ( 3 ): 271 – 8 . Google Scholar Crossref Search ADS PubMed 5 Chalmers JD , Singanayagam A , Akram AR , et al. . Severity assessment tools for predicting mortality in hospitalised patients with community-acquired pneumonia. Systematic review and meta-analysis . Thorax 2010 ; 65 ( 10 ): 878 – 83 . Google Scholar Crossref Search ADS PubMed 6 Lockrem J. Recommendations for intensive care unit admission and discharge criteria . Crit Care Med 1989 ; 17 ( 6 ): 597. Google Scholar Crossref Search ADS PubMed 7 Dawson J. Admission, discharge, and triage in critical care. Principles and practice . Crit Care Clin 1993 ; 9 ( 3 ): 555 – 74 . Google Scholar Crossref Search ADS PubMed 8 Nates J , Nunnally M , Kleinpell R , et al. . ICU admission, discharge, and triage guidelines: a framework to enhance clinical operations, development of institutional policies, and further research . Crit Care Med 2016 ; 44 ( 8 ): 1553 – 602 . Google Scholar Crossref Search ADS PubMed 9 Fleming S , Thompson M , Stevens R , et al. . Normal ranges of heart rate and respiratory rate in children from birth to 18 years of age: a systematic review of observational studies . Lancet 2011 ; 377 ( 9770 ): 1011 – 8 . Google Scholar Crossref Search ADS PubMed 10 Knudson RJ , Slatin RC , Lebowitz MD et al. . The maximal expiratory flow-volume curve . Am Rev Respir Dis 2015 ; 113 (5): 587–600. 11 Falkner B , Daniels S , Flynn J , et al. . The fourth report on the diagnosis, evaluation, and treatment of high blood pressure in children and adolescents . Pediatrics 2004 ; 114 ( 2 III ): 555 – 76 . Google Scholar PubMed 12 Bonafide CP , Brady PW , Keren R , et al. . Development of heart and respiratory rate percentile curves for hospitalized children . Pediatrics 2013 ; 131 ( 4 ): e1150. Google Scholar Crossref Search ADS PubMed 13 Eytan D , Goodwin AJ , Greer R , et al. . Distributions and behavior of vital signs in critically ill children by admission diagnosis. Pediatr Crit Care Med 2018 ; 1 9 ( 2 ): 115 – 24 . 14 Eytan D , Goodwin A , Greer R , et al. . Heart rate and blood pressure centile curves and distributions by age of hospitalized critically ill children . Front Pediatr 2017 ; 5 : 52 . Google Scholar Crossref Search ADS PubMed 15 Albers DJ , Elhadad N , Claassen J , et al. . Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms . J Biomed Inform 2018 ; 78 : 87 – 101 . Google Scholar Crossref Search ADS PubMed 16 Aczon M , Ledbetter D , Ho LV , et al. . Dynamic mortality risk predictions in pediatric critical care using recurrent neural networks. arXiv:. 2017 : arXiv preprint(170106675). 17 Huff SM , Rocha RA , Bray BE , et al. . An event model of medical information representation . J Am Med Inform Assoc 1995 ; 2 ( 2 ): 116 – 34 . Google Scholar Crossref Search ADS PubMed 18 Lasko T , Denny J , Levy M. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data . PLoS One 2013 ; 8 ( 6 ):e66341. 19 Ho LV , Ledbetter D , Aczon M , et al. . The Dependence of Machine Learning on Electronic Medical Record Quality . AMIA Annual Symposium Proceedings . 2017 ; 2017 : 883 – 891 . Google Scholar PubMed 20 Simel D. Approach to the patient: history and physical examination . Goldmans Cecil Med 2016 . 24th edition, Pages 22–27. 21 Behrman RE , Kligman RM , Jensen HB. Nelson’s Textbook of Pediatrics . WB Saunders ; 2000 . Philadelphia, PA. 22 Hochreiter S , Schmidhuber J. Long short-term memory . Neural Comput 1997 ; 9 ( 8 ): 1735 – 80 . Google Scholar Crossref Search ADS PubMed 23 Shah S , Ledbetter D , Aczon M , et al. . Early prediction of patient deterioration using machine learning techniques with time series data . Crit Care Med 2016 ; 44 ( 12 ): 87. Google Scholar Crossref Search ADS 24 Choi E , Schuetz A , Stewart W , et al. . Using recurrent neural network models for early detection of heart failure onset . J Am Med Inform Assoc 2017 ; 24 ( 2 ): 361 – 70 . Google Scholar PubMed 25 Dernoncourt F , Lee J , Uzuner O , et al. . De-identification of patient notes with recurrent neural networks . J Am Med Inform Assoc 2017 ; 24 ( 3 ): 596 – 606 . Google Scholar PubMed 26 Razavian N , Sontag D. Temporal convolutional neural networks for diagnosis from lab tests. arXiv 2015 : arXiv preprint(151107938). 27 Greff K , Srivastana RKJ , et al. . LSTM: a search space odyssey. arXiv 2015: arXiv preprint(150304069). 28 Cressie NA , Sheffield LJ , Whitford HJ. Use of the one sample t-test in the real world . J Chronic Dis 1984 ; 37 ( 2 ): 107 – 14 . Google Scholar Crossref Search ADS PubMed 29 Hazelwood A. ICD-9-CM Diagnostic Coding and Reimbursement for Physician Services 2006 Edition. American Health Information Management Association; 2005 . Chicago, IL. 30 ARDS Definition Task Force . Acute respiratory distress syndrome . JAMA 2012 ; 307 ( 23 ): 2526 – 33 . PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Project Tycho 2.0: a repository to improve the integration and reuse of data for global population healthvan Panhuis, Willem G; Cross, Anne; Burke, Donald S
doi: 10.1093/jamia/ocy123pmid: 30321381
Abstract Objective In 2013, we released Project Tycho, an open-access database comprising 3.6 million counts of infectious disease cases and deaths reported for over a century by public health surveillance in the United States. Our objective is to describe how Project Tycho version 1 (v1) data has been used to create new knowledge and technology and to present improvements made in the newly released version 2.0 (v2). Materials and Methods We analyzed our user database and conducted online searches to analyze the use of Project Tycho v1 data. For v2, we added new US data and dengue data for other countries, and grouped data into 360 datasets, each with a digital object identifier and rich metadata. In addition, we used standard vocabularies to encode data where possible, improving compliance with FAIR (findable, accessible, interoperable, reusable) guiding principles for data management. Results Since release, 3174 people have registered to use Project Tycho data, leading to 18 new peer-reviewed papers and 27 other creative works, such as conference papers, student theses, and software applications. Project Tycho v2 comprises 5.7 million counts of infectious diseases in the United States and of dengue-related conditions in 98 additional countries. Discussion Project Tycho v2 contributes to improving FAIR compliance of global health data, but more work is needed to develop community-accepted standard representations for global health data. Conclusion FAIR principles are a valuable guide for improving the integration and reuse of data in global health to improve disease control and save lives. global health, information storage and retrieval, public health surveillance, communicable diseases, information dissemination BACKGROUND AND SIGNIFICANCE Decisions in global population health can affect the lives of millions of people and can change the future of entire communities. For example, the decision to declare an influenza pandemic and stockpile vaccines can save millions of lives if a pandemic of highly pathogenic influenza actually occurred, or could waste millions of dollars if the decision was based on false alarm.1 Decision making in global health is often made under a high degree of uncertainty and with incomplete information. New data are rapidly emerging from mobile technology, electronic health records, and remote sensing.2 These new data can expand opportunities for data-driven decision making in global health. In reality, multiple layers of challenges, ranging from technical to ethical barriers, can limit the effective (re)use of data in global health.3,4 For example, composing an epidemic model to inform decisions about vaccine stockpiling requires the integration of existing data from a wide range of data sources, such as a population census, disease surveillance, environmental monitoring, and research studies.5 Integrating data can be a daunting task, especially since global health data are often stored in domain-specific data siloes that can each use different formats and content standards, ie, they can be syntactically and semantically heterogeneous. The heterogeneity of data in global health can slow down scientific progress, as researchers have to spend much time on data discovery and curation.6 To improve access to standardized data in global health, we created the Project Tycho data repository in 2013.7 The first version of Project Tycho (v1) comprised over a century of infectious disease surveillance data for the United States that had been published in weekly reports between 1888 and 2014.7 Weekly US disease surveillance reports were previously available in PDF and HTML formats on various online repositories and were not usable for research without substantial data curation. We digitized, transformed, and standardized these data and publicly released the entire database through www.tycho.pitt.edu. Since the public release of Project Tycho in 2013, over 3000 users have registered and have used Project Tycho data for research and technology development, leading to 45 new creative works. Now, we have released Project Tycho version 2.0 (v2). We have updated the content of Project Tycho data with new weekly US disease surveillance data, and we have added surveillance data for dengue-related conditions from 98 additional countries. In addition, we have redesigned the Project Tycho data format by grouping data into 360 datasets. Each dataset can now be identified by a digital object identifier (DOI) and can have its own metadata. We followed FAIR (findable, accessible, interoperable, and reusable) guiding principles where possible during the design of data and metadata representations.8 In this paper, we describe how others have used Project Tycho v1 data, illustrating the value of investing in a domain-specific open-data resource for accelerating science and creating new knowledge. We also describe the significant update of Project Tycho into version 2.0 with new data and improved FAIR compliance, towards a FAIR compliant data repository for global population health. METHODS Analyze Project Tycho data use We analyzed our database of users who registered between the release of Project Tycho v1 (November 28, 2013) and December 31, 2017, to describe the type of users and the data they downloaded. To access Project Tycho data, users have to create an account with their name, institutional affiliation, country, and email. Users entered their information, except country, as free text. We standardized the names of the 100 most commonly listed institutions. We then classified users into 4 categories: 1) academic, 2) government, 3) personal, 4) other. We transformed all text to uppercase and classified all affiliations matching the regular expression “UNIVERSITY|UNIV|COLLEGE|SCHOOL|RESEARCH|FACULTY|ECOLE” as academic, affiliations matching the regular expression “NATIONAL|AGENCY|MINISTRY|^US|DEPARTMENT|DEPT” as government, affiliations with words “‘NONE’, ‘N/A’, ‘SELF’, ‘PERSONAL’, ‘PRIVATE’, ‘STUDENT’, ‘HOME’, ‘-’, ‘RETIRED’, ‘INDIVIDUAL’, ‘ME’, ‘INDEPENDENT’, ‘NA’, ‘NOT APPLICABLE’, and ‘CONSULTANT’” as personal, and all remaining affiliations as other. For all users first classified in the “other” category, we classified users with an email address ending with “.edu” as academic, with “gmail.com” as personal, with “.com” and not “gmail.com” as corporate, and with “.gov” as government. All users who could not be classified remained in the “other” category. Since November 28, 2013, we have recorded the queries and downloads made by users through the online Graphical User Interface (GUI). When signing up for an account, users agree to a privacy statement that notifies them of query and download tracking. Since February 4, 2014, we have also recorded calls to our Application Programming Interface (API). We extracted the disease names from user queries and analyzed the number of queries for each disease by user category. We also conducted online searches to determine what creative works have resulted from use of Project Tycho v1 data by others (not including our own team). We defined a creative work as any publication (journal, newspaper, magazine, blog, website), algorithm, or software application that used Project Tycho data. During registration, users had to agree to an attribution license that requires citation of our primary paper describing the Project Tycho database, published in 2013 in the New England Journal of Medicine.7 We conducted searches in Google Scholar for citations of the primary Project Tycho paper to identify creative works potentially based on Project Tycho data. To identify papers that potentially used Project Tycho data, but did not cite the primary paper, we also searched PubMed, Scopus, Web of Science, Google Scholar, Google, Github, The Comprehensive R Archive Network (CRAN), Dryad, Figshare, and Zenodo for titles and/or abstracts containing the words “Project Tycho” and combinations of “Project Tycho” and disease names, or the words “vaccines” or “data.” We included only creative works published between the release of Project Tycho v1 and December 31, 2017. We then reviewed the main content of all papers and other creative works that potentially used Project Tycho data and verified that data used was indeed derived from Project Tycho by reviewing data source descriptions and references. All papers that cited the Project Tycho primary paper but that did not use Project Tycho data were categorized as citations only and were not included in the analysis of Project Tycho data use. Expanding data content Project Tycho v1 included weekly National Notifiable Disease Surveillance System (NNDSS) reports published by the US Centers for Disease Control and Prevention (CDC) between September 9, 1887, and August 2, 2014. For version 2.0, we updated the US weekly surveillance data until the last week of 2017 by retrieving NNDSS data from the API of the CDC Morbidity and Mortality Weekly Report (MMWR).9 We standardized new NNDSS data into the Project Tycho standard format for pre-compiled datasets. We also added surveillance data for dengue-related conditions that we previously retrieved from the WHO DengueNet database, from WHO regional office websites, and from Ministries of Health in partner countries in Southeast Asia. We have previously published details about the origin and collection methods of these dengue data.10,11 Dengue surveillance data comprised counts of the number of cases or deaths reported for a specific location (country, first administrative subdivision, second administrative subdivision) through passive disease surveillance systems. Improved findability, accessibility, interoperability, and reusability We grouped Project Tycho data into pre-compiled datasets in which each comprises all data from one condition in one country. Many users conduct disease- or country-specific analysis, making country-condition groups of data a natural fit. A dataset can include a large heterogeneity of information, eg, for a diversity of sublocations in a country and from numerous different sources. We also enabled the creation of custom datasets by users through a “create-your-own” GUI or the API. Grouping data into pre-compiled datasets enabled us to assign a DOI to each dataset, and to create a standard metadata file for each dataset. We minted Project Tycho DOI’s consisting of a standard sequence “10.25337/T7/PTYCHO.V2.0/” and a unique sequence for each dataset. A dataset DOI resolves to the landing page of the dataset. Each landing page lists the dataset DOI, the download links for the data file (Comma Separated Value (CSV) format) and metadata files, and information about the condition, country, and time period represented by the dataset. We also listed the citation of the dataset on its landing page. For each dataset, we created metadata in the XML format specified by DataCite12 and in the JSON format specified by the Data Tag Suite (DATS).13 We defined a standard data format for pre-compiled Project Tycho datasets that includes 20 variables, and a format for custom compiled datasets that includes one additional variable, ie, the DOI of pre-compiled datasets from which each count was derived. Datasets are comprised of counts of cases or deaths due to certain conditions as reported by public health surveillance. The dataset variables represent attributes of counts, such as the location, time interval, condition, pathogen, etc. For each attribute, we tried to use a standard vocabulary or ontology. We searched the BioPortal and FAIRsharing catalogues, and the Google search engine to find appropriate standard vocabularies or ontologies. We encoded metadata for each dataset according to the DataCite XML and DATS JSON schemas. We first compiled information for as many metadata attributes as possible. For most attributes, information could be derived from the dataset itself, but for some attributes we conducted additional online searches. For example, we conducted additional searches for contact information and ORCID identifiers for dataset creators and authors. We searched the BioPortal and FAIRsharing catalogues, and the Google search engine to find appropriate standard vocabularies or ontologies to represent metadata attributes. A valuable component of the DATS schema is the “CitedBy” attribute that enables representation of papers that have cited a dataset, in the dataset metadata. Although Project Tycho v1 data have been used by others, we did not use the “CitedBy” attributes yet because users have not yet cited the version 2.0 datasets and their DOIs. We plan to regularly update our dataset metadata with new “CitedBy” information, as users start to cite version 2.0 datasets. RESULTS Project Tycho v1 comprised 3 666 141 provisional weekly counts of cases and deaths due to 50 infectious disease conditions (Supplementary Table S1), reported by the US NNDSS between September 9, 1887, and August 2, 2014. In the United States, the number of cases or deaths due to a select set of infectious diseases has been reported routinely for weekly time intervals since 1887 in public health journals such as Public Health Reports and MMWR. The US CDC and its precursors have published disease counts on a provisional basis every week and the final counts in annual summary reports.14 Project Tycho v1 included the provisional counts because the weekly time resolution enabled a greater range of epidemiological investigations compared to annual data, as illustrated by published work based on Project Tycho v1 data (https://www.tycho.pitt.edu/featured-works/). Project Tycho user community Between November 28, 2013, and December 31, 2017, 3174 new users registered to access Project Tycho data (Figure 1A). We identified 1869 unique institutional affiliations. About one third (1203) of all registered users were affiliated with one of the 100 most frequently listed institutions (Supplementary Table S2). Over half of the users (1734) had an academic affiliation, 502 had a corporate affiliation (16%), 200 listed a government affiliation (6%), 479 listed no affiliation (15%), and for the 259 remaining users, we could not determine the type of affiliation (8%). Most users were based in the United States (2208), followed by the United Kingdom (133) and India (101). Users represented a total of 92 countries (Figure 1B). Figure 1. Open in new tabDownload slide Project Tycho v1 registered users by day and country. (A) Number of user registrations per day since the release of Project Tycho v1, by type of user affiliation; (B) number of registered users per country. Figure 1. Open in new tabDownload slide Project Tycho v1 registered users by day and country. (A) Number of user registrations per day since the release of Project Tycho v1, by type of user affiliation; (B) number of registered users per country. Data reuse and knowledge generation Users could download Project Tycho v1 data through our online GUI or via calls to our API. Between release and December 31, 2017, 1048 distinct users have downloaded 6809 datasets through our GUI. We started tracking API calls on February 4, 2014, and between then and December 31, 2017, 161 users made 1 022 480 API calls. Seventy-four users downloaded data through both the GUI and the API. The maximum number of API calls made by one user was 390 519, and the maximum number of datasets downloaded by one user through the GUI was 375. We found that 6757 GUI downloads and 953 712 API calls included disease-specific information. We explored the number of downloads per disease and user affiliation (Figure 2). GUI users downloaded the most datasets for measles (1682), followed by Hepatitis A (851) and Pertussis (753) (Figure 2A). The distribution of conditions selected for downloading was not identical among the different types of users. For example, all cholera datasets were downloaded by personal users (not affiliated to a government, academic, or corporate institution), and a larger percentage of dengue datasets were downloaded by government users compared to other conditions (Figure 2A). User preferences for specific datasets likely reflected a combination of data availability and user interest. For example, the longest time series are available for measles and pertussis, and these diseases are also very highly studied in the infectious disease epidemiology domain; and the emergence of dengue is likely a concern for government health agencies. The relatively large proportion of datasets downloaded by personal users likely reflects a high level of interest among the general public in infectious diseases such as cholera and pneumonia, although these data should be interpreted carefully, as many professional users may not have listed their affiliation or may have used their Gmail address to login. We inspected the distribution of diseases downloaded by users across countries and did find very similar patterns across countries, suggesting no relationship between the disease burden in user countries and user interest in specific diseases. Users from India downloaded relatively more data on respiratory diseases compared to users from other countries, possibly related to the importance of respiratory diseases in India. Disease preferences were slightly different for API users (Figure 2B). API calls most frequently included measles (63 589), followed by influenza (38 460), and anthrax (37 205). Interestingly, API calls were almost exclusively made by academic and corporate users. It is likely that these user groups have more advanced computational capability enabling them to make API calls compared to government or personal users. Project Tycho data for anthrax are very limited, which is immediately visible through the GUI, but not before an API call is made, explaining why GUI users did not show the same interest in anthrax as API users. We also studied the data download patterns over time (Figure 3). Data for most conditions were downloaded continuously throughout the time period, but behavior was different for the API vs the GUI. API calls included all conditions much more frequently compared to the GUI, reflecting the advantage of the API vs the GUI for accessing data for multiple conditions at once, which would be much more laborious through the GUI. Figure 2. Open in new tabDownload slide Project Tycho v1 data downloads by condition and type of user affiliation. (A) Proportion of datasets downloaded through the graphical user interface by each type of user affiliation, per condition, and the total number of datasets downloaded per condition in gray; (B) as (A), but for API calls. Figure 2. Open in new tabDownload slide Project Tycho v1 data downloads by condition and type of user affiliation. (A) Proportion of datasets downloaded through the graphical user interface by each type of user affiliation, per condition, and the total number of datasets downloaded per condition in gray; (B) as (A), but for API calls. Figure 3. Open in new tabDownload slide Project Tycho v1 datasets downloaded per condition and week. (A) Weekly number of datasets downloaded through the graphical user interface, per condition, with weekly totals in the top panel; (B) as (A), but for API calls. Figure 3. Open in new tabDownload slide Project Tycho v1 datasets downloaded per condition and week. (A) Weekly number of datasets downloaded through the graphical user interface, per condition, with weekly totals in the top panel; (B) as (A), but for API calls. We identified new knowledge generated based on Project Tycho data. We found 150 published works that cited the Project Tycho release paper, 47 published by authors from one of the 100 institutions most commonly listed as affiliation by registered Project Tycho users. Not all works that cited the primary Project Tycho paper used our data to generate new knowledge. We found 45 creative works published before December 31, 2017, including 18 peer-reviewed papers, 8 conference papers or pre-prints, 3 student theses, and 16 newspaper, blog, or website articles that used Project Tycho data and that were not produced by our own team (Figure 4, Supplementary Table S3). Based on the acknowledgments made in published papers, we found that 5 papers were funded by a foundation, 6 papers were funded by the US National Institutes of Health, 6 papers by the United States or China National Science Foundation, and 1 paper was funded by industry grants. Most of the new knowledge derived from Project Tycho data was about disease transmission patterns for respiratory pathogens (measles, pertussis, chickenpox, streptococcus) and fecal-orally transmitted pathogens (polio, hepatitis A, typhoid). Project Tycho data were also used to create new technology including statistical clustering, machine learning, data mining, disease forecasting algorithms, and visualization software. New knowledge or technology was not only created by post-doctoral researchers, but also by pre-doctoral students, including a master’s thesis on measles vaccination in the United States and 2 doctoral theses describing new data integration methods. Furthermore, Project Tycho data were used for advocacy, eg, by news articles about the importance of vaccination programs and disease surveillance. Figure 4. Open in new tabDownload slide Creative works resulting from Project Tycho v1 data. (A) Peer-reviewed papers published that used Project Tycho v1 data in chronological order; (B) as (A), but for conference papers, pre-prints, and student theses; (C) as (A), but for newspaper articles, visualizations, and software. Figure 4. Open in new tabDownload slide Creative works resulting from Project Tycho v1 data. (A) Peer-reviewed papers published that used Project Tycho v1 data in chronological order; (B) as (A), but for conference papers, pre-prints, and student theses; (C) as (A), but for newspaper articles, visualizations, and software. Project Tycho v1 data have also been used by training programs. For example, the University of Pittsburgh organized an Undergraduate Data Palooza in 2012 and 2013, for which undergraduate students from around the world used Project Tycho data to study historical disease patterns. The three best submissions won a $1000 prize each. The Last Mile initiative used Project Tycho data to train inmates in the 2015 class of the San Quentin 7370 program for web development. The 2017 Computational Biology of Infectious Disease (CBID) course in Southeast Asia, organized by the French Institute of Research for Development (IRD), used Project Tycho data to train students in statistical analysis of disease surveillance data. Examples of using Project Tycho data for training programs became known to us due to our personal involvement or because colleagues mentioned it to us. More examples could exist that have remained unknown to us. Project Tycho version 2.0 For Project Tycho v2, we updated the US data and added dengue data for 98 additional countries. We added new US NNDSS counts published in weekly MMWR surveillance tables between August 2, 2014, and December 31, 2017. We compiled dengue data from 13 different sources including the World Health Organization, and from country-specific databases from 8 countries in Southeast Asia (Supplementary Table S4), as previously described.10,11 We also improved the data format, standardization, and metadata, following FAIR guiding principles where possible. Project Tycho v2 includes 5 676 424 counts of cases or deaths due to reported disease conditions, an increase of 50% from v1. Version 2.0 data represent 99 countries, 92 conditions, and 58 pathogens (Supplementary Table S5). We could not assign a pathogen to each condition because some conditions were not caused by a pathogen, such as aseptic meningitis, and some conditions could be caused by pathogens from multiple superkingdoms, such as dysentery. We grouped data into pre-compiled datasets in which each comprises all counts for a single condition in one country. For example, we grouped all counts for measles in the United States in one dataset. We created 360 datasets. Ninety-two datasets are for disease conditions in the United States, and 268 are for up to 3 dengue-related conditions in other countries (Figure 5). Datasets for conditions in the United States can include counts reported between 1887 and 2017, while most datasets for dengue-related conditions in other countries include counts reported between 1960 and 2012 (Figure 6). Figure 5. Open in new tabDownload slide Project Tycho v2 data available per country. (A) Number of counts available for each country; (B) highest spatial resolution of data available for each country. Figure 5. Open in new tabDownload slide Project Tycho v2 data available per country. (A) Number of counts available for each country; (B) highest spatial resolution of data available for each country. Figure 6. Open in new tabDownload slide Project Tycho v2 data available per condition, country, and year. (A) Annual number of counts available for each condition (ranked alphabetically), with yearly totals in the top panel; (B) number of counts available for each condition and country (both ranked alphabetically), with country totals in the top panel; (C) annual number of counts available for each country (ranked alphabetically), with annual totals in the top panel. Figure 6. Open in new tabDownload slide Project Tycho v2 data available per condition, country, and year. (A) Annual number of counts available for each condition (ranked alphabetically), with yearly totals in the top panel; (B) number of counts available for each condition and country (both ranked alphabetically), with country totals in the top panel; (C) annual number of counts available for each country (ranked alphabetically), with annual totals in the top panel. Findability We created a DOI for each Project Tycho v2 pre-compiled dataset. Users can also compile custom datasets through the online GUI or API, but custom datasets do not have DOIs because there are an infinite number of possible datasets that can be compiled by users. For example, Project Tycho v1 users have created over 300 000 datasets in 1 day through the API. For custom-compiled datasets, we include, for every count, the DOI of the pre-compiled dataset that also includes that count. The DOIs will enable users to cite each dataset and will enable us to more accurately detect the reuse of datasets by others. We also created metadata for each dataset in XML format following the DataCite schema,12 and in JSON format following the DATS schema developed by the Biomedical and Healthcare Data Discovery Index Ecosystem (bioCADDIE).13 We used metadata attributes and values from standard vocabularies and ontologies where possible. We used 82 attributes from the DataCite XML schema to describe each Project Tycho dataset (Supplementary Table S6), and 163 attributes from the DATS specification (Supplementary Table S7). The main attributes, included in both schemas, were the dataset identifier, creators, subjects (IsAbout attribute in DATS), license, associated publications, and funders. In addition, the DATS schema included information about the distribution of the dataset (CSV files) and about the Project Tycho repository containing the distributions. We used the DataCite XML metadata files to mint DOIs for our datasets through the University of California EZID service.15 Project Tycho 2.0 datasets are now indexed by the EZID catalogue at the University of California and by DataCite.16 We also registered Project Tycho with the DataMed catalogue for indexing.17 Through our participation in the Models of Infectious Disease Agent Study (MIDAS), Project Tycho 2.0 datasets have also been indexed by the MIDAS Digital Commons (MDC).18 Furthermore, the Project Tycho database is registered with FAIRSharing19 and with various university libraries. Accessibility Project Tycho pre-compiled datasets are listed on our website and can be accessed through download links on the dataset landing pages (Supplementary Figure S1A-B). The dataset landing pages list the DOI, the reported conditions and pathogens included in the dataset, the country included, and the time period covered by the dataset. Each landing page also includes a download link for metadata files. Users can also compile a custom dataset through the online GUI or API (Supplementary Figure S1C). For example, users may want to combine data from multiple conditions or countries in one dataset instead of downloading a pre-compiled dataset for each country-condition combination. Also, the GUI enables users to visualize data before downloading. Only authorized users can download data after authentication via their login or API web key. All users that have created a user account are authorized to download Project Tycho data. We have included the URL of the main data access page (www.tycho.pitt.edu/data) and of dataset-specific landing pages in dataset metadata files. We also encoded data access, authorization, and authentication procedures in the metadata. Interoperability We have designed a standard Project Tycho data format for pre-compiled and for custom-created datasets and registered both formats with FAIRsharing (Supplementary Table S8).19 We used 17 standard vocabularies and ontologies to represent information in Project Tycho datasets (Table 1). All Project Tycho data are represented as counts of cases or deaths related to disease conditions reported by public health surveillance. We used Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) names and codes to represent disease conditions, NCBI Taxonomy names and identifiers to represent pathogens, ISO-3166 names and codes to represent countries and first level administrative subdivisions, and names from the Geonames vocabulary to represent second-level administrative subdivisions and cities. We also used SNOMED-CT terminology to represent the reporting of cases or deaths due to diseases, and to represent diagnostic certainty. We did not use the International Classification of Diseases (ICD) vocabulary to encode disease conditions because many conditions were not specified in the ICD standard, and the SNOMED-CT ontology represented the logical relationships between conditions in more detail. If needed, existing maps between SNOMED and ICD vocabularies can be used to convert between the two.20 In metadata files, we also encoded information that applied to the entire dataset, including SNOMED-CT condition names and codes, NCBI Taxonomy names and identifiers for pathogens, and ISO 3166 country names and codes. In addition, we encoded information about the general type of data (disease surveillance), and the general method used to collect the data (disease notification) in the metadata. Also, we encoded the time period represented, identifiers and roles of researchers involved in creating datasets, and funding agencies involved (Supplementary Tables S6–S7). Table 1. Standard vocabularies and ontologies used by Project Tycho data and metadata Attribute . Vocabulary/ontology . Vocabulary/ontology identifier . Condition, case, death, diagnosis certainty SNOMED-CT doi: 10.25504/fairsharing.d88s6e Country and Admin1 ISO-3166 https://www.iso.org/standard/63545.html Admin2 and city Geonames http://www.geonames.org/ Pathogen NCBI Taxonomy doi: 10.25504/fairsharing.fj07xj Funding agency Crossref Funder Registry https://www.crossref.org/services/funder-registry/ Fatality SNOMED-CT doi: 10.25504/fairsharing.d88s6e Time intervals Time Intervals Ontology http://reference.data.gov.uk/def/intervals Researchers Open Researcher and Contributor ID (ORCID) doi: 10.25504/fairsharing.nx58jg Surveillance data Apollo SV doi: 10.25504/fairsharing.ngv2xx Disease notification Medical Subject Headings (MeSH) doi: 10.25504/fairsharing.qnkw45 Contributor type, related identifier relation type DataCite XSD doi: 10.25504/fairsharing.me4qwe Cumulative incidence Epidemiology Ontology http://www.ontobee.org/ontology/EPO Dates ISO 8061 https://www.iso.org/iso-8601-date-and-time-format.html Infectious disease incidence Infectious Disease Ontology doi: 10.25504/fairsharing.aae3v6 Creator and author roles Scholarly Contributions and Roles Ontology http://purl.org/spar/scoro Type of standard for dataset distribution EMBRACE Data and Methods Ontology doi: 10.25504/fairsharing.a6r7zs Megabites Unit Ontology doi: 10.25504/fairsharing.mjnypw Attribute . Vocabulary/ontology . Vocabulary/ontology identifier . Condition, case, death, diagnosis certainty SNOMED-CT doi: 10.25504/fairsharing.d88s6e Country and Admin1 ISO-3166 https://www.iso.org/standard/63545.html Admin2 and city Geonames http://www.geonames.org/ Pathogen NCBI Taxonomy doi: 10.25504/fairsharing.fj07xj Funding agency Crossref Funder Registry https://www.crossref.org/services/funder-registry/ Fatality SNOMED-CT doi: 10.25504/fairsharing.d88s6e Time intervals Time Intervals Ontology http://reference.data.gov.uk/def/intervals Researchers Open Researcher and Contributor ID (ORCID) doi: 10.25504/fairsharing.nx58jg Surveillance data Apollo SV doi: 10.25504/fairsharing.ngv2xx Disease notification Medical Subject Headings (MeSH) doi: 10.25504/fairsharing.qnkw45 Contributor type, related identifier relation type DataCite XSD doi: 10.25504/fairsharing.me4qwe Cumulative incidence Epidemiology Ontology http://www.ontobee.org/ontology/EPO Dates ISO 8061 https://www.iso.org/iso-8601-date-and-time-format.html Infectious disease incidence Infectious Disease Ontology doi: 10.25504/fairsharing.aae3v6 Creator and author roles Scholarly Contributions and Roles Ontology http://purl.org/spar/scoro Type of standard for dataset distribution EMBRACE Data and Methods Ontology doi: 10.25504/fairsharing.a6r7zs Megabites Unit Ontology doi: 10.25504/fairsharing.mjnypw Open in new tab Table 1. Standard vocabularies and ontologies used by Project Tycho data and metadata Attribute . Vocabulary/ontology . Vocabulary/ontology identifier . Condition, case, death, diagnosis certainty SNOMED-CT doi: 10.25504/fairsharing.d88s6e Country and Admin1 ISO-3166 https://www.iso.org/standard/63545.html Admin2 and city Geonames http://www.geonames.org/ Pathogen NCBI Taxonomy doi: 10.25504/fairsharing.fj07xj Funding agency Crossref Funder Registry https://www.crossref.org/services/funder-registry/ Fatality SNOMED-CT doi: 10.25504/fairsharing.d88s6e Time intervals Time Intervals Ontology http://reference.data.gov.uk/def/intervals Researchers Open Researcher and Contributor ID (ORCID) doi: 10.25504/fairsharing.nx58jg Surveillance data Apollo SV doi: 10.25504/fairsharing.ngv2xx Disease notification Medical Subject Headings (MeSH) doi: 10.25504/fairsharing.qnkw45 Contributor type, related identifier relation type DataCite XSD doi: 10.25504/fairsharing.me4qwe Cumulative incidence Epidemiology Ontology http://www.ontobee.org/ontology/EPO Dates ISO 8061 https://www.iso.org/iso-8601-date-and-time-format.html Infectious disease incidence Infectious Disease Ontology doi: 10.25504/fairsharing.aae3v6 Creator and author roles Scholarly Contributions and Roles Ontology http://purl.org/spar/scoro Type of standard for dataset distribution EMBRACE Data and Methods Ontology doi: 10.25504/fairsharing.a6r7zs Megabites Unit Ontology doi: 10.25504/fairsharing.mjnypw Attribute . Vocabulary/ontology . Vocabulary/ontology identifier . Condition, case, death, diagnosis certainty SNOMED-CT doi: 10.25504/fairsharing.d88s6e Country and Admin1 ISO-3166 https://www.iso.org/standard/63545.html Admin2 and city Geonames http://www.geonames.org/ Pathogen NCBI Taxonomy doi: 10.25504/fairsharing.fj07xj Funding agency Crossref Funder Registry https://www.crossref.org/services/funder-registry/ Fatality SNOMED-CT doi: 10.25504/fairsharing.d88s6e Time intervals Time Intervals Ontology http://reference.data.gov.uk/def/intervals Researchers Open Researcher and Contributor ID (ORCID) doi: 10.25504/fairsharing.nx58jg Surveillance data Apollo SV doi: 10.25504/fairsharing.ngv2xx Disease notification Medical Subject Headings (MeSH) doi: 10.25504/fairsharing.qnkw45 Contributor type, related identifier relation type DataCite XSD doi: 10.25504/fairsharing.me4qwe Cumulative incidence Epidemiology Ontology http://www.ontobee.org/ontology/EPO Dates ISO 8061 https://www.iso.org/iso-8601-date-and-time-format.html Infectious disease incidence Infectious Disease Ontology doi: 10.25504/fairsharing.aae3v6 Creator and author roles Scholarly Contributions and Roles Ontology http://purl.org/spar/scoro Type of standard for dataset distribution EMBRACE Data and Methods Ontology doi: 10.25504/fairsharing.a6r7zs Megabites Unit Ontology doi: 10.25504/fairsharing.mjnypw Open in new tab Reusability We aimed to create rich metadata for pre-compiled Project Tycho datasets by using as many attributes as possible from the DataCite and DATS schemas (Supplementary Tables S6–S7). All Project Tycho data are licensed under a Creative Commons Attribution 4.0 International license that allows any use free of charge as long as Project Tycho is attributed as the source of the data.21 We list the dataset citation, that includes the DOI, on the landing page of each dataset to enable appropriate attribution (Supplementary Figure S1B). Users who compiled a custom dataset through the GUI or API can cite the DOI of each dataset from which counts for the custom dataset were taken. DISCUSSION In this paper, we presented the value of sharing historical epidemiological data for creating new knowledge and technology with the example of Project Tycho v1 and improvements made for Project Tycho v2. We anticipate that the improved FAIR representation of Project Tycho (meta)data will lead to a broader and more efficient reuse of disease surveillance data for research and technology development. Our aim is to grow Project Tycho into a domain repository for global population health, serving as an integrated data resource for the global health community. Creating an open-access repository for standardized information that was previously available only in PDF format required significant upfront investments of time and effort, but will be cost effective in the long run by avoiding duplication of data digitization and curation for many individual research projects. From a researcher perspective, digitizing and standardizing all data were more expensive than digitizing one specific section of interest. From a funder and community perspective, digitizing and standardizing all data in one sweep were more efficient and cost effective compared to doing the same work for small parts of the data at a time, as many operations are done only once, such as compiling all files, establishing standard operating procedures, opening, closing, and saving files, and standardizing file formats and contents. Updating Project Tycho v1 to v2 required an additional investment (50% faculty, 50% senior developer, and 100% junior developer full-time equivalent over the course of 5 months) for redesigning the data format, improving FAIR compliance, and for development of a new online user interface. The return on this investment will be further prevention of duplication and the acceleration of new science and discovery in global health through improved FAIR compliance of the Project Tycho data repository. We presented new knowledge based on Project Tycho data in terms of published papers, theses, software products, etc., and expect that this knowledge will have a positive impact on global health. This impact will take time and will be difficult to measure. Project Tycho v1 has enabled research in 3 ways. First, the large spatial (entire United States) and temporal (1888-2013) scope has enabled studies to expand their scope and make more robust inference.22–25 For example, a highly impactful study on the immunomodulatory effect of measles infection, published by Mina et al. in Science,22 used US data from Project Tycho in conjunction with data from Denmark and the United Kingdom, expanding the scope of this study beyond Europe. Second, the standard format of Project Tycho data has enabled the application of new analytical algorithms developed by computer scientists to biomedical data,23,26–29 bridging the disciplinary gap between these domains. Without standardized data from Project Tycho, these scientists may have used non-medical data instead, possibly depriving the biomedical domain of new innovations. Third, the user-friendly format of Project Tycho data has led to the use of biomedical data by journalists for advocacy about the value of vaccination and surveillance in widely read magazines and newspapers,30–33 such as Nature,30 the New York Times,31 the Wall Street Journal,32 and Forbes Magazine,33 leading to a better informed general public. In many of these instances, the research would have been possible without Project Tycho, as demonstrated by similar papers that did not use Project Tycho data,34–37 but would have required significant investments in data digitization and standardization, thus duplicating efforts without resulting in an open-access, community resource. With Project Tycho v2, we have contributed to improving FAIR compliance of data in global health, but more work remains to be done. For example, we used a set of 17 external standard vocabularies and ontologies to represent disease surveillance data, but many additional standards exist and may be applicable. It would be relevant for the global health community to develop a collection of preferred, existing standards and ontologies that can be used by most stakeholders. In addition, new standards and ontologies could be created for data attributes that cannot easily be represented with current standards. For example, it would be valuable to represent data provenance in detail, given the heterogeneity in data methods, sources, and workflows used in global health. Also, global health data are dynamic and always changing, which complicates data identifiability. Fortunately, metadata schemas have started to incorporate attributes that can represent changing data, eg, through the “relationType” attribute of the DataCite XML schema that can take on values such as “IsContinuedBy,” “Continues,” or “ISNewVersionOf” to represent relationships between datasets.12 CONCLUSION The FAIR principles are a valuable guide for improving the integration and reuse of global health data. Adoption of these principles by Project Tycho and other data repositories in global health can accelerate the integration and reuse of data to discover new knowledge and technology for improving the lives of populations around the world. FUNDING This research was funded by the National Institute of General Medical Sciences Models of Infectious Disease Agent Study (MIDAS, 5U54GM088491-09), the NIH Big Data to Knowledge (BD2K) program (5K01ES026836-03 and 3K01ES026836-02S1), and the Bill and Melinda Gates Foundation (49276). CONTRIBUTOR WGVP conceptualized the project, conducted the analyses, and wrote the manuscript; AC conducted the analysis and contributed to writing the manuscript; DSB conceptualized the project and contributed to the analysis and writing of the manuscript. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. Conflict of interest statement. The authors declare no competing interests. ACKNOWLEDGMENTS The authors would like to acknowledge the MIDAS Informatics Services Group (ISG, 5U24GM110707-04) for its technical support, and public health agencies in countries around the world for collecting and making available disease surveillance data. REFERENCES 1 Fineberg HV. Pandemic preparedness and response—lessons from the H1N1 influenza of 2009 . N Engl J Med 2014 ; 370 ( 14 ): 1335 – 42 . Google Scholar OpenURL Placeholder Text WorldCat 2 Wyber R , Vaillancourt S, Perry W, et al. . Big data in global health: improving health in low- and middle-income countries . Bull World Health Organ 2015 ; 93 ( 3 ): 203 – 8 . Google Scholar OpenURL Placeholder Text WorldCat 3 van Panhuis WG , Paul P, Emerson C, et al. . A systematic review of barriers to data sharing in public health . BMC Public Health 2014 ; 14 ( 1 ): 2579 . Google Scholar OpenURL Placeholder Text WorldCat 4 Heymann DL , Chen L, Takemi K, et al. . Global health security: the wider lessons from the west African Ebola virus disease epidemic . Lancet (Lond, Engl) 2015 ; 385 ( 9980 ): 1884 – 901 . Google Scholar OpenURL Placeholder Text WorldCat 5 Zhang Q , Sun K, Chinazzi M, et al. . Spread of Zika virus in the Americas . Proc Natl Acad Sci USA 2017 ; 114 ( 22 ): E4334 – 43 . Google Scholar OpenURL Placeholder Text WorldCat 6 Press G. Cleaning big data: most time-consuming, least enjoyable data science task, survey says. Forbes Magazine. 2016 . https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#143240186f63. Accessed April 13, 2018. 7 van Panhuis WG , Grefenstette J, Jung SY, et al. . Contagious diseases in the United States from 1888 to the present . N Engl J Med 2013 ; 369 ( 22 ): 2152 – 8 . Google Scholar OpenURL Placeholder Text WorldCat 8 Wilkinson MD , Dumontier M, Aalbersberg IJJ, et al. . The FAIR guiding principles for scientific data management and stewardship . Sci Data 2016 ; 3 : 160018 . Google Scholar OpenURL Placeholder Text WorldCat 9 Centers for Disease Control and Prevention. National Notifiable Diseases Surveillance System. 2018 . https://data.cdc.gov/browse? category=NNDSS&tags=nndss. Accessed April 13, 2018. 10 van Panhuis WG , Choisy M, Xiong X, et al. . Region-wide synchrony and traveling waves of dengue across eight countries in Southeast Asia . Proc Natl Acad Sci USA 2015 ; 112 ( 42 ): 13069 . Google Scholar OpenURL Placeholder Text WorldCat 11 Ruberto I , Marques E, Burke DS, Van Panhuis WG. The availability and consistency of dengue surveillance data provided online by the World Health Organization . PLoS Negl Trop Dis 2015 ; 9 ( 4 ): e0003511. Google Scholar OpenURL Placeholder Text WorldCat 12 DataCite Metadata Working Group . DataCite Metadata Schema Documentation for the Publication and Citation of Researdch Data. Version 4.1. 2017 . doi:10.5438/0014 13 Sansone S-A , Gonzalez-Beltran A, Rocca-Serra P, et al. . DATS, the data tag suite to enable discoverability of datasets . Sci Data 2017 ; 4 : 170059 . Google Scholar OpenURL Placeholder Text WorldCat 14 US Centers for Disease Control and Prevention . Notifiable Infectious Diseases and Conditions Data Tables|NNDSS. 2017 . https://wwwn.cdc.gov/nndss/infectious-tables.html. Accessed April 13, 2018. 15 University of California . EZID: Identifiers Made Easy. 2018 . https://ezid.cdlib.org/. Accessed April 13, 2018. 16 DataCite . Welcome to DataCite. 2018 . https://www.datacite.org/. Accessed April 13, 2018. 17 biomedical and healthCAre Data Discovery Index Ecosystem . Home - DataMed | bioCADDIE Data Discovery Index. 2018 . https://datamed.org/. Accessed April 13, 2018. 18 MIDAS Informatics Services Group . MIDAS Digital Commons. 2018 . http://betaweb.rods.pitt.edu/digital-commons/main#_. Accessed April 13, 2018. 19 The FAIRsharing team. FAIRsharing.org: Standards, Databases, Policies. 2018. https://fairsharing.org. Accessed April 13, 2018. 20 US National Library of Medicine . SNOMED CT to ICD-10-CM Map. 2015 . https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html. Accessed April 13, 2018. 21 Creative Commons . Creative Commons—Attribution 4.0 International (CC BY 4.0). 2018 . https://creativecommons.org/licenses/by/4.0/. Accessed April 13, 2018. 22 Mina MJ , Metcalf CJE, de Swart RL, Osterhaus ADME, Grenfell BT. Long-term measles-induced immunomodulation increases overall childhood infectious disease mortality . Science 2015 ; 348 ( 6235 ): 694 – 9 . Google Scholar OpenURL Placeholder Text WorldCat 23 Magpantay FMG , Rohani P. Dynamics of pertussis transmission in the United States . Am J Epidemiol 2015 ; 181 ( 12 ): 921 – 31 . Google Scholar OpenURL Placeholder Text WorldCat 24 Shrestha S , Foxman B, Berus J, et al. . The role of influenza in the epidemiology of pneumonia . Sci Rep 2015 ; 5 ( 1 ): 1 – 12 . Google Scholar OpenURL Placeholder Text WorldCat 25 Dalziel BD , Bjørnstad ON, van Panhuis WG, et al. . Persistent chaos of measles epidemics in the prevaccination United States caused by a small change in seasonal transmission patterns . PLoS Comput Biol 2016 ; 12 ( 2 ): e1004655 . Google Scholar OpenURL Placeholder Text WorldCat 26 Herlands W , Wilson A, Nickisch H, et al. . Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces. 2015 . http://arxiv.org/abs/1511.04408. Accessed April 13, 2018. 27 Liu Z , Song HA, Zadorozhny V, Faloutsos C, Sidiropoulos NH. Fuse: efficient fusion of aggregated historical data. In: Nitesh Chawla and Wei Wang, eds. Proceedings of the 2017 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics; 2017 : 786–94. Philadelphia, Pennsylvania. doi:10.1137/1.9781611974973.88 28 Ghosh S , Chakraborty P, Nsoesie EO, et al. . Temporal topic modeling to assess associations between news trends and infectious disease outbreaks . Sci Rep 2017 ; 7 : 40841 . Google Scholar OpenURL Placeholder Text WorldCat 29 Scarpino SV , Petri G. On the Predictability of Infectious Disease Outbreaks. 2017 . http://arxiv.org/abs/1703.07317. Accessed April 13, 2018. 30 Scully T. The age of vaccines . Nature 2014 ; 507 ( 7490 ): S2 – 3 . Google Scholar OpenURL Placeholder Text WorldCat 31 Lohr S. The vaccination effect: 100 million cases of contagious disease prevented. New York Times. 2013 . https://bits.blogs.nytimes.com/2013/11/27/the-vaccination-effect-100-million-cases-of-contagious-disease-prevented/?_r=0%0A. Accessed April 13, 2018. 32 DeBold T , Friedman D. Battling infectious diseases in the 20th century: the impact of vaccines . Wall Street Journal . 2015 . http://graphics.wsj.com/infectious-diseases-and-vaccines/. Accessed April 13, 2018. Google Scholar OpenURL Placeholder Text WorldCat 33 Bigman D. Worries beyond ebola: infographic shows what else is on America’s deadly disease watchlist. Forbes Magazine. 2014 . https://www.forbes.com/sites/danbigman/2014/10/15/beyond-ebola-what-else-is-on-americas-deadly-disease-watchlist/#f7293967b7ef%0A. Accessed April 13, 2018. 34 Martinez-Bakker M , King AA, Rohani P. Unraveling the transmission ecology of polio . PLoS Biol 2015 ; 13 ( 6 ): e1002172. Google Scholar OpenURL Placeholder Text WorldCat 35 Choisy M , Rohani P. Changing spatial epidemiology of pertussis in continental USA . Proc Biol Sci 2012 ; 279 ( 1747 ): 4574 – 81 . Google Scholar OpenURL Placeholder Text WorldCat 36 Cummings DAT , Irizarry RA, Huang NE, et al. . Travelling waves in the occurrence of dengue haemorrhagic fever in Thailand . Nature 2004 ; 427 ( 6972 ): 344 – 5 . Google Scholar OpenURL Placeholder Text WorldCat 37 Martinez-Bakker M , Bakker KM, King AA, Rohani P. Human birth seasonality: latitudinal gradient and interplay with childhood disease dynamics . Proc R Soc B Biol Sci 2014 ; 281 ( 1783 ): 20132438 . Google Scholar OpenURL Placeholder Text WorldCat © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Effect of vocabulary mapping for conditions on phenotype cohortsHripcsak, George; Levine, Matthew E; Shang, Ning; Ryan, Patrick B
doi: 10.1093/jamia/ocy124pmid: 30395248
Abstract Objective To study the effect on patient cohorts of mapping condition (diagnosis) codes from source billing vocabularies to a clinical vocabulary. Materials and Methods Nine International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9-CM) concept sets were extracted from eMERGE network phenotypes, translated to Systematized Nomenclature of Medicine - Clinical Terms concept sets, and applied to patient data that were mapped from source ICD9-CM and ICD10-CM codes to Systematized Nomenclature of Medicine - Clinical Terms codes using Observational Health Data Sciences and Informatics (OHDSI) Observational Medical Outcomes Partnership (OMOP) vocabulary mappings. The original ICD9-CM concept set and a concept set extended to ICD10-CM were used to create patient cohorts that served as gold standards. Results Four phenotype concept sets were able to be translated to Systematized Nomenclature of Medicine - Clinical Terms without ambiguities and were able to perform perfectly with respect to the gold standards. The other 5 lost performance when 2 or more ICD9-CM or ICD10-CM codes mapped to the same Systematized Nomenclature of Medicine - Clinical Terms code. The patient cohorts had a total error (false positive and false negative) of up to 0.15% compared to querying ICD9-CM source data and up to 0.26% compared to querying ICD9-CM and ICD10-CM data. Knowledge engineering was required to produce that performance; simple automated methods to generate concept sets had errors up to 10% (one outlier at 250%). Discussion The translation of data from source vocabularies to Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) resulted in very small error rates that were an order of magnitude smaller than other error sources. Conclusion It appears possible to map diagnoses from disparate vocabularies to a single clinical vocabulary and carry out research using a single set of definitions, thus improving efficiency and transportability of research. vocabulary, terminology mapping, observational research, phenotyping INTRODUCTION Much observational research relies on structured data such as diagnoses, medications, procedures, and laboratory tests. Each area draws its structured codes from some combination of disparate vocabularies and local coding schemes. Diagnoses are among the most used in phenotype definitions in observational research, and in the United States, they include International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9-CM)1 billing codes for data before October 2015, ICD10-CM2 billing codes for data after that, Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT)3 codes for some problem lists and natural language processing of narrative clinical notes, MedDRA4 for drug side effects, and local codes for some problem lists and narrative notes. International databases show more diversity, also including, eg, ICD10 (non-CM) codes and Read Codes.5 While it is possible to define phenotypes made of sets of concepts defined separately from each of the above vocabularies, the process is difficult because of the number of vocabularies and because query authors do not generally have access to databases with all of the vocabularies to train or test on; the result is also hard to maintain. Limiting the number of vocabularies in the phenotype definition limits the generalizability of the phenotype. For example, with only 5% of the world population, the United States can study hypotheses only on more prevalent diagnoses, treatments, and effects. Focusing only on ICD codes, as most U.S. phenotyping activities do, relies on coarse diagnostic codes and suffers from the limitations of ICD’s hierarchical organization. The Observational Health Data Sciences and Informatics (OHDSI)6,7 initiative has produced and maintains mappings from 80 source vocabularies to a smaller set of “standard” vocabularies that are usually queried. SNOMED CT is OHDSI’s standard vocabulary for diagnoses, which are called “conditions” in the OHDSI data model. SNOMED CT was chosen for its international reach, its clinical as opposed to billing focus, its fine granularity, its extensive hierarchy, and its increasing use in clinical data entry methods such as natural language processing and problem lists. Source data are mapped to standard vocabularies, and both the mapped and source data are stored in OHDSI’s data model, commonly referred to as the Observational Medical Outcomes Partnership (OMOP) Common Data Model,8 named after OHDSI’s predecessor, OMOP. For ICD9-CM and ICD10-CM conditions, OHDSI’s primary source of mappings is a combination of the National Library of Medicine’s Unified Medical Language System Metathesaurus Mapping Project9 and a mapping from the United Kingdom National Health Service Terminology Service. OHDSI contracts a knowledge engineering vendor (Odysseus Data Services, Cambridge, MA) to import these mappings, expand and correct them as needed, and, as appropriate, suggest additions and corrections back. Mappings for other vocabularies may have other sources or may be generated by the vendor. All mappings are freely available (OHDSI.org). Typical mappings are ICD9-CM 3-digit non-billing code 410 “Acute myocardial infarction” to SNOMED CT 57054005 “Acute myocardial infarction,” and ICD9-CM 5-digit billing code 410.00 “Acute myocardial infarction of anterolateral wall, episode of care unspecified” to SNOMED CT 70211005 “Acute myocardial infarction of anterolateral wall.” The first two terms, 410 and 57054005, are both ancestors of the second two terms, 410.00 and 70211005, in the respective hierarchies. There is concern about information loss any time there is a mapping: does the new coding scheme retain distinctions that were apparent in the original? Previous work by Reich et al.10 showed that while there are vocabulary differences in mapping from ICD9-CM to SNOMED CT, and while those differences cause differences in cohorts, the studies that use the mappings showed minimal differences from the original studies. In this study, we extend the analysis to ICD10-CM source data in addition to ICD9-CM, looking at the accuracy of code mappings and at the effect on patient cohorts that are generated by the mappings. We hypothesize that differences between vocabularies will create imperfect code mappings, but that the actual effect on patient cohorts will be minor, possibly because more frequently used codes tend to be better matched between vocabularies and because of redundancy in the concepts that define a cohort (several related codes may be included in the definition) and in patient records (one patient may have several related codes). METHODS In this study, we create patient cohorts by selecting all patients whose structured patient record contains at least one code that is included in a list of concepts, referred to in this paper as a “concept set.” Our goal was to assess the effect of mapping patient data from a source vocabulary, which was ICD9-CM and ICD10-CM in this case, to an OHDSI standard vocabulary, which was SNOMED CT for conditions (diagnoses). We used the OHDSI mappings for the conversion. Once the patient data are mapped to a different vocabulary, then any concept sets used to query those data must also be mapped. For example, if a concept set includes ICD9-CM 410.00 “Acute myocardial infarction of anterolateral wall, episode of care unspecified” to query the ICD9-CM data, then after the data are mapped, a new concept set should include SNOMED CT 70211005 “Acute myocardial infarction of anterolateral wall.” As will be seen below, mapping codes in concept sets is different from mapping codes in patient records, and several approaches are possible. A secondary goal was therefore to assess the performance of different approaches to mapping concept sets. We assessed both the effect on the codes in the concept sets and the effect on patient cohorts generated by applying those concept sets to our clinical database. See Figure 1 for an overview of the study. Figure 1. Open in new tabDownload slide Design of the vocabulary study. The OHDSI (OMOP) database comprises the source data in ICD9-CM and ICD10-CM (bottom left) and the mapped data in SNOMED CT (bottom right). The gold standard concept sets include the original ICD9-CM concept set, run only on the ICD9-CM codes in the source data, and the extension of that concept set to ICD10-CM (and SNOMED CT but not used here) based on the current authors’ interpretation of the original authors’ intent 11-17. New SNOMED CT concept sets are generated from the original concept set both using knowledge engineering and via automatic translation. The generated concept sets and the gold standard concept sets are run against their respective data sets, and the resulting patient cohorts are compared for false positives (FP) or false negatives (FN) with the original concept set serving as the gold standard for Table 3 and the extended concept set that is based on the original authors’ intent serving as the gold standard in Table 4. Source of phenotypes We used 9 phenotypes11–17 from the eMERGE18 initiative (Table 1). This initiative was chosen because the phenotype definitions were validated, because the phenotypes were explained in each case, thus allowing us to assess intent, and because the sets were made available on the Internet. The phenotypes were chosen based on having a predominant concept set (as opposed to, say, relying primarily on laboratory values). Our study addressed only the concept sets, not the logic that surrounds them, because our goal was specifically to study the mappings. For example, a phenotype definition may require multiple diagnosis concept sets, impose temporal constraints, combine diagnosis evidence with evidence from medications and other areas, or exploit narrative data. Table 1. Original ICD9-CM definition of concept sets used in phenotypes Algorithm . Original ICD9-CM concept set‡ . Heart failure (HF)11 428.* Heart failure as exclusion diagnosis (HF2)11 428.* Type-1 diabetes mellitus (T1DM)12 250.x1, 250.x3 Type-2 diabetes mellitus (T2DM)12 250.x0, 250.x2 Appendicitis (Appy)13 540.* Attention deficit hyperactivity disorder (ADHD)14 314, 314.0, 314.01, 314.1, 314.2, 314.8, 314.9 Cataract (Catar)15 366.10, 366.12, 366.13, 366.14, 366.15, 366.16, 366.17, 366.18, 366.19, 366.21, 366.30, 366.41, 366.45, 366.8, 366.9 Crohn’s disease (Crohn)16 555, 555.0, 555.1, 555.2, 555.9 Rheumatoid arthritis (RA)17† 714, 714.0, 714.1, 714.2 Algorithm . Original ICD9-CM concept set‡ . Heart failure (HF)11 428.* Heart failure as exclusion diagnosis (HF2)11 428.* Type-1 diabetes mellitus (T1DM)12 250.x1, 250.x3 Type-2 diabetes mellitus (T2DM)12 250.x0, 250.x2 Appendicitis (Appy)13 540.* Attention deficit hyperactivity disorder (ADHD)14 314, 314.0, 314.01, 314.1, 314.2, 314.8, 314.9 Cataract (Catar)15 366.10, 366.12, 366.13, 366.14, 366.15, 366.16, 366.17, 366.18, 366.19, 366.21, 366.30, 366.41, 366.45, 366.8, 366.9 Crohn’s disease (Crohn)16 555, 555.0, 555.1, 555.2, 555.9 Rheumatoid arthritis (RA)17† 714, 714.0, 714.1, 714.2 ‡ Within a code list, “*” means one or more digits or a period; “x” means one digit. † Only rheumatoid arthritis also had ICD10-CM codes in its original definition, namely, M05* and M06*, and these were used in the second gold standard. Open in new tab Table 1. Original ICD9-CM definition of concept sets used in phenotypes Algorithm . Original ICD9-CM concept set‡ . Heart failure (HF)11 428.* Heart failure as exclusion diagnosis (HF2)11 428.* Type-1 diabetes mellitus (T1DM)12 250.x1, 250.x3 Type-2 diabetes mellitus (T2DM)12 250.x0, 250.x2 Appendicitis (Appy)13 540.* Attention deficit hyperactivity disorder (ADHD)14 314, 314.0, 314.01, 314.1, 314.2, 314.8, 314.9 Cataract (Catar)15 366.10, 366.12, 366.13, 366.14, 366.15, 366.16, 366.17, 366.18, 366.19, 366.21, 366.30, 366.41, 366.45, 366.8, 366.9 Crohn’s disease (Crohn)16 555, 555.0, 555.1, 555.2, 555.9 Rheumatoid arthritis (RA)17† 714, 714.0, 714.1, 714.2 Algorithm . Original ICD9-CM concept set‡ . Heart failure (HF)11 428.* Heart failure as exclusion diagnosis (HF2)11 428.* Type-1 diabetes mellitus (T1DM)12 250.x1, 250.x3 Type-2 diabetes mellitus (T2DM)12 250.x0, 250.x2 Appendicitis (Appy)13 540.* Attention deficit hyperactivity disorder (ADHD)14 314, 314.0, 314.01, 314.1, 314.2, 314.8, 314.9 Cataract (Catar)15 366.10, 366.12, 366.13, 366.14, 366.15, 366.16, 366.17, 366.18, 366.19, 366.21, 366.30, 366.41, 366.45, 366.8, 366.9 Crohn’s disease (Crohn)16 555, 555.0, 555.1, 555.2, 555.9 Rheumatoid arthritis (RA)17† 714, 714.0, 714.1, 714.2 ‡ Within a code list, “*” means one or more digits or a period; “x” means one digit. † Only rheumatoid arthritis also had ICD10-CM codes in its original definition, namely, M05* and M06*, and these were used in the second gold standard. Open in new tab Concept sets We used several versions of concepts sets to query the data (Table 2). The original ICD9-CM concept set served as a baseline, and it was run on the unmapped ICD9-CM patient data. The rest of the concept sets comprised SNOMED CT codes and were run on the mapped data. A hand-engineered SNOMED CT concept set was intended to mimic the behavior of the original ICD9-CM concept set. A second hand-engineered concept set was optimized to extend the original query author’s intent to ICD10-CM codes (and SNOMED CT codes, but we had no such data for testing). These two concept sets might not be identical because adding a SNOMED CT concept that pulls in a needed ICD10-CM code might pull in unwanted ICD9-CM codes or because a SNOMED CT concept that is needed to pull in an ICD9-CM code might also pull in an unwanted ICD10-CM code that has many more patients. Table 2. Methods to generate concept sets from ICD9-CM concept set Method . Description . Original (no mapping) Original concept set. ICD9 set Original ICD9-CM concept set generated by the phenotype author. This set is always run against the patients’ original ICD9-CM terms to show what would have happened before either data or concept sets were mapped. Knowledge engineered (automatically map data; manually translate concept sets) These SNOMED CT concept sets were created by hand. They are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED mimic SNOMED CT concept set designed to mimic the original ICD9-CM concept set as much as possible, ignoring data from other vocabularies. SNOMED optimize SNOMED CT concept set designed to carry out phenotype author’s intent to ICD9-CM, ICD10-CM, and SNOMED CT. Automatically generated (automatically map data and concept sets) These SNOMED CT concept sets were generated automatically from the original ICD9-CM set using OHDSI vocabulary mappings. Like knowledge engineered, they are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED no desc SNOMED CT concept set generated by using OHDSI vocabulary mappings to map from ICD9-CM terms to SNOMED CT, not using the SNOMED hierarchy. SNOMED all desc Like “SNOMED no desc,” but includes all terms in the SNOMED CT hierarchy that are descendants of the mapped terms. SNOMED desc x child Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s children is also in the concept set. Can be seen as limited descendants. SNOMED desc x desc Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s descendants is also in the concept set. Can be seen as more limited descendants. Method . Description . Original (no mapping) Original concept set. ICD9 set Original ICD9-CM concept set generated by the phenotype author. This set is always run against the patients’ original ICD9-CM terms to show what would have happened before either data or concept sets were mapped. Knowledge engineered (automatically map data; manually translate concept sets) These SNOMED CT concept sets were created by hand. They are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED mimic SNOMED CT concept set designed to mimic the original ICD9-CM concept set as much as possible, ignoring data from other vocabularies. SNOMED optimize SNOMED CT concept set designed to carry out phenotype author’s intent to ICD9-CM, ICD10-CM, and SNOMED CT. Automatically generated (automatically map data and concept sets) These SNOMED CT concept sets were generated automatically from the original ICD9-CM set using OHDSI vocabulary mappings. Like knowledge engineered, they are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED no desc SNOMED CT concept set generated by using OHDSI vocabulary mappings to map from ICD9-CM terms to SNOMED CT, not using the SNOMED hierarchy. SNOMED all desc Like “SNOMED no desc,” but includes all terms in the SNOMED CT hierarchy that are descendants of the mapped terms. SNOMED desc x child Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s children is also in the concept set. Can be seen as limited descendants. SNOMED desc x desc Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s descendants is also in the concept set. Can be seen as more limited descendants. Open in new tab Table 2. Methods to generate concept sets from ICD9-CM concept set Method . Description . Original (no mapping) Original concept set. ICD9 set Original ICD9-CM concept set generated by the phenotype author. This set is always run against the patients’ original ICD9-CM terms to show what would have happened before either data or concept sets were mapped. Knowledge engineered (automatically map data; manually translate concept sets) These SNOMED CT concept sets were created by hand. They are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED mimic SNOMED CT concept set designed to mimic the original ICD9-CM concept set as much as possible, ignoring data from other vocabularies. SNOMED optimize SNOMED CT concept set designed to carry out phenotype author’s intent to ICD9-CM, ICD10-CM, and SNOMED CT. Automatically generated (automatically map data and concept sets) These SNOMED CT concept sets were generated automatically from the original ICD9-CM set using OHDSI vocabulary mappings. Like knowledge engineered, they are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED no desc SNOMED CT concept set generated by using OHDSI vocabulary mappings to map from ICD9-CM terms to SNOMED CT, not using the SNOMED hierarchy. SNOMED all desc Like “SNOMED no desc,” but includes all terms in the SNOMED CT hierarchy that are descendants of the mapped terms. SNOMED desc x child Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s children is also in the concept set. Can be seen as limited descendants. SNOMED desc x desc Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s descendants is also in the concept set. Can be seen as more limited descendants. Method . Description . Original (no mapping) Original concept set. ICD9 set Original ICD9-CM concept set generated by the phenotype author. This set is always run against the patients’ original ICD9-CM terms to show what would have happened before either data or concept sets were mapped. Knowledge engineered (automatically map data; manually translate concept sets) These SNOMED CT concept sets were created by hand. They are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED mimic SNOMED CT concept set designed to mimic the original ICD9-CM concept set as much as possible, ignoring data from other vocabularies. SNOMED optimize SNOMED CT concept set designed to carry out phenotype author’s intent to ICD9-CM, ICD10-CM, and SNOMED CT. Automatically generated (automatically map data and concept sets) These SNOMED CT concept sets were generated automatically from the original ICD9-CM set using OHDSI vocabulary mappings. Like knowledge engineered, they are run against data in the form of SNOMED CT terms that were generated by mapping data from ICD9-CM and ICD10-CM to SNOMED CT using the OHDSI vocabulary mappings. SNOMED no desc SNOMED CT concept set generated by using OHDSI vocabulary mappings to map from ICD9-CM terms to SNOMED CT, not using the SNOMED hierarchy. SNOMED all desc Like “SNOMED no desc,” but includes all terms in the SNOMED CT hierarchy that are descendants of the mapped terms. SNOMED desc x child Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s children is also in the concept set. Can be seen as limited descendants. SNOMED desc x desc Like “SNOMED no desc,” but includes descendants for mapped terms only if none of the term’s descendants is also in the concept set. Can be seen as more limited descendants. Open in new tab Several concept sets were generated automatically by applying OHDSI data mappings to the ICD9-CM codes in the original concept sets to generate SNOMED CT concept sets. The first version, “no descendants,” takes only the SNOMED CT concepts that are directly mapped from ICD9-CM codes in the original concept set. This works for the ICD9-CM patient data, but can miss much of the ICD10-CM data because ICD10-CM has greater granularity than ICD9-CM. If one includes descendants of the SNOMED CT concepts, one can often pull in these needed ICD10-CM codes. Therefore a second choice is to include all descendants of the SNOMED CT concepts. Because that often also pulls in many unwanted ICD9-CM and ICD10-CM codes, we also studied a pair of intermediate algorithms. They pull in the descendants of a SNOMED CT concept only if none of that concept’s children or descendants (depending on which algorithm) is also in the list. The intuition in these algorithms is that if a query author is selecting some children or descendants and not others, then there may be a reason that those other children or descendants are excluded. We then assessed concept sets by determining whether the SNOMED CT codes included in a given set would retrieve data that were mapped from the desired ICD9-CM codes and implied ICD10-CM codes (see “Gold standards,” below). The concept sets are available in the Supplementary Materials and online (https://github.com/mattlevine22/emerge2ohdsi_information_loss_CURATED.git). Patient cohorts We applied the concept sets to the New York Presbyterian/Columbia University Irving Medical Center patient database. The database has over 5 million patients. It has ICD9-CM codes for about 30 years and ICD10-CM codes since October 2015. While our concept sets would also retrieve data originally stored as SNOMED CT codes, our OHDSI database did not actually include SNOMED CT codes as source concepts (we use them for natural language processing, and that has not been pulled into the database yet). We defined cohorts of patients as those who had at least one of the codes from the concept set ever in their records. This OHDSI study was approved by our institutional review board. Gold standards We used 2 gold standards to assess patient cohorts. The first was the simple application of the original ICD9-CM concept sets to the original ICD9-CM patient data. This gold standard reflects the basic fidelity of the translated query. The second gold standard was created by the authors to extend the query beyond ICD9-CM codes to ICD10-CM and SNOMED CT codes based on the original intent described by the phenotype definition authors. It was created by casting a broad net using mappings and search terms on the source vocabularies, enumerating every code in the hierarchies under those terms, and—code by code in all 3 vocabularies—deciding whether it matched the phenotype authors’ intent. In 2 cases, heart failure as an exclusion diagnosis and cataract, ICD9-CM codes were also added because it appeared that the query authors had missed the codes. In general, we assumed the query authors were correct unless there were other included codes that made it clear that new codes were intended and that the new codes would improve the phenotype. These gold standard concept sets were applied to the source codes (not the mapped data) in the database to create gold standard patient cohorts. For the evaluation, we counted the number of patients inappropriately included in (false positive or FP) or missing from (false negative or FN) the patient cohort generated by the new SNOMED CT concept sets compared to the patient cohorts generated from the gold standard. RESULTS Patient cohorts Here we show how mapping source data affected patient cohorts and, further, how different approaches to mapping concept sets affected that performance (examples will be provided in the diagnostics section). Table 3 shows how the different concept set mapping methods performed for the specific task of mimicking what the original ICD9-CM concept sets returned for only ICD9-CM patients. By definition, the “ICD9 set” was perfect, as it was the gold standard and was run on the unmapped data. The knowledge-engineered concept sets performed well, with the query intended to mimic the ICD9-CM concept set, “SNOMED mimic,” having a maximum error rate, defined as the number of FP and FN divided by the total true cases, of less than 0.15%. Table 3. Performance on ICD9-CM source data mapped to SNOMED CT (FP false positive, FN false negative) Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 seta . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMEDall desc . SNOMED desc x child . SNOMEDdesc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 HF2 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 T1DM 27 861 0 0 0 23 0 23 108 0 943 0 943 0 108 0 T2DM 125 342 0 0 3 30 3 30 34 0 1318 0 104 0 34 0 Appy 9887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADHD 14 399 0 0 0 19 0 19 1362 0 1362 0 1362 0 1362 0 Catar 50 879 0 0 50 0 74 0 50 0 2491 0 80 0 80 0 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9655 0 0 0 0 0 0 0 0 25 103 0 0 0 0 0 Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 seta . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMEDall desc . SNOMED desc x child . SNOMEDdesc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 HF2 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 T1DM 27 861 0 0 0 23 0 23 108 0 943 0 943 0 108 0 T2DM 125 342 0 0 3 30 3 30 34 0 1318 0 104 0 34 0 Appy 9887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADHD 14 399 0 0 0 19 0 19 1362 0 1362 0 1362 0 1362 0 Catar 50 879 0 0 50 0 74 0 50 0 2491 0 80 0 80 0 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9655 0 0 0 0 0 0 0 0 25 103 0 0 0 0 0 a This column is used as the gold standard and is run on unmapped source data and therefore must have perfect performance. Open in new tab Table 3. Performance on ICD9-CM source data mapped to SNOMED CT (FP false positive, FN false negative) Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 seta . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMEDall desc . SNOMED desc x child . SNOMEDdesc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 HF2 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 T1DM 27 861 0 0 0 23 0 23 108 0 943 0 943 0 108 0 T2DM 125 342 0 0 3 30 3 30 34 0 1318 0 104 0 34 0 Appy 9887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADHD 14 399 0 0 0 19 0 19 1362 0 1362 0 1362 0 1362 0 Catar 50 879 0 0 50 0 74 0 50 0 2491 0 80 0 80 0 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9655 0 0 0 0 0 0 0 0 25 103 0 0 0 0 0 Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 seta . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMEDall desc . SNOMED desc x child . SNOMEDdesc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 HF2 75 312 0 0 0 0 0 0 0 0 1262 0 1054 0 1054 0 T1DM 27 861 0 0 0 23 0 23 108 0 943 0 943 0 108 0 T2DM 125 342 0 0 3 30 3 30 34 0 1318 0 104 0 34 0 Appy 9887 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ADHD 14 399 0 0 0 19 0 19 1362 0 1362 0 1362 0 1362 0 Catar 50 879 0 0 50 0 74 0 50 0 2491 0 80 0 80 0 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9655 0 0 0 0 0 0 0 0 25 103 0 0 0 0 0 a This column is used as the gold standard and is run on unmapped source data and therefore must have perfect performance. Open in new tab The concept sets that were created automatically from the original ICD9-CM concept sets varied in performance. All of the errors were FP because the algorithms always included all the SNOMED CT codes that the ICD9-CM codes mapped to. The most restrictive, “SNOMED no desc” resulted in the fewest FP: FP were less than half a percent for all but 1 phenotype, attention deficit hyperactivity disorder, which got to almost 10% (2 orders of magnitude greater than the knowledge engineered query). The least restrictive, “SNOMED all desc,” had an error rate of over 250% on rheumatoid arthritis. Table 4 shows performance on ICD9-CM and ICD10-CM data, based on the current authors’ interpretation of the original authors’ intents. The original ICD9-CM query missed the ICD10-CM codes, resulting in errors up to 2.2%. Table 4. Performance on ICD9-CM and ICD10-CM source data mapped to SNOMED CT (FP false positive, FN false negative) Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 set . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMED all desc . SNOMED desc x child . SNOMED desc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 626 0 314 0 0 0 0 0 0 1332 0 1116 0 1116 0 HF2 75 958 0 1646 0 1332 0 0 0 1332 0 0 0 216 0 216 T1DM 27 935 0 74 0 23 0 23 108 67 943 0 943 67 108 67 T2DM 126 828 0 1486 3 1412 3 30 34 1486 1317 0 104 1382 34 1382 Appy 9920 0 33 0 0 0 0 0 8 0 0 0 8 0 8 ADHD 14 547 0 148 0 39 0 19 1359 19 1359 0 1359 19 1359 19 Catar 50 953 0 194 39 26 39 2 39 26 2451 0 51 8 51 8 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9793 0 138 0 25 0 25 0 25 25 151 0 0 25 0 25 Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 set . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMED all desc . SNOMED desc x child . SNOMED desc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 626 0 314 0 0 0 0 0 0 1332 0 1116 0 1116 0 HF2 75 958 0 1646 0 1332 0 0 0 1332 0 0 0 216 0 216 T1DM 27 935 0 74 0 23 0 23 108 67 943 0 943 67 108 67 T2DM 126 828 0 1486 3 1412 3 30 34 1486 1317 0 104 1382 34 1382 Appy 9920 0 33 0 0 0 0 0 8 0 0 0 8 0 8 ADHD 14 547 0 148 0 39 0 19 1359 19 1359 0 1359 19 1359 19 Catar 50 953 0 194 39 26 39 2 39 26 2451 0 51 8 51 8 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9793 0 138 0 25 0 25 0 25 25 151 0 0 25 0 25 Open in new tab Table 4. Performance on ICD9-CM and ICD10-CM source data mapped to SNOMED CT (FP false positive, FN false negative) Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 set . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMED all desc . SNOMED desc x child . SNOMED desc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 626 0 314 0 0 0 0 0 0 1332 0 1116 0 1116 0 HF2 75 958 0 1646 0 1332 0 0 0 1332 0 0 0 216 0 216 T1DM 27 935 0 74 0 23 0 23 108 67 943 0 943 67 108 67 T2DM 126 828 0 1486 3 1412 3 30 34 1486 1317 0 104 1382 34 1382 Appy 9920 0 33 0 0 0 0 0 8 0 0 0 8 0 8 ADHD 14 547 0 148 0 39 0 19 1359 19 1359 0 1359 19 1359 19 Catar 50 953 0 194 39 26 39 2 39 26 2451 0 51 8 51 8 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9793 0 138 0 25 0 25 0 25 25 151 0 0 25 0 25 Pheno . #Cases . Original . Knowledge engineered . Automated concept set creation . ICD9 set . SNOMED mimic . SNOMED optimize . SNOMED no desc . SNOMED all desc . SNOMED desc x child . SNOMED desc x desc . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . FP . FN . HF 75 626 0 314 0 0 0 0 0 0 1332 0 1116 0 1116 0 HF2 75 958 0 1646 0 1332 0 0 0 1332 0 0 0 216 0 216 T1DM 27 935 0 74 0 23 0 23 108 67 943 0 943 67 108 67 T2DM 126 828 0 1486 3 1412 3 30 34 1486 1317 0 104 1382 34 1382 Appy 9920 0 33 0 0 0 0 0 8 0 0 0 8 0 8 ADHD 14 547 0 148 0 39 0 19 1359 19 1359 0 1359 19 1359 19 Catar 50 953 0 194 39 26 39 2 39 26 2451 0 51 8 51 8 Crohn 4679 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RA 9793 0 138 0 25 0 25 0 25 25 151 0 0 25 0 25 Open in new tab The optimized knowledge-engineered query performed well, with maximum error rates of 0.26% and 0.13%, with the rest less than 0.1%. The automated queries achieved rates up to 10% other than the one outlier at 250%. Code mapping diagnostics The following analysis is based on the “SNOMED optimize” concept set, which represents the current authors’ best effort at generating a concept set. Error-free translations Some phenotype concept sets such as appendicitis could be translated without inappropriately including or losing codes and therefore without FP or FN patients. The optimized query used a single code, SNOMED CT 85189001 “Acute appendicitis,” and all its descendants. Similarly, Crohn’s Disease was encoded without FP or FN using two codes, SNOMED CT 34000006 “Crohn’s disease” and 1085911000119103 “Complication due to Crohn’s disease,” and all their descendants. Heart failure as an exclusion diagnosis was straightforward without FP or FN, using 3 codes, SNOMED CT 84114007 “Heart failure,” 371037005 “Systolic dysfunction,” 3545003 “Diastolic dysfunction,” and all their descendants. Heart failure as an inclusion diagnosis—which emphasizes specificity over sensitivity—was less straightforward, including 29 SNOMED CT terms, some with and some without descendants, but could be mapped to the intended terms without FP or FN. (This latter phenotype achieved higher specificity despite more terms than heart failure as an exclusion diagnosis because each of the terms was more specific.) Multiple source codes (ICD) to one standard code (SNOMED CT) The primary difficulty we found related to ambiguity when 2 or more ICD9-CM or ICD10-CM source codes mapped to the same SNOMED CT standard code. The difficulty arises when a phenotype concept set includes 1 of the source codes but excludes another. There is then no way in the mapped data to get them all right; some must be erroneously included or excluded. Take attention deficit hyperactivity disorder as an example. The original definition includes ICD9-CM 314.0 “Attention deficit disorder of childhood” but excludes its child, 314.00 “Attention deficit disorder without mention of hyperactivity.” Both of these terms map to SNOMED CT 192127007 “Child attention deficit disorder.” Because 314.0 is not a reimbursable code and should have fewer cases, it was deemed more important to exclude 314.00, which is a reimbursable code, which also meant excluding SNOMED code 192127007. Type-1 diabetes mellitus had 8 standard codes with multiple source codes, 1 of which did cause consequential ambiguities (“consequential” here means it caused FP or FN in the patient cohorts): SNOMED CT 420662003 “Coma associated with diabetes mellitus” was mapped from ICD9-CM 250.30 “Diabetes with other coma, type II or unspecified type, not stated as uncontrolled,” 250.31 “Diabetes with other coma, type I [juvenile type], not stated as uncontrolled,” and 2 others (thus this set of type 1 and type 2 patients could not be separated after mapping). Type-2 diabetes mellitus had 8 standard codes with multiple source codes, and 2 of these caused consequential ambiguities. Cataract also had several ambiguities: ICD9-CM 366 “Cataract” was included because it was mapped to SNOMED 193570009 “Cataract,” which was needed to pull in other source codes; 1 other extra code was included, and 7 codes were excluded because of ambiguities. Rheumatoid arthritis was similarly affected: ICD9-CM 714 “Rheumatoid arthritis and other inflammatory polyarthropathies” was included but pulled in some other inappropriate codes (they turned out to have no consequence), and ICD10-CM M06.4 “Inflammatory polyarthropathy” was not included because it pulled in too many inappropriate codes (it did turn out to cause some FN). Based on the definitions, we do not believe either one should have been in the original concept set, but the original authors included them, and we acquiesced. A number of codes related to specific joints were ambiguous: ICD10-CM M08.011 “Unspecified juvenile rheumatoid arthritis, right shoulder” had to be inappropriately included because it mapped to a more general SNOMED CT term for the joint, such as SNOMED CT 201766009 “Rheumatoid arthritis of shoulder”; it was not consequential. One source code (ICD) to multiple standard codes (SNOMED CT) In some cases, 1 source code mapped to multiple standard codes. This usually occurs because the source code is a compound concept that exists only as separate codes in the standard vocabulary. This is generally easily addressed in the mapped concept set by including a conjunction of both terms. In type-1 diabetes mellitus, ICD9-CM 250.03 “Diabetes mellitus without mention of complication, type I [juvenile type], uncontrolled” mapped to both SNOMED CT 46635009 “Type 1 diabetes mellitus” and 444073006 “Type 1 diabetes mellitus uncontrolled,” which was likely just an oversight in the mapping process. No conjunction was necessary because the first subsumes the second. Type-2 diabetes mellitus had a similar circumstance with ICD9-CM 250.02 “Diabetes mellitus without mention of complication, type II or unspecified type, uncontrolled,” which also had no consequence. Missing OMOP codes In some cases, source codes had no corresponding OHDSI code and therefore could not be mapped to a standard code. This generally reflected a lag between the creation of new ICD10-CM codes and their incorporation into OHDSI. Type-1 diabetes mellitus had 57 such codes at the time of the evaluation, such as ICD10-CM E10.3211 “Type 1 diabetes mellitus with mild nonproliferative diabetic retinopathy with macular edema, right eye.” We verified that the codes were new enough that they had not been used in patient care yet in our institution. Type-2 diabetes mellitus also had 57 missing codes, also with no consequence. Information gain We found that the mapping process also produced some benefits. For example, in heart failure as an exclusion diagnosis, the original ICD9-CM query eliminated heart failure only under ICD9-CM 428 “Heart failure.” The SNOMED CT hierarchy pulled in relevant exclusion diagnoses that were not under 428, but under other codes such as 398, 402, 404, and 415. Because the goal was for the phenotype to be specific, these codes were deemed to be relevant to an exclusion diagnosis and included in the intent gold standard. DISCUSSION Main finding: source data mapping produces minimal error We found that the vocabulary mapping process produced little error in creating cohorts. For 4 of 9 phenotypes, the concept set mapping was straightforward without ambiguity. For the other 5, some number of ambiguities arose, although the number was always small compared to the number of concepts involved. For the patient cohorts, the differences were very small at a few patients per thousand (0.26%) or less. Compare that rate to the rate of erroneous diagnosis codes at 14%19 or the rate of entering notes on the wrong patient at 0.5%.20 The consequential ambiguities were always in the form of 2 or more source codes (ICD9-CM or ICD10-CM) mapping to 1 SNOMED CT code. Most of those ambiguities involved ICD9-CM codes, so we guess that over time as more billing data are encoded in ICD10-CM, the error rates will drop further. Concept set mapping The OHDSI vocabulary mappings were designed to map data from source vocabularies to their standard vocabularies. They were not designed to map concept sets. We found that none of our automated algorithms to map concept sets performed that well, with error rates up to 10% (with 1 larger outlier). The alternative is manually translating the concept sets using knowledge engineering. In our study, the process of creating the gold standard was merged with the process of translating the concept set, and we estimate times from 1 hour to 2 days to create an optimized concept set depending on complexity. While we found that knowledge engineering was necessary to optimize the query, we also found that the process could improve the concept sets. ICD9-CM has a strict hierarchy, so that a term can have only 1 parent; for example, an infection of an anatomical structure must be stored with infections or with the relevant structure but not with both. SNOMED CT is a multiple hierarchy and can place concepts under several parents. We found, especially for heart failure as an exclusion diagnosis, that the SNOMED CT hierarchy could identify codes that would have been missed by simply looking at the ICD9-CM hierarchy. Use of SNOMED CT Our study demonstrates feasibility of using SNOMED CT as the basis of concept sets for accurate phenotype definitions. The use of SNOMED CT brings several benefits. International studies can use a single coding scheme for conditions and distribute studies broadly. Attempting to write every phenotype definition to accommodate ICD9-CM, ICD10-CM, ICD10, SNOMED CT, Read codes, MedDRA, etc., separately is not feasible, especially because no phenotype author will have access to patient databases with all the diagnosis codes to test the accuracy of the phenotype. As electronic health records advance, enabling clinicians to enter clinically relevant data more easily, we should see greater availability of problem lists and clinical documentation that are encoded in SNOMED CT either directly or through natural language processing. We believe that shifting the entire nation toward clinically oriented vocabularies may do more to improve research than carefully querying poorly coded billing data. Related work Our work corroborates previous studies about vocabulary mapping. In the study closest to ours, Reich et al.10 looked at ICD9-CM, SNOMED CT, and MedDRA diagnosis concept sets in OMOP, and they found the same kinds of ambiguities in mapped concept sets due to many-to-one source-to-standard mappings. They then carried out drug-outcome studies using those concept sets, pre- and post-mapping, and they found that the ambiguities caused minimal changes in the study results. Our study uses independently validated concept set definitions, uses newer versions of the OHDSI mappings, bridges ICD9-CM and ICD10-CM, and includes an optimized knowledge engineered version of the mapped concept sets that are intended to minimize the differences before and after mapping. In related work, Defalco et al.21 showed incomplete OMOP mappings among 3 drug classification schemes but did not look at the consequences of those differences. A number of studies look at coverage. For example, Cartagena et al.22 measured the incomplete overlap between SNOMED CT problem lists and ICD10-CM codes using National Library of Medicine SNOMED-to-ICD mappings in the Unified Medical Language System,23 on which the OHDSI mappings are largely based. Fung et al.24 look at ways to automate mappings from ICD9-CM to ICD10-CM exploiting the Centers for Medicare and Medicaid Services General Equivalent Maps (GEMs). With respect to the potential switch from billing to more clinical vocabularies, Bodenreider25 looked at using SNOMED CT to enter clinical concepts related to drug reactions and the possible subsequent mapping to MedDRA for research and reporting, and Elkin et al.26 showed that the use of a more clinically oriented terminology such as SNOMED-RT outperformed the ICD9-CM billing terminology for encoding and querying clinical text diagnoses. Alternatives One could carry out an analogous study going from ICD9-CM to ICD10-CM instead of SNOMED CT using Centers for Medicare and Medicaid Services mappings. We do not currently have our database encoded that way, but we expect similarly good performance, especially given that ICD9 is the precursor to ICD10. Some ambiguities do occur. For example, the attention deficit hyperactivity disorder ICD9-CM concept set includes 314.01 but excludes 314.00, but both map to the same ICD10-CM term. We still advocate for the SNOMED CT mapping for the broader reach and clinical focus. While our study measures inaccuracies produced by data mappings and concept set mappings, it does not imply that these inaccuracies are properties of the OHDSI data model. OHDSI retains the source data so that queries can always go back to the original data if desired, and no records are lost even if mappings do not exist yet. Limitations This study has several limitations. Only a small number of phenotypes were studied. The size was limited by the amount of work necessary to create the gold standard based on the original authors’ intent, which represented the bulk of the work. Once that gold standard was created, the rest of the knowledge engineering followed logically. We chose phenotypes from the eMERGE set based on the presence of a concept set that steered the majority of the phenotype definition. No phenotype definitions were rejected based on performance, good or bad. A second limitation is that we did not measure the clinical accuracy of the assignment of patients in the cohort but relied on their billing codes in both gold standards. Based on our very low error rate (0.26%), however, we can reuse the original authors’ clinically derived error rates (around 5%) to infer continued good performance. A third limitation is that we cannot prove that the errors in our patient cohorts are not the patients who would have been most important in the study. Our very low error rate points against a large effect, the codes that caused errors did not seem to be especially clinically unique, and the Reich et al.10 study corroborated little effect. Finally, we optimized our codes for our database, which had mostly ICD9-CM codes. The optimal concept set for a database of mostly ICD10-CM codes might be different. CONCLUSION Mapping data from source ICD billing codes to SNOMED CT codes produced only a very small effect on the generated patient cohorts, and one can infer that the corresponding phenotype definitions would maintain their accuracies. Only the concept sets that were hand-engineered achieved that performance; simple automated translation of concept sets did not work as well. The implication is that it should be feasible to define phenotypes using a single diagnosis vocabulary. FUNDING This work was funded by grants from the National Institutes of Health R01 LM006910 “Discovering and applying knowledge in clinical databases” and U01 HG008680 “Columbia GENIE (GENomic Integration with Ehr).” Conflict of interest statement. None. CONTRIBUTORS All authors made substantial contributions to the conception and design of the work; drafted the work or revised it critically for important intellectual content; had final approval of the version to be published; and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. REFERENCES 1 International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), National Center for Health Statistics. http://www.cdc.gov/nchs/icd/icd9cm.htm Accessed April 24, 2018. 2 International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), National Center for Health Statistics. http://www.cdc.gov/nchs/icd/icd10cm.htm Accessed April 24, 2018. 3 SNOMED CT . http://www.snomed.org/snomed-ct Accessed April 24, 2018. 4 MedDRA . Medical dictionary for regulatory activities. http://www.meddra.org/ Accessed April 24, 2018. 5 RCD (Read Codes) – Synopsis. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/RCD/ Accessed April 24, 2018. 6 Hripcsak G , Duke JD, Shah NH, et al. . Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. MEDINFO’15; August 19–23, 2015 ; São Paulo, Brazil. 7 Hripcsak G , Ryan PB, Duke JD, et al. . Characterizing treatment pathways at scale using the OHDSI network . Proc Natl Acad Sci USA 2016 ; 113 ( 27 ): 7329 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Overhage JM , Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research . J Am Med Inform Assoc 2012 ; 19 ( 1 ): 54 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Unified Medical Language System (UMLS) Metathesaurus - Mapping Projects. https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/mapping_projects/index.html Accessed June 15, 2018. 10 Reich C , Ryan PB, Stang PE, Rocca M. Evaluation of alternative standardized terminologies for medical conditions within a network of observational healthcare databases . J Biomed Inform 2012 ; 45 ( 4 ): 689 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Newton KM , Peissig PL, Kho AN, et al. . Validation of electronic medical record–based phenotyping algorithms: results and lessons learned from the eMERGE network . J Am Med Inform Assoc 2013 ; 20 ( e1 ): e147 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 12 The eMERGE Network . Type 2 diabetes mellitus. https://phekb.org/phenotype/type-2-diabetes-mellitus Accessed April 24, 2018. 13 The eMERGE Network . Appendicitis. https://phekb.org/phenotype/appendicitis Accessed April 24, 2018. 14 The eMERGE Network . ADHD phenotype algorithm. https://phekb.org/phenotype/adhd-phenotype-algorithm Accessed April 24, 2018. 15 The eMERGE Network . Cataracts. https://phekb.org/phenotype/cataracts Accessed April 24, 2018. 16 The eMERGE Network . Crohn’s disease—demonstration project. https://phekb.org/phenotype/crohns-disease-demonstration-project Accessed April 24, 2018. 17 The eMERGE Network . Rheumatoid arthritis (RA). https://phekb.org/phenotype/rheumatoid-arthritis-ra Accessed April 24, 2018. 18 eMERGE Network. Heart failure (HF) with differentiation between preserved and reduced ejection fraction. https://phekb.org/phenotype/heart-failure-hf-differentiation-between-preserved-and-reduced-ejection-fraction Accessed April 24, 2018. 19 Hogan WR , Wagner MM. Accuracy of data in computer-based patient records . J Am Med Inform Assoc 1997 ; 4 ( 5 ): 342 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Wilcox AB , Chen YH, Hripcsak G. Minimizing electronic health record patient-note mismatches . J Am Med Inform Assoc 2011 ; 18 ( 4 ): 511 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Defalco FJ , Ryan PB, Soledad Cepeda M. Applying standardized drug terminologies to observational healthcare databases: a case study on opioid exposure . Health Serv Outcomes Res Methodol 2013 ; 13 ( 1 ): 58 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Cartagena FP , Schaeffer M, Rifai D, Doroshenko V, Goldberg HS. Leveraging the NLM map from SNOMED CT to ICD-10-CM to facilitate adoption of ICD-10-CM . J Am Med Inform Assoc 2015 ; 22 ( 3 ): 659 – 70 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 23 Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology . Nucleic Acids Res 2004 ; 32 ( 90001 ): 267D – D270 . Google Scholar Crossref Search ADS WorldCat 24 Fung KW , Richesson R, Smerek M, et al. . Preparing for the ICD-10-CM transition: automated methods for translating ICD codes in clinical phenotype definitions . eGEMs 2016 ; 4 ( 1 ): 4 – 1211 . Google Scholar Crossref Search ADS WorldCat 25 Bodenreider O. Using SNOMED CT in combination with MedDRA for reporting signal detection and adverse drug reactions reporting . AMIA Annu Symp Proc 2009 ; 2009 : 45 – 9 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 26 Elkin PL , Ruggieri AP, Brown SH, et al. . A randomized controlled trial of the accuracy of clinical record retrieval using SNOMED-RT as compared with ICD9-CM . Proc AMIA Symp 2001 ; 159 – 63 . Google Scholar OpenURL Placeholder Text WorldCat © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.
A randomized controlled trial to improve engagement of hospitalized patients with their patient portalsGreysen, S, Ryan;Harrison, James, D;Rareshide,, Charles;Magan,, Yimdriuska;Seghal,, Neil;Rosenthal,, Jaime;Jacolbia,, Ronald;Auerbach, Andrew, D
doi: 10.1093/jamia/ocy125pmid: 30346543
Abstract Objectives To test a patient-centered, tablet-based bedside educational intervention in the hospital and to evaluate the efficacy of this intervention to increase patient engagement with their patient portals during hospitalization and after discharge. Materials and Methods We conducted a randomized controlled trial of adult patients admitted to the hospitalist service in one large, academic medical center. All participants were supplied with a tablet computer for 1 day during their inpatient stay and assistance with portal registration and initial login as needed. Additionally, intervention group patients received a focused bedside education to demonstrate key functions of the portal and explain the importance of these functions to their upcoming transition to post-discharge care. Our primary outcomes were proportion of patients who logged into the portal and completed specific tasks after discharge. Secondary outcomes were observed ability to navigate the portal before discharge and self-reported patient satisfaction with bedside tablet use to access the portal. Results We enrolled 97 participants (50 intervention; 47 control); overall 57% logged into their portals ≥1 time within 7 days of discharge (58% intervention vs. 55% control). Mean number of logins was higher for the intervention group (3.48 vs. 2.94 control), and mean number of specific portal tasks performed was higher in the intervention group; however, no individual comparison reached statistical significance. Observed ability to login and navigate the portal in the hospital was higher for the intervention group (64% vs. 60% control), but only 1 specific portal task was significant (view provider messaging tab: 92% vs. 77% control, P = .04). Time needed to deliver the intervention was brief (<15 min for 80%), and satisfaction with the bedside tablet to access the portal was high in the intervention group (88% satisfied/very satisfied). Conclusion Our intervention was highly feasible and acceptable to patients, and we found a highly consistent, but statistically non-significant, trend towards higher inpatient engagement and post-discharge use of key portal functions among patients in the intervention group. portals, patient engagement, transitions of care, hospitalization BACKGROUND AND SIGNIFICANCE Increased patient engagement in care is both one of the greatest opportunities and greatest challenges for the “digital era” of healthcare.1–3 One of the top priorities for Center for Medicare and Medicaid Service’s (CMS’s) incentive program for Meaningful Use of Electronic Health Records (EHRs) is to “engage patients and families” to result in “empowered individuals.”4 Many of these objectives will be met through increased use of personal health records (PHR) or patient portals. Allowing patients’ access to information such as laboratory results, information about medications, patient-specific education resources, and secure messaging to their providers has the potential to engage patients in their healthcare leading to improve quality, safety, and outcomes.5 Although consensus is emerging around key issues to facilitate portal use in acute care settings,6,7 patient engagement with portals in the acute and post-acute setting still trails far behind use in outpatient settings, and evidence for successful implementation of inpatient portals remains very limited. Most recent studies of portal use in hospital settings have been qualitative or exploratory in nature with notable exceptions.8,9 For example, a study by Wollen and colleagues noted that following bedside training, hospitalized patients responded favorably to accessing their clinical information and reported high levels of satisfaction at accessing information about medications and patient education materials.10 Another study by O’Leary et al found that inpatients who were orientated to their patient portals were significantly more likely to correctly name their physicians and understand their physicians’ roles.11 Similarly, previous work by our group has demonstrated the feasibility of teaching patients to use their portals during hospitalization12 and has suggested that hospitalized patients are eager to use technology to engage in their care13 and that bedside training can improve portal engagement among non-users.14,15 To our knowledge, there are no studies that have examined the role of portals to engage patients during transitions from acute care and after hospital discharge, although several are ongoing.16,17 Engagement during care transitions from the hospital is especially important because of high stakes for patients as well as high costs for the system. Furthermore, recent Meaningful Use objectives also focus on coordination of care, suggesting opportunities to use leverage portals to engage patients’ during their care transitions.4 To explore this opportunity and address gaps in the literature around portal use and transitions of care, we conducted a pilot randomized controlled clinical trial focused on bedside engagement of hospitalized patients to use their patient portals during and after hospital discharge. We hypothesized that patients who received a patient-centered, tablet-based bedside educational intervention on portal use focused on accomplishing key post-discharge tasks (outpatient provider messaging, view results, view medications, and view appointments) would have higher rates of post-discharge portal use than a control group that was provided the same access (bedside tablet computer) but no education. METHODS Study design, participants, and setting We conducted a prospective, randomized controlled trial (RCT) embedded within a larger, observational study of patient engagement in discharge planning.18 We approached patients on either their first or second day of hospitalization, which is aligned with clinical practice at our institution and a widely held maxim of hospital medicine that “discharge planning begins on the day of admission.”19 Participants were eligible for inclusion if they were admitted to the medical service, English speaking, and over age 18 years of age. (Supplementary Appendix: Study Protocol). Participants were ineligible if they were blind, deaf, cognitively impaired by the assessment of their medical team, or involuntarily hospitalized due to incarceration or major psychiatric illness. Participants did not have to be currently enrolled in the University of California San Francisco’s (UCSF’s) portal platform to be eligible to participate, however, were required to have access to a personal tablet or home computer when discharged. Patients could elect to participate in the larger, observational discharge planning study but not the RCT. In such cases, the participant was not assigned to an intervention group, and pre-screening allocation established by the randomization procedure (see below) remained concealed and applied to the next patient who agreed to participate in the RCT. The study took place on the Medicine Service at UCSF, a tertiary referral academic medical center. This study was approved by the UCSF Institutional Review Board and registered as a clinical trial at www.clinicaltrials.gov (identifier NCT02109601). Randomization and power calculations We used a block randomization procedure to assign patients to intervention or control groups. Prior to screening or approaching patients, research assistants (RAs) used a random number generator to create blocks of 10 with random ordering of repeated numbers (1 = intervention, 2 = control). Due to the nature of the intervention, participants or study personnel were not blinded following random assignment. We calculated 100 patients were needed to enable detection of a 25% absolute difference in ability to perform at least 1 post-discharge portal task (80% power, 2-sided alpha 0.05). Intervention and control group descriptions To ensure that device access and portal login were not barriers to pre-discharge portal access, we provided all patients with tablet computers (iPad® 16 GB 3rd generation Model A1430) and in-person login assistance. The portal used at our institution (MyChart by Epic Systems) is mobile friendly and easily accessible via web browsers commonly used on tablets, desktops, laptops, or smartphones. Control patients received a tablet computer and limited assistance registering and logging in if they were first-time users of our portal; no other assistance or instruction on how to use the portal was offered or delivered to control patients. Intervention patients received an extensive bedside, structured education by trained RAs in the hospital, which guides patients through key functions of the portal, including how to verify personal information (eg, address, phone number, email), view medications and request refills, view test results (labs and radiology), view current appointments and request changes, and how to view and send secure messages to outpatient providers. To enhance patient engagement in the intervention arm, we explained the relevance of these tasks and gave explicit examples for how the portal could be useful to them after leaving the hospital, eg, checking (or changing) appointments made during hospitalization that they may wish to change after discharge, checking for new medications prescribed in the hospital (or requesting refills), and reviewing test results (especially those pending at the time of discharge). While this tutorial provides uniform basic content for all patients, it was designed to be adaptive enough to allow RAs to “speed up” or “slow down” or otherwise tailor the depth of explanation to the needs of each individual patient. All patients were approached before noon, and they were encouraged to use the tablet on their own for the rest of the day. The RA returned to re-collect the device approximately 5 hours later and complete a debrief interview (Supplementary Appendix: Study Protocol). Data collection and sources of data Following enrollment and consent, RAs administered a brief pre-study survey to assess baseline technology use. This survey, which was previously used by our group,12 assessed device ownership, internet use, and any pre-admission portal activities (defined as use of any online portal, not just our institutional portal, to access health information or accomplish tasks such as refilling prescription). At the end of the day, the RAs performed a debrief interview in which participants were asked to independently demonstrate ability to perform key portal tasks (the same ones addressed in the structured tutorial). The RA recorded which tasks (if any) were able to be accomplished independently and/or whether the RA provided additional assistance. To assess portal access and use to accomplish post-discharge tasks (defined as login, outpatient provider messaging, view results, view medications, and view appointments), we accessed searchable databases of our EHR via Epic Systems Clarity clinical databases (Verona WI). Patient demographic and clinical information was also obtained from the EHR. For clinical severity specifically, we used the used Severity of Illness classification developed by the CMS, which categorizes all patients with a given Diagnosis-Related Group into 1 of 4 classes of severity (minor, moderate, major, and extreme). Primary and secondary outcomes Primary outcomes were the proportion of patients who logged in to the portal and completed specific tasks after discharge: login, outpatient provider messaging, view results, view medications, and view appointments. Secondary outcomes were the RA-observed ability to navigate the portal before discharge (login, view medications, view med refills, view appointments, view lab results, view messages) and patient satisfaction (overall satisfaction using the tablet in the hospital and using the tablet specifically to access and navigate the portal in hospital). Data analysis Two-hundred-fifty patients were eligible for the study and were approached as described in our CONSORT Flow Diagram (Figure 1). Of these, 163 were enrolled in a larger observational study of patient engagement in discharge planning, and 113 of these also agreed to participate in the randomized trial. Of 113 who agreed to enroll in the trial and were randomized, 97 completed were able to register/log in successfully to MyChart and willing to allow monitoring of their post-discharge portal use. Given that the outcome of interest was post-discharge portal use, we could not use an intention-to-treat analysis to include the 16 patients who initially agreed to participate but were ultimately unable to log in or would not allow post-discharge data access, as these patients did not have the primary outcome of interest (post-discharge portal usage). Most of these patients had existing accounts but could not remember their login information and were unable to reset their password or were locked out after multiple attempts at the time of study enrollment. Although our team was able to register participants for new accounts, we were unable to reset or unlock existing accounts if the participant could not remember details needed to restore or reset his or her account (such as the correct email used to originally register the account). Therefore, we used a per-protocol approach to data analysis and present results from the 97 participants who were able to register/log in successfully and allow access to their post-discharge portal data. Patient demographic and clinical information was summarized using descriptive statistics. To assess the rigor of randomization, we compared these characteristics by intervention and control group using t tests and chi-square tests. We summed and averaged the total number of logins to the portal and number of clicks in different domains of the portal (medications, labs, appointments). We used logistic regression to adjust for age and previous portal use, as well as functional impairment given our prior work suggesting negative effects on both transitions20 and internet access at home.21 For logistic regressions, we constructed our dichotomous outcome variables as “ever” or “at least once” events, eg, a patient who logged in once or performed a certain task (ex: viewing medications) would be categorized the same as one who logged in (or performed a certain task) many times. Those who logged in or completed tasks at least once were compared to patients who never logged in or never completed the specified tasks. Analyses were conducted in SAS. Figure 1. View largeDownload slide CONSORT flow diagram. Figure 1. View largeDownload slide CONSORT flow diagram. RESULTS Patient characteristics (Table 1) Although our study arms were not powered to detect small differences in patient characteristics, overall demographics appear very similar between groups, and prior technology use rates were comparable between groups in terms of device ownership, internet use, and pre-study online health tasks. Notably, the one area of difference between the 2 groups that was significant was prior MyChart registration: 34 participants in the intervention group (68%) were previously registered vs. 18 (38%) in the control group (P < .01). Thus, 16 patients in the intervention group and 29 in the control group were registered for new MyChart accounts at the time of their enrollment in the study. This difference notwithstanding, our key measure for feasibility, time needed for basic orientation to use of the portal on the tablets, was not different by intervention group. Most participants required less than 15 minutes (40 or 80% intervention; 40 or 85% control) and a few required 15–30 minutes (4 or 8% intervention; 3 or 6% control) or over 30 minutes (6 or 12% intervention; 4 or 9% control). Table 1. Participant characteristics Total Overall N (%) Intervention n (%) Control n (%) 97 50 (52%) 47 (48%) Demographics Age 18–49 48 (49) 20 (40) 28 (60) 50–60 41 (42) 24 (48) 17 (36) ≥70 8 (8) 6 (12) 2 (4) Gender Female 53 (55) 28 (58) 25 (55) Race/ethnicity White 53 (55) 26 (52) 27 (57) Black 19 (20) 9 (18) 10 (21) Hispanic 9 (9) 5 (10) 4 (9) Asian 7 (7) 5 (10) 2 (4) Other/Unknown 9 (9) 5 (10) 4 (9) Married/Living as married 47 (48) 22 (44) 25 (55) Payer Medicaid 23 (24) 10 (20) 13 (28) Medicare 22 (23) 13 (26) 9 (19) Private 26 (47) 24 (50) 22 (47) Self-pay/Uninsured 6 (6) 3 (6) 3 (6) UCSF primary care provider 43 (44) 25 (50) 18 (38) Clinical Characteristics Clinical severity of illness Minor or moderate 29 (30) 14 (28) 15 (32) Major 55 (57) 27 (54) 28 (60) Extreme 13 (13) 9 (18) 4 (9) Length of stay Mean 6.4 (13.5 STD) Mean 5.1 (STD 4.7) Mean 7.7 (STD 18.8) Technology use characteristics Device ownership Own desktop computer 51 (53) 25 (50) 26 (55) Own laptop computer 67 (69) 36 (72) 31 (66) Own smartphone 57 (59) 26 (52) 31 (66) Own tablet computer 48 (49) 26 (52) 22 (47) Doesn’t own a device 6 (6) 3 (6) 3 (6) Internet use Internet use daily 79 (81) 39 (78) 40 (85) Internet use several times a week 7 (7) 4 (8) 3 (6) Internet use once a week or less 6 (6) 2 (4) 4 (9) Pre-study online health tasks (any platform) Looked up health information 78 (80) 40 (80) 38 (81) Communicated with provider 55 (57) 30 (60) 25 (55) Scheduled medical appointment 39 (41) 23 (46) 16 (34) Refilled prescription 34 (35) 25 (50) 9 (19) None of these 10 (10) 3 (6) 7 (15) Total Overall N (%) Intervention n (%) Control n (%) 97 50 (52%) 47 (48%) Demographics Age 18–49 48 (49) 20 (40) 28 (60) 50–60 41 (42) 24 (48) 17 (36) ≥70 8 (8) 6 (12) 2 (4) Gender Female 53 (55) 28 (58) 25 (55) Race/ethnicity White 53 (55) 26 (52) 27 (57) Black 19 (20) 9 (18) 10 (21) Hispanic 9 (9) 5 (10) 4 (9) Asian 7 (7) 5 (10) 2 (4) Other/Unknown 9 (9) 5 (10) 4 (9) Married/Living as married 47 (48) 22 (44) 25 (55) Payer Medicaid 23 (24) 10 (20) 13 (28) Medicare 22 (23) 13 (26) 9 (19) Private 26 (47) 24 (50) 22 (47) Self-pay/Uninsured 6 (6) 3 (6) 3 (6) UCSF primary care provider 43 (44) 25 (50) 18 (38) Clinical Characteristics Clinical severity of illness Minor or moderate 29 (30) 14 (28) 15 (32) Major 55 (57) 27 (54) 28 (60) Extreme 13 (13) 9 (18) 4 (9) Length of stay Mean 6.4 (13.5 STD) Mean 5.1 (STD 4.7) Mean 7.7 (STD 18.8) Technology use characteristics Device ownership Own desktop computer 51 (53) 25 (50) 26 (55) Own laptop computer 67 (69) 36 (72) 31 (66) Own smartphone 57 (59) 26 (52) 31 (66) Own tablet computer 48 (49) 26 (52) 22 (47) Doesn’t own a device 6 (6) 3 (6) 3 (6) Internet use Internet use daily 79 (81) 39 (78) 40 (85) Internet use several times a week 7 (7) 4 (8) 3 (6) Internet use once a week or less 6 (6) 2 (4) 4 (9) Pre-study online health tasks (any platform) Looked up health information 78 (80) 40 (80) 38 (81) Communicated with provider 55 (57) 30 (60) 25 (55) Scheduled medical appointment 39 (41) 23 (46) 16 (34) Refilled prescription 34 (35) 25 (50) 9 (19) None of these 10 (10) 3 (6) 7 (15) Table 1. Participant characteristics Total Overall N (%) Intervention n (%) Control n (%) 97 50 (52%) 47 (48%) Demographics Age 18–49 48 (49) 20 (40) 28 (60) 50–60 41 (42) 24 (48) 17 (36) ≥70 8 (8) 6 (12) 2 (4) Gender Female 53 (55) 28 (58) 25 (55) Race/ethnicity White 53 (55) 26 (52) 27 (57) Black 19 (20) 9 (18) 10 (21) Hispanic 9 (9) 5 (10) 4 (9) Asian 7 (7) 5 (10) 2 (4) Other/Unknown 9 (9) 5 (10) 4 (9) Married/Living as married 47 (48) 22 (44) 25 (55) Payer Medicaid 23 (24) 10 (20) 13 (28) Medicare 22 (23) 13 (26) 9 (19) Private 26 (47) 24 (50) 22 (47) Self-pay/Uninsured 6 (6) 3 (6) 3 (6) UCSF primary care provider 43 (44) 25 (50) 18 (38) Clinical Characteristics Clinical severity of illness Minor or moderate 29 (30) 14 (28) 15 (32) Major 55 (57) 27 (54) 28 (60) Extreme 13 (13) 9 (18) 4 (9) Length of stay Mean 6.4 (13.5 STD) Mean 5.1 (STD 4.7) Mean 7.7 (STD 18.8) Technology use characteristics Device ownership Own desktop computer 51 (53) 25 (50) 26 (55) Own laptop computer 67 (69) 36 (72) 31 (66) Own smartphone 57 (59) 26 (52) 31 (66) Own tablet computer 48 (49) 26 (52) 22 (47) Doesn’t own a device 6 (6) 3 (6) 3 (6) Internet use Internet use daily 79 (81) 39 (78) 40 (85) Internet use several times a week 7 (7) 4 (8) 3 (6) Internet use once a week or less 6 (6) 2 (4) 4 (9) Pre-study online health tasks (any platform) Looked up health information 78 (80) 40 (80) 38 (81) Communicated with provider 55 (57) 30 (60) 25 (55) Scheduled medical appointment 39 (41) 23 (46) 16 (34) Refilled prescription 34 (35) 25 (50) 9 (19) None of these 10 (10) 3 (6) 7 (15) Total Overall N (%) Intervention n (%) Control n (%) 97 50 (52%) 47 (48%) Demographics Age 18–49 48 (49) 20 (40) 28 (60) 50–60 41 (42) 24 (48) 17 (36) ≥70 8 (8) 6 (12) 2 (4) Gender Female 53 (55) 28 (58) 25 (55) Race/ethnicity White 53 (55) 26 (52) 27 (57) Black 19 (20) 9 (18) 10 (21) Hispanic 9 (9) 5 (10) 4 (9) Asian 7 (7) 5 (10) 2 (4) Other/Unknown 9 (9) 5 (10) 4 (9) Married/Living as married 47 (48) 22 (44) 25 (55) Payer Medicaid 23 (24) 10 (20) 13 (28) Medicare 22 (23) 13 (26) 9 (19) Private 26 (47) 24 (50) 22 (47) Self-pay/Uninsured 6 (6) 3 (6) 3 (6) UCSF primary care provider 43 (44) 25 (50) 18 (38) Clinical Characteristics Clinical severity of illness Minor or moderate 29 (30) 14 (28) 15 (32) Major 55 (57) 27 (54) 28 (60) Extreme 13 (13) 9 (18) 4 (9) Length of stay Mean 6.4 (13.5 STD) Mean 5.1 (STD 4.7) Mean 7.7 (STD 18.8) Technology use characteristics Device ownership Own desktop computer 51 (53) 25 (50) 26 (55) Own laptop computer 67 (69) 36 (72) 31 (66) Own smartphone 57 (59) 26 (52) 31 (66) Own tablet computer 48 (49) 26 (52) 22 (47) Doesn’t own a device 6 (6) 3 (6) 3 (6) Internet use Internet use daily 79 (81) 39 (78) 40 (85) Internet use several times a week 7 (7) 4 (8) 3 (6) Internet use once a week or less 6 (6) 2 (4) 4 (9) Pre-study online health tasks (any platform) Looked up health information 78 (80) 40 (80) 38 (81) Communicated with provider 55 (57) 30 (60) 25 (55) Scheduled medical appointment 39 (41) 23 (46) 16 (34) Refilled prescription 34 (35) 25 (50) 9 (19) None of these 10 (10) 3 (6) 7 (15) Inpatient portal use and satisfaction All 97 participants included in the analysis were able to login and navigate within their portals during their inpatient stay. When study tablets were collected at the end of the day by the RA, however, only 60 (62%) participants were able to demonstrate ability to log in independently as part of the debrief interview; the remaining 37 (38%) were able to log in with assistance (Table 2). Once logged in, participants were able to accomplish other tasks as part of this debrief interview with varying frequency. For all tasks, we observed a higher percentage of success in accomplishing the task independently in the intervention group (range 46%–92% intervention vs. 32%–79% control), although this reached statistical significance only for the task of navigating to the outpatient provider messaging tab (46 or 92% intervention; 36 or 77% control, P = .04). With respect to participant satisfaction, overall satisfaction was high for using the tablet during hospitalization in general (78 or 80% were “satisfied” or “very satisfied”) and for using the tablet to access and navigate their portals specifically (83 or 86% “satisfied” or “very satisfied”). Again we observed higher rates in the intervention group vs. control, but these did reach statistical significance (Table 2). Table 2. Inpatient portal tasks accomplished independently and patient satisfaction Tasks Accomplished Overall N (%) Intervention n (%) Control n (%) P-value Login 60 (62) 32 (64) 28 (60) .65 View provider messaging 82 (86) 46 (92) 36 (77) .04 View lab results 79 (81) 43 (86) 36 (77) .23 View medications 74 (76) 41 (82) 33 (70) .17 View appointments 82 (86) 45 (90) 37 (79) .12 Patient satisfaction Overall satisfaction using tablet in the hospital 78 (80) 43 (86) 35 (74) .15 Satisfaction with tablet to access and navigate portal 83 (86) 44 (88) 39 (83) .48 Tasks Accomplished Overall N (%) Intervention n (%) Control n (%) P-value Login 60 (62) 32 (64) 28 (60) .65 View provider messaging 82 (86) 46 (92) 36 (77) .04 View lab results 79 (81) 43 (86) 36 (77) .23 View medications 74 (76) 41 (82) 33 (70) .17 View appointments 82 (86) 45 (90) 37 (79) .12 Patient satisfaction Overall satisfaction using tablet in the hospital 78 (80) 43 (86) 35 (74) .15 Satisfaction with tablet to access and navigate portal 83 (86) 44 (88) 39 (83) .48 Table 2. Inpatient portal tasks accomplished independently and patient satisfaction Tasks Accomplished Overall N (%) Intervention n (%) Control n (%) P-value Login 60 (62) 32 (64) 28 (60) .65 View provider messaging 82 (86) 46 (92) 36 (77) .04 View lab results 79 (81) 43 (86) 36 (77) .23 View medications 74 (76) 41 (82) 33 (70) .17 View appointments 82 (86) 45 (90) 37 (79) .12 Patient satisfaction Overall satisfaction using tablet in the hospital 78 (80) 43 (86) 35 (74) .15 Satisfaction with tablet to access and navigate portal 83 (86) 44 (88) 39 (83) .48 Tasks Accomplished Overall N (%) Intervention n (%) Control n (%) P-value Login 60 (62) 32 (64) 28 (60) .65 View provider messaging 82 (86) 46 (92) 36 (77) .04 View lab results 79 (81) 43 (86) 36 (77) .23 View medications 74 (76) 41 (82) 33 (70) .17 View appointments 82 (86) 45 (90) 37 (79) .12 Patient satisfaction Overall satisfaction using tablet in the hospital 78 (80) 43 (86) 35 (74) .15 Satisfaction with tablet to access and navigate portal 83 (86) 44 (88) 39 (83) .48 Post-discharge patterns of portal use Just over half of all participants logged into their portals at least once within 7 days of discharge: 55 or 57% overall, 29 (58%) for intervention; 26 (55%) for control (Table 3). Mean number of total logins was higher for the intervention group (3.48 vs. 2.94 for control) but not statistically significant. The percentage of participants who performed specific tasks once logged in to the portal ranged from 20%–48% for the intervention group and 10%–38% for control. Mean number of tasks performed was higher for the intervention group in every area measured; however, no individual comparison reached statistical significance. The most frequently performed (highest means) for both groups were outpatient provider messaging (5.98 intervention, 3.98 control), lab results (5.68 intervention, 4.36 control), and appointment review (2.14 intervention, 2.04 control). Table 3. Post-discharge portal tasks accomplished by patients Clicked at least once Mean number of clicks Task All N (%) Intervention n (%) Control n (%) P-value (ChiSq) Intervention Control P-value (Wilcoxon) Login 55 (57) 29 (58) 26 (55) .86 3.48 2.94 .60 Provider messaging 42 (43) 24 (48) 18 (38) .55 5.98 3.98 .33 Lab results 40 (41) 22 (44) 18 (38) .59 5.68 4.36 .49 View medications 33 (34) 17 (34) 16 (34) .53 1.24 1.23 .78 Appointment review 34 (35) 21 (42) 13 (28) .41 2.14 2.04 .23 Clicked at least once Mean number of clicks Task All N (%) Intervention n (%) Control n (%) P-value (ChiSq) Intervention Control P-value (Wilcoxon) Login 55 (57) 29 (58) 26 (55) .86 3.48 2.94 .60 Provider messaging 42 (43) 24 (48) 18 (38) .55 5.98 3.98 .33 Lab results 40 (41) 22 (44) 18 (38) .59 5.68 4.36 .49 View medications 33 (34) 17 (34) 16 (34) .53 1.24 1.23 .78 Appointment review 34 (35) 21 (42) 13 (28) .41 2.14 2.04 .23 Table 3. Post-discharge portal tasks accomplished by patients Clicked at least once Mean number of clicks Task All N (%) Intervention n (%) Control n (%) P-value (ChiSq) Intervention Control P-value (Wilcoxon) Login 55 (57) 29 (58) 26 (55) .86 3.48 2.94 .60 Provider messaging 42 (43) 24 (48) 18 (38) .55 5.98 3.98 .33 Lab results 40 (41) 22 (44) 18 (38) .59 5.68 4.36 .49 View medications 33 (34) 17 (34) 16 (34) .53 1.24 1.23 .78 Appointment review 34 (35) 21 (42) 13 (28) .41 2.14 2.04 .23 Clicked at least once Mean number of clicks Task All N (%) Intervention n (%) Control n (%) P-value (ChiSq) Intervention Control P-value (Wilcoxon) Login 55 (57) 29 (58) 26 (55) .86 3.48 2.94 .60 Provider messaging 42 (43) 24 (48) 18 (38) .55 5.98 3.98 .33 Lab results 40 (41) 22 (44) 18 (38) .59 5.68 4.36 .49 View medications 33 (34) 17 (34) 16 (34) .53 1.24 1.23 .78 Appointment review 34 (35) 21 (42) 13 (28) .41 2.14 2.04 .23 Finally, we performed logistic regression analysis to determine if odds of post-discharge portal access were higher among intervention participants. Given the differences observed between intervention and control groups in terms of age (non-significant) and prior MyChart use (significant), we performed logistic regression analyses adjusted for these variables to predict login or performance of any single task listed in Table 3. Although unadjusted odds ratios were higher (>1.0) for each task when modeled for the intervention group, this did not achieve statistical significance. Similarly, adding age and prior MyChart use to the model as adjustor variables did not result in any statistically significant result (all confidence intervals crossed 1.0, results not shown). DISCUSSION We found that a patient-centered, tablet-based bedside educational intervention focused on portal training for hospitalized patients produced a trend towards higher overall use of and engagement with the portal among intervention patients, specifically in their observed ability to login and navigate the portal before discharge, satisfaction with portal use on tablets in the hospital, and frequency of portal use after discharge. Our approach differs from previous hospital-based EHR interventions that have focused on engaging providers to increase the completion and accuracy of key transition tasks such as scheduling appointments, communicating with providers, and completing medication reconciliation.22–24 There are few studies that have examined the potential for portals to improve patient engagement in these tasks during transitions of care. Recent systematic reviews of this area have highlighted that existing studies have largely been qualitative or exploratory in nature8,9; ours is the first randomized clinical trial to rigorously test an interventional approach to increase patient engagement with their portals in the hospital and post-discharge settings. While the positive trend towards greater engagement in tasks for our intervention group was very consistent, only 1 area (outpatient provider messaging) was statistically significant. There are several factors that could explain this result. First, it is possible that providing patients with tablets to access their portals could be a more powerful intervention than we anticipated and could have overshadowed the educational intervention. To explore this question, we created a virtual cohort of 400 active users of our institutional portal who were hospitalized around the same time as patients in the current study but did not receive any device during their stay.14 We compared portal use in this virtual cohort to patients in the current study as part of our published study protocol. Virtual cohort patients had a similar number of logins, but patients in our study who received tablets were more likely to use the portal to check their medications during hospitalization, even after adjustment for prior portal use. Second, although we anticipated a modest effect and set our enrollment targets accordingly, our intervention effect was even smaller than anticipated. Our previous work using a 12-hospital quality and research network showed that few patients (20%–30%) had ever used the internet for post-discharge tasks related to medications, appointments, or provider communication.25 In the current study, however, most patients reported they had used the internet for these tasks in the past, and our observed use of the portal for these post-discharge tasks among all study participants ranged from 35%–57% overall. Thus, it is possible that a secular trend towards greater portal use effectively elevated the pre-study baseline we had previously observed and used to set our targets. Although a secular trend towards greater portal use would be good news, the larger literature on patient perceptions and experiences using inpatient portals suggests plenty of room for improvement in the hospital to increase engagement.8,9 Indeed, our qualitative work with patients admitted at 12 hospitals across the United States,26 as well as more in-depth interviews with a subset of patients who used iPad tablets to access their portal at our hospital10 and others,27 suggests educational barriers are among the most important—and addressable—challenges to inpatient and post-discharge use of portals.28,29 While the trend we observed in this study toward greater portal engagement among intervention patients who received adaptive educational training in portal use adds validity to this hypothesis, the limited (non-statistically significant) effect may also be due to broad enrollment criteria, which allowed a wide range of experience and aptitude. Future studies should focus interventions to patients with observed inability to perform certain tasks as inclusion criteria or those at highest risk of poor transitions. Future studies should ultimately measure success in terms of outcomes such as fewer missed follow-up appointments or unfilled medications, as these seem plausibly modifiable through better portal engagement to view or change appointments and refill medications. Furthermore, future educational interventions could leverage more sophisticated approaches to enhance patient self-efficacy. As one example, a “teach to goal” approach has recently been deployed in the hospital to achieve more consistent and effective use of inhalers after discharge in patients with respiratory conditions such as asthma.30,31 In this approach, patients must independently demonstrate competency in a series of small tasks to complete the intervention (eg, remove cap from inhaler, attach spacer to inhaler, depress inhaler correct number of times for intended dose, inhale and hold breath for 5 seconds, exhale slowly). If the patient performs any of these tasks incorrectly, the provider reviews those steps until the patient can reach the goal of completing all tasks correctly and independently. This teach to goal approach could be applied to hospital-based portal training to ensure higher self-efficacy in portal use, particularly in patients with limited baseline portal skills or at higher risk for poor transitions of care. Finally, while we did not explore the abilities of family members or other caregivers to engage in portal tasks on the patient’s behalf, several studies of parental portal use during pediatric hospitalization demonstrated high engagement and can serve as examples for future work in this area with adult inpatients.32,33 Another step towards greater patient efficacy in portal engagement could be to teach patients to access the portal on their own devices. We provided tablet computers at bedside to ensure a uniform portal experience for participants, but many patients now bring their own mobile devices to the hospital with them. Indeed, in a separate study, we found about 2 in 3 patients at our hospital brought and used at least 1 mobile device during their hospitalization (rates were as high as 80% in some units such as oncology).13 As the movement towards “bring your own device” (BYOD) gains momentum for patient engagement with the EHR and other health-related platforms (eg, diet, activity, and medication logs or other health-related apps), there is tremendous opportunity for patients in acute and post-acute phases of care. For example, we are currently studying a BYOD approach to engaging patients in mobility during and after hospitalization using accelerometers and patients’ own smartphones to provide feedback as they strive to achieve walking goals (steps per day) to promote recovery.34 Finally, there may be certain pragmatic advantages to a BYOD approach to portal engagement. In our study, we found that only 62% of all patients were able to log in independently but, once logged in, ability to accomplish other tasks independently was notably higher (76%–86%). While there may be a number of factors at play, one compelling hypothesis is that patients had a difficult time getting to the login page and entering their credentials on an unfamiliar device. Future studies could explore whether a BYOD approach could facilitate stored credentials or even use biometrics (fingerprint of facial recognition), as these are becoming standard for most mobile devices. We believe our results and experience have important implications for policy and practice regarding the use of portals in the hospital and immediately after discharge. First and foremost, more studies of portal use in the hospital are needed, especially given that most hospitals have not yet deployed this feature of EHR, and Meaningful Use will require higher use in the near future,4 suggesting an impending implementation boom—more evidence is needed to guide this process. Furthermore, future implementation work will need to anticipate and study the effect of portals that are unique to the inpatient environment, such as MyChart Bedside—we used a single portal that reflected actual practice at UCSF. While there may be certain advantages to an inpatient-only mode for portals, there may be tradeoffs as well, such as having to “re-learn” how to navigate a different layout and set of functionalities after discharge. Studies of this platform to date have focused on provider experience35 or patient satisfaction,36 but have not yet examined patient usage or outcomes. Indeed, while national data are lacking, our experience networking with peer institutions suggests that most are still using a single portal designed primarily for outpatient use without a separate inpatient version or adaptive portal that has an inpatient view (such as MyChart Bedside). Finally, our portal (MyChart) is “tethered” to our EHR (Epic), which means there is limited ability to share information between among in different systems with different EHR systems. Future studies should examine how patients who receive care in multiple systems may leverage the portal differently after discharge. While our study brings insights into the use of portals in the hospital, it also has limitations. We studied patients at 1 hospital only on a general medicine service; patients at other medical centers or different services may have different experiences and abilities using portals. Second, we were unable to follow an intention-to-treat analysis because some patients with existing MyChart accounts were unable to log in or unlock their accounts after randomization. Thus, these patients did not have the opportunity to contribute to the study outcomes (frequency of login and frequency of MyChart task completion) and could not be analyzed together with patients who were able to successfully log in; however, our analysis did include all other patients who were enrolled (there was no loss to follow-up and inability to login was the one and only cause for exclusion from final analysis). Third, although we used a rigorous randomization process and found no differences in virtually every category of clinical/demographic features or technology use, we did observe far higher rates of prior MyChart use in the intervention group. Nonetheless, this did not appear to be a significant predictor of engagement in our analyses that adjusted for prior MyChart use. CONCLUSIONS We conducted a randomized trial of a bedside tablet-based educational intervention to improve post-discharge use of our institutional patient portal and found consistent but statistically non-significant trends towards higher post-discharge use in the intervention group. As demand for patient access to electronic medical records increases from key stakeholders including patients/caregivers, government and private payers, as well as healthcare providers, more evidence from rigorous studies will be needed to guide successful implementation and operationalization of patient portals during and after acute care. FUNDING This research was supported by the National Institutes of Health (NIH), National Institute of Aging (NIA) through the Claude D. Pepper Older Americans Independence Center, grant number P30AG021342, and a Career Development Award, grant number K23AG045338-05 (Dr Greysen). This research was also sponsored by an intramural grant for Digital Health by the University of California, San Francisco (no grant number, Dr Greysen, PI). No sponsor had any role in the design or conduct of the study; collection, management, analysis, or interpretation of the data; or preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication. CONTRIBUTION STATEMENT SRG: conception of the work; acquisition, analysis, and interpretation of data; drafting of the manuscript and revisions for important intellectual content; final approval of the version to be published; agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. JDH: analysis and interpretation of data; drafting of the manuscript and revisions for important intellectual content; final approval of the version to be published. CR: analysis and interpretation of data; revisions to the manuscript for important intellectual content; final approval of the version to be published. YM: acquisition, analysis, and interpretation of data; drafting of the manuscript and revisions for important intellectual content; final approval of the version to be published. NH: analysis and interpretation of data; revisions to the manuscript for important intellectual content; final approval of the version to be published. JR: acquisition, analysis, and interpretation of data; revisions to the manuscript for important intellectual content; final approval of the version to be published. ADA: conception of the work and interpretation of data; revisions to the manuscript for important intellectual content; final approval of the version to be published. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. Conflict of interest statement. All authors have no competing interests to declare. ACKNOWLEDGMENTS Preliminary results from this project were presented as part of a panel presentation at the 2015 Annual Meeting of the American Medical Informatics Association and as a research poster presentation at the 2015 Annual Meeting of the Society for Hospital Medicine. REFERENCES 1 Laurance J , Henderson S , Howitt PJ , et al. . Patient engagement: four case studies that highlight the potential for improved health outcomes and reduced costs . Health Aff (Millwood) 2014 ; 33 ( 9 ): 1627 – 34 . Google Scholar Crossref Search ADS PubMed 2 Irizarry T , DeVito Dabbs A , Curran CR. Patient portals and patient engagement: a state of the science review . J Med Internet Res 2015 ; 17 ( 6 ): e148. Google Scholar Crossref Search ADS PubMed 3 Kruse CS , Argueta DA , Lopez L , Nair A. Patient and provider attitudes toward the use of patient portals for the management of chronic disease: a systematic review . J Med Internet Res 2015 ; 17 ( 2 ): e40. Google Scholar Crossref Search ADS PubMed 4 EHR Incentives and Certification: Meaningful Use Definitions and Objectives. https://www.healthit.gov/providers-professionals/meaningful-use-definition-objectives. Accessed March 22, 2018 . 5 Kruse CS , Bolton K , Freriks G. The effect of patient portals on quality outcomes and its implications to meaningful use: a systematic review . J Med Internet Res 2015 ; 17 ( 2 ): e44. Google Scholar Crossref Search ADS PubMed 6 Collins SA , Rozenblum R , Leung WY , et al. . Acute care patient portals: a qualitative study of stakeholder perspectives on current practices . J Am Med Inform Assoc 2017 ; 24 ( e1 ): e9 – e17 . [Epub ahead of print] PubMed PMID: 27357830. Google Scholar PubMed 7 Grossman LV , Choi SW , Collins S , et al. . Implementation of acute care patient portals: recommendations on utility and use from six early adopters . J Am Med Inform Assoc 2018 ; 25 ( 4 ): 370 – 9 . [Epub ahead of print] PubMed PMID: 29040634. Google Scholar Crossref Search ADS PubMed 8 Prey JE , Woollen J , Wilcox L , et al. . Patient engagement in the inpatient setting: a systematic review . J Am Med Inform Assoc 2014 ; 21 ( 4 ): 742 – 50 . Google Scholar Crossref Search ADS PubMed 9 Kelly MM , Coller RJ , Hoonakker PL. Inpatient portals for hospitalized patients and caregivers: a systematic review . J Hosp Med 2017 ; doi:10.12788/jhm.2894. 10 Woollen J , Prey J , Wilcox L , et al. . Patient experiences using an inpatient personal health record . Appl Clin Inform 2016 ; 7 ( 2 ): 446 – 60 . Google Scholar Crossref Search ADS PubMed 11 O’Leary KJ , Lohman ME , Culver E , et al. . The effect of tablet computers with a mobile patient portal application on hospitalized patients’ knowledge and activation . J Am Med Inform Assoc 2016 ; 23 ( 1 ): 159 – 65 . Google Scholar Crossref Search ADS PubMed 12 Greysen SR , Khanna RR , Jacolbia R , Lee HM , Auerbach AD. Tablet computers for hospitalized patients: a pilot study to improve inpatient engagement . J Hosp Med 2014 ; 9 ( 6 ): 396 – 9 . Google Scholar Crossref Search ADS PubMed 13 Ludwin S , Greysen SR. Use of smartphones and mobile devices in hospitalized patients: untapped opportunities for inpatient engagement . J Hosp Med 2015 ; 10 ( 7 ): 459 – 61 . Google Scholar Crossref Search ADS PubMed 14 Greysen SR , Magan Mendoza Y , Rosenthal J , et al. . Using tablet computers to increase patient engagement with electronic personal health records: protocol for a prospective, randomized interventional study . JMIR Res Protoc 2016 ; 5 ( 3 ): e176. Google Scholar Crossref Search ADS PubMed 15 Greysen SR , Rajkomar A , Ritchie CR , Auerbach AD. Improving engagement of older, hospitalized adults through bedside use of personal health records [abstract] . J Hosp Med 9 (Suppl 2) : 147 . 16 McAlearney AS , Sieck CJ , Hefner JL , et al. . High touch and high tech (HT2) proposal: transforming patient engagement throughout the continuum of care by engaging patients with portal technology at the bedside . JMIR Res Protoc 2016 ; 5 ( 4 ): e221. Google Scholar Crossref Search ADS PubMed 17 Masterson Creber R , Prey J , Ryan B , et al. . Engaging hospitalized patients in clinical care: study protocol for a pragmatic randomized controlled trial . Contemp Clin Trials 2016 ; 47 : 165 – 71 . Google Scholar Crossref Search ADS PubMed 18 Harrison JD , Greysen SR , Jacolbia R , et al. . Not ready, not set…discharge: patient-reported barriers to discharge readiness at an academic medical center . J Hosp Med 2016 ; 11 ( 9 ): 610 – 4 . Google Scholar Crossref Search ADS PubMed 19 Goodman DM , Burke AE , Livingston EH. Discharge planning . JAMA 2013 ; 309 ( 4 ): 406. Google Scholar Crossref Search ADS PubMed 20 Greysen SR , Stijacic Cenzer I , Auerbach AD , Covinsky KE. Functional impairment and hospital readmission in Medicare seniors . JAMA Intern Med 2015 ; 175 ( 4 ): 559 – 65 . Google Scholar Crossref Search ADS PubMed 21 Greysen SR , Chin Garcia C , Sudore RL , Cenzer IS , Covinsky KE. Functional impairment and internet use among older adults: implications for meaningful use of patient portals . JAMA Intern Med 2014 ; 174 ( 7 ): 1188 – 90 . Google Scholar Crossref Search ADS PubMed 22 Schnipper JL , Liang CL , Hamann C , et al. . Development of a tool within the electronic medical record to facilitate medication reconciliation after hospital discharge . J Am Med Inform Assoc 2011 ; 18 ( 3 ): 309 – 13 . Google Scholar Crossref Search ADS PubMed 23 Schnipper JL , Hamann C , Ndumele CD , et al. . Effect of an electronic medication reconciliation application and process redesign on potential adverse drug events: a cluster-randomized trial . Arch Intern Med 2009 ; 169 ( 8 ): 771 – 80 . Google Scholar Crossref Search ADS PubMed 24 Turchin A , Hamann C , Schnipper JL , et al. . Evaluation of an inpatient computerized medication reconciliation system . J Am Med Inform Assoc 2008 ; 15 ( 4 ): 449 – 52 . Google Scholar Crossref Search ADS PubMed 25 Greysen S , Metlay J , Kripalani S , Sarkar U , Robinson E , Auerbach A. Internet use for postdischarge health care tasks among readmitted patients: preliminary results from the homerun transitions of care study [abstract ]. J Hosp Med 2013 ; 8 (suppl 2) . https://www.shmabstracts.com/abstract/internet-use-for-postdischarge-health-care-tasks-among-readmitted-patients-preliminary-results-from-the-homerun-transitions-of-care-study/. Accessed October 3, 2018. 26 Greysen SR , Harrison JD , Kripalani S , et al. . Understanding patient-centred readmission factors: a multi-site, mixed-methods study . BMJ Qual Saf 2017 ; 26 ( 1 ): 33 – 41 . Google Scholar Crossref Search ADS PubMed 27 O’Leary KJ , Sharma RK , Killarney A , et al. . Patients’ and healthcare providers’ perceptions of a mobile portal application for hospitalized patients . BMC Med Inform Decis Mak 2016 ; 16 ( 1 ) 123 . Google Scholar Crossref Search ADS PubMed 28 Prey JE , Restaino S , Vawdrey DK. Providing hospital patients with access to their medical records . AMIA Annu Symp Proc 2014 ; 2014 : 1884 – 93 . Google Scholar PubMed 29 Davis SE , Osborn CY , Kripalani S , et al. . Health literacy, education levels, and patient portal usage during hospitalizations . AMIA Annu Symp Proc 2015 ; 2015 : 1871 – 80 . Google Scholar PubMed 30 Press VG , Arora VM , Shah LM , et al. . Teaching the use of respiratory inhalers to hospitalized patients with asthma or COPD: a randomized trial . J Gen Intern Med 2012 ; 27 ( 10 ): 1317 – 25 . Google Scholar Crossref Search ADS PubMed 31 Press VG , Kelly CA , Kim JJ , et al. . Virtual Teach-To-Goal™ adaptive learning of inhaler technique for inpatients with asthma or COPD . J Allergy Clin Immunol Pract 2017 ; 5 ( 4 ): 1032 – 9.e1 . Google Scholar Crossref Search ADS PubMed 32 Kaziunas E , Hanauer DA , Ackerman MS , Choi SW. Identifying unmet informational needs in the inpatient setting to increase patient and caregiver engagement in the context of pediatric hematopoietic stem cell transplantation . J Am Med Inform Assoc 2016 ; 23 ( 1 ): 94 – 104 . Google Scholar Crossref Search ADS PubMed 33 Kelly MM , Hoonakker PL , Dean SM. Using an inpatient portal to engage families in pediatric hospital care . J Am Med Inform Assoc 2017 ; 24 ( 1 ): 153 – 61 . Google Scholar Crossref Search ADS PubMed 34 Mobility Optimization Via Engagement with Interactive Technology – The MOVE-IT Randomized Trial. https://clinicaltrials.gov/ct2/show/NCT03321279. Accessed April 3, 2018 . 35 Hefner JL , Sieck CJ , Walker DM , et al. . System-wide inpatient portal implementation: survey of health care team perceptions . JMIR Med Inform 2017 ; 5 ( 3 ): e31. Google Scholar Crossref Search ADS PubMed 36 Winstanley EL , Burtchin M , Zhang Y , et al. . Inpatient experiences with MyChart bedside . Telemed J E Health 2017 ; 23 ( 8 ): 691 – 3 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Expert-level sleep scoring with deep neural networksBiswal, Siddharth; Sun, Haoqi; Goparaju, Balaji; Westover, M Brandon; Sun, Jimeng; Bianchi, Matt T
doi: 10.1093/jamia/ocy131pmid: 30445569
Abstract Objectives Scoring laboratory polysomnography (PSG) data remains a manual task of visually annotating 3 primary categories: sleep stages, sleep disordered breathing, and limb movements. Attempts to automate this process have been hampered by the complexity of PSG signals and physiological heterogeneity between patients. Deep neural networks, which have recently achieved expert-level performance for other complex medical tasks, are ideally suited to PSG scoring, given sufficient training data. Methods We used a combination of deep recurrent and convolutional neural networks (RCNN) for supervised learning of clinical labels designating sleep stages, sleep apnea events, and limb movements. The data for testing and training were derived from 10 000 clinical PSGs and 5804 research PSGs. Results When trained on the clinical dataset, the RCNN reproduces PSG diagnostic scoring for sleep staging, sleep apnea, and limb movements with accuracies of 87.6%, 88.2% and 84.7% on held-out test data, a level of performance comparable to human experts. The RCNN model performs equally well when tested on the independent research PSG database. Only small reductions in accuracy were noted when training on limited channels to mimic at-home monitoring devices: frontal leads only for sleep staging, and thoracic belt signals only for the apnea-hypopnea index. Conclusions By creating accurate deep learning models for sleep scoring, our work opens the path toward broader and more timely access to sleep diagnostics. Accurate scoring automation can improve the utility and efficiency of in-lab and at-home approaches to sleep diagnostics, potentially extending the reach of sleep expertise beyond specialty clinics. deep learning, sleep scoring, neural network, EEG analysis Introduction Common sleep disorders such as sleep apnea, insomnia, and restless legs syndrome impact tens of millions of adults and are significant risk factors for cardiometabolic and neurodegenerative diseases, impaired performance, and decreased quality of life.1–7 The population health impact is enormous, including medical and psychiatric morbidity, motor vehicle accidents, decreased work productivity and quality of life, and increased mortality.7,8 Timely and accurate diagnosis of sleep disorders is critical to pursue appropriate treatment and improve health outcomes,9 yet most sleep disorders remain undiagnosed.10,11 Recent advances in portable monitoring technology have increased access to sleep diagnostics, yet both at-home and the gold-standard in-lab polysomnography (PSG) still require manual scoring. Previous attempts to automate diagnosis of sleep disorders have generally relied on fewer than 100 PSGs from relatively homogeneous groups of healthy individuals.12 Models trained on such datasets are not likely to generalize well, because PSG signals vary widely due to differences in demographics, medication effects, sleep conditions, and medical conditions. We address this variability using a data-driven approach based on 79 456 hours of clinical data from 10 000 nights of PSG recording. This real-world data, recorded over 8 years in a clinical sleep laboratory, makes our PSG analysis system robust to physiologic variability between patients. Most prior approaches involve preprocessing and extraction of carefully engineered features before classification.12 Our system is trained end-to-end, directly from labeled signals. Deep neural networks, fueled by increases in computing power and availability of large labeled datasets, have recently matched the performance of medical experts in complex medical pattern recognition tasks such as visual diagnosis of dermatologic lesions13 and diabetic retinopathy.14 In this paper, we outline the development of a Recurrent Convolutional Neural Network (RCNN) that matches the performance of sleep experts in annotating overnight PSGs. No prior study has simultaneously addressed all 3 key types of PSG information extracted by expert scorers: sleep stages, respiratory events, and limb movements. Our system uses a unified deep network architecture (RCNN) to accomplish all 3 tasks. Prior work, for comparison, uses mainly small datasets, and mainly of healthy adults.15–30 Methods Description of deep neural network development Deep learning algorithms such as multi-layer perceptrons, convolutional neural networks (CNN), and recurrent neural networks (RNN) have been successfully applied to many domains to solve challenging tasks. The most basic computation unit in neural networks is a perceptron which performs linear combinations of input features followed by a nonlinear transformation. The standard deep neural networks (DNN) consist of multiple layers of perceptrons, which all fully connected across consecutive layers. To avoid dense connections in DNN, CNNs introduces local connections and parameter sharing through convolution operations, which demonstrated numerous successes in computer vision application such as object recognition. RNNs are another extension of DNN that are suitable for modeling sequential data such as natural language text and time series. A detailed overview of various deep learning models for analyzing medical data can be found at Xiao et al.31 Here, we briefly describe the rationale of designing and developing deep neural networks for analysis of clinical sleep data. We initially used classical machine learning algorithms such as logistic regression and random forest directly on expert defined features. However, the resulting performance is not very high as shown in Supplementary Table S2. Also, it often takes a lot of time and effort to carefully develop expert defined features, since it requires domain expertise. On the other hand, deep neural networks, such as a convolutional neural network, can extract better features and then pass those learned features in a recurrent neural network to detect sleep stages over time. Dataset The datasets used in this paper are from 2 sources: The Massachusetts General Hospital (MGH) sleep laboratory and the Sleep Heart Health Study (SHHS), summarized in Supplementary Table S1. Permissions for the SHHS were obtained via the online portal: www.sleepdata.org. The MGH Institutional Review Board approved retrospective analysis of clinically acquired PSG data without requiring additional consent. These 2 datasets consist of in-lab (MGH) and at-home (SHHS) PSG recordings which include combinations of electroencephalogram (EEG), respiratory signals, and electromyogram signals (EMG). The MGH dataset was scored as part of routine clinical practice by certified sleep technicians using the American Academy of Sleep Medicine (AASM) guidelines. The SHHS dataset was scored using the Rechtschaffen and Kales (R&K) guidelines. R&K scores are converted to AASM scores by combining stages NREM 3 and 4, designated in AASM as the single stages N3. The MGH dataset consists of a mixture of diagnostic, split night, and titration protocols. The SHHS PSGs are all diagnostic. EEG data is used for sleep staging, respiratory channels are used for apnea detection, and, for the MGH set, and the bilateral leg EMG channels are used for limb movement detection. The MGH dataset and SHHS dataset have 2 EEG channels in common (central). All 4 respiratory channels are present in both datasets. Pressure transducer airflow (PTAF) and EMG channels are available in the MGH dataset only. Classification targets Different target labels are modeled for the 3 scoring tasks. For sleep staging, EEG signals are scored in non-overlapping 30-second epochs according to AASM standards as one of 5 stages: wake (W), rapid eye movement (REM) – R, non-REM stage 1 (N1), non- REM stage 2 (N2), and non-REM stage 3 (N3). Thus, sleep staging is formulated as a 5-class classification problem. For respiratory event detection, we consider the following classes: obstructive apnea, central apnea, mixed apnea, and hypopnea (defined using the 4% desaturation rule). We combine these different respiratory event class labels into a single class (apnea event), and; thus, we perform a binary classification, ie presence or absence of apnea respiratory event, to mimic the clinical use of the composite apnea-hypopnea index (AHI). Event detections are performed in consecutive, non-overlapping 1-second intervals. Event detections in consecutive time windows are merged into a single “apnea event” in order to calculate the AHI, defined as the total number of apneas during sleep, divided by the number of hours of sleep (ie in sleep stage N1, N2, N3, or R). Calculation of AHI depends on the results of automated sleep staging, needed to calculate the total sleep time (the sum of N1-N3 and R). For limb movement detection, EMG signals are marked for presence or absence of limb movement events. The majority (>90%) are periodic, and, because from a signal standpoint isolated limb movements have similar properties, we combine them into a single label. Limb movement detection is, therefore, formulated as a binary classification problem when we detect the presence or absence of limb movement events. Limb movement detections are performed in consecutive, non-overlapping 1-second intervals. Like AHI detection, limb movements detected in consecutive seconds are merged into a single event. Limb movement burden is quantified by the limb movement index (LMI), the number of limb movements per hour of sleep. Calculation of LMI for AHI depends on the results of automated sleep staging. Data preparation EEG data in PSG consists of signals from 6 channels, ie F3, F4, C3, C4, O1 and O2, each referenced to the contralateral mastoid. In Supplementary Figure S1, we show a schematic of the locations of the electrodes. While the MGH dataset has 6 electrodes, the SHHS dataset has only 2 EEG electrodes (C3, C4). Both MGH and SHHS datasets contain the following respiratory signals: chest belt, abdomen belt, SaO2 (oximetry), and airflow. The pressure transducer airflow (PTAF) present in the diagnosis phase of MGH set is not used in the final model, since including it yields no significant performance improvement (data not shown). The left and right anterior tibialis (LAT and RAT) EMG channels for limb movement detection are present in MGH dataset only. The sampling frequency of the data is 200Hz. We use both raw waveform and spectrogram representations of the data as inputs for our models. For the spectrogram representation of EEG and EMG data, we segment each 30-second epoch into 29 subepochs of 2 seconds duration with 1-second overlap. For each 2-second subepoch, we use Thomson’s multitaper method to estimate the power spectral density (PSD), with the following parameters: window length, T = 2s, time-bandwidth product, TW = 3, number of tapers K = 5.32–34 For respiratory signals, the parameters are T = 30s, TW = 1.5, K = 2. We split our datasets into train and test sets using 90/10 percentage splits of the original cohorts. Model performance is evaluated on the test sets. There is no overlap between test and training sets. As the MGH dataset has 10 000 PSGs, the train set consists of 9000 cases, and the test set consists of 1000 cases. The SHHS dataset has 5804 PSGs, so the train set has 5224 and the test set has 580 cases. Sample selection The sleep staging task has five different target classes: N1, N2, N3, R and W. These classes have approximately 19 million, 75 million, 22 million, 21 million, and 18 million 30 second epochs from the MGH dataset for N1-N3, R and W, respectively. Similarly, the SHHS dataset consists of approximately 11 million, 46 million, 9 million, 8 million, and 11 million, 30 second epochs for N1-N3, R and W, respectively. For sleep apnea detection, the MGH dataset contains approximately 2 million respiratory events. Similarly, the SHHS dataset has about 650 000 apnea events. For limb movement detection, the MGH data has approximately 2.7 million limb movement events. Training algorithms We combine two different primary types of neural networks in all experiments. We use a convolutional neural network (CNN) and recurrent neural network (RNN). We refer to the combination of these models as RCNN. The combination of CNN with RNN enables us to extract features from raw data using the CNN and to model long-range temporal dependencies present in the data with the RNN. The CNN module contains 2 filter sizes (100 and 200 dimensions) to capture patterns across different time scales, which we empirically find to have better performance than just a single filter size. The details of the RCNN architecture used in our experiments are presented in Supplementary Figure S1. For sleep staging, the input for the CNN is the spectrogram representation of the EEG signal. Similarly, for AHI detection, we provide 60 second blocks of respiratory signal data or spectrogram representation of these channels. For the limb movement detection task, we provide 60 second blocks of EMG (LAT, RAT) raw waveform or spectrogram representation to the CNN. Our models are trained using backpropagation. We use cross-entropy as the loss function to train the models. The categorical cross-entropy loss is given by Hy'y= -∑iy'i log(yi) where y is the predicted probability distribution and y’ is the true distribution. We adopt batch normalization (BN) after each convolution and before activation.35 We initialize the weights as in36 and train all neural networks from scratch. We use stochastic gradient descent (SGD) with a mini-batch size of 100. The learning rate starts from 0.1 and is divided by 10 when the error plateaus. We use a weight decay of 0.0001 and a momentum of 0.9. We perform 50 iterations of random search over a set of parameter choices for hyper-parameter tuning. All models are implemented using PyTorch (http://pytorch.org/). All experiments are conducted on a server with Intel Xeon E5-2640, 256GB RAM, four NVidia Titan X GPU, and CUDA 8.0. Scoring algorithms Given an input sample, the trained model outputs a probability distribution over the possible target classes. In the sleep stage detection model, the model provides the probability distribution over the 5 AASM sleep stages. Similarly, in the AHI and LMI detection tasks, the model provides a probability that the sample is an apnea or a limb movement event. We use a sliding window to combine adjacent one-second output decisions to define individual apnea or limb-movement events. By merging adjacent one second outputs, we combine them into a single detected apnea or limb movement event. This allows us to compare annotations from the RCNN directly with those from experts, since experts label entire events (eg by marking the beginning and ending of an apnea) rather than independently labeling 2 second intervals. Expressing detections as single merged events also allows us to calculate the clinically relevant measures of apnea and limb movement abnormality, AHI (apnea hypopnea index) and LMI (limb movement index), which are the number of apneas or limb movements, respectively, per hour of sleep. Evaluation To measure performance on sleep stage classification, we use the overall classification accuracy, and classification accuracy broken down by stage, shown as a confusion matrix. Element (i, j) of each confusion matrix represents the empirical probability of predicting class j given that the ground truth (expert label) is class i. To measure performance on apnea classification, we use the correlation value (r2) between the algorithm-predicted AHI and the AHI computed from expert-scored PSGs, where AHI = (Apnea + Hypopnea events)/hours of sleep. To measure performance in the limb movement detection task, we calculate the correlation value (r2) between the algorithm-predicted LMI and the LMI based on expert scoring of PSGs, where LMI = (number of leg movement events)/hours of sleep. Cross dataset experiments We evaluate our models for sleep stage detection and apnea detection in both the MGH and the SHHS datasets in the supplemental material (Supplementary Tables S2–S4). In Supplementary Table S2 we present the accuracy of models trained using MGH data and tested on SHHS, trained using SHHS and tested using MGH, and trained using the combination of MGH and SHHS and tested on MGH or on SHHS. We also show the test performance of the MGH model using only frontal channels to simulate sleep monitoring using home monitoring devices. Supplementary Table S3 shows the AHI estimation in different test-train and limited-channel contexts. Unlike multi-channel PSG data, home monitoring sensors often come from a single channel such as abdomen or chest belt. To simulate home monitoring, we assess how well models trained on a single channel (either abdomen or chest belt) perform in comparison to models given access to multi-channel PSG data. Finally, Supplementary Table S4 shows model performance for limb movement detection on MGH data. Results Our data consisted of 10 000 clinical PSGs performed at the Massachusetts General Hospital Sleep Laboratory (MGH data), split into 9000 training and validation PSGs and 1000 PSGs held out for testing. We utilized a convolutional neural network (CNN) to model the local spatiotemporal characteristics of 30-second PSGs, combined with a recurrent neural network (RNN) to model long-range temporal dependencies. Figure 1 shows the RCNN system architecture. Our dataset was composed of PSGs labeled by certified sleep technologists, following the American Academy of Sleep Medicine (AASM) standards.37 The RCNN was trained to use 6 EEG channels to assign to each 30-second PSG to one of 5 sleep stages: awake (W), rapid eye movement (REM) sleep (R), and non-REM stages 1-3 (N1-N3). In addition, the RCNN was trained to use 5 respiratory channels to detect apnea events, quantified as the apnea-hypopnea index (AHI; events/hour of sleep), and limb movement events using the leg EMG channels, quantified as the limb movement index (LMI; events/hour of sleep). Figure 1. Open in new tabDownload slide Deep RCNN layout for automated polysomnography analysis. a. Data are recorded during sleep by sensors that measure brain activity (electroencephalography, EEG), eye movements (electrooculogram, and EOG), oronasal airflow, heart rhythm (electrocardiography, ECG), blood oxygenation (pulse oximetry), respiration (chest and abdominal belts), and limb movements (limb electromyography (EMG), placed over the anterior tibialis muscles. b. Examples of some of the signals and event labels provided by experts. Top: hypnogram showing sleep stages, and the corresponding spectrogram for one of the 6 EEG channels. Middle: Apnea events (black bars) and corresponding spectrogram for the chest best signal. Bottom: limb movement events (black bars) and corresponding spectrogram for one of the limb EMG signals. c. Close ups, showing details of the selected signals and labeled events. d. Architecture of the RCNN model. Signals consecutive epochs (xi) are sequentially fed into a convolutional neural network module (CNN). The CNN output is fed into a bidirectional recurrent neural network, which a sequence of inferred labels: sleep stages, apnea detections, and PLM detections. Details of the CNN architecture are provided in the supplemental material. For sleep staging, the RCNN achieved an overall accuracy of 87.5% [84.2, 90.9], which compares favorably to human expert performance38,39 (Figure 2a). Also RCNN significantly outperformed classical machine learning methods such as logistic regression (accuracy 69.34%) and random forest (accuracy 74.52%) as shown in Supplementary Table S2. Besides lower performance in terms of accuracy, classical machine learning methods require expert-defined features, which are not always available such as for AHI and LMI prediction. AHI inferred by the RCNN strongly correlated with expert scoring (r2= 0.85) (Figure 2b). Converting AHI values into standard clinical categories of mild, moderate and severe disease, the RCNN achieved an overall diagnostic accuracy of 88.2% [84.7, 91.4] (Table 1). Importantly, when the apnea severity inferred by the RCNN disagreed with experts, misclassification was mainly to an adjacent severity category (Figure 2c). We used the desaturation criteria (rather than arousal criteria) for calculating AHI events, as inter-rater reliability is higher for desaturation criteria.40 The predicted LMI correlated strongly with expert scoring (r2 =0.79; Figure 2d). This level of performance is comparable with expert performance, though annotation performance of limb movements is less well studied, particularly in subjects with concurrent sleep apnea.41 Table 1. Generalization experiments when applying models trained on clinical data (MGH) to the MGH and SHHS test sets Task . Experiment setup . Accuracy . Kappa . Sleep staging Train and test on MGH (6 channels) 87.5% 80.5 Train on MGH and test on MGH (2 channels) 81.9% 76.4 Train on MGH and test on SHHS (2 channels) 77.7% 73.2 Sleep apnea detection Accuracy r2 (AHI) Train and test on MGH 88.2% 0.85 Train on MGH and test on SHHS 80.2% 0.77 Limb movement detection Accuracy r2 (LMI) Train and test on MGH data 84.7% 0.79 Task . Experiment setup . Accuracy . Kappa . Sleep staging Train and test on MGH (6 channels) 87.5% 80.5 Train on MGH and test on MGH (2 channels) 81.9% 76.4 Train on MGH and test on SHHS (2 channels) 77.7% 73.2 Sleep apnea detection Accuracy r2 (AHI) Train and test on MGH 88.2% 0.85 Train on MGH and test on SHHS 80.2% 0.77 Limb movement detection Accuracy r2 (LMI) Train and test on MGH data 84.7% 0.79 Note: Accuracy is measured as the percent agreement between labels inferred by the algorithm and expert labels. For apnea and limb movement detection, accuracy is measured both by the correlation ( r2 ) with expert scores of the algorithm’s estimate of the number of events per hour of sleep (apnea-hypopnea index (AHI), or limb movement index (LMI)), and by the expert-algorithm agreement of regarding categorization of the event burden as mild, moderate or severe. Cohen’s Kappa is provided as a complementary measure of accuracy which takes into account the probability of agreement occurring by chance. Open in new tab Table 1. Generalization experiments when applying models trained on clinical data (MGH) to the MGH and SHHS test sets Task . Experiment setup . Accuracy . Kappa . Sleep staging Train and test on MGH (6 channels) 87.5% 80.5 Train on MGH and test on MGH (2 channels) 81.9% 76.4 Train on MGH and test on SHHS (2 channels) 77.7% 73.2 Sleep apnea detection Accuracy r2 (AHI) Train and test on MGH 88.2% 0.85 Train on MGH and test on SHHS 80.2% 0.77 Limb movement detection Accuracy r2 (LMI) Train and test on MGH data 84.7% 0.79 Task . Experiment setup . Accuracy . Kappa . Sleep staging Train and test on MGH (6 channels) 87.5% 80.5 Train on MGH and test on MGH (2 channels) 81.9% 76.4 Train on MGH and test on SHHS (2 channels) 77.7% 73.2 Sleep apnea detection Accuracy r2 (AHI) Train and test on MGH 88.2% 0.85 Train on MGH and test on SHHS 80.2% 0.77 Limb movement detection Accuracy r2 (LMI) Train and test on MGH data 84.7% 0.79 Note: Accuracy is measured as the percent agreement between labels inferred by the algorithm and expert labels. For apnea and limb movement detection, accuracy is measured both by the correlation ( r2 ) with expert scores of the algorithm’s estimate of the number of events per hour of sleep (apnea-hypopnea index (AHI), or limb movement index (LMI)), and by the expert-algorithm agreement of regarding categorization of the event burden as mild, moderate or severe. Cohen’s Kappa is provided as a complementary measure of accuracy which takes into account the probability of agreement occurring by chance. Open in new tab Figure 2. Open in new tabDownload slide Classification performance of the RCNN for polysomnography scoring. The labels inferred by the RCNN are tested against the annotations of medical experts. a. Confusion matrix for sleep staging, showing RCNN agreement with expert scores. Sleep experts score each 30 second EEG epoch as 1 of 5 sleep stages: awake (W), non-REM stage 1, 2, or 3 (N1, N2, and N3), or rapid eye movement sleep (R). The RCNN outputs a probability for each stage, and we compare the highest probability class against the expert’s score for each epoch. The RCNN’s labels show >80% agreement for all classes except N1, comparable to levels of agreement between human experts. b. Sleep apnea events are detected by the RCNN in 1 second epochs, and the AHI (apnea hypopnea index: number of RCNN-detected apnea events per hour of sleep) is plotted against the AHI estimated from expert PSG scores. The correlation between expert and RCNN AHI scores is shown. c. Confusion matrix for the classification of AHI severity (none, 5; mild, 5-15; moderate, 15-30; severe, >30 per hour), comparing AHI scores inferred by the RCNN against expert scores. d. Limb movement index (LMI) are detected in consecutive one second intervals, and the total burden of lime movements, summarized as the limb movement index (LMI, number of lime movements per hour of sleep). The LMI inferred by the RCNN is compared with scores from sleep experts. To further validate the RCNN’s generalization capability, we evaluated the performance of sleep staging and sleep apnea detection on an independent set of publicly available PSGs (SHHS data; n = 5804; www.sleepdata.org). The SHHS utilizes limited-channel EEG data (2 central channels), and respiratory effort, airflow and oximetry channels, but does not include limb electromyogram signals (EMG). First, we tested the MGH-trained RCNN on 1000 randomly selected PSGs from SHHS. To enable testing on SHHS, we first retrained the sleep staging RCNN on the MGH training data while allowing access to only 2 central EEG channels to mimic the SHHS EEG configuration. For sleep staging the MGH-trained RCNN, when tested on the SHHS testing PSGs, achieved an accuracy of 77.7% [74.3, 79.7]. Next, we applied the MGH-trained AHI prediction model to the SHHS test set, which also demonstrated a strong correlation with expert labels (r2 =0.77) [0.72, 0.79]. By comparison, on the MGH test data, the limited-channel RCNN classified sleep stages with 81.9% [78.2, 84.9] overall accuracy, and AHI with r2 = 0.85 [0.83, 0.87]. To compare generalization capabilities of RCNNs trained with real-world clinical PSG data (MGH) vs standardized clinical trial data (SHHS), we evaluated sleep staging and apnea detection in cross-training experiments: train on MGH data, test on both MGH and SHHS data; then train on SHHS data, and test on MGH and SHHS data. In all experiments, the PSG sets used for training and testing are kept constant. Results are shown in Table 1 (with additional experiments shown in Supplementary Tables S2–S4). In all cases, models trained with MGH data performed well on test sets from both MGH and SHHS, confirming the importance and sufficiency of large heterogeneous datasets, even when they derive from routine clinical practice settings, for robust model training. We next investigated the features learned by the RCNN for scoring sleep stages (Figure 3a), AHI (Figure 3b), and LMI (Figure 3c), using t-SNE (t-distributed Stochastic Neighbor Embedding).42 Each point represented a signal segment projected from the 124-dimensional high-dimensional output of the RCNN’s last hidden layer onto a plane. The RCNN has learned to form well-separated clusters of points from signals belonging to the same annotation classes. Figure 3. Open in new tabDownload slide t-SNE visualization of the last hidden layer representations in the CNN. Here we show the CNN’s internal representation of a) sleep stages, b) apnea events, and c) limb movements. Points are obtained by applying t-SNE, a method for visualizing high-dimensional data, to the last hidden layer representation in the RCNN for each model. Colored points represent the different event types, showing how the algorithm learns to cluster the signals. Waveforms near show typical examples from each cluster. Discussion Our results demonstrate human-level performance of deep learning algorithms trained on large PSG datasets to replicate the primary categories of scoring: stages, sleep apnea, and limb movements. Our large PSG sample size allows cross-validation steps during training (to minimize the risk of model overfitting), and testing on an independent, held-out set of 1000 PSGs to obtain unbiased estimates of performance. The clinical heterogeneity and lack of special selection or exclusion of cases supports the generalization of performance when trained on the MGH set and tested on the independent research cohort of the SHHS. External validation of this kind is crucial to address a common criticism of scoring automation: will the algorithm performance be robust when applied broadly? Our results also address one form of conventional wisdom that careful standardization of PSG recording conditions and homogeneity of patient characteristics are critical to obtaining generalizable algorithms. Our results suggest that, given sufficiently large datasets, training on real-world data can yield human level performance and generalize to standardized data sets (such as SHHS). The capacity to generalizability is a pre-requisite for algorithm deployment in real-world settings, especially in medical diagnostics that routinely encounter heterogeneous pathophysiology. Further, the availability of clinical datasets obtained in routine practice in principle far exceeds that of research studies, and our results provide motivation to utilize such “in-hand” data to develop predictive algorithms. Feature selection is another key problem in the application of supervised machine learning. Although it is natural to assume that features informed by experts with domain knowledge, sometimes described as “feature engineering,” ought to be an important component in developing machine learning algorithms, our results shows that deep learning models can learn better features than human in this specific task. Specifically, we train our deep learning algorithms using generic features, as well as direct time series data, and obtain human-level scoring accuracy. The advantages to this approach include the minimization of bias, as well as reduced burden on human capital, which can be spent more efficiently on preparation and interpretation. In addition to automation of in-lab scoring, the accuracy of portable sleep recording systems stand to directly benefit from reliable and robust algorithms. Because portable systems reach a far larger audience, whether clinical or consumer is in nature, robust and scalable scoring is necessary, if only to accommodate the increased scale. The minor reduction in accuracy when moving from 6- to 2-channel EEG for staging, and from 5 respiratory channels to 1 for sleep apnea, is still on par with the level of accuracy attained by experts. These results suggest that accurate automated analysis of sleep stages and apnea is attainable with limited-channel devices such as those available for at-home use. Improvement of classification with limited channels has important implications for clinical diagnostics such as home sleep apnea testing kits,43 as well as consumer facing devices,44 which the Food and Drug Administration is showing increasing willingness to consider for some medical uses (for example, arrhythmia detection).45 In summary, our deep network is accurate and scalable, and can be deployed on multi-channel (eg in-lab PSG) or limited channel (eg portable) acquisition systems. The potential for substantial clinical impact includes broadening the reach of clinical sleep medicine, augmenting clinical decision-making for sleep specialists, and improving the accuracy and reliability of at-home portable systems. Further work should focus on integrating this new technology into specific monitoring devices and optimizing performance in real-world clinical settings. The ability to automate overnight PSG scoring with the accuracy of a sleep specialist has the potential to expand access to essential medical care. Funding Dr Bianchi has received funding from, the Center for Integration of Medicine and Innovative Technology, the Milton Family Foundation, the MGH-MIT Grand Challenge, and the American Sleep Medicine Foundation, and the Department of Neurology. Dr Westover has received funding from NIH-NINDS (1K23NS090900). Dr Sun received funding from the National Science Foundation (IIS-1418511, CCF-1533768), NIH (1R01MD011682-01, R56HL138415), Children’s Healthcare of Atlanta, and UCB. This was not an industry supported study. Contributors Biswal implemented the algorithms and conducted the experiments. Biswal, Westover, J. Sun and Bianchi developed methods. H. Sun and Goparaju extracted and provided the data. All authors were involved in drafting the paper. Competing interests Dr Bianchi has a patent pending on a home sleep monitoring device, has research agreements with MC10 and Insomnisolv, and consulting agreements with McKesson, International Flavors and Fragrances, and Apple Inc., serves as a medical monitor for Pfizer, and has provided expert testimony in sleep medicine. This was not an industry supported study, and none of these entities had any role in the study. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. Abbreviations AASM: American Academy of Sleep Medicine AHI: apnea-hypopnea index CNN: convolutional neural network EEG: electroencephalogram EMG: electromyogram EOG: electrooculogram LMI: limb movement index MGH: Massachusetts General Hospital NREM: non-rapid eye movement PSG: polysomnogram R&K: Rechtschaffen and Kales REM: rapid eye movement RNN: recurrent neural network SHHS: Sleep Heart Health Study References 1 Buysse DJ. Insomnia . JAMA 2013 ; 309 ( 7 ): 706 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Iranzo A. Sleep in neurodegenerative diseases . Sleep Med Clin 2016 ; 11 ( 1 ): 1 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Kapur VK. Obstructive sleep apnea: diagnosis, epidemiology, and economics . Respir Care 2010 ; 55 ( 9 ): 1155 – 67 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 4 Budhiraja R , Budhiraja P, Quan SF. Sleep-disordered breathing and cardiovascular disorders . Respir Care 2010 ; 55 ( 10 ): 1322 – 32; discussion 30–2. Google Scholar PubMed OpenURL Placeholder Text WorldCat 5 Tregear S , Reston J, Schoelles K, Phillips B. Obstructive sleep apnea and risk of motor vehicle crash: systematic review and meta-analysis. J Clin Sleep Med 2009 ; 5 ( 6 ): 573 – 81 . 6 Smolensky MH , Di Milia L, Ohayon MM, Philip P. Sleep disorders, medical conditions, and road accident risk . Accid Anal Prev 2011 ; 43 ( 2 ): 533 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Skaer TL , Sclar DA. Economic implications of sleep disorders . Pharmacoeconomics 2010 ; 28 ( 11 ): 1015 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Pietzsch JB , Garner A, Cipriano LE, Linehan JH. An integrated health-economic analysis of diagnostic and therapeutic strategies in the treatment of moderate-to-severe obstructive sleep apnea . Sleep 2011 ; 34 ( 6 ): 695 – 709 Google Scholar PubMed OpenURL Placeholder Text WorldCat 9 McDaid C , Duree KH, Griffin SC, et al. . A systematic review of continuous positive airway pressure for obstructive sleep apnoea-hypopnoea syndrome . Sleep Med Rev 2009 ; 13 ( 6 ): 427 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Usmani ZA , Chai-Coetzer CL, Antic NA, McEvoy RD. Obstructive sleep apnoea in adults . Postgrad Med J 2013 ; 89 ( 1049 ): 148 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Leger D , Bayon V. Societal costs of insomnia . Sleep Med Rev 2010 ; 14 ( 6 ): 379 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Sun H , Jia J, Goparaju B, et al. . Large-scale automated sleep staging . Sleep 2017 ; 40 ( 10 ): doi:10.1093/sleep/zsx139. Google Scholar OpenURL Placeholder Text WorldCat 13 Esteva A , Kuprel B, Novoa RA, et al. . Dermatologist-level classification of skin cancer with deep neural networks . Nature 2017 ; 542 ( 7639 ): 115 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Gulshan V , Peng L, Coram M, et al. . Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs . JAMA 2016 ; 316 ( 22 ): 2402 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Fraiwan L , Lweesy K, Khasawneh N, Fraiwan M, Wenz H, Dickhaus H. Classification of sleep stages using multi-wavelet time frequency entropy and LDA . Methods Inf Med 2010 ; 49 ( 3 ): 230 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Lajnef T , Chaibi S, Ruby P, et al. . Learning machines and sleeping brains: automatic sleep stage classification using decision-tree multi-class support vector machines . J Neurosci Methods 2015 ; 250 : 94 – 105 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Liang SF , Kuo CE, Hu YH, Cheng YS. A rule-based automatic sleep staging method . J Neurosci Methods 2012 ; 205 ( 1 ): 169 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Anderer P , Gruber G, Parapatics S, et al. . health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24 x 7 utilizing the Siesta database . Neuropsychobiology 2005 ; 51 ( 3 ): 115 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Berthomier C , Drouot X, Herman-Stoica M, et al. . Automatic analysis of single-channel sleep EEG: validation in healthy individuals . Sleep 2007 ; 30 ( 11 ): 1587 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Wang Y , Loparo KA, Kelly MR, Kaplan RF. Evaluation of an automated single-channel sleep staging algorithm . Nat Sci Sleep 2015 ; 7 : 101 – 11 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 21 Hassan AR , Bhuiyan MI. A decision support system for automatic sleep staging from EEG signals using tunable Q-factor wavelet transform and spectral features . J Neurosci Methods 2016 ; 271 : 107. Google Scholar Crossref Search ADS PubMed WorldCat 22 Punjabi NM , Shifa N, Dorffner G, Patil S, Pien G, Aurora RN. Computer-assisted automated scoring of polysomnograms using the somnolyzer system . Sleep 2015 ; 38 ( 10 ): 1555 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Malhotra A , Younes M, Kuna ST, et al. . Performance of an automated polysomnography scoring system versus computer-assisted manual scoring . Sleep 2013 ; 36 ( 4 ): 573 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Anderer P , Moreau A, Woertz M, et al. . Computer-assisted sleep classification according to the standard of the American Academy of Sleep Medicine: validation study of the AASM version of the Somnolyzer 24 x 7 . Neuropsychobiology 2010 ; 62 ( 4 ): 250 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Schaltenbrand N , Lengelle R, Toussaint M, et al. . Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients . Sleep 1996 ; 19 ( 1 ): 26 – 35 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Younes M , Younes M, Giannouli E. Accuracy of automatic polysomnography scoring using frontal electrodes . J Clin Sleep Med 2016 ; 12 ( 05 ): 735 – 46 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Younes M , Soiferman M, Thompson W, Giannouli E. Performance of a new portable wireless sleep monitor . J Clin Sleep Med 2017 ; 13 ( 02 ): 245 – 58 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Shambroom JR , Fabregas SE, Johnstone J. Validation of an automated wireless system to monitor sleep in healthy adults . J Sleep Res 2012 ; 21 ( 2 ): 221 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Vilamala A , Madsen KH, Hansen LK. Deep convolutional neural networks for interpretable analysis of EEG sleep stage scoring . arXiv : 1710 . 00633 2017 . OpenURL Placeholder Text WorldCat 30 Zhang J , Wu Y, Bai J, Chen F. Automatic sleep stage classification based on sparse deep belief net and combination of multiple classifiers . Trans Inst Meas Control 2016 ; 38 ( 4 ): 435 – 51 . Google Scholar Crossref Search ADS WorldCat 31 Xiao C , Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review . J Am Med Inform Assoc 2018 ; 25 ( 10 ): 1419 – 28 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Thomson DJ. Spectrum estimation and harmonic analysis . Proc IEEE 1982 ; 70 : 1055 – 96 . Google Scholar Crossref Search ADS WorldCat 33 Bokil H , Andrews P, Kulkarni JE, Mehta S, Mitra PP. Chronux: a platform for analyzing neural signals . J Neurosci Methods 2010 ; 192 ( 1 ): 146 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Bokil H , Purpura K, Schoffelen JM, Thomson D, Mitra P. Comparing spectra and coherences for groups of unequal size . J Neurosci Methods 2007 ; 159 ( 2 ): 337 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Ioffe S , Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift . ArXiv 2015 ; 1502 .03167v3. Google Scholar OpenURL Placeholder Text WorldCat 36 Glorot X , Bengio Y. Understanding the difficulty of training deep feedforward neural networks . Proc Mach Learn Res 2010 ; 9 : 249 – 56 . Google Scholar OpenURL Placeholder Text WorldCat 37 Silber MH , Ancoli-Israel S, Bonnet MH, et al. . The visual scoring of sleep in adults . J Clin Sleep Med 2007 ; 3 ( 2 ): 121 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Danker-Hopfe H , Anderer P, Zeitlhofer J, et al. . Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard . J Sleep Res 2009 ; 18 ( 1 ): 74 – 84 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Magalang UJ , Chen NH, Cistulli PA, et al. . Agreement in the scoring of respiratory events and sleep among international sleep centers . Sleep 2013 ; 36 ( 4 ): 591 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Redline S , Budhiraja R, Kapur V, et al. . The scoring of respiratory events in sleep: reliability and validity . J Clin Sleep Med 2007 ; 3 ( 2 ): 169 – 200 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Stefani A , Heidbreder A, Hackner H, Hogl B. Validation of a leg movements count and periodic leg movements analysis in a custom polysomnography system . BMC Neurol 2017 ; 17 ( 1 ): 42. Google Scholar Crossref Search ADS PubMed WorldCat 42 van der Maaten L , Hinton G. Visualizing data using t-SNE . J Mach Learn Res 2008 ; 9 : 2579 – 605 . Google Scholar OpenURL Placeholder Text WorldCat 43 Collop NA , Tracy SL, Kapur V, et al. . Obstructive sleep apnea devices for out-of-center (OOC) testing: technology evaluation . J Clin Sleep Med 2011 ; 7 ( 5 ): 531 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Bianchi MT. Sleep devices: wearables and nearables, informational and interventional, consumer and clinical . Metabolism 2017 ; doi: 10.1016/j.metabol.2017.10.008. Google Scholar OpenURL Placeholder Text WorldCat 45 Gottlieb S. FDA Announces New Steps to Empower Consumers and Advance Digital Healthcare . Secondary FDA Announces New Steps to Empower Consumers and Advance Digital Healthcare 2017 ; https://www.fda.gov/NewsEvents/Newsroom/FDAVoices/ucm612014.htm. Accessed October 18, 2018. Google Scholar OpenURL Placeholder Text WorldCat Author notes These authors contributed equally to this work © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please [email protected] © The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.