Está en la página 1de 6

Predictions in Antibiotics Resistance and nosocomial infections monitoring.

Mary Gerontini Athens University of Economics Department of Informatics 76, Patission Str., GR10434 Athens-Greece mgerontini@gmail.com Michalis Vazirgiannis Athens University of Economics Department of Informatics mvazirg@aueb.gr

Alkiviadis C. Vatopoulos National School of Public Health, Athens, Greece Department of Microbiology 196, Alexandras STR.,GR11521 Athens-Greece avatopoulos@esdy.edu.gr Michalis Polemis National School of Public Health, Athens, Greece Department of Microbiology mpolemis@freemail.gr

Abstract
Nosocomial infections and antibiotic resistance are regarded as critical issues both in clinical medicine as well as in Public health, thus understanding their epidemiology is a priority in the health sector. Our research aims at demonstrating that data mining techniques, such as regression, classication and association rules and assist in discovering interesting patterns in the epidemiological trends of antibiotic resistance in Greek Hospitals. In this work, we present a novel framework which integrates data from multiple hospitals and discovers association rules stored in a data warehouse. Furthermore, this data warehouse is used as a source for extracting interesting and valid predictions by applying techniques such as regression and classication. Our system is fully operational and treats realworld data from the WHONET, a software installed on the majority of Greek member hospitals of the Greek System for Surveillance of Antimicrobial Resistance network. The contributions of the proposed framework are i. a standardized workow for the seamless integration of data produced in various hospitals into a consistent data warehouse and b. the use of a mechanisms to predict hidden future behavior on large datasets, using regression and classica Prof. M. Vazirgiannis is partially supported by the DIGITEO Chair grant LEVETONE in France and the Research Centre of the Athens University of Economics and Business, Greece

tion algorithms, which can provide signicant surveillance warnings.

1. Introduction
The increasing rate of isolation during the last years of bacteria which are resistant to antibiotics in hospitals and the society is considered a main Public Health threat in many parts of the world. The respective infections are difcult to be treated since few antibiotics remain active against the respective infectious agent. Moreover, these antibiotics are expensive and in few accessions toxic and pharmacokinetically/pharmacodynamically less appropriate. Antibiotic resistance is the result of various genetic alterations in the bacterial cell such as mutations of the target site of the antibiotic the acquisition of efux pumps, and more importantly the acquisition through horizontal gene transfer among bacteria of genes, encoding for enzymes that destroy the respective antibiotic, such as beta lactamase etc.In that respect the epidemiology of antibiotic resistance is the result of the spread of pathogenic bacteria evolving through mutations and or acquisition of genes. Surveillance of antibiotic resistance is based on monitoring the aforementioned mobility of bacteria, as well as of genes, and is carried out either through the collection and study of representative bacterial isolates or through the

analysis of routinely collected data from the microbiology laboratories of the hospitals. In Greece, a national network for continuous monitoring of bacterial antibiotic re-sistance in the Greek hospitals (Greek System for the Surveillance of Antimicrobial Resistance) is in place since 1995 [6]. Its function is based on the assumption that the routine results of the antibiotic sensitivity tests performed daily in each hospital clinical laboratory should be considered as a major resource for antibiotic resistance surveillance. Moreover, and since the quality and compatibility of these data are in principle uncertain, our approach is to work in parallel, on both accessing the data and assessing its quality. This is accomplished through the establishment of a quality control procedure and the adaptation of a source code and data format in all hospitals through the use of the Whonet software[7], originally developed by WHO Collaborating Centre for Surveillance of Antibiotic Resistance in Boston USA and further developed in the Division of Emerging and other Communicable Diseases Surveillance and Control, WHO (WHO/EMC), Geneva, Switzerland [7][8]. The WHONET software is adopted as a common software platform, due to its friendly, exible features to expanding the pyramidal reporting structure and capability to interface with other statistical packages and programs.Data is being collected from all sources every 6 months, analyzed and relevant reports are published in the respective Web site: www.mednet.gr/whonet. However the complexity of the antibiotic resistance phenomenon, the fact that it involves many bacterial species, evolving bacterial clones and horizontally transferred genes, gave rise to the pursue of techniques for further analyzing these data, in order to reveal hidden associations, time trends and time/space clustering, important for an effective strategy to confront the antibiotic resistance epidemic. For the above reason numerous Data Mining Algorithms have been recently used to extract knowledge from large databases. [3][4][5]. Since traditional manual activities such as antibiogram summaries are proven to be time consuming, the production of measures and patterns is often not up-to-date and many useful patterns remain undiscovered.For these reasons, we designed and developed a web based framework that contributes to the antibiotics resistance surveillance while identifying outbreaks in antibiotic resistance and applying extensive analysis of hospital data. In addition to this, many other studies [2][3] data. Our system i. supports data collection from multiple hospitals via a user-friendly interface, with data noise cleaning capabilities. The data are stored in a central data warehouse. ii. Based on state of the art data mining algorithms (such as: association rules, Support Vector Machines and Linear Regression) extract useful previously unknown patterns to build antibiotic sen-

sitivity prediction models as well as nosocomial infections forecasting and iii. An advanced visualization and reposting mechanism via customizable graphs in order to instantly present critical information to experts and thus, make data management and decision making easier and more effective. The paper is organized as follows: in Section 2 we present the architecture of our model and data format, in Sections 3 we present the results of our a many-fold data analysis, while in Section 4 we conclude with a brief summary and further research directions.

System Model

Our system is a web based framework which manages the collection and integration of incoming public health data from multiple hospitals. While in a previous work [2] we report on a system to extract association rules, here we extend this idea by providing new features for prediction and visualization of data capitalizing on these association rules. Specically, we provide visualization of the temporal validity of these rules. We also attempt predictions regarding a. the future validity of the rules and b. the antibiotic resistance of bacteria based on the public health data stored in a data warehouse.

2.1

Data

The data set used in this work were collected via the WhoNet system during the period 2003-2009. The data are stored, preprocessed, cleaned and formatted with the usage of an interface which we developed for this purpose. Here we discuss the attributes we use in our analysis regarding the bacterial strain and a sensitivity test, isolated from a patient contatining the following features: organism group (bacterial species), specimen group, department the patient was hospitalized, the period of time the strain has been isolated (we use trimesters in our implementation) and the resistance at antibiotics which have been tested to the specic bacteria - see Table 1 for a summary of the data we capitalize on. The data set consisted of 1768 training instances of the data (association rules retrieved), including 442 organism groups, 53 specimen groups, 56 hospitals in Greece and 41 types of antibiotics.

2.2

Extraction of Association Rules

An initial data mining step involved the extraction of association rules (a very popular technique to discover corellations) representing non obvious relations and hidden patterns in public health data. The produced rules are aggregated and stored in an appropriate warehouse which provides easy access to the them. The algorithm which has

Name Specimen Group Hospital Period Department Organism Group Antibiotic Resistance

Type wound,blood,urine GR0001,GR002,... 1-3/2003,4-7/2003,? icu,meth,out,... E. Coli,... Resistant, Intermediate,Sensitive,...

Spc. UR

Hosp. GR61

Dep. out

LHS Per. 1-6/05

Path. E.coli

Antib. CLI

RHS Res. R

Table 3. Sample for second type of extracted association rules

Table 1. Data attributes and values been used to produce the association rules is the Apriori [2] [3] and its pseudo code described in gure 1. The extracted rules have the specic format Specimen, Hospital, Department, Period, Organism AntibioticResistance or Specimen, Hospital, Department, Period, Organism and are used for further statistical analysis in order to predict the future behavior of these rules. A sample from the extracted rules are displayed on Tables 2 and 3.

via state of the art data mining methods, to predict the validity of the extracted association rules. The techniques we used include: time series analysis for statistical analysis and Support Vector Machines. Hereafter we elaborate on the usage and results of those methods. 2.3.1 Linear Regression

Figure 1. Apriori Algorithm LHS Hospital Depart. GR61 out RHS Pathogen Strept.

Specimen Genital

Period 4-6/2003

In most hospitals, it is vital to forecast the trends in the isolation rate of a variety of pathogens and the antibiotic resistance. In that respect we used the association rules which were extracted before in order to check their temporal validity and their future behavior. In this work we used two basic interestingness measures, condence and leverage, which as mentioned before provide measures on the interestingness and validity of the specic rule. Assume a set of n association rules for which we observe the leverage and the condence values of m times steps as the most interesting measures for association rules. Let y1i = (x1i1 , ...., x1im ) be the leverage values of the ith rule at the time points t = (t1 , ..., tm ) and x2i = (x2i1 , ..., x2im ) be the condence values of the ith rule at the time points t = (t1 , ..., tm ). Further, we assume that the n x m design matrix X1 stores all the observed leverage values and n x m design matrix X2 stores all the observed condence values such that each row corresponds to a rule and each column to a time point. Given these observations we aim to predict the leverage X1i() and condence value X2i() for each rule ith at some time t . t will typically correspond to a future time point, i.e. t > ti , with i = 1, ..., m. We now discuss discuss a simple prediction method, based on linear regression, where the input variable corresponds to time and the response variable is the leverage or the condence value. The general linear regression equation for a line that ts data is x=a+bt where t the independent variable - time (represented by the id of the respective trimester), x the dependent variable (condence or leverage) and a, b are the constant regression parameters that must be computed to optimally t a line to the available data points. The a, b parameters are determined based on the following equations : a= ( x) t2 ( t) ( tx) (n) ( t2 ) ( t)2

Table 2. Sample for rst type of extracted association rules

2.3

Prediction methods design

The temporal dimension of the nosocomial infections and antibiotic resistance data is a critical one. We attempt,

3
b= 2.3.2 (n) ( tx) ( t) ( x) (n) ( t2 ) ( t)2

Experimental Methodology
Experimental protocol

3.1

Classication

The second method for prediction we employ in our framework is the SVM classication algorithm aiming to predict the antibiotic resistance of certain organisms in hospitals seasonally. A summary of the attributes used can be seen in Table. 1. Our effort to deal with the classication problem utilizes three classiers. However, after an extensive series of experiments the Support Vector Machine algorithm presented the best classication results. Support Vector Machines (SVM) are learning predictors based on the Structural Risk Minimization (SRM) principle from statistical learning theory. The SRM principle seeks to minimize an upper bound of the generalization error rather than minimizing the training error (Empirical Risk Minimization). This approach results in better generalization than conventional techniques based on the ERM principle [4]. Consider an n-dimensional object x which has n coordinates x = (x1 , x2 , x3 , ?, xn ), where each xi is a real number xi R for i = 1, 2, ..n. Each object xj belongs to a class yj [1, +1]. Furthermore, we have a training set T of m objects together with their classes, T = (x1 , y1 ), (x2 , y2 ), ?, (xm , ym ). A dot product space S includes the objects x and are embedded x1 , x2 , ..xm S. Any hyperplane in the space S can be written as (x S|wx+b = 0). The dot product w x is dened by:
k

wx=
i=1

wi xi

A training set of objects is linearly separable if there exists at least one linear classier dened by the pair (w, b) which correctly classies all training objects. This linear classier is represented by the hyperplane H (w x + b = 0) and denes a region for class +1 objects (w x + b > 0) and another region for class -1 object (w x + b < 0). After training, the classier is ready to predict the class membership for new objects, different from those used in training. The class of an object xk is determined with the equation: class(xk ) = +1 1 if w xk + b > 0 if w xk + b < 0

A very important concept in machine learning and data mining is the overtting issue occurring when a model is too perfectly t to a limited set of training data points. Then the resulting model cannot predict to satisfatory degree for unknown data and thus, the accuracy of the model is low. For these reasons, there are many techniques to tackle this issue. One of them is the cross validation technique which we used in order to produce an accurate prediction model and not waste data for testing. In cross validation we divide the data into k folds. For each fold, we use the whole data set excluding the current one as a learning set and the rest data is being used as a test set. The mean error on each fold gives a low biased estimator. In our implementation we used the 10-fold validation once many references describe that accuracy differences for additional folds are insignicant[10]. On the other hand, in the linear regression approach we chose the following method to avoid overtting. The user denes the number of T time points ranging inside the interval: [3, n 1), where n is the last (most recent) timestamp in our data. The choice to use the aforementioned range was taken due to the fact that, after extensive experimentation, a stable prediction model could only be extracted when having three or more observations for condence/leverage-timeslot measurements per rule. In addition, the most accurate model extracted consisted of m observations, where m is equal to (timestamps 1). In addition to this, we measure the accuracy of our model through a variety of statistical measures such as TP Rate, FP Rate, Precision, Recall, F-measure. TP is the number of items correctly labeled the proper class, FP is the number of objects falsely classied to the proper class. Precision is a measure of exactness while recall is a measure of completeness and nally, F-measure is the weighted harmonic mean of precision and recall.

3.2
3.2.1

Experiments and Results and Goals


Linear Regression Analysis

Therefore, the classication of new objects depends only on the sign of the expression w x + b. In our implementation objects x are association rules in a specic form and with them we can predict the future resistance of specic pathogen organisms.

Regarding the association rules, we experimented with all the pathogen microorganisms as members of the RHS in rules and the results are comparable. Due to lack of space we report only the results on Esc. Coli that are representative of the whole result set. As we can see in Figure 2 there are some repeated patterns for the specic association rules and outbreaks with regards to the condence and leverage. From this, we can infer that in certain repeated periods there is a high isolation of this pathogen organism (the pathogen

Esc.Coli in our case) so we should be prepared to avoid the spread of this organism. We claim hospitals could benet from this framework and can exploit all these observations in order to prevent hospital-acquired infections. Figure 2 presents some of our predictions. It is clear from the results that predicted values for condence and leverage are very close to the real values and assure thus a robust prediction framework in this context. The prediction error is calculated as follows: error = |expectedvalue predictedvalue| .

Figure 3. Regression Error rate for predicted condence and leverage values

of the data we can observe some repeated trends and patterns during the months. 3.2.2 Classication

Figure 2. condence and leverage values which produced via the Apriori Algorithm. As we can see at the values there are some repeated peaks of the leverage and condence among the time which inform us about high Sensitivity(S) at the antibiotic AMC of the E.coli organism (eco).

Regarding the linear regression method, we used ten (10) time points in order to determine the curve (y = ax + b see section 2.3) and we predicted six (6) time points based on curve which calculated before. In the graph of g.3 is illustrated the amount of prediction error and is indicated how low is for each rule. Through the Linear regression we can predict the future importance of the rule and as a result forecast outbreaks to the presence of a pathogen organism. For example, If a rule like: Urine, Gr0061, out, t1 E.coli occurred with a condence near the value 1 we could inferred that in the urine at Gr0061 and in the outpatient department and at the time point t1 Escherichia coli is isolated in a rate higher than expected. Furthermore, due to the visualization

Valid and reliable automatic disease classiers are considered as vital components of a antibiotic resistance monitoring system. In our work we measured the actual performance of three classiers (Naive Bayes, SVM, C4.5, implemented in the open source library of Weka 3.7) designed to early detect special cases of antibiotic resistance that have regularly occurred often in hospitals. We formulated a classication problem aiming to predict the antibiotic resistance of a pathogen based on data concerning the following attributes: hospital, specimen group, department of a hospital, the pathogen organism and the respected season. The accuracy results of our implementation are shown in Table 4. As we can see all three algorithms have similar results according to the measures mentioned above. However, Support Vector Machines has achieved the best results in comparison to the rest of the algorithms according to F-measure which is essential for distinguishing accurate from inaccurate structures. Furthermore, TP- rate values are considerably high which means that all three algorithms can predict correctly pathogen organisms which may observed in a hospital. With, this type of predictions it is feasible to forecast possible diseases that could be acquired during the upcoming trimester and the resistance on antibiotics for these diseases. These algorithms models are being trained on historical data stored in our data warehouse and a prediction on the antibiotic resistance of the pathogen organism is made for the next trimester. For a example for a given organism, spec-

imen group, antibiotic, hospital and season we can predict with 98 percent accuracy the antibiotic resistance for the specic organism. In Table.2 we illustrate a prediction accuracy of each aforementioned algorithm. Measure Model TP Rate FP Rate Precision Recall F-measure Naive Bayes 0.946 0.469 0.942 0.946 0.943 C4.5 0.935 0.78 0.915 0.935 0.919 SVM 0.978 0.157 0.978 0.978 0.978

References
[1] R. P. Trueblood, J. N. Lovett,Jr., Data Mining and Statistical Analysis Using SQL, Apress, Berkeley, California, 2001. Eugenia G. Giannopoulou, V. P. Kemerlis, Michalis Polemis, J. Papaparaskevas, Alkiviadis C. Vatopoulos, Michalis Vazirgiannis, A Large Scale Data Mining Approach to Antibiotic Resistance Surveillance, cbms, pp.439-444, Twentieth IEEE International Symposium on Computer-Based Medical Systems, 2007. Mykola Pechenizkiy, Alexey Tsymbal, Seppo Puuronen, Michael Shifrin, Irina Alexandrova, Knowledge Discovery from Microbiology Data: Many-Sided Analysis of Antibiotic Resistance in Nosocomial Infections, in: WM05, 3rd International Conference on Professional Knowledge Management: Experience and Visions, Kaiserslautern, Germany, pp. 360-372, April 2005. G. Cohen, M. Hilario, H. Sax, S. Hugonnet, C. Pellegrini, A. Geissbuhler, An Application of One-Class Support Vector Machines to Nosocomial Infection Detection, in: In Proc. of Medical Informatics, 2004. Brossette SE, Sprague AP, Jones WT, et al. A data mining system for infection control surveillance. Methods Inf Med 2000;39:303-10. Vatopoulos AC, Kalapothaki V, Legakis NJ. An electronic network for the surveillance of antimicrobial resistance in bacterial nosocomial isolates in Greece. The Greek Network for the Surveillance of Antimicrobial Resistance. Bull World Health Organ. 1999;77:595-601 OBrien TF, Stelling JM. WHONET: an information system for monitoringantimicrobial resistance. Emerg Infect Dis. 1995;1:66. Stelling JM. WHONET: removing obstacles to the full use onformation about antimicrobial resistance. Diagn Microbiol Infect Dis. 1996;25:162-8. Samore M, Lichtenberg D, Saubermann L, Kawachi C, Carmeli Y. A clinical data repository enhances hospital infection control. Proc AMIA Annu Fall Symp. 1997:56?60.

[2]

Table 4. Prediction Quality for Escerichia coli [3] The results are apparently very attractive as all the used measures reveal a quite precise prediction rate for all measures and algorithms. In most of the cases the Support Vector Machines algorithm gives the best prediction results.

Conclusion and Discussion

[4]

Surveillance of nosocomial infections as well as antibiotic resistance are two of the most important functions of a hospital infection control program. In public health and more specically in surveillance of antibiotic resistance, it is important to discover new associations and patterns before they become widely spread in a hospital or a region. Furthermore, is real important to predict future behavior from epidemic data in order for hospitals to be prepared for outbreaks at the isolation of pathogen organisms. In this paper we have presented a fully functional and implemented framework for predictions and visualization for in this context. The systems capitalize on the real world data of the Greek national network for continuous monitoring of bacterial antibiotic resistance in the Greek hospitals (Greek System for the Surveillance of Antimicrobial Resistance) in place since 1995 [6]. We achieved robust and accurate predictions that are quite promising in terms of better understanding the problem and patterns of Nosocomial infections. Moreover the system offers a friendly interface which could be used by people who are not data mining experts. The results were achieved using data with patients over the last seven years. Finally, future work will be devoted in using larger data set collections, spanning proactive time periods. Likewise, infection control systems require or will require data mining tools such as clustering for further research about future trends. The system implementation and full functionality is available on line at http://195.251.235.83/en/index.html .

[5]

[6]

[7]

[8]

[9]

[10] Sterlin, P. Overtting prevention with crossvalidation. Master?s thesis. University Pierre and Marie Curie (Paris VI): Paris, France, 200

También podría gustarte