Analysis of massive data streams using R

Description

Today, omnipresent sensors are continuously providing streaming data on the environments in which they operate. For instance, a typical monitoring and analysis system may use streaming data generated by sensors to monitor the status of a particular device and to make predictions about its future behaviour, or diagnostically infer the most likely system configuration that has produced the observed data. Sources of streaming data with even a modest updating frequency can produce extremely large volumes of data, thereby making efficient and accurate data analysis and prediction difficult. One of the main challenges is related to handling uncertainty in data, where principled methods and algorithms for dealing with uncertainty in massive data applications are required. Probabilistic graphical models (PGMs) provide a well-founded and principled approach for performing inference and belief updating in complex domains endowed with uncertainty. The on-going EU-FP7 research project AMIDST (Analysis of MassIve Data STreams, http://www.amidst.eu) is aimed at producing scalable methods able to handle massive data streams based on Bayesian networks technology. All of the developed methods will be made available through the AMIDST toolbox, a software suite composed by the HUGIN software (http://amidst.hugin.com) and the open source AMIDST Toolbox. On the other hand, the R statistical package (http://www.cran.r-project.org) has become a widely spread standard for data manipulation and statistical analysis.

The main goal of the tutorial will be to learn how R and the AMIDST toolbox can be linked to assist in the complete lifecycle of data streams processing, from exploratory analysis to probabilistic inference. To achieve this goal, several existing R packages will be used, and the Ramidst package will be introduced to the community.

 

Content and schedule

Part 1: Introduction to Bayesian networks (20 minutes)

  • Static and dynamic Bayesian networks
  • Inference
  • Learning

Part 2: Exploratory analysis of data streams in R (30 mins.)

  • Data preparation
  • Time series analysis in R
  • Break (15 mins.)

Part 3: Reporting the results of the exploratory analysis in LaTeX (15 mins.)

  • Automatic report generation using Sweave

Part 4: The Ramidst package (40 mins)

  • Download and installation
  • The AMIDST functionality
  • Practical examples

 

Material

The slides of the tutorial can be downloaded in PDF format [SLIDES]. Links to the Ramidst package will be provided for download.

Audience

The target audience of the tutorial includes:

  • Data scientists.
  • R users and developers with interest in data streams processing.
  • Researchers from the Bayesian networks community.

 

 

Speakers

 

salmeron

Antonio Salmerón

is Professor of Statistics and Operations Research. He obtained his PhD in Artificial Intelligence from the University of Granada in 1998. He has a long record of publications in relevant international journals and conferences, covering aspects of approximate inference in Bayesian networks, hybrid Bayesian networks (including modelling, inference and learning), probabilistic decision graphs, classification and regression. In 2001, he got the José Cuena award from the Spanish Association for Artificial Intelligence, for a paper on approximate inference in Bayesian networks. He was the program co-chair of the First European Workshop on Probabilistic Graphical Models (2002) and the 13th Conference of the Spanish Association for Artificial Intelligence.

 

langseth

Helge Langseth

is Professor of Computer Science and Machine Learning. He obtained his PhD in mathematical statistics form the Norwegian Institute of Technology in 2002. His current research interests cover different aspects of decision support systems. His machine learning research is targeted at model structures that will either replace or work in cooperation with the human-made models in decision support systems. The main type of model studied is probabilistic graphical models, in particular Bayesian Networks, where the main focus is on how to learn these graphical models from data. Recently, he has moved towards data mining applications. He was earlier employed as a Senior Research Scientist at SINTEF, where he over a ten year period was project manager for several research projects.

 

madsen

Anders L. Madsen

is Chief Executive Officer of HUGIN EXPERT and adjunct Professor of Computer Science, Aalborg University. He has a PhD in Decision Support Systems from Aalborg University (1999) and a Master of Business Administration (MBA) from Henley Business School, United Kingdom (2011). He is an expert in probabilistic modelling and his research interested are mainly focused on inference in and applications of Bayesian networks and influence diagrams. He has participated in numerous international RTD projects on the application of probabilistic graphical models including projects funded or supported by customers, Danish government and the European Commission.

 

nielsen

Thomas D. Nielsen

is an associate professor of computer science at Aalborg University, where he obtained his PhD in 2001. His main research interests concern learning of probabilistic graphical models from (hybrid) data and the use of these types of models for machine learning and decision support systems. He is an expert within the use of probabilistic graphical models for decision analysis, and he is a co-author on a well-known textbook on probabilistic graphical models and decision analysis. He has recently been program co-chair of the European Workshop on Probabilistic Graphical Models (2012) as well as the Scandinavian Conference on Artificial Intelligence (2013). He is a member of the editorial board of the Journal of Artificial Intelligence Research and the Progress in Artificial Intelligence journal, and he is area editor for the International Journal of Approximate Reasoning.

 

References

 

  • AMIDST. Analysis of Massive Data Streams. Project funded by the European Comission’s 7th Framework Programme, grant 619209. [Web].
  • R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Web].

 

amidst

 

Ilustraciones de la ciudad de Albacete cedidas por Alicia Gosalbez
Copyright © 2019 Conference CAEPIA 2015