Python Environment for Bayesian Learning (pebl)

Category Intelligent Software>Bayesian Network Systems/Tools

Abstract pebl is a python library and command line application for learning the structure of a Bayesian Network given prior knowledge and observations.

pebl Features - pebl provides many features for working with data and Bayesian Networks (BNs), some of the more notable ones are listed below:

1) Structure Learning --

pebl can load data from tab-delimited text files with continuous, discrete and class variables and can perform maximum entropy discretization.

Data collected following an intervention is important for determining causality but requires an altered scoring procedure.

pebl uses the BDe metric (Bayesian metric with Dirichlet priors and equivalence) for scoring networks and handles interventional data using the method described by:

(Yoo C, Thorsson V, Cooper GF - Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac Symp Biocomput. 2002;7:498-509).

pebl can handle missing values and hidden variables using exact marginalization and Gibbs sampling.

The Gibbs sampler can be resumed from a previously suspended state, allowing for interactive inspection of preliminary results or a manual strategy for determining satisfactory convergence.

Note: A key strength of Bayesian analysis is the ability to use prior knowledge.

pebl supports structural priors over edges specified as ‘hard’ constraints or ‘soft’ energy matrices and arbitrary constraints specified as Python functions or lambda expressions.

pebl includes greedy hill-climbing and simulated annealing learners and makes writing custom learners easy.

Efficient implementation of learners requires careful programming to eliminate redundant computation.

pebl provides components to alter, score and rollback changes to BNs in a simple, transactional manner and with these, efficient ‘learners’ look remarkably similar to pseudo-code.

2) Convenience and Scalability --

pebl includes both a library and a command line application.

It aims for a balance between ease of use, extensibility and performance. The majority of pebl is written in Python, a dynamically-typed programming language that runs on all major operating systems.

Critical sections use the NumPy library for high-performance matrix operations and custom extensions written in ANSI C for portability and speed.

NumPy - NumPy is the fundamental package needed for scientific computing with Python. It contains among other things:

pebl’s use of Python makes it suitable for both programmers and domain experts.

Python provides interactive shells and notebook interfaces and includes an extensive standard library and many third-party packages.

Note: While many tasks related to Bayesian learning are embarrassingly parallel in theory, few software packages take advantage of it.

pebl can execute learning tasks in parallel over multiple processors or CPU cores, an Apple Xgrid, an IPython cluster or the Amazon EC2 platform.

The EC2 platform is especially attractive for scientists because it allows one to rent processing power on an on-demand basis and execute pebl tasks on them.

With appropriate configuration settings and the use of parallel execution, pebl can be used for large learning tasks.

Although pebl has been tested successfully with datasets containing 10,000 variables and samples, BN structure learning is a known NP-Hard problem and analysis using datasets with more than a few hundred variables is likely to result in poor results due to poor coverage of the search space.

Summary of pebl Features --

1) Can learn with observational and interventional data;

2) Handles missing values and hidden variables using exact and heuristic methods;

3) Provides several learning algorithms, makes creating new ones simple;

4) Has facilities for transparent parallel execution;

5) Calculates edge marginal’s and consensus networks; and

6) Presents results in a variety of formats.

pebl Concepts --

All pebl analysis includes data, a learner and a result. They may also include prior models and task controllers.

1) Data - This is the set of observations that is used to score a given network.

The data can include missing values and hidden/unobserved variables and observations can be marked as being the result of specific interventions. Data can be read from a file or created programmatically.

2) Learner - A learner implements a specific learning algorithm. It is given some data, prior model and stopping criteria and returns a result object.

3) Result - A result object contains a list of the top-scoring networks found during a learner run and some statistics about the analysis. Results from different learning runs with the same data can be merged and visualized in various formats.

4) Prior Models - A key strength of Bayesian analysis is the ability to integrate knowledge with observations. A pebl prior model specifies the prior belief about the set of possible networks and can include hard and soft constraints.

5) Task Controllers - pebl uses task controllers to run analyses in parallel. Users can utilize multiple CPU cores or computational clusters without managing any of the details related to parallel programming (as stated above…).

Note: The manufacturer's hope that their open development model will convince others to use pebl as a platform for Bayesian Network algorithms research.

System Requirements

Contact manufacturer.

Manufacturer

Manufacturer Web Site pebl Information and Download

Price Contact manufacturer.

G6G Abstract Number 20700

G6G Manufacturer Number 104272