Machine Learning Toolkit

The Machine Learning Toolkit is a comprehensive suite designed to empower kdb+/q users with advanced machine learning capabilities. It offers a robust and flexible framework for addressing a wide range of tasks, including time series analysis, natural language processing, and automated machine learning. By integrating seamlessly with kdb+/q, the toolkit facilitates efficient data handling and processing, leveraging both traditional machine learning techniques and modern NLP models.

The repository is structured as three modules: ml and nlp can each be used independently for their respective feature sets as further described below; automl builds upon ml and nlp to deliver automated machine learning capabilities.

Requirements

kdb+ >= 3.5 64-bit

The Python packages required to allow successful execution of all functions within the machine learning toolkit can be installed via:

pip:

bash
pip install -r requirements.txt

or via conda:

bash
conda install --file requirements.txt

Alternatively, use requirements_pinned.txt for a fully resolved, pinned & known working set of dependencies or module specific requirements.txt (eg ml/requirements.txt) when only utilizing a subset of the toolkit.

While the nlp framework may be used with other models, automl the nlp tests use en_core_web_sm. You can download this after installing the python requirements like so:

bash
python -m spacy download en_core_web_sm

Installation

To install, simply copy or link the desired components to your $QHOME directory, for example: cp -r {ml,nlp,automl} $QHOME/.

To load all functionality into the .automl, .ml, and .nlp namespaces, run the following from q:

q
\l automl/automl.q
.automl.loadfile`:init.q

To load only specific modules, replace automl with ml or nlp in the commands above.

Once installed, you can explore the toolkit's capabilities by trying out our examples.

Components

ml

This library contains functions that cover the following areas:

An implementation of the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm for use in the extraction of features from time series data and the reduction in the number of features through statistical testing.
Cross-validation and grid-search functions allowing for testing of the stability of models to changes in the volume of data or the specific subsets of data used in training.
Clustering algorithms used to group data points and to identify patterns in their distributions. The algorithms make use of a k-dimensional tree to store points and scoring functions to analyze how well they performed.
Statistical timeseries models and feature-extraction techniques used for the application of machine learning to timeseries problems. These models allow for the forecasting of the future behavior of a system under various conditions.
Numerical techniques for calculating the optimal parameters for an objective function.
A graphing and pipeline library for the creation of modularized executable workflow based on a structure described by a mathematical directed graph.
Utility functions relating to areas including statistical analysis, data preprocessing and array manipulation.
A multi-processing framework to parallelize work across many cores or nodes.
Functions for seamless integration with PyKX or EmbedPy, which ensure seamless interoperability between Python and kdb+/q in either environment.
A location for the storage and versioning of ML models on-prem along with a common model retrieval API allowing models regardless of underlying requirements to be retrieved and used on kdb+ data. This allows for enhanced team collaboration opportunities and management oversight by centralising team work to a common storage location.

These sections are explained in greater depth within the FRESH, cross validation, clustering, timeseries, optimization, graph/pipeline, utilities and registry documentation.

nlp

The Natural language processing (NLP) module allows users to parse dataset using the spacy model from python in which it runs tokenisation, Sentence Detection, Part of speech tagging and Lemmatization. In addition to parsing, users can cluster text documents together using different clustering algorithms like MCL, K-means and radix. You can also run sentiment analysis which indicates whether a word has a positive or negative sentiment.

automl

The automated machine learning library described here is built on top of ml & nlp. The purpose of this framework is help you automate the process of applying machine learning techniques to real-world problems. In the absence of expert machine-learning engineers this handles the following processes within a traditional workflow.

Data preprocessing
Feature engineering and feature selection
Model selection
Hyperparameter Tuning
Report generation and model persistence

Each of these steps is outlined in depth within the documentation.