Glossary#

Machine learning terms#

Categorical feature#

A feature with a discrete domain. Can be nominal or ordinal.

Synonyms: discrete feature, qualitative feature

Classification#

A supervised machine learning problem of assigning labels to feature vectors.

Examples: predict what type of object is on the picture (a dog or a cat?), predict whether or not an email is spam

Clustering#

An unsupervised machine learning problem of grouping feature vectors into bunches, which are usually encoded as nominal values.

Example: find big star clusters in the space images

Continuous feature#

A feature with values in a domain of real numbers. Can be interval or ratio

Synonyms: quantitative feature, numerical feature

Examples: a person’s height, the price of the house

CSV file#

A comma-separated values file (csv) is a type of a text file. Each line in a CSV file is a record containing fields that are separated by the delimiter. Fields can be of a numerical or a text format. Text usually refers to categorical values. By default, the delimiter is a comma, but, generally, it can be any character. For more details, see.

Dimensionality reduction#

A problem of transforming a set of feature vectors from a high-dimensional space into a low-dimensional space while retaining meaningful properties of the original feature vectors.

Feature#

A particular property or quality of a real object or an event. Has a defined type and domain. In machine learning problems, features are considered as input variable that are independent from each other.

Synonyms: attribute, variable, input variable

Feature vector#

A vector that encodes information about real object, an event or a group of objects or events. Contains at least one feature.

Example: A rectangle can be described by two features: its width and height

Inference#

A process of applying a trained model to the dataset in order to predict response values based on input feature vectors.

Synonym: prediction

Inference set#

A dataset used at the inference stage. Usually without responses.

Interval feature#

A continuous feature with values that can be compared, added or subtracted, but cannot be multiplied or divided.

Examples: a time frame scale, a temperature in Celsius or Fahrenheit

Label#

A response with categorical or ordinal values. This is an output in classification and clustering problems.

Example: the spam-detection problem has a binary label indicating whether the email is spam or not

Model#

An entity that stores information necessary to run inference on a new dataset. Typically a result of a training process.

Example: in linear regression algorithm, the model contains weight values for each input feature and a single bias value

Nominal feature#

A categorical feature without ordering between values. Only equality operation is defined for nominal features.

Examples: a person’s gender, color of a car

Nu-classification#

An SVM-specific classification problem where \(\nu\) parameter is used instead of \(C\). \(\nu\) is an upper bound on the fraction of training errors and a lower bound of the fraction of the support vector.

Nu-regression#

An SVM-specific regression problem where \(\nu\) parameter is used instead of \(\epsilon\). \(\nu\) is an upper bound on the fraction of training errors and a lower bound of the fraction of the support vector.

Observation#

A feature vector and zero or more responses.

Synonyms: instance, sample

Ordinal feature#

A categorical feature with defined operations of equality and ordering between values.

Example: student’s grade

Outlier#

Observation which is significantly different from the other observations.

Ratio feature#

A continuous feature with defined operations of equality, comparison, addition, subtraction, multiplication, and division. Zero value element means the absence of any value.

Example: the height of a tower

Regression#

A supervised machine learning problem of assigning continuous responses for feature vectors.

Example: predict temperature based on weather conditions

Response#

A property of some real object or event which dependency from feature vector need to be defined in supervised learning problem. While a feature is an input in the machine learning problem, the response is one of the outputs can be made by the model on the inference stage.

Synonym: dependent variable

Result options:#

Result options are entities that mimic C++ enums. They are used to specify which results of an algorithm should be computed. The use of result options may alter the default algorithm flow and result in performance differences. In general, fewer results to compute means faster performance. An error is thrown when you use an invalid set of result options or try to access the results that are not yet computed.

Example: k-NN Classification algorithm can perform classification and also return indices and distances to the nearest observations as a result option.

A kNN-specific optimization problem of finding the point in a given set that is the closest to the given points.

Supervised learning#

Training process that uses a dataset with information about dependencies between features and responses. The goal is to get a model of dependencies between input feature vector and responses.

Training#

A process of creating a model based on information extracted from a training set. Resulting model is selected in accordance with some quality criteria.

Training set#

A dataset used at the training stage to create a model.

Unsupervised learning#

Training process that uses a training set with no responses. The goal is to find hidden patters inside feature vectors and dependencies between them.

Graph analytics terms#

Adjacency#

A vertex \(u\) is adjacent to vertex \(v\) if they are joined by an edge.

Adjacency matrix#

An \(n \times n\) matrix \(A_G\) for a graph \(G\) whose vertices are explicitly ordered \((v_1, v_2, ..., v_n)\),

\[\begin{split}\mathrm{A_G}=\begin{cases} 1, \text{where } v_i \text{ and } v_j \text{ adjacent} \\ 0, \text{otherwise.} \end{cases}\end{split}\]
Attribute#

A value assigned to graph, vertex or edge. Can be numerical (weight), string or any other custom data type.

Component#

A connected subgraph \(H\) of graph \(G\) such that no subgraph of \(G\) that properly contains \(H\) is connected [Gross2014].

Connected graph#

A graph is connected if there is a walk between every pair of its vertices [Gross2014].

Directed graph#

A graph where each edge is an ordered pair \((u, v)\) of vertices. \(v\) is designated as the tail, and \(u\) is designated as the head.

Edge index#

The index \(i\) of an edge \(e_i\) in an edge set \(E=\{e_1, e_2, ..., e_m\}\) of graph \(G\). Can be an integer value.

Graph#

An object \(G=(V;E)\) that consists of two sets, \(V\) and \(E\), where \(V\) is a finite nonempty set, \(E\) is a finite set that may be empty, and the elements of \(E\) are two-element subsets of \(V\). \(V\) is called a set of vertices, \(E\) is called a set of edges [Gross2014].

Induced subgraph on the edge set#

Each subset \(E' \subseteq E\) defines a unique subgraph \(H' = (V'; E')\) of graph \(G = (V; E)\), where \(V'\) consists of only those vertices that are the endpoints of the edges in \(E'\). The subgraph \(H\) is called an induced subgraph of \(G\) on the edge set \(E'\) [Gross2014].

Induced subgraph on the vertex set#

Each subset \(V' \subseteq V\) defines a unique subgraph \(H = (V'; E')\) of graph \(G = (V; E)\), where \(E'\) consists of those edges whose endpoints are in \(V'\). The subgraph \(H\) is called an induced subgraph of \(G\) on the vertex set \(V'\) [Gross2014].

Self-loop#

An edge that joins a vertex to itself.

Subgraph#

A graph \(H = (V'; E')\) is called a subgraph of graph \(G = (V; E)\) if \(V' \subseteq V; E' \subseteq E\) and \(V'\) contain all endpoints of all the edges in \(E'\) [Gross2014].

Topology#

A graph without attributes.

Undirected graph#

A graph where each edge is an unordered pair \((u, v)\) of vertices.

Unweighted graph#

A graph where all vertices and all edges has no weights.

Vertex index#

The index \(i\) of a vertex \(v_i\) in a vertex set \(V=\{v_1, v_2, ..., v_n\}\) of graph \(G\). Can be an integer value.

Walk#

An alternating sequence of vertices and edges such that for each edge, one endpoint precedes and the other succeeds that edge in the sequence [Gross2014].

Weight#

A numerical value assigned to vertex, edge or graph.

Weighted graph#

A graph where all vertices or all edges have weights

oneDAL terms#

Accessor#

A oneDAL concept for an object that provides access to the data of another object in the special data format. It abstracts data access from interface of an object and provides uniform access to the data stored in objects of different types.

Batch mode#

The computation mode for an algorithm in oneDAL, where all the data needed for computation is available at the start and fits the memory of the device on which the computations are performed.

Builder#

A oneDAL concept for an object that encapsulates the creation process of another object and enables its iterative creation.

Contiguous data#

Data that are stored as one contiguous memory block. One of the characteristics of a data format.

CSR data#

A Compressed Sparse Row (CSR) data is the sparse matrix representation. Data with values of a single data type and the same set of available operations defined on them. One of the characteristics of a data format. This representation stores the non-zero elements of a matrix in three arrays. The arrays describe the sparse matrix \(A\) as follows:

  • The array values contain non-zero elements of the matrix row-by-row.

  • The element number j of the columns_indices array encodes the column index in the matrix \(A\) for the jth element of the array values.

  • The element number i of the row_offsets array encodes the index in the array values corresponding to the first non-zero element in rows indexed i or greater. The last element in the array row_offsets encodes the number of non-zero elements in the matrix \(A\).

oneDAL supports zero-based and one-based indexing.

Data format#

Representation of the internal structure of the data.

Examples: data can be stored in array-of-structures or compressed-sparse-row format

Data layout#

A characteristic of data format which describes the order of elements in a contiguous data block.

Example: row-major format, where elements are stored row by row

Data type#

An attribute of data used by a compiler to store and access them. Includes size in bytes, encoding principles, and available operations (in terms of a programming language).

Examples: int32_t, float, double

Dataset#

A collection of data in a specific data format.

Examples: a collection of observations, a graph

Flat data#

A block of contiguous homogeneous data.

Getter#

A method that returns the value of the private member variable.

Example:

std::int64_t get_row_count() const;
Heterogeneous data#

Data which contain values either of different data types or different sets of operations defined on them. One of the characteristics of a data format.

Example: A dataset with 100 observations of three interval features. The first two features are of float32 data type, while the third one is of float64 data type.

Homogeneous data#

Data with values of single data type and the same set of available operations defined on them. One of the characteristics of a data format.

Example: A dataset with 100 observations of three interval features, each of type float32

Immutability#

The object is immutable if it is not possible to change its state after creation.

Metadata#

Information about logical and physical structure of an object. All possible combinations of metadata values present the full set of possible objects of a given type. Metadata do not expose information that is not a part of a type definition, e.g. implementation details.

Example: table object can contain three nominal features with 100 observations (logical part of metadata). This object can store data as sparse csr array and provides direct access to them (physical part)

Online mode#

The computation mode for an algorithm in oneDAL, where the data needed for computation becomes available in parts over time.

Reference-counted object#

A copy-constructible and copy-assignable oneDAL object which stores the number of references to the unique implementation. Both copy operations defined for this object are lightweight, which means that each time a new object is created, only the number of references is increased. An implementation is automatically freed when the number of references becomes equal to zero.

Setter#

A method that accepts the only parameter and assigns its value to the private member variable.

Example:

void set_row_count(std::int64_t row_count);
Table#

A oneDAL concept for a dataset that contains only numerical data, categorical or continuous. Serves as a transfer of data between user’s application and computations inside oneDAL. Hides details of data format and generalizes access to the data.

Workload#

A problem of applying a oneDAL algorithm to a dataset.

Common oneAPI terms#

API#

Application Programming Interface

DPC++#

Data Parallel C++ (DPC++) is a high-level language designed for data parallel programming productivity. DPC++ is based on SYCL* from the Khronos* Group to support data parallelism and heterogeneous programming.

Host/Device#

OpenCL [OpenCLSpec] refers to CPU that controls the connected GPU executing kernels.

JIT#

Just in Time Compilation — compilation during execution of a program.

Kernel#

Code written in OpenCL [OpenCLSpec] or SYCL and executed on a GPU device.

SPIR-V#

Standard Portable Intermediate Representation - V is a language for intermediate representation of compute kernels.

SYCL#

SYCL(TM) [SYCLSpec] — high-level programming model for OpenCL(TM) that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++.

Distributed computational mode terms#

Communicator#

A oneDAL concept for an object that is used to perform inter-process collective operations

Communicator backend#

A particular library providing collective operations.

Examples: oneCCL, oneMPI

SPMD#

Single Program, Multiple Data (SPMD) is a technique employed to achieve parallelism. In SPMD model, multiple autonomous processors simultaneously execute the same program at independent points.