---
title: "Introduction to processpredictR: workflow"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to processpredictR: workflow}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = FALSE,
cache = FALSE
)
```
```{r setup, message = F, eval = T}
library(processpredictR)
library(bupaR)
library(ggplot2)
library(dplyr)
library(keras)
library(purrr)
```
# Introduction
The goal of processpredictR is to perform prediction tasks on processes using event logs and Transformer models.
The 5 process monitoring tasks are defined as follows:
* _outcome_: predict the case outcome, which can be the last activity, or a manually defined variable
* _next activity_: predict the next activity instance
* _remaining trace_: predict the sequence of all next activity instances
* _next time_: predict the start time of the next activity instance
* _remaining time_: predict the remaining time till the end of the case
The overall approach using `processpredictR` is shown in the Figure below. `prepare_examples()` transforms logs into a dataset that can be used for training and prediction, which is thereafter split into train and test set. Subsequently a model is made, compiled and fit. Finally, the model can be used to predict and can be evaluated
```{r echo = F, eval = T, out.width = "60%", fig.align = "center"}
knitr::include_graphics("framework.PNG")
```
Different levels of customization are offered. Using `create_model()`, a standard off-the-shelf model can be created for each of the supported tasks, including standard features.
A first customization is to include additional features, such as case or event attributes. These can be configured in the `prepare_examples()` step, and they will be processed automatically (normalized for numerical features, or hot-encoded for categorical features).
A further way to customize your model, is to only generate the input layer of the model with `create_model()`, and define the remainder of the model yourself by adding `keras` layers using the provided `stack_layers()` function.
Going beyond that, you can also create the model entirely yourself using `keras`, including the preprocessing of the data. Auxiliary functions are provided to help you with, e.g., tokenizing activity sequences.
In the remainder of this tutorial, each of the steps and possible avenues for customization will be described in more detail.
# Preprocessing
As a first step in the process prediction workflow we use `prepare_examples()` to obtain a dataset, where:
* each row/observation is a unique activity instance id
* the prefix(_list) column stores the sequence of activities already executed in the case
* necessary features and target variables are calculated and/or added
The returned object is of class `ppred_examples_df`, which inherits from `tbl_df`.
In this tutorial we will use the `traffic_fines` event log from `eventdataR`. Note that both `eventlog` and `activitylog` objects, as defined by `bupaR` are supported.
```{r, eval = T}
df <- prepare_examples(traffic_fines, task = "outcome")
df
```
We split the transformed dataset `df` into train- and test sets for later use in `fit()` and `predict()`, respectively. The proportion of the train set is configured with the `split` argument.
```{r, eval = T}
set.seed(123)
split <- df %>% split_train_test(split = 0.8)
split$train_df %>% head(5)
split$test_df %>% head(5)
```
It's important to note that the split is done at case level (a case is fully part of either the train data or either the test data). Furthermore, the split is done chronologically, meaning that the train set contains the split\% first cases, and the test set contains the (1-split)\% last cases.
Note that because the split is done at case level, the percentage of all examples in the train set can be slightly different, as cases differ with respect their length.
```{r, eval = T}
nrow(split$train_df) / nrow(df)
n_distinct(split$train_df$case_id) / n_distinct(df$case_id)
```
# Transformer model
The next step in the workflow is to build a model. `processpredictR` provides a default set of functions that are wrappers of generics provided by `keras`. For ease of use, the preprocessing steps, such as tokenizing of sequences, normalizing numerical features, etc. happen within the `create_model()` function and are abstracted from the user.
## Define model
Based on the train set we define the default transformer model, using `create_model()`.
```{r}
model <- split$train_df %>% create_model(name = "my_model")
# pass arguments as ... that are applicable to keras::keras_model()
model # is a list
```
```
#> Model: "my_model"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> input_1 (InputLayer) [(None, 9)] 0
#> token_and_position_embedding (Toke (None, 9, 36) 792
#> nAndPositionEmbedding)
#> transformer_block (TransformerBloc (None, 9, 36) 26056
#> k)
#> global_average_pooling1d (GlobalAv (None, 36) 0
#> eragePooling1D)
#> dropout_3 (Dropout) (None, 36) 0
#> dense_3 (Dense) (None, 64) 2368
#> dropout_2 (Dropout) (None, 64) 0
#> dense_2 (Dense) (None, 6) 390
#> ================================================================================
#> Total params: 29,606
#> Trainable params: 29,606
#> Non-trainable params: 0
#> ________________________________________________________________________________
```
Some useful information and metrics are stored for a tracebility and an easy extraction if needed.
```{r}
model %>% names() # objects from a returned list
```
```
#> $names
#> [1] "model" "max_case_length" "number_features" "task"
#> [5] "num_outputs" "vocabulary"
```
Note that `create_model()` returns a list, in which the actual keras model is stored under the element name `model`. Thus, we can use functions from the keras-package as follows:
```{r}
model$model$name # get the name of a model
```
```
#> [1] "my_model"
```
```{r}
model$model$non_trainable_variables # list of non-trainable parameters of a model
```
```
#> list()
```
The result of `create_model()` is assigned it's own class (`ppred_model`) for which the `processpredictR` provides the methods _compile()_, _fit()_, _predict()_ and _evaluate()_.
## Compilation
The following step is to compile the model. By default, the loss function is the log-cosh or the categorical cross entropy, for regression tasks (next time and remaining time) and classification tasks, respectively. It is of course possible to override the defaults.
```{r}
model %>% compile() # model compilation
```
```
#> Compilation complete!
```
## Training
Training of the model is done with the `fit()` function. During training, a visualization window will open in the Viewer-pane to show the progress in terms of loss. Optionally, the result of `fit()` can be assigned to an object to access the training metrics specified in _compile()_.
```{r}
hist <- fit(object = model, train_data = split$train_df, epochs = 5)
```
```{r}
hist$params
```
```
#> $verbose
#> [1] 1
#>
#> $epochs
#> [1] 5
#>
#> $steps
#> [1] 2227
```
```{r}
hist$metrics
```
```
#> $loss
#> [1] 0.7875332 0.7410239 0.7388409 0.7385073 0.7363014
#>
#> $sparse_categorical_accuracy
#> [1] 0.6539739 0.6713067 0.6730579 0.6735967 0.6747193
#>
#> $val_loss
#> [1] 0.7307042 0.7261314 0.7407018 0.7326428 0.7317348
#>
#> $val_sparse_categorical_accuracy
#> [1] 0.6725934 0.6727730 0.6725934 0.6725934 0.6722342
```
## Make predictions
The method _predict()_ can return 3 types of output, by setting the argument `output` to "append", "y_pred" or "raw".
Test dataset with appended predicted values (output = "append")
```{r}
predictions <- model %>% predict(test_data = split$test_df,
output = "append") # default
predictions %>% head(5)
```
```
#> # A tibble: 5 × 13
#> ith_case case_id prefix prefix_…¹ outcome k activ…² resou…³
#>
```
#> Payment Send for Credit Collection Send Fine
#> [1,] 4.966056e-01 0.344094276 1.423686e-01
#> [2,] 9.984029e-01 0.001501600 8.890528e-05
#> [3,] 4.966056e-01 0.344094276 1.423686e-01
#> [4,] 9.984029e-01 0.001501600 8.890528e-05
#> [5,] 4.966056e-01 0.344094276 1.423686e-01
#> [6,] 1.556145e-01 0.518976271 2.884890e-01
#> [7,] 2.345311e-01 0.715000629 5.147375e-06
#> [8,] 2.627363e-01 0.726804197 5.480492e-06
#> [9,] 3.347774e-05 0.999961376 2.501280e-08
#> [10,] 4.966056e-01 0.344094276 1.423686e-01
```
, y_pred
raw predicted values (output = "raw")