Skip to content

Commit 37b07cb

Browse files
topepojuliasilgehfrickDavisVaughan
authored
Revised model documentation (#456)
* re-organize linear_reg documentation * search across parsnip adjacent packages for details files * added stan details * Made clickable link for lm() * Move some details down to highlight most important stuff * Typo * Spacing * Edits to wording for clarity * doc refresh * extended glmnet documentation * Update man/rmd/glmnet-details.Rmd Co-authored-by: Hannah Frick <[email protected]> * Update R/linear_reg.R Co-authored-by: Hannah Frick <[email protected]> * Update R/aaa_models.R Co-authored-by: Davis Vaughan <[email protected]> * better documentation based on review comments * move to underscore in file names * use linked verions of function names * more information on additional engines and tidymodels.org * expand the package exclusion list * added boosted tree docs * reworked the parameter code to use tunable * early_stop re-added (required devel dials) * minor linear_reg updates and use of templates * Update man/rmd/glmnet-details.Rmd Co-authored-by: Hannah Frick <[email protected]> * Update man/rmd/glmnet-details.Rmd Co-authored-by: Hannah Frick <[email protected]> * update boosting pages * better seeaslo and references * decision_tree files * fix failing test case * logistic_reg files * mars files * mlp files * multinomial files * fix some file names * un-needed files * knn files * rand_forest files * svm files * cleaned up titles (no more "general interfaces") * standardize on "specific engines only" * remove "Parameters can be represented by a placeholder" in examples * Update man/rmd/glmnet-details.Rmd Co-authored-by: Hannah Frick <[email protected]> * Update man/rmd/glmnet-details.Rmd Co-authored-by: Hannah Frick <[email protected]> * suggestions from Hannah * fixed a few bugs/typos * bug fix * dynamic @Seealso * added an overview of dynamic documentation bits. * updated glmnet information * small doc updates for glmnet * Update NEWS.md Co-authored-by: Hannah Frick <[email protected]> * prototype sections for worked examples * fix train.test indices * more roxygenization * added man-roxygen to build ignore * mode for null model * examples for rand_forest() with engines ranger and randomForest * examples for svm_linear() with engines kernlab and LiblineaR * set seed for reproducibility * examples for `svm_poly()` and `svm_rbf()` * clean-up * add sentence about model spec * add example for `multinom_reg()` with penguins * Need this for `devtools::document()` now * Edits to doc tools * Refine boosted tree docs * Refine decision_tree() docs * Refine linear/logistic docs * Refine mars, mlp, multinom (plus logistic again) * Finish refining model pages * Refine details pages * Finish up details pages, and document * tidy up examples of class prediction * doc refresh after updating from master * remove examples as they slow down `document()` * remove another example * update tree splitting template (and its name) * remove default engine text * add default engine to list * maybe fix GHA issues when new dependencies are on CRAN * remove multilevelmod reference * doc refresh * missing comma Co-authored-by: Julia Silge <[email protected]> Co-authored-by: Hannah Frick <[email protected]> Co-authored-by: Davis Vaughan <[email protected]> Co-authored-by: Hannah Frick <[email protected]>
1 parent ae16fb8 commit 37b07cb

File tree

166 files changed

+6384
-2909
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

166 files changed

+6384
-2909
lines changed

.Rbuildignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,4 @@ derby.log
1919
^README\.html$
2020
^codecov\.yml$
2121
^LICENSE\.md$
22+
^man-roxygen$

.github/workflows/R-CMD-check.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ jobs:
6565
run: |
6666
pak::local_system_requirements(execute = TRUE)
6767
pak::pkg_system_requirements("rcmdcheck", execute = TRUE)
68+
pak::pkg_system_requirements("textshaping", execute = TRUE)
69+
pak::pkg_system_requirements("gert", execute = TRUE)
6870
shell: Rscript {0}
6971

7072
- name: Install dependencies

DESCRIPTION

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ Suggests:
5555
modeldata,
5656
LiblineaR,
5757
Matrix,
58-
mgcv
59-
Remotes:
58+
mgcv,
59+
dials (>= 0.0.9.9000)
60+
Remotes:
61+
tidymodels/dials,
6062
topepo/C5.0

NAMESPACE

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ export(control_parsnip)
138138
export(convert_stan_interval)
139139
export(decision_tree)
140140
export(eval_args)
141+
export(find_engine_files)
141142
export(fit)
142143
export(fit.model_spec)
143144
export(fit_control)
@@ -158,6 +159,8 @@ export(linear_reg)
158159
export(logistic_reg)
159160
export(make_call)
160161
export(make_classes)
162+
export(make_engine_list)
163+
export(make_seealso_list)
161164
export(mars)
162165
export(maybe_data_frame)
163166
export(maybe_matrix)
@@ -230,6 +233,7 @@ export(update_main_parameters)
230233
export(varying)
231234
export(varying_args)
232235
export(xgb_train)
236+
importFrom(dplyr,"%>%")
233237
importFrom(dplyr,arrange)
234238
importFrom(dplyr,as_tibble)
235239
importFrom(dplyr,bind_cols)

NEWS.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,11 @@
3333

3434
* `set_mode()` now checks if `mode` is compatible with the model class, similar to `new_model_spec()` (@jtlandis, #467). Both `set_mode()` and `set_engine()` now error for `NULL` or missing arguments (#503).
3535

36-
* Re-organized model documentation for `update` methods (#479).
36+
* Re-organized model documentation:
37+
38+
* `update` methods were moved out of the model help files (#479).
39+
* Each model/engine combination has its own help page.
40+
* The model help page has a dynamic bulleted list of the engines with links to the individual help pages.
3741

3842
* `generics::required_pkgs()` was extended for `parsnip` objects.
3943

R/aaa_models.R

Lines changed: 150 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ get_model_env <- function() {
6565
#' @export
6666
get_from_env <- function(items) {
6767
mod_env <- get_model_env()
68-
rlang::env_get(mod_env, items)
68+
rlang::env_get(mod_env, items, default = NULL)
6969
}
7070

7171
#' @rdname get_model_env
@@ -497,6 +497,7 @@ set_model_mode <- function(model, mode) {
497497

498498
#' @rdname set_new_model
499499
#' @keywords internal
500+
#' @importFrom dplyr %>%
500501
#' @export
501502
set_model_engine <- function(model, mode, eng) {
502503
check_model_exists(model)
@@ -951,3 +952,151 @@ get_encoding <- function(model) {
951952
}
952953
res
953954
}
955+
956+
#' Tools for dynamically documenting packages
957+
#'
958+
#' @description
959+
#' These are functions used to create dynamic documentation in Rd files
960+
#' based on which parsnip-related packages are loaded by the user.
961+
#'
962+
#' These functions can be used to make dynamic lists of documentation help
963+
#' files. \pkg{parsnip} uses these, along with files contained in `man/rmd`
964+
#' containing expanded documentation, for specific model/engine combinations.
965+
#' [find_engine_files()] looks for files that have the pattern
966+
#' `details_{model}_{engine}.Rd` to link to. These files are generated by files
967+
#' named `man/rmd/details_{model}_{engine}.Rmd`. `make_engine_list()` creates a
968+
#' list seen at the top of the model Rd files while `make_seealso_list()`
969+
#' populates the list seen in "See Also" below. See the details section.
970+
#'
971+
#' @param mod A character string for the model file (e.g. "linear_reg")
972+
#' @return
973+
#' `make_engine_list()` returns a character string that creates a
974+
#' bulleted list of links to more specific help files.
975+
#'
976+
#' `make_seealso_list()` returns a formatted character string of links.
977+
#'
978+
#' `find_engine_files()` returns a tibble.
979+
#' @details
980+
#' The \pkg{parsnip} documentation is generated _dynamically_. Part of the Rd
981+
#' file populates a list of engines that depends on what packages are loaded
982+
#' *at the time that the man file is loaded*. For example, if
983+
#' another package has a new engine for `linear_reg()`, the
984+
#' `parsnip::linear_reg()` help can show a link to a detailed help page in the
985+
#' other package.
986+
#'
987+
#' To enable this, the process for a package developer is to:
988+
#'
989+
#' 1. Create an engine-specific R file in the `R` directory with the name
990+
#' `{model}_{engine}.R` (e.g. `boost_tree_C5.0.R`). This has a small amount of
991+
#' documentation, as well as the directive
992+
#' "`@includeRmd man/rmd/{model}_{engine}.Rmd details`".
993+
#'
994+
#' 1. Copy the file in \pkg{parsnip} that is in `man/rmd/setup.Rmd` and put
995+
#' it in the same place in your package.
996+
#'
997+
#' 1. Write your own `man/rmd/{model}_{engine}.Rmd` file. This can include
998+
#' packages that are not listed in the DESCRIPTION file. Those are only
999+
#' required when the documentation file is created locally (probably using
1000+
#' [devtools::document()].
1001+
#'
1002+
#' 1. Run [devtools::document()] so that the Rmd content is included in the
1003+
#' Rd file.
1004+
#'
1005+
#' The examples in \pkg{parsnip} can provide guidance for how to organize
1006+
#' technical information about the models.
1007+
#' @name doc-tools
1008+
#' @keywords internal
1009+
#' @export
1010+
#' @examples
1011+
#' find_engine_files("linear_reg")
1012+
#' cat(make_engine_list("linear_reg"))
1013+
find_engine_files <- function(mod) {
1014+
1015+
# Get available topics
1016+
topic_names <- search_for_engine_docs(mod)
1017+
if (length(topic_names) == 0) {
1018+
return(character(0))
1019+
}
1020+
1021+
# Subset for our model function
1022+
eng <- strsplit(topic_names, "_")
1023+
eng <- purrr::map_chr(eng, ~ .x[length(.x)])
1024+
eng <- tibble::tibble(engine = eng, topic = topic_names)
1025+
1026+
# Combine them to keep the order in which they were registered
1027+
all_eng <- get_from_env(mod) %>% dplyr::distinct(engine)
1028+
all_eng$.order <- 1:nrow(all_eng)
1029+
eng <- dplyr::left_join(eng, all_eng, by = "engine")
1030+
eng <- eng[order(eng$.order),]
1031+
1032+
# Determine and label default engine
1033+
default <- get_default_engine(mod)
1034+
eng$default <- ifelse(eng$engine == default, " (default)", "")
1035+
1036+
eng
1037+
}
1038+
1039+
#' @export
1040+
#' @rdname doc-tools
1041+
make_engine_list <- function(mod) {
1042+
eng <- find_engine_files(mod)
1043+
1044+
res <-
1045+
glue::glue(" \\item \\code{\\link[=|eng$topic|]{|eng$engine|} |eng$default| }",
1046+
.open = "|", .close = "|")
1047+
1048+
res <- paste0("\\itemize{\n", paste0(res, collapse = "\n"), "\n}")
1049+
res
1050+
}
1051+
1052+
get_default_engine <- function(mod) {
1053+
cl <- rlang::call2(mod, .ns = "parsnip")
1054+
rlang::eval_tidy(cl)$engine
1055+
}
1056+
1057+
#' @export
1058+
#' @rdname doc-tools
1059+
make_seealso_list <- function(mod) {
1060+
eng <- find_engine_files(mod)
1061+
1062+
res <-
1063+
glue::glue("\\code{\\link[=|eng$topic|]{|eng$engine| engine details}}",
1064+
.open = "|", .close = "|")
1065+
1066+
main <- c("\\code{\\link[=fit.model_spec]{fit.model_spec()}}",
1067+
"\\code{\\link[=set_engine]{set_engine()}}",
1068+
"\\code{\\link[=update]{update()}}"
1069+
)
1070+
paste0(c(main, res), collapse = ", ")
1071+
}
1072+
1073+
# These will never have documentation and we can avoid searching them.
1074+
excl_pkgs <-
1075+
c("C50", "Cubist", "earth", "flexsurv", "forecast", "glmnet",
1076+
"keras", "kernlab", "kknn", "klaR", "LiblineaR", "liquidSVM",
1077+
"magrittr", "MASS", "mda", "mixOmics", "naivebayes", "nnet",
1078+
"prophet", "pscl", "randomForest", "ranger", "rpart", "rstanarm",
1079+
"sparklyr", "stats", "survival", "xgboost", "xrf")
1080+
1081+
search_for_engine_docs <- function(mod) {
1082+
all_deps <- get_from_env(paste0(mod, "_pkgs"))
1083+
all_deps <- unlist(all_deps$pkg)
1084+
all_deps <- unique(c("parsnip", all_deps))
1085+
1086+
all_deps <- all_deps[!(all_deps %in% excl_pkgs)]
1087+
res <- purrr::map(all_deps, find_details_topics, mod = mod)
1088+
res <- unique(unlist(res))
1089+
res
1090+
}
1091+
1092+
find_details_topics <- function(pkg, mod) {
1093+
meta_loc <- system.file("Meta/Rd.rds", package = pkg)
1094+
meta_loc <- meta_loc[meta_loc != ""]
1095+
if (length(meta_loc) > 0) {
1096+
topic_names <- readRDS(meta_loc)$Name
1097+
res <- grep(paste0("details_", mod), topic_names, value = TRUE)
1098+
} else {
1099+
res <- character(0)
1100+
}
1101+
res
1102+
}

R/augment.R

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
#' `augment()` will add column(s) for predictions to the given data.
44
#'
55
#' For regression models, a `.pred` column is added. If `x` was created using
6-
#' [fit()] and `new_data` contains the outcome column, a `.resid` column is
6+
#' [fit.model_spec()] and `new_data` contains the outcome column, a `.resid` column is
77
#' also added.
88
#'
99
#' For classification models, the results can include a column called
1010
#' `.pred_class` as well as class probability columns named `.pred_{level}`.
1111
#' This depends on what type of prediction types are available for the model.
12-
#' @param x A `model_fit` object produced by [fit()] or [fit_xy()].
12+
#' @param x A `model_fit` object produced by [fit.model_spec()] or
13+
#' [fit_xy.model_spec()] .
1314
#' @param new_data A data frame or matrix.
1415
#' @param ... Not currently used.
1516
#' @rdname augment

R/boost_tree.R

Lines changed: 28 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,58 @@
11
# Prototype parsnip code for boosted trees
22

3-
#' General Interface for Boosted Trees
3+
#' Boosted trees
44
#'
5-
#' `boost_tree()` is a way to generate a _specification_ of a model
6-
#' before fitting and allows the model to be created using
7-
#' different packages in R or via Spark. The main arguments for the
8-
#' model are:
9-
#' \itemize{
10-
#' \item \code{mtry}: The number of predictors that will be
11-
#' randomly sampled at each split when creating the tree models.
12-
#' \item \code{trees}: The number of trees contained in the ensemble.
13-
#' \item \code{min_n}: The minimum number of data points in a node
14-
#' that is required for the node to be split further.
15-
#' \item \code{tree_depth}: The maximum depth of the tree (i.e. number of
16-
#' splits).
17-
#' \item \code{learn_rate}: The rate at which the boosting algorithm adapts
18-
#' from iteration-to-iteration.
19-
#' \item \code{loss_reduction}: The reduction in the loss function required
20-
#' to split further.
21-
#' \item \code{sample_size}: The amount of data exposed to the fitting routine.
22-
#' \item \code{stop_iter}: The number of iterations without improvement before
23-
#' stopping.
24-
#' }
25-
#' These arguments are converted to their specific names at the
26-
#' time that the model is fit. Other options and arguments can be
27-
#' set using the `set_engine()` function. If left to their defaults
28-
#' here (`NULL`), the values are taken from the underlying model
29-
#' functions. If parameters need to be modified, `update()` can be used
30-
#' in lieu of recreating the object from scratch.
5+
#' @description
6+
#'
7+
#' `boost_tree()` defines a model that creates a series of decision trees
8+
#' forming an ensemble. Each tree depends on the results of previous trees.
9+
#' All trees in the ensemble are combined to produce a final prediction.
10+
#'
11+
#' There are different ways to fit this model. See the engine-specific pages
12+
#' for more details:
13+
#'
14+
#' \Sexpr[stage=render,results=rd]{parsnip:::make_engine_list("boost_tree")}
15+
#'
16+
#' More information on how \pkg{parsnip} is used for modeling is at
17+
#' \url{https://www.tidymodels.org/}.
3118
#'
3219
#' @param mode A single character string for the prediction outcome mode.
3320
#' Possible values for this model are "unknown", "regression", or
3421
#' "classification".
3522
#' @param engine A single character string specifying what computational engine
36-
#' to use for fitting. Possible engines are listed below. The default for this
37-
#' model is `"xgboost"`.
23+
#' to use for fitting.
3824
#' @param mtry A number for the number (or proportion) of predictors that will
39-
#' be randomly sampled at each split when creating the tree models (`xgboost`
40-
#' only).
25+
#' be randomly sampled at each split when creating the tree models
26+
#' (specific engines only)
4127
#' @param trees An integer for the number of trees contained in
4228
#' the ensemble.
4329
#' @param min_n An integer for the minimum number of data points
4430
#' in a node that is required for the node to be split further.
4531
#' @param tree_depth An integer for the maximum depth of the tree (i.e. number
46-
#' of splits) (`xgboost` only).
32+
#' of splits) (specific engines only).
4733
#' @param learn_rate A number for the rate at which the boosting algorithm adapts
48-
#' from iteration-to-iteration (`xgboost` only).
34+
#' from iteration-to-iteration (specific engines only).
4935
#' @param loss_reduction A number for the reduction in the loss function required
50-
#' to split further (`xgboost` only).
36+
#' to split further (specific engines only).
5137
#' @param sample_size A number for the number (or proportion) of data that is
5238
#' exposed to the fitting routine. For `xgboost`, the sampling is done at
5339
#' each iteration while `C5.0` samples once during training.
5440
#' @param stop_iter The number of iterations without improvement before
55-
#' stopping (`xgboost` only).
56-
#' @details
57-
#' The data given to the function are not saved and are only used
58-
#' to determine the _mode_ of the model. For `boost_tree()`, the
59-
#' possible modes are "regression" and "classification".
41+
#' stopping (specific engines only).
6042
#'
61-
#' The model can be created using the `fit()` function using the
62-
#' following _engines_:
63-
#' \itemize{
64-
#' \item \pkg{R}: `"xgboost"` (the default), `"C5.0"`
65-
#' \item \pkg{Spark}: `"spark"`
66-
#' }
43+
#' @template spec-details
6744
#'
68-
#' For this model, other packages may add additional engines. Use
69-
#' [show_engines()] to see the current set of engines.
45+
#' @template spec-references
7046
#'
71-
#' @includeRmd man/rmd/boost-tree.Rmd details
47+
#' @seealso \Sexpr[stage=render,results=rd]{parsnip:::make_seealso_list("boost_tree")},
48+
#' [xgb_train()], [C5.0_train()]
7249
#'
73-
#' @note For models created using the spark engine, there are
74-
#' several differences to consider. First, only the formula
75-
#' interface to via `fit()` is available; using `fit_xy()` will
76-
#' generate an error. Second, the predictions will always be in a
77-
#' spark table format. The names will be the same as documented but
78-
#' without the dots. Third, there is no equivalent to factor
79-
#' columns in spark tables so class predictions are returned as
80-
#' character columns. Fourth, to retain the model object for a new
81-
#' R session (via `save()`), the `model$fit` element of the `parsnip`
82-
#' object should be serialized via `ml_save(object$fit)` and
83-
#' separately saved to disk. In a new session, the object can be
84-
#' reloaded and reattached to the `parsnip` object.
85-
#'
86-
#' @importFrom purrr map_lgl
87-
#' @seealso [fit()], [set_engine()], [update()]
8850
#' @examples
8951
#' show_engines("boost_tree")
9052
#'
9153
#' boost_tree(mode = "classification", trees = 20)
92-
#' # Parameters can be represented by a placeholder:
93-
#' boost_tree(mode = "regression", mtry = varying())
9454
#' @export
95-
55+
#' @importFrom purrr map_lgl
9656
boost_tree <-
9757
function(mode = "unknown",
9858
engine = "xgboost",
@@ -573,7 +533,8 @@ xgb_by_tree <- function(tree, object, new_data, type, ...) {
573533
#' random proportion of the data should be used to train the model.
574534
#' By default, all the samples are used for model training. Samples
575535
#' not used for training are used to evaluate the accuracy of the
576-
#' model in the printed output.
536+
#' model in the printed output. A value of zero means that all the training
537+
#' data are used.
577538
#' @param ... Other arguments to pass.
578539
#' @return A fitted C5.0 model.
579540
#' @keywords internal

R/boost_tree_C5.0.R

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#' Boosted trees via C5.0
2+
#'
3+
#' [C50::C5.0()] creates a series of classification trees forming an
4+
#' ensemble. Each tree depends on the results of previous trees. All trees in
5+
#' the ensemble are combined to produce a final prediction.
6+
#'
7+
#' @includeRmd man/rmd/boost_tree_C5.0.Rmd details
8+
#'
9+
#' @name details_boost_tree_C5.0
10+
#' @keywords internal
11+
NULL

0 commit comments

Comments
 (0)