Skip to content

Conversation

llrs-roche
Copy link
Contributor

@llrs-roche llrs-roche commented Mar 5, 2025

Linked to insightsengineering/NEST-roadmap#36

Pull Request

Redesign extraction

This branch/PR is a POC of specification of user input to be extracted.
See tests added about how it works (currently most if not all of them should work), but simple example of specification:

spec <- c(datasets("df"), variables(where(is.factor)))

Features:

  • Specification can be composed, so it allows to reuse specifications.
  • Three different steps: specification, validation and extraction.
  • Supporting a new object class is as easy as adding a new extraction method (if the default doesn't work) so that user supplied functions work on that class.
  • Possible to set a defaults for the choices. (See TODO list).
  • Simple to define and reuse if user defines good names of each specification.

Examples

Resolve input to be used by module developer

The module_input_ui and module_input_server help the developer to get the required data.

devtools::load_all(".")
library("shiny")
library("teal")
options(shiny.reactlog = TRUE)


# UI function for the custom histogram module
histogram_module_ui <- function(id) {
  ns <- shiny::NS(id)
  shiny::tagList(
    shiny::selectInput(
      ns("dataset"),
      "Select Dataset",
      choices = NULL    ),
    shiny::selectInput(
      ns("variable"),
      "Select Variable",
      choices = NULL
    ),
    shiny::plotOutput(ns("histogram_plot")),
    shiny::verbatimTextOutput(ns("plot_code")) # To display the reactive plot code
  )
}

# Server function for the custom histogram module with injected variables in within()
histogram_module_server <- function(id, data, spec) {
  moduleServer(id, function(input, output, session) {
    # Update dataset choices based on available datasets in teal_data
    spec_resolved <- teal.transform::resolver(spec, data())

    shiny::updateSelectInput(
      session,
      "dataset",
      choices = spec_resolved$datasets$names,
      selected = spec_resolved$datasets$select
    )
    react_dataset <- reactive({
      req(input$dataset)
      req(input$dataset %in% spec_resolved$datasets$names)
      if (!input$dataset %in% spec_resolved$datasets$select) {
        spec_resolved |>
          update_spec("datasets", input$dataset) |>
          teal.transform::resolver(data())
      } else {
        spec_resolved
      }
    })

    react_variable <- reactive({
      req(input$variable, react_dataset())
      spec_resolved <- react_dataset()
      req(input$variable %in% spec_resolved$variables$names)
      if (!input$variable %in% spec_resolved$variables$select) {
        react_dataset() |>
          update_spec("variables", input$variable) |>
          teal.transform::resolver(data())
      } else {
        spec_resolved
      }
    })

    observe({
      spec_resolved <- req(react_dataset())
      shiny::updateSelectInput(
        session,
        "variable",
        choices = spec_resolved$variables$names,
        selected = spec_resolved$variables$select
      )
    })

    # Create a reactive `teal_data` object with the histogram plot
    result <- reactive({
      spec_resolved <- req(react_variable())
      # Create a new teal_data object with the histogram plot
      new_data <- within(
        data(),
        {
          my_plot <- hist(
            input_dataset[[input_vars]],
            las = 1,
            main = paste("Histogram of", input_vars),
            xlab = input_vars,
            col = "lightblue",
            border = "black"
          )
        },
        input_dataset = as.name(as.character(spec_resolved$datasets$select)), # Replace `input_dataset` with selection dataset
        input_vars = as.character(spec_resolved$variables$select) # Replace `input_vars` with selection
      )
      new_data
    })

    # Render the histogram from the updated teal_data object
    output$histogram_plot <- shiny::renderPlot({
      req(result())
      result()[["my_plot"]] # Access and render the plot stored in `new_data`
    })

    # Reactive expression to get the generated code for the plot
    output$plot_code <- shiny::renderText({
      teal.code::get_code(result()) # Retrieve and display the code for the updated `teal_data` object
    })
  })
}

# Custom histogram module creation
create_histogram_module <- function(label = "Histogram Module", spec) {

  teal::module(
    label = label,
    ui = histogram_module_ui,
    server = histogram_module_server,
    server_args = list(spec = spec),
    datanames = "all"
  )
}

# Initialize the teal app
app <- init(
  data = teal_data(IRIS = iris, MTCARS = mtcars),
  modules = modules(create_histogram_module(spec = c(datasets(where(is.data.frame)),
                                                     variables(where(is.numeric)))))
)

runApp(app)

Merge arbitrary data

The merge_module_srv accepts the inputs and merge arbitrary data processed by the module_input_server, this is done thanks to extract_input the responsible to obtain the requested data from data().

devtools::load_all(".")
library("teal")
library("shiny")
library("dplyr")

# Initialize the teal app
mod <- function(label, x, y) {
  module(
    ui = function(id, x, y) {
      ns <- NS(id)
      div(
        module_input_ui(ns("x"), "x", x),
        module_input_ui(ns("y"), "y", y),
        shiny::tagList(
          shiny::textOutput(ns("text")),
          shiny::tableOutput(ns("table"))
        )
      )
    },
    server = function(id, data, x, y) {
      moduleServer(id, function(input, output, session) {
        x_in <- module_input_server("x", x, data)
        y_in <- module_input_server("y", y, data)

        merged_data <- reactive({
          req(x_in(), y_in())
          data_to_merge <- list(x_in(), y_in())
          m <- merge_call_multiple(data_to_merge, ids = NULL, merge_function = "dplyr::full_join", data = data())
          m
        })

        output$table <- shiny::renderTable({
          merged_data()[["code"]][["ANL"]]
        })
      })
    },
    ui_args = list(x = x, y = y),
    server_args = list(x = x, y = y)
  )
}

data <- within(teal.data::teal_data(), {
  df1 <- data.frame(id = 1:10, var1 = letters[1:10], var2 = letters[1:10], var3 = 1:10)
  df2 <- data.frame(id = rep(1:10, 2), var2 = letters[1:20], var3 = factor(LETTERS[1:20]), var4 = LETTERS[1:20])
})

ids_df1 <- c("id", "var1", "var2")
ids_df2 <- c("id", "var2")

app <- init(
  data = data,
  modules = modules(
    mod(x = c(datasets("df1"), variables(all_of(ids_df1), all_of(ids_df1))),
        y = c(datasets("df2"), variables(all_of(ids_df2), all_of(ids_df2))))
  )
)

shinyApp(app$ui, app$server)


MAE

devtools::load_all()
library("DFplyr") #
# library("SummarizedExperiment")

# MAE ####
tda <- within(teal.data::teal_data(), {
  library("MultiAssayExperiment")
  m <- diag(5)
  i <- iris
  data(miniACC, envir = environment())
  mae2 <- hermes::multi_assay_experiment
})


r <- resolver(c(datasets(where(function(x){is(x, "MultiAssayExperiment")})),
           mae_colData(where(is.numeric))),
         tda)

tda[[r[[1]]$selected]] |>
  colData() |>
  select(!!r[[2]]$selected)

DONE

  • Data extraction and preparation design.
  • UI update based on partial resolution.
  • Check it is simple for app developers.
  • Use tidyselect instead of own operators and resolution.
  • Update API to not rely on operators on this package (rely only on those on tidyselect)
  • Test extraction on non data.frame objects: MAE objects/matrices, lists...

TODO 1

  • Check missing corners: test required changes on current modules (as per team previous decisions):
    • tmg: tm_missing_data, tm_g_distribution, tm_data_table, tm_g_bivariate;
    • tmc: tm_a_mmrm, tm_t_events, tm_t_summary
      Only up to the server body, check the defaults,
      8 hours each module, mmrm = 10 hours.
      If there's no bug, then this is the checkpoint of the stable feature branch.
  • PoC demo (the new solution with the modules using them) with the team

TODO 2

  • Verify value specification: Some modules request variable A vs other variables or a numeric input.
    • To test the data extraction and merging the module (ANL).
    • With different data types: data.frame, Matrices, MAE, SE
      20 hours
  • Verify bookmarking works
  • Verify it works on multiple data merges from multiple sources (cfr: [Bug]: Forced merge precludes using modules without joinin keys #258 )
  • Check it is simple for module developers: using resolver() is good enough? Is it simple to create the code for reproducibility with teal_data inside the module via extract()?
  • Decide defaults to constructors when multiple options are possible for: datasets(select = ), variables(select = ), values(select = ).
  • Handle UI validation: removing unnecessary input UI or gray out when no input is possible (just one choice). Example: selection that only allows to select data.frames if they have column A, there is no need to select a variable or to select different data.frames if there is only one data.frame with a column named A.
  • Documentation/help
  • Check transition notify dependencies/users
  • Update module packages to use that feature
    • teal.modules.general
    • teal.modules.clinical
    • Other
  • Set sensible defaults to specifications
    What should be the constructor of the modules should have as a default for the extract spec (i.e. is it data.frame? take all numeric columns? Only select the first one (with all available options)?
    Each module might have different requirements.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

spec <- datasets("df") & variables(is.factor)

This would specify "factor columns from df", correct? Variables are specified with predicate functions?
Currently variable choices are specified with functions that act on the whole dataset, which allows for more flexibility, and which was deemed necessary in preliminary discussions. If one wants to use all columns whose names are in all caps, it cannot be done by applying predicates to columns.

@llrs-roche
Copy link
Contributor Author

llrs-roche commented Mar 5, 2025

This would specify "factor columns from df", correct?

Yes, I'm glad it is easy to understand.

Variables are specified with predicate functions? Currently variable choices are specified with functions that act on the whole dataset, which allows for more flexibility, and which was deemed necessary in preliminary discussions. If one wants to use all columns whose names are in all caps, it cannot be done by applying predicates to columns.

Not sure I fully understand your question. Variables can also be manually set, for example variables("A"). But filtering columns whose names are all in caps is currently possible:

td <- within(teal.data::teal_data(), {
  df <- data.frame(A = as.factor(LETTERS[1:5]), Ab = letters[1:5])
  m <- matrix()
})
spec <- datasets("df") & variables(function(x){x==toupper(x)})
resolver(spec, td) #no print method available yet, but basically resolves to df dataset and A variable

The only limitation is that the function provided should return a logical vector. In terms of implementation, the function is applied to the names and to the values, precisely to accommodate cases like this.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

spec <- datasets("df") & variables(function(x){x==toupper(x)})

I don't get this one. I would think it would return columns where the values are in all caps.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

In terms of implementation, the function is applied to the names and to the values, precisely to accommodate cases like this.

The same function is applied to column names and column values? 🤔

@llrs-roche
Copy link
Contributor Author

spec <- datasets("df") & variables(function(x){x==toupper(x)})

I don't get this one. I would think it would return columns where the values are in all caps.

It could mean both things imho. But here it is specifying variables, not values. For filtering by value there is values() (still work in progress).
Perhaps this other example is more representative of what I meant, (didn't realize the confusion when I reused an example with capital letters):

td <- within(teal.data::teal_data(), {
    df <- data.frame(A = as.factor(letters[1:5]), Ab = letters[1:5])
    m <- matrix()
})
spec <- datasets("df") & variables(function(x){x==toupper(x)})
resolver(spec, td)

Still selects A as the name is in capital letters

The same function is applied to column names and column values? 🤔

Yes, if there is a single argument that should accept is.factor and all_caps <- function(x){x == toupper(x)} and filter by names too with functions like stars_with(). I hope this helps.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

Thanks, it does explain things a bit.

@llrs-roche
Copy link
Contributor Author

llrs-roche commented Mar 7, 2025

I updated the branch to make it easy to reflect updates on selections from the possible choices. Besides some internal functions, it now exports a new key function update_spec() to check if updated selections invalidate other selections (If the dataset changes, variables should change too).

Here is a full example with a shiny module and teal:

library("teal")
library("shiny")

# Initialize the teal app
mod <- function(label, x, y) {
  module(
    ui = function(id, x, y) {
      ns <- NS(id)
      div(
        module_input_ui(ns("x"), "x", x),
        module_input_ui(ns("y"), "y", y),
        shiny::tagList(
          shiny::textOutput(ns("text"))
        )
      )
    },
    server = function(id, data, x, y) {
      moduleServer(id, function(input, output, session) {
        x_in <- module_input_server("x", x, data)
        y_in <- module_input_server("y", y, data)

        output$text <- shiny::renderText({
          req(x_in(), y_in())
          l <- lapply(
            list(x = x_in, y = y_in),
            function(sel) {
              if (sum(lengths(sel()))) {
                paste0(
                  "Object: ", sel()$datasets, "\nVariables: ",
                  paste0(sel()$variables, collapse = ", ")
                )
              } else {
                "No info??"
              }
            }
          )
          unlist(l)
        })
      })
    },
    ui_args = list(x = x, y = y),
    server_args = list(x = x, y = y)
  )
}

Here I create three different specifications, you can pass them to the app and they will work:

iris_numeric <- datasets("IRIS") & variables(is.numeric)
mtcars_char <- datasets("MTCARS") & variables(is.character)
iris_numeric_or_mtcars_char <- iris_numeric | mtcars_char

app <- init(
  data = teal.data::teal_data(IRIS = iris, MTCARS = mtcars),
  modules = modules(
    mod(x = datasets("IRIS") & variables(is.factor),
        y = datasets("MTCARS") & variables(is.numeric))
  )
)

shinyApp(app$ui, app$server)

@llrs-roche
Copy link
Contributor Author

Short summary about the discussion we had regarding selection based on the possible choices and how to select them. We decided to keep the tidyselect on the selection but only pass the possible choices.

A small example for the extraction of the factors variables but selecting those without NAs:

devtools::load_all(".")

td <- within(teal.data::teal_data(), {
  i <- iris
  p <- penguins
  m <- diag(5)
})

r <- resolver(c(datasets(where(is.data.frame), "p"),
                variables(where(is.factor),
                          !where(anyNA))),
              td)
r$variables$select
## "species" "island"

We don't need to repeat the initial code for selecting the variables and tidyselection makes it flexible for multiple use cases (for example avoid plotting with NAs or outliers even if they are valid numeric or categorical values).

@gogonzo gogonzo force-pushed the redesign_extraction@main branch from cfcb269 to 3ec348b Compare September 9, 2025 15:15
@gogonzo
Copy link
Contributor

gogonzo commented Sep 11, 2025

Closing a PR and opening NEW soon to have a clean discussion feed

@gogonzo gogonzo closed this Sep 11, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Sep 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design data extract and data merge
4 participants