--- title: "Manipulating individual tables" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Technical: Manipulating individual tables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ``````{r setup, include = FALSE} source("setup/setup.R") `````` This vignette deals with situations where you want to transform tables of your `dm` object and then update an existing table or add a new table to the `dm` object. There are two approaches: 1. extract the tables relevant to the calculation, perform the necessary transformations, and (if needed) recombine the resulting table into a `dm`, 1. do all this within the `dm` object by zooming to a table and manipulating it. Both approaches aim at maintaining the key relations whenever possible. We will explore the first approach here. For the second approach, see `vignette("tech-dm-zoom")`. ## Enabling {dplyr}-workflow within a `dm` The `dm_get_tables()` and `pull_tbl()` functions have a new experimental argument `keyed`, which defaults to `FALSE`. If set to `TRUE`, a list of objects of class `dm_keyed_tbl` is returned instead. Because `dm_keyed_tbl` inherits from `tbl` or `tbl_lazy`, many {dplyr} and {tidyr} verbs will work unchanged. These objects will also attempt to track primary and foreign keys, so that they are available for joins and when recombining these tables later into a `dm` object. When you are finished with transforming your data, you can use `dm()` or `new_dm()` to recombine the tables into a `dm` object. The resulting tables in the `dm` will have all the primary and foreign keys available that could be tracked from the original table. Reconstructing the `dm` object is not strictly necessary if you're primarily interested in deriving one or multiple separate tables for analysis. If this workflow proves as useful as it seems, subsetting tables via `$`, `[[` will default to `keyed = TRUE` in a forthcoming major release of {dm}. ## Examples So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data. ### Use case 1: Add a new column to an existing table Imagine you want to have a column in `flights`, specifying if a flight left before noon or after. Just like with {dplyr}, we can tackle this with `mutate()`. Let us do this step by step: ```{r zoom} library(dm) library(dplyr) flights_dm <- dm_nycflights13(cycle = TRUE) flights_dm flights_keyed <- flights_dm %>% dm_get_tables(keyed = TRUE) # The print output for a `dm_keyed_tbl` looks very much like that from a normal # `tibble`, with additional details about keys. flights_keyed$flights flights_tbl_mutate <- flights_keyed$flights %>% mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm"), .after = dep_time) flights_tbl_mutate ``` To update the original `dm` with a new `flights` table we use `dm()`. The bang-bang-bang (`!!!`) is a technical necessity that will become superfluous in a forthcoming release. ```{r zoom2} updated_flights_dm <- dm( flights = flights_tbl_mutate, !!!flights_keyed[c("airlines", "airports", "planes", "weather")] ) # The only difference in the `dm` print output is the increased number of # columns updated_flights_dm # The schematic view of the data model remains unchanged dm_draw(updated_flights_dm) ``` ### Use case 2: Creation of a surrogate key The same course of action could, for example, be employed to create a surrogate key for a table, a synthetic simple key that replaces a compound key. We can do this for the `weather` table. ```{r} library(tidyr) flights_keyed$weather # Maybe there is some hidden candidate for a primary key that we overlooked? enum_pk_candidates(flights_keyed$weather) # Seems we have to construct a column with unique values # This can be done by combining column `origin` with `time_hour`, if the latter # is converted to a single time zone first; all within the `dm`: weather_tbl_mutate <- flights_keyed$weather %>% # first convert all times to the same time zone: mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>% # paste together as character the airport code and the time unite("origin_slot_id", origin, time_hour_fmt) %>% select(origin_slot_id, everything()) # check if we the result is as expected: weather_tbl_mutate %>% enum_pk_candidates() %>% filter(candidate) # We apply the same transformation to create # the foreign key in the flights table: flights_tbl_mutate <- flights_keyed$flights %>% mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>% unite("origin_slot_id", origin, time_hour_fmt) %>% select(origin_slot_id, everything()) surrogate_flights_dm <- dm( weather = weather_tbl_mutate, flights = flights_tbl_mutate, !!!flights_keyed[c("airlines", "airports", "planes")] ) %>% dm_add_pk(weather, origin_slot_id) %>% dm_add_fk(flights, origin_slot_id, weather) surrogate_flights_dm %>% dm_draw() ``` ### Use case 3: Disentangle `dm` If you look at the `dm` created by `dm_nycflights13(cycle = TRUE)`, you see that two columns of `flights` relate to the same table, `airports`. One column stands for the departure airport and the other for the arrival airport. This generates a cycle which leads to failures with many operations that only work on cycle-free data models, such as `dm_flatten_to_tbl()`, `dm_filter()` or `dm_wrap_tbl()`. In such cases, it can be beneficial to "disentangle" the `dm` by duplicating the referred table. One way to do this in the {dm}-framework is as follows: ```{r} disentangled_flights_dm <- dm( destination = flights_keyed$airports, origin = flights_keyed$airports, !!!flights_keyed[c("flights", "airlines", "planes", "weather")] ) %>% # Key relations are also duplicated, so the wrong ones need to be removed dm_rm_fk(flights, dest, origin) %>% dm_rm_fk(flights, origin, destination) disentangled_flights_dm %>% dm_draw() ``` ### Use case 4: Add summary table to `dm` Here is an example for adding a summary of a table as a new table to a `dm`. Foreign-key relations are taken care of automatically. This example shows an alternative approach of deconstruction reconstruction using `pull_tbl()`. ```{r} flights_derived <- flights_dm %>% pull_tbl(flights, keyed = TRUE) %>% dplyr::count(origin, carrier) derived_flights_dm <- dm(flights_derived, !!!flights_keyed) derived_flights_dm %>% dm_draw() ``` ### Use case 5: Joining tables If you would like to join some or all of the columns of one table to another, you can make use of one of the `..._join()` methods for a `dm_keyed_tbl`. In many cases, using keyed tables derived from a `dm` object allows omitting the `by` argument without triggering a message, because they are safely inferred from the foreign keys stored in the `dm_keyed_tbl` objects. For the syntax, please see the example below. ```{r} planes_for_join <- flights_keyed$planes %>% select(tailnum, plane_type = type) joined_flights_tbl <- flights_keyed$flights %>% # let's first reduce the number of columns of flights select(-dep_delay:-arr_delay, -air_time:-minute, -starts_with("sched_")) %>% # in the {dm}-method for the joins you can specify which columns you want to # add to the subsetted table left_join(planes_for_join) joined_flights_dm <- dm( flights_plane_type = joined_flights_tbl, !!!flights_keyed[c("airlines", "airports", "weather")] ) # this is how the table looks now joined_flights_dm$flights_plane_type # also here, the FK-relations are transferred to the new table joined_flights_dm %>% dm_draw() ``` ### Use case 6: Retrieve all tables Retrieving all tables from a `dm` object requires a lot of boilerplate code. The `dm_deconstruct()` function helps creating that boilerplate. For a `dm` object, it prints the code necessary to create local variables for all tables. ```{r} dm <- dm_nycflights13() dm_deconstruct(dm) ``` This code can be copy-pasted into your script or function.