--- title: "Zooming and manipulating tables" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Technical: Zooming and manipulating tables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ``````{r setup, include = FALSE} source("setup/setup.R") `````` This vignette deals with situations where you want to transform tables of your `dm` and then update an existing table or add a new table to the `dm`. There are two straightforward approaches: 1. extract the tables relevant to the calculation, perform the necessary transformations, and (if needed) recombine the resulting table into a `dm`, 1. do all this within the `dm` object by zooming to a table and manipulating it while maintaining the key relations whenever possible. Both approaches aim at maintaining the key relations whenever possible. We will explore the second approach here. For the first approach, see `vignette("tech-dm-zoom")`. ## Enabling {dplyr}-workflow within a `dm` "Zooming" to a table of a `dm` means: - all information stored in the original `dm` is kept, including the originally zoomed table - an object of class `dm_zoomed` is produced, presenting a view of the table for transformations - you do not need to specify the table when calling `select()`, `mutate()` and other table manipulation functions {dm} provides methods for many of the {dplyr}-verbs for a `dm_zoomed` which behave the way you are used to, affecting only the zoomed table and leaving the rest of the `dm` untouched. When you are finished with transforming the table, there are three options to proceed: 1. use `dm_update_zoomed()` if you want to replace the originally zoomed table with the new table 1. use `dm_insert_zoomed()` if you are creating a new table for your `dm` 1. use `dm_discard_zoomed()` if you do not need the result and want to discard it When employing one of the first two options, the resulting table in the `dm` will have all the primary and foreign keys available that could be tracked from the originally zoomed table. ## Examples So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data. ### Use case 1: Add a new column to an existing table Imagine you want to have a column in `flights`, specifying if a flight left before noon or after. Just like with {dplyr}, we can tackle this with `mutate()`. Let us do this step by step: ```{r zoom} library(dm) library(dplyr) flights_dm <- dm_nycflights13() flights_dm flights_zoomed <- flights_dm %>% dm_zoom_to(flights) # The print output for a `dm_zoomed` looks very much like that from a normal `tibble`. flights_zoomed flights_zoomed_mutate <- flights_zoomed %>% mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm")) %>% # in order to see our changes in the output we use `select()` for reordering the columns select(year:dep_time, am_pm_dep, everything()) flights_zoomed_mutate # To update the original `dm` with a new `flights` table we use `dm_update_zoomed()`: updated_flights_dm <- flights_zoomed_mutate %>% dm_update_zoomed() # The only difference in the `dm` print output is the increased number of columns updated_flights_dm # The schematic view of the data model remains unchanged dm_draw(updated_flights_dm) ``` ### Use case 2: Creation of a surrogate key The same course of action could, for example, be employed to create a surrogate key for a table, a synthetic simple key that replaces a compound key. We can do this for the `weather` table. ```{r} library(tidyr) weather_zoomed <- flights_dm %>% dm_zoom_to(weather) weather_zoomed # Maybe there is some hidden candidate for a primary key that we overlooked enum_pk_candidates(weather_zoomed) # Seems we have to construct a column with unique values # This can be done by combining column `origin` with `time_hour`, if the latter # is converted to a single time zone first; all within the `dm`: weather_zoomed_mutate <- weather_zoomed %>% # first convert all times to the same time zone: mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>% # paste together as character the airport code and the time unite("origin_slot_id", origin, time_hour_fmt) %>% select(origin_slot_id, everything()) # check if we the result is as expected: enum_pk_candidates(weather_zoomed_mutate) %>% filter(candidate) flights_upd_weather_dm <- weather_zoomed_mutate %>% dm_update_zoomed() %>% dm_add_pk(weather, origin_slot_id) flights_upd_weather_dm # creating the coveted FK relation between `flights` and `weather` extended_flights_dm <- flights_upd_weather_dm %>% dm_zoom_to(flights) %>% mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>% # need to keep `origin` as FK to airports, so `remove = FALSE` unite("origin_slot_id", origin, time_hour_fmt, remove = FALSE) %>% dm_update_zoomed() %>% dm_add_fk(flights, origin_slot_id, weather) extended_flights_dm %>% dm_draw() ``` ### Use case 3: Disentangle `dm` If you look at the `dm` created by `dm_nycflights13(cycle = TRUE)`, you see that two columns of `flights` relate to one and the same table, `airports`. One column stands for the departure airport and the other for the arrival airport. ```{r} dm_draw(dm_nycflights13(cycle = TRUE)) ``` This generates a cycle which leads to failures with many operations that only work on cycle-free data models, such as `dm_flatten_to_tbl()`, `dm_filter()` or `dm_wrap_to_tbl()`. In such cases, it can be beneficial to "disentangle" the `dm` by duplicating the referred table. One way to do this in the {dm}-framework is as follows: ```{r} disentangled_flights_dm <- dm_nycflights13(cycle = TRUE) %>% # zooming and immediately inserting essentially creates a copy of the original table dm_zoom_to(airports) %>% # reinserting the `airports` table under the name `destination` dm_insert_zoomed("destination") %>% # renaming the originally zoomed table dm_rename_tbl(origin = airports) %>% # Key relations are also duplicated, so the wrong ones need to be removed dm_rm_fk(flights, dest, origin) %>% dm_rm_fk(flights, origin, destination) dm_draw(disentangled_flights_dm) ``` In a future update, we will provide a more convenient way to "disentangle" `dm` objects, so that the individual steps will be done automatically. ### Use case 4: Add summary table to `dm` Here is an example for adding a summary of a table as a new table to a `dm` (FK-relations are taken care of automatically): ```{r} dm_with_summary <- flights_dm %>% dm_zoom_to(flights) %>% dplyr::count(origin, carrier) %>% dm_insert_zoomed("dep_carrier_count") dm_draw(dm_with_summary) ``` ### Use case 5: Joining tables If you would like to join some or all of the columns of one table to another, you can make use of one of the `join`-methods for a `dm_zoomed`. In addition to the usual arguments for the {dplyr}-joins, by supplying the `select` argument you can specify which columns of the RHS-table you want to be included in the join. For the syntax please see the example below. The LHS-table of a join is always the zoomed table. ```{r} joined_flights_dm <- flights_dm %>% dm_zoom_to(flights) %>% # let's first reduce the number of columns of flights select(-dep_delay:-arr_delay, -air_time:-time_hour) %>% # in the {dm}-method for the joins you can specify which columns you want to add to the zoomed table left_join(planes, select = c(tailnum, plane_type = type)) %>% dm_insert_zoomed("flights_plane_type") # this is how the table looks now joined_flights_dm$flights_plane_type # also here, the FK-relations are transferred to the new table dm_draw(joined_flights_dm) ``` ### Tip: Accessing the zoomed table At each point, you can retrieve the zoomed table by calling `pull_tbl()` on a `dm_zoomed`. To use our last example once more: ```{r} flights_dm %>% dm_zoom_to(flights) %>% select(-dep_delay:-arr_delay, -air_time:-time_hour) %>% left_join(planes, select = c(tailnum, plane_type = type)) %>% pull_tbl() ``` ### Possible pitfalls and caveats 1. Currently, not all {dplyr}-verbs have their own method for a `dm_zoomed` object, so be aware that in some cases it will still be necessary to resort to extracting one or more tables from a `dm` and reinserting a transformed version back into the `dm` object. The supported functions are: `group_by()`, `ungroup()`, `summarise()`, `mutate()`, `transmute()`, `filter()`, `select()`, `relocate()`, `rename()`, `distinct()`, `arrange()`, `slice()`, `left_join()`, `inner_join()`, `full_join()`, `right_join()`, `semi_join()`, and `anti_join()`. 1. The same is true for {tidyr}-functions. Methods are provided for: `unite()` and `separate()`. 1. There might be situations when you would like the key relations to remain intact, but they are dropped nevertheless. This is because a rigid logic is implemented, that does drop a key when its associated column is acted upon with e.g. a `mutate()` call. In these cases, the key relations will need to be re-established after finishing with the manipulations. 1. For each implemented {dplyr}-verb, there is a logic for tracking key relations between the tables. Up to {dm} version 0.2.4 we tried to track the columns in a very detailed manner. This has become increasingly difficult, especially with `dplyr::across()`. As of {dm} 0.2.5, we give more responsibility to the {dm} user: Now those columns are tracked whose **names** remain in the resulting table. Affected by these changes are the methods for: `mutate()`, `transmute()`, `distinct()`. When using one of these functions, be aware that if you want to replace a key column with a column with a different content but of the same name, this column will automatically become a key column.