--- title: "Getting started with dm" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{First read: Getting started with dm} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ``````{r setup, include = FALSE} source("setup/setup.R") `````` In this article, we're going show you how easy it is to move from connecting to the database holding your data to producing the results you need. It's meant to be a quick and friendly introduction to {dm}, so it is low on details and caveats. Links to detailed documentation are provided at the end. (If your data is in data frames instead of a database and you're in a hurry, jump over to `vignette("howto-dm-df")`.) ## Creating a dm object dm objects can be created from individual tables or loaded directly from a relational data model on an RDBMS (relational database management system). For this demonstration, we're going to work with a model hosted on a public server. The first thing we need is a connection to the RDBMS hosting the data. ``````{r connect, eval = FALSE} library(RMariaDB) fin_db <- dbConnect( MariaDB(), username = "guest", password = "ctu-relational", dbname = "Financial_ijs", host = "relational.fel.cvut.cz" ) `````` ``````{r connect-real, echo = FALSE} library(RMariaDB) fin_db <- dm:::financial_db_con() `````` We create a dm object from an RDBMS using `dm_from_con()`, passing in the connection object we just created as the first argument. ``````{r load-full} library(dm) fin_dm <- dm_from_con(fin_db) fin_dm `````` The dm object interrogates the RDBMS for table and column information, and primary and foreign keys. Currently, primary and foreign keys are only available from `r glue::glue_collapse(dm:::schema_supported_dbs()[["db_name"]], sep = ", ", last = " and ")`. ## Selecting tables The dm object can be accessed like a named list of tables: ```{r names} names(fin_dm) fin_dm$loans dplyr::count(fin_dm$trans) ``` Additionally, most `dm` functions are [pipe-friendly](https://r4ds.had.co.nz/pipes.html) and support [tidy evaluation](https://adv-r.hadley.nz/metaprogramming.html). We can use `[` or the `dm_select_tbl()` verb to derive a smaller dm with the `loans`, `accounts`, `districts` and `trans` tables: ```{r select} fin_dm_small <- fin_dm[c("loans", "accounts", "districts", "trans")] fin_dm_small <- fin_dm %>% dm_select_tbl(loans, accounts, districts, trans) ``` ## Linking tables by adding keys In many cases, `dm_from_con()` already returns a dm with all keys set. If not, dm allows us to define primary and foreign keys ourselves. For this, we use `learn_keys = FALSE` to obtain a `dm` object with only the tables. ``````{r load} library(dm) fin_dm_small <- dm_from_con(fin_db, learn_keys = FALSE) %>% dm_select_tbl(loans, accounts, districts, trans) `````` In our data model, `id` columns uniquely identify records in the `accounts` and `loans` tables, and was used as a primary key. A primary key is defined with `dm_add_pk()`. Each loan is linked to one account via the `account_id` column in the `loans` table, the relationship is established with `dm_add_fk()`. ```{r keys} fin_dm_keys <- fin_dm_small %>% dm_add_pk(table = accounts, columns = id) %>% dm_add_pk(loans, id) %>% dm_add_fk(table = loans, columns = account_id, ref_table = accounts) %>% dm_add_pk(trans, id) %>% dm_add_fk(trans, account_id, accounts) %>% dm_add_pk(districts, id) %>% dm_add_fk(accounts, district_id, districts) ``` ## Visualizing a data model Having a diagram of the data model is the quickest way to verify we're on the right track. We can display a visual summary of the dm at any time. The default is to display the table name, any defined keys, and their links to other tables. Visualizing the dm in its current state, we can see the keys we have created and how they link the tables together. Color guides the eye. ```{r visualize_keys} fin_dm_keys %>% dm_set_colors(darkgreen = c(loans, accounts), darkblue = trans, grey = districts) %>% dm_draw() ``` ## Accessing a data model as a table If we want to perform modeling or analysis on this relational model, we need to transform it into a tabular format that R functions can work with. With the argument `recursive = TRUE`, `dm_flatten_to_tbl()` will automatically follow foreign keys across tables to gather all the available columns into a single table. ```{r squash} fin_dm_keys %>% dm_flatten_to_tbl(loans, .recursive = TRUE) ``` Apart from the rows printed above, no data has been fetched from the database. Use `select()` to reduce the number of columns fetched, and `collect()` to retrieve the entire result for local processing. ```{r model} loans_df <- fin_dm_keys %>% dm_flatten_to_tbl(loans, .recursive = TRUE) %>% select(id, amount, duration, A3) %>% collect() model <- lm(amount ~ duration + A3, data = loans_df) model ``` ## Operations on table data within a dm We don't need to take the extra step of exporting the data to work with it. Through the dm object, we have complete access to dplyr's data manipulation verbs. These operate on the data within individual tables. To work with a particular table we use `dm_zoom_to()` to set the context to our chosen table. Then we can perform any of the dplyr operations we want. ```{r zoom} fin_dm_total <- fin_dm_keys %>% dm_zoom_to(loans) %>% group_by(account_id) %>% summarize(total_amount = sum(amount, na.rm = TRUE)) %>% ungroup() %>% dm_insert_zoomed("total_loans") fin_dm_total$total_loans ``` Note that, in the above example, we use `dm_insert_zoomed()` to add the results as a new table to our data model. This table is temporary and will be deleted when our session ends. If you want to make permanent changes to your data model on an RDBMS, please see the "Persisting results" section in `vignette("howto-dm-db")`. ## Checking constraints It's always smart to check that your data model follows its specifications. When building our own model or changing existing models by adding tables or keys, it is even more important that the new model is validated. `dm_examine_constraints()` checks all primary and foreign keys and reports if they violate their expected constraints. ```{r constraints} fin_dm_total %>% dm_examine_constraints() ``` For more on constraint checking, including cardinality, finding candidate columns for keys, and normalization, see `vignette("tech-dm-low-level")`. ``````{r disconnect, echo = FALSE, results = "hide"} dbDisconnect(fin_db) `````` ## Next Steps Now that you have been introduced to the basic operation of dm, the next step is to learn more about the dm methods that your particular use case requires. Is your data in an RDBMS? Then move on to `vignette("howto-dm-db")` for a more detailed look at working with an existing relational data model. If your data is in data frames, then you want to read `vignette("howto-dm-df")` next. If you would like to know more about relational data models in order to get the most out of dm, check out `vignette("howto-dm-theory")`. If you're familiar with relational data models, but want to know how to work with them in dm, then any of `vignette("tech-dm-join")`, `vignette("tech-dm-filter")`, or `vignette("tech-dm-zoom")` is a good next step. ## Standing on the shoulders of giants The {dm} package follows the tidyverse principles: - `dm` objects are immutable (your data will never be overwritten in place) - most functions used on `dm` objects are pipeable (i.e., return new `dm` or table objects) - tidy evaluation is used (unquoted function arguments are supported) The {dm} package builds heavily upon the [{datamodelr} package](https://github.com/bergant/datamodelr), and upon the [tidyverse](https://www.tidyverse.org/). We're looking forward to a good collaboration! The [{polyply} package](https://github.com/russHyde/polyply) has a similar intent with a slightly different interface. The [{data.cube} package](https://github.com/jangorecki/data.cube) has quite the same intent using `array`-like interface. Articles in the [{rquery} package](https://github.com/WinVector/rquery) discuss [join controllers](https://github.com/WinVector/rquery/blob/master/extras/JoinController.md) and [join dependency sorting](https://github.com/WinVector/rquery/blob/master/extras/DependencySorting.md), with the intent to move the declaration of table relationships from code to data. The [{tidygraph} package](https://github.com/thomasp85/tidygraph) stores a network as two related tables of `nodes` and `edges`, compatible with {dplyr} workflows. In object-oriented programming languages, [object-relational mapping](https://en.wikipedia.org/wiki/Object-relational_mapping) is a similar concept that attempts to map a set of related tables to a class hierarchy.