The 💪 of `{targets}` for
Reproducible Data Science

Get ready!
Setup instructions –> https://tinyurl.com/r-targets-setup

Rahul S

/setup

Have R & RStudio running
Clone github repo:
- https://tinyurl.com/r-targets-workshop
If you don’t have {renv}, run install.packages("renv")
Run renv::restore() to install the needed packages

/agenda

Time	Segment	Duration	Description
0:00–0:15	Introduction & Setup	15 min	Welcome, objectives, environment check, intro to `{targets}`.
0:15–0:35	01-basics: Core Concepts Exercise	20 min	Build your first `{targets}` pipeline.
0:35–0:55	02-functions: Modularization Exercise	20 min	Write and source custom functions for your pipeline.
0:55–1:05	Break	10 min	Stretch, coffee, questions.
1:05–1:25	03-files: File I/O & Quarto Exercise	20 min	Handle file inputs/outputs and automate reporting.
1:25–1:45	04-parallel: Parallel Computing Exercise	20 min	Speed up pipelines with parallel computing.
1:45–1:55	Break	10 min	Stretch, coffee, questions.
1:55–2:10	05-dynamic_branching: Dynamic Branching Ex.	15 min	Process multiple groups/files efficiently.
2:10–2:25	06-database: Database Integration Exercise	15 min	Integrate databases for robust data management.
2:25–2:35	Break	10 min	Stretch, coffee, questions.
2:35–2:45	07-full_example: Full Pipeline Exercise	10 min	Bring together all concepts in a comprehensive example.
2:45–3:00	Wrap-Up and Q&A	15 min	Recap, resources, open questions, next steps.

/engagement

Feel free to ask questions along the way
You don’t have to code along
- More important to absorb the fundamentals and design patterns
- Code bases are fully reproducible & available in GitHub

/whoami

‘Full Stack’ Data Science Manager, Mechanical Engineer
Focus areas
- Time Series
- Scalable Solutioning
- Reproducible data science
- ML Ops
R, Py Package Author
github: rsangole

/motivation/reproducibility-crisis

Not good 🙈

A survey of 1,576 researchers found that over 70% had failed to reproduce another scientist’s experiments, and more than 50% couldn’t reproduce their own results!

/motivation/reproducibility-crisis

Concerning… 😩

Scientists tried replicating 56 studies. Only 19% found results consistent with the original papers

/motivation/reproducibility-crisis

Gosh! ☠️

A 2019 paper found just 24% of 800k Jupyter notebooks on GitHub could be rerun, and only 4% reproduced the same results!

/motivation/reproducibility-crisis

Say it isn’t so! 🤯

A 2024 study on code reproducibility in economics attempted to reproduce 67 economics papers and found that only about 50% were reproducible, even with author assistance and mandatory code-sharing policies at journals!

/reproducibility

How do we ensure end to end reproducibility?

mindmap
  root)Reproducibility(
    3-Code
      [Version]
        [Github]
      [Execution]
        ))targets((
    2-Environment
      🐳 [Docker]
        [OS]
        [Libraries]
        [Environment]
        [R Packages]
        [Py Packages]
    1-Data
      [Databases]
      [Blob storage]
      [Archives]

/{targets}/what-is-it?

Author Will Landau explains…

A make-like pipeline tool that coordinates the pieces of computationally demanding analyses
The package skips costly run-time for tasks that are already up to date
It orchestrates computation, handles parallel computing
If all the current output matches the code and data, then the whole pipeline is up to date, and the results are more trustworthy

/{targets}/what-is-it?

It’ll help you go from here…

/{targets}/what-is-it?

To here…

/{targets}/what-we’ll-cover

The workshop is structured to build your skills step-by-step, from foundational concepts to some advanced features:

Fundamentals

01-basics
02-functions
03-files

Advanced Topics

04-parallel
05-dynamic_branching
06-database
07-full_example

/lets::code()

/recap

We learned:

What {targets} is and why reproducibility matters
How to build and modularize pipelines
Managing files, Quarto reports, and databases
Using parallel computing and dynamic branching
Bringing it all together in a full example

/further-reading

We just scratched the surface…

/lets-make-data-science-reproducible!

🙏

rsangole
rahulsangole

The 💪 of {targets} for Reproducible Data Science

/setup

/agenda

/engagement

/whoami

/motivation/reproducibility-crisis

/motivation/reproducibility-crisis

/motivation/reproducibility-crisis

/motivation/reproducibility-crisis

/reproducibility

/{targets}/what-is-it?

/{targets}/what-is-it?

/{targets}/what-is-it?

/{targets}/what-we’ll-cover

Fundamentals

Advanced Topics

/lets::code()

/recap

/further-reading

/lets-make-data-science-reproducible!

🙏

The 💪 of `{targets}` for
Reproducible Data Science