The šŸ’Ŗ of {targets} for
Reproducible Data Science


Get ready!
Setup instructions –> https://tinyurl.com/r-targets-setup

Rahul S

/setup


  • Have R & RStudio running
  • Clone github repo:
  • If you don’t have {renv}, run install.packages("renv")
  • Run renv::restore() to install the needed packages



/agenda


Time Segment Duration Description
0:00–0:15 Introduction & Setup 15 min Welcome, objectives, environment check, intro to {targets}.
0:15–0:35 01-basics: Core Concepts Exercise 20 min Build your first {targets} pipeline.
0:35–0:55 02-functions: Modularization Exercise 20 min Write and source custom functions for your pipeline.
0:55–1:05 Break 10 min Stretch, coffee, questions.
1:05–1:25 03-files: File I/O & Quarto Exercise 20 min Handle file inputs/outputs and automate reporting.
1:25–1:45 04-parallel: Parallel Computing Exercise 20 min Speed up pipelines with parallel computing.
1:45–1:55 Break 10 min Stretch, coffee, questions.
1:55–2:10 05-dynamic_branching: Dynamic Branching Ex. 15 min Process multiple groups/files efficiently.
2:10–2:25 06-database: Database Integration Exercise 15 min Integrate databases for robust data management.
2:25–2:35 Break 10 min Stretch, coffee, questions.
2:35–2:45 07-full_example: Full Pipeline Exercise 10 min Bring together all concepts in a comprehensive example.
2:45–3:00 Wrap-Up and Q&A 15 min Recap, resources, open questions, next steps.

/engagement


  • Feel free to ask questions along the way
  • You don’t have to code along
    • More important to absorb the fundamentals and design patterns
    • Code bases are fully reproducible & available in GitHub

/whoami


  • ā€˜Full Stack’ Data Science Manager, Mechanical Engineer
  • Focus areas
    • Time Series
    • Scalable Solutioning
    • Reproducible data science
    • ML Ops
  • R, Py Package Author
  • github: rsangole

/motivation/reproducibility-crisis

Not good šŸ™ˆ

A survey of 1,576 researchers found that over 70% had failed to reproduce another scientist’s experiments, and more than 50% couldn’t reproduce their own results!

/motivation/reproducibility-crisis

Concerning… 😩

Scientists tried replicating 56 studies. Only 19% found results consistent with the original papers

/motivation/reproducibility-crisis

Gosh! ā˜ ļø

A 2019 paper found just 24% of 800k Jupyter notebooks on GitHub could be rerun, and only 4% reproduced the same results!

/motivation/reproducibility-crisis

Say it isn’t so! 🤯

A 2024 study on code reproducibility in economics attempted to reproduce 67 economics papers and found that only about 50% were reproducible, even with author assistance and mandatory code-sharing policies at journals!

/reproducibility

How do we ensure end to end reproducibility?

mindmap
  root)Reproducibility(
    3-Code
      [Version]
        [Github]
      [Execution]
        ))targets((
    2-Environment
      🐳 [Docker]
        [OS]
        [Libraries]
        [Environment]
        [R Packages]
        [Py Packages]
    1-Data
      [Databases]
      [Blob storage]
      [Archives]

/{targets}/what-is-it?


Author Will Landau explains…

  • A make-like pipeline tool that coordinates the pieces of computationally demanding analyses
  • The package skips costly run-time for tasks that are already up to date
  • It orchestrates computation, handles parallel computing
  • If all the current output matches the code and data, then the whole pipeline is up to date, and the results are more trustworthy

/{targets}/what-is-it?

It’ll help you go from here…

/{targets}/what-is-it?

To here…

/{targets}/what-we’ll-cover


The workshop is structured to build your skills step-by-step, from foundational concepts to some advanced features:


Fundamentals

  • 01-basics
  • 02-functions
  • 03-files

Advanced Topics

  • 04-parallel
  • 05-dynamic_branching
  • 06-database
  • 07-full_example

/lets::code()

/recap


We learned:

  • What {targets} is and why reproducibility matters
  • How to build and modularize pipelines
  • Managing files, Quarto reports, and databases
  • Using parallel computing and dynamic branching
  • Bringing it all together in a full example

/further-reading

We just scratched the surface…

The user manual

The documentation

/lets-make-data-science-reproducible!


šŸ™


rsangole
rahulsangole