Unlocking Big Data in R
Using Arrow

Oman R Users

Rahul S

2023-11-08

/whoami


  • Data Scientist, Mechanical Engineer
  • Time Series, Anomaly Detection
  • R Packages
  • Shiny
  • ML Ops, Docker

/motivation

  • Data larger than memory?
  • Data transformation pipelines slow?

/my journey


/my journey


/my journey


/my journey


/caveats


  • I’m assuming you know R
  • You’re familiar with tidyverse, particularly dplyr
  • I’m not an subject matter expert, but a practitioner
  • Motivating examples & fundamentals

/arrow/what is it


  • Columnar memory format
  • For flat and hierarchical data
  • Organized for efficient analytic operations
  • Language-independent (C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, & Rust)

/arrow/advantages


  • Fast Columnar Format
  • Arrow Format Standardization

/arrow/R package

The arrow R package exposes an interface to the Arrow C++ library, enabling access to many of its features in R

Read and write

  • Parquet files, an efficient and widely used columnar format
  • CSV files with excellent speed and efficiency
  • Multi-file and larger-than-memory datasets
  • Read JSON files

Data analysis

  • Analyze larger-than-memory datasets
  • Manipulate Arrow data with dplyr verbs

/arrow/R package

/large data


New York City Taxi

Yellow and green taxi trip records… pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, passenger counts…


Size : 40 GB on Disk

Dimensions : 1.15 B rows x 24 cols!

lets::code()

/duckdb + arrow

/performance/parquet

Ref: https://parquet.apache.org


Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

/recap


  • arrow evaluates lazily by default
  • execution only runs on collect()
  • {dplyr} verbs, filter, select, mutate, join, distinct, group_by + summarize, and across
  • to_duckdb() saves the day for pivoting and window functions
  • register_scalar_function can be used to create UDFs
  • massive performance gains using parquet files and smart partitioning

/i hope


/further reading

Cookbook

2 Day Workshop

R Package Docs

شكراً


rsangole/oman-rusers-arrow

rahulsangole