Unlocking Big Data in R
Using Arrow

Oman R Users

Rahul S

2023-11-08

/whoami

Data Scientist, Mechanical Engineer
Time Series, Anomaly Detection
R Packages
Shiny
ML Ops, Docker

/motivation

Data larger than memory?
Data transformation pipelines slow?

/my journey

/my journey

/caveats

I’m assuming you know R
You’re familiar with tidyverse, particularly dplyr
I’m not an subject matter expert, but a practitioner
Motivating examples & fundamentals

/arrow/what is it

Columnar memory format
For flat and hierarchical data
Organized for efficient analytic operations
Language-independent (C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, & Rust)

/arrow/advantages

Fast Columnar Format
Arrow Format Standardization

/arrow/R package

The arrow R package exposes an interface to the Arrow C++ library, enabling access to many of its features in R

Read and write

Parquet files, an efficient and widely used columnar format
CSV files with excellent speed and efficiency
Multi-file and larger-than-memory datasets
Read JSON files

Data analysis

Analyze larger-than-memory datasets
Manipulate Arrow data with dplyr verbs

/arrow/R package

/large data

New York City Taxi

Yellow and green taxi trip records… pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, passenger counts…

Size : 40 GB on Disk

Dimensions : 1.15 B rows x 24 cols!

lets::code()

/duckdb + arrow

/performance/parquet

Ref: https://parquet.apache.org

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

/recap

arrow evaluates lazily by default
execution only runs on collect()
{dplyr} verbs, filter, select, mutate, join, distinct, group_by + summarize, and across
to_duckdb() saves the day for pivoting and window functions
register_scalar_function can be used to create UDFs
massive performance gains using parquet files and smart partitioning

/i hope

/further reading

شكراً

rsangole/oman-rusers-arrow

rahulsangole

Unlocking Big Data in R Using Arrow

/whoami

/motivation

/my journey

/my journey

/my journey

/my journey

/caveats

/arrow/what is it

/arrow/advantages

/arrow/R package

/arrow/R package

/large data

New York City Taxi

lets::code()

/duckdb + arrow

/performance/parquet

/recap

/i hope

/further reading

شكراً

Unlocking Big Data in R
Using Arrow