Big Data with RUST Part 1/3 — Yes we can!!!

10 min readOct 10, 2022

A series of post to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: this with introduction and part 1/3 using Datafusion, part 2/3 showing Polars and part 3/3 with Fluvio.

After a few months without time to write everything, I decide to write some things about tests that I am doing using Rust.

Purpose of this article

the purpose that I wrote this article is to share my little experience with these frameworks for Big Data and show that Rust can be a light alternative to another computational libraries or frameworks for big data computing … After all, Moore’s law is there to guide us about the evolution of hardware, and not everything is solved by increasing computational power.

Other parts of this series

Big Data with RUST Part 2/3 — Yes we can!!!

A series of posts to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: the…

miyake-akio.medium.com

Big Data with RUST Part 3/3 — Yes we can!!!

A series of posts to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: the…

miyake-akio.medium.com

Why Rust?

First I love studying and test new technologies, Second Rust is increasing the number of users, is safety, and a good language to know, you can know more about Rust here:

What is Rust and why is it so popular?

Rust has been Stack Overflow's most loved language for four years in a row, indicating that many of those who have had…

stackoverflow.blog

In Rust We Trust: Microsoft Azure CTO shuns C and C++

Updated Microsoft Azure CTO Mark Russinovich has had it with C and C++, time-tested programming languages commonly used…

www.theregister.com

Linus Torvalds: Rust will go into Linux 6.1

The Rust in Linux debate is over. The implementation has begun. In an email conversation, Linux's creator Linus…

www.zdnet.com

Stack Overflow: Rust Remains Most Loved, but Clojure Pays the Best

The 2022 results of the annual Stack Overflow Developer Survey are in! Over 73,000 developers from over 180 countries …

thenewstack.io

My Big Data context

I have 4+ years of experience with data being 3+ in Big Data and today I work with data architecture, data pipelines, data security and cloud computing technologies, and when I discover Rust I think.. This is great Why I don't try it in Big Data context? And after some time, I be here.

Frameworks for Big Data Processing

Today exists a lot of framework for big data processing, I will cite common and most used today:

Spark: Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Kafka: Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Flink: Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Beam: Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. Thousands of organizations around the world choose Apache Beam due to its unique data processing features, proven scale, and powerful yet extensible capabilities.

Cloud Options: Have a lot of cloud options that uses in core some of above frameworks example: AWS EMR cluster uses spark or another tool, Databricks uses spark, GCP Dataflow uses apache Beam. etc..

PS. : DataFusion at this moment is only for batch processing in another parts as described above I will show another for streaming.

PS2.: Some references are got from projects documentation.

What is Datafusion ?

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV and Parquet) using threads.

Arrow

Apache arrow is a project that some tools like Spark, Parquet, Polars, Dremio, Dask and Pandas uses it. For another use cases see this page.

Why DataFusion?

ps.: infos from main page…

High Performance: Leveraging Rust and Arrow’s memory model, DataFusion achieves very high performance
Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
Easy to Embed: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

Come on and get start with code…

https://media.giphy.com/media/l1J9NRpOeS7i54xnW/giphy.gif

First: All of code are in this repository in gitlab: https://gitlab.com/miyake-diogo/rust-big-data-playground

Second: Install Rust

Third: Clone repo or make tou new code with cargo new <any name of your project`

Fourth: See dataset details here: https://www.stats.govt.nz/large-datasets/csv-files-for-download/

Info about datasets: https://www.stats.govt.nz/information-releases/statistical-area-1-dataset-for-2018-census-updated-march-2020

Click on this link and download file

After extract and renamed files are look like this:

After download add on folder like structure below:

data
└── minilake
├── curated
│   ├── aggregated_tables
│   │   ├── part-0.parquet
│   │   ├── part-1.parquet
│   │   ├── part-2.parquet
│   │   ├── part-3.parquet
│   │   ├── part-4.parquet
│   │   ├── part-5.parquet
│   │   ├── part-6.parquet
│   │   ├── part-7.parquet
│   │   ├── part-8.parquet
│   │   └── part-9.parquet
│   └── final_obt
│       ├── part-0.parquet
│       ├── part-1.parquet
│       ├── part-2.parquet
│       ├── part-3.parquet
│       ├── part-4.parquet
│       ├── part-5.parquet
│       ├── part-6.parquet
│       ├── part-7.parquet
│       ├── part-8.parquet
│       └── part-9.parquet
├── raw
│   ├── age_census
│   │   └── DimenLookupAge8277.csv
│   ├── area_census
│   │   └── DimenLookupArea8277.csv
│   ├── covid
│   │   └── owid-covid-data.csv
│   ├── ethnic_census
│   │   └── DimenLookupEthnic8277.csv
│   ├── fato_census
│   │   └── Data8277.csv
│   ├── sex_census
│   │   └── DimenLookupSex8277.csv
│   └── year_census
│       └── DimenLookupYear8277.csv
└── stage
├── age_census
│   └── part-0.parquet
├── area_census
│   └── part-0.parquet
├── covid
│   └── part-0.parquet
├── ethnic_census
│   └── part-0.parquet
├── fato_census
│   └── part-0.parquet
├── sex_census
│   └── part-0.parquet
└── year_census
└── part-0.parquet

Below image of folders:

DON'T FORGET DOWNLOAD DATA AND CREATE FOLDER STRUCTURE

Let's Start

Adding on cargo.toml dependencies:

Run cargo build to install libraries.
**Note:** If you have some problems with parquet or arrow you need to update rustup.

Overview of commands

First create a context

You can create session context to run your processing queries or methods.

let ctx: SessionContext = SessionContext::new();

Read a CSV

To read csv we can use the 2 different methods:

Dataframe API

// read csv to dataframe
let df = ctx.read_csv(“file.csv”, CsvReadOptions::new()).await?;

If you want to pass methods in CsvReadOptions you can pass after `.new()`

// read csv read csv to dataframe passing method delimiter to enum CsvReadOptionslet df = ctx.read_csv(“file.csv”, CsvReadOptions::new().delimiter(b’;’)).await?;// execute and print resultsdf.show_limit(5).await?;Ok(())

SQL API

// Register table from csv
ctx.register_csv(“table_name”, “file.csv”, CsvReadOptions::new()).await?;// Create a plan to run a SQL querylet df = ctx.sql(“SELECT * FROM table_name”).await?;

Let's Show our data and Schema

To we show the data we can write code below:

// execute and print resultsdf.show().await?; // show entirely df to show n lines use show_limitOk(())

Hint: Note that show of data is likely apache spark.. some methods are like too. We have difference is in show method for show a determined number of rows you need use show_limit.

Save Dataframe as parquet

We can save dataframe as parquet, and method also are similar to spark.

df.write_parquet("folder_to_save/{}",None).await?;

Transforming Data

Go to transformation and make some exploratory of data

To we select only columns that we need we can use select we can see more in this link.

Select

let df = df.select(vec![col("a"), col("b")])?;
df.show_limit(5).await?;

Filter

df.filter(col("column_name").eq(lit(0_i32)))?.show_limit(5).await?;

Distinct

let df = df.distinct()?;
df.show().await?;

Others: You can try a lot of another functions like union, union_distinct, sort, join, collect, aggregate, etc… I will show join and aggregate.

Case, When, Then

//Use a auxiliar functionlet function = when(col("column_name").eq(lit("stringtoget")), lit(0_u32)).otherwise(col("column_name"))?;let dataframe = dataframe.with_column("column_name",cast(function, Int64))?;

Join

let join_dataframe = left_dataframe.join(right_dataframe, JoinType::Inner, &["col_a", "col_b"], &["col_a2", "col_b2"], None)?;

Aggregate

let agg_df = dataframe.aggregate( 
vec![col("col1"),col("col2"),col("col3"),col("col4"),col("col5")], vec![sum(col("total_count"))])?;

My Pipeline for test

I wrote a simple pipeline to get data on local folder and save intermediate like a data lake approach and made some transformations in data to test and check functionality of framework.

I resume all of things that I do because in code have some info and I explain above some methods.

Pipeline

First importing libs.

After open a main function.

Creating of context and set of files path as variables.

Reading files as Dataframe API. Note that I use a loop to print schema and first 5 lines of Dataframe.

Renaming columns and printing to check.

Saving files as staged area and after it I made a join and some aggregate, and to finish I save these files on local storage.

Conclusion

I like to write some code for data in Rust and I think Datafusion is a good library but I found some problems:

When I try to save large file in parquet the file is not saved and is corrupted
I try parse and change column schema but when I try to save, file is not saved (I describe on issue with link below) I try filter also and the problem is same.
I try to save parquet with partition and the same problem occurs.
I needed to use a limit size (100K) to finish my tests and… yes I have success to join and aggregate Dataframe (I said this because when I try it with all columns of Dataframe I receive error of compiler)

My concerns: Write code in Rust Is love.. I don't think Rust is a hard language to learn ( and I'm not a master blaster expert on software engineering) and Datafusion is likely spark but with particularities of Rust.
Although Rust Is awesome and Datafusion is good, I think need more maturity to change your spark code to Datafusion. I understand that is a project in ascension, the community of users are not greater than spark or pyspark and I hope that in the future we will have more projects in Rust.

Issues I opened:

Write csv not save all lines of dataframe · Issue #3783 · apache/arrow-datafusion

Describe the bug When I try to save dataframe as csv, only around 400K of lines are saved.. data has more than 1M of…

github.com

Problem to get value in filter after change datatype of column · Issue #3701 ·…

Describe the bug When I try to use filter method to get a specific value occur this error: *Error…

github.com

[RUST][Datafusion] What causes "Error: Execution("file size of 4 is less than footer")" error? ·…

When I Try to read a CSV and Write as Parquet, compiler raises this errror: Error: Execution("file size of 4 is less…

github.com

Thanks

Thanks for reading and feel free to reach me on linkedin.
And in some weeks I write Part 2 of this series.

References

All codes are in my repo: https://gitlab.com/miyake-diogo/rust-big-data-playground

CSV files for download | Stats NZ

Find CSV files with the latest data from Infoshare and our information releases.

www.stats.govt.nz

https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip?_ga=2.20958136.102282402.1663639578-985979153.1663098055

Example Usage - Arrow DataFusion documentation

Edit description

arrow.apache.org

datafusion - Rust

DataFusion is an extensible query execution framework that uses Apache Arrow as its in-memory format. DataFusion…

docs.rs

DataFrame in datafusion::dataframe - Rust

pub struct DataFrame { /* private fields */ } Expand description DataFrame represents a logical set of rows with the…

docs.rs

Data Engineering With Rust And Apache Arrow DataFusion 1/4 — Introduction

Welcome to the introduction of my article series “Data Engineering With Rust And Apache Arrow DataFusion.”

medium.com

Writing A Data Pipeline in Rust with DataFusion (vs PySpark)

Most of the Data pipelines I have written so far are written in Python: Pandas or Spark: PySpark. Recently in one of my…

towardsdev.com

Expr in datafusion::prelude - Rust

Expand description Expr is a central struct of DataFusion's query API, and represent logical expressions such as A + 1…

docs.rs

Big Data with RUST Part 1/3 — Yes we can!!!

Purpose of this article

Other parts of this series

Big Data with RUST Part 2/3 — Yes we can!!!

A series of posts to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: the…

Big Data with RUST Part 3/3 — Yes we can!!!

A series of posts to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: the…

Why Rust?

What is Rust and why is it so popular?

Rust has been Stack Overflow's most loved language for four years in a row, indicating that many of those who have had…

In Rust We Trust: Microsoft Azure CTO shuns C and C++

Updated Microsoft Azure CTO Mark Russinovich has had it with C and C++, time-tested programming languages commonly used…

Linus Torvalds: Rust will go into Linux 6.1

The Rust in Linux debate is over. The implementation has begun. In an email conversation, Linux's creator Linus…

Stack Overflow: Rust Remains Most Loved, but Clojure Pays the Best

The 2022 results of the annual Stack Overflow Developer Survey are in! Over 73,000 developers from over 180 countries …

My Big Data context

Frameworks for Big Data Processing

What is Datafusion ?

Arrow

Why DataFusion?

Come on and get start with code…

Overview of commands

First create a context

Read a CSV

Save Dataframe as parquet

Transforming Data

My Pipeline for test

Pipeline

Conclusion

Issues I opened:

Write csv not save all lines of dataframe · Issue #3783 · apache/arrow-datafusion

Describe the bug When I try to save dataframe as csv, only around 400K of lines are saved.. data has more than 1M of…

Problem to get value in filter after change datatype of column · Issue #3701 ·…

Describe the bug When I try to use filter method to get a specific value occur this error: *Error…

[RUST][Datafusion] What causes "Error: Execution("file size of 4 is less than footer")" error? ·…

When I Try to read a CSV and Write as Parquet, compiler raises this errror: Error: Execution("file size of 4 is less…

Thanks

References

CSV files for download | Stats NZ

Find CSV files with the latest data from Infoshare and our information releases.

Example Usage - Arrow DataFusion documentation

Edit description

datafusion - Rust

DataFusion is an extensible query execution framework that uses Apache Arrow as its in-memory format. DataFusion…

DataFrame in datafusion::dataframe - Rust

pub struct DataFrame { /* private fields */ } Expand description DataFrame represents a logical set of rows with the…

Data Engineering With Rust And Apache Arrow DataFusion 1/4 — Introduction

Welcome to the introduction of my article series “Data Engineering With Rust And Apache Arrow DataFusion.”

Writing A Data Pipeline in Rust with DataFusion (vs PySpark)

Most of the Data pipelines I have written so far are written in Python: Pandas or Spark: PySpark. Recently in one of my…

Expr in datafusion::prelude - Rust

Expand description Expr is a central struct of DataFusion's query API, and represent logical expressions such as A + 1…

Written by Diogo Miyake