Big Data with RUST Part 1/3 — Yes we can!!!

Diogo Miyake
10 min readOct 10, 2022

--

A series of post to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: this with introduction and part 1/3 using Datafusion, part 2/3 showing Polars and part 3/3 with Fluvio.

Photo by Cookie the Pom on Unsplash

After a few months without time to write everything, I decide to write some things about tests that I am doing using Rust.

Purpose of this article

the purpose that I wrote this article is to share my little experience with these frameworks for Big Data and show that Rust can be a light alternative to another computational libraries or frameworks for big data computing … After all, Moore’s law is there to guide us about the evolution of hardware, and not everything is solved by increasing computational power.

Other parts of this series

Why Rust?

First I love studying and test new technologies, Second Rust is increasing the number of users, is safety, and a good language to know, you can know more about Rust here:

My Big Data context

I have 4+ years of experience with data being 3+ in Big Data and today I work with data architecture, data pipelines, data security and cloud computing technologies, and when I discover Rust I think.. This is great Why I don't try it in Big Data context? And after some time, I be here.

Frameworks for Big Data Processing

Today exists a lot of framework for big data processing, I will cite common and most used today:

Spark: Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Kafka: Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Flink: Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Beam: Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. Thousands of organizations around the world choose Apache Beam due to its unique data processing features, proven scale, and powerful yet extensible capabilities.

Cloud Options: Have a lot of cloud options that uses in core some of above frameworks example: AWS EMR cluster uses spark or another tool, Databricks uses spark, GCP Dataflow uses apache Beam. etc..

PS. : DataFusion at this moment is only for batch processing in another parts as described above I will show another for streaming.

PS2.: Some references are got from projects documentation.

What is Datafusion ?

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV and Parquet) using threads.

Arrow

Apache arrow is a project that some tools like Spark, Parquet, Polars, Dremio, Dask and Pandas uses it. For another use cases see this page.

Why DataFusion?

ps.: infos from main page…

  • High Performance: Leveraging Rust and Arrow’s memory model, DataFusion achieves very high performance
  • Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
  • Easy to Embed: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
  • High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.

Come on and get start with code…

https://media.giphy.com/media/l1J9NRpOeS7i54xnW/giphy.gif

First: All of code are in this repository in gitlab: https://gitlab.com/miyake-diogo/rust-big-data-playground

Second: Install Rust

Third: Clone repo or make tou new code with cargo new <any name of your project`

Fourth: See dataset details here: https://www.stats.govt.nz/large-datasets/csv-files-for-download/

Info about datasets: https://www.stats.govt.nz/information-releases/statistical-area-1-dataset-for-2018-census-updated-march-2020

Click on this link and download file

After extract and renamed files are look like this:

After download add on folder like structure below:

data
└── minilake
├── curated
│ ├── aggregated_tables
│ │ ├── part-0.parquet
│ │ ├── part-1.parquet
│ │ ├── part-2.parquet
│ │ ├── part-3.parquet
│ │ ├── part-4.parquet
│ │ ├── part-5.parquet
│ │ ├── part-6.parquet
│ │ ├── part-7.parquet
│ │ ├── part-8.parquet
│ │ └── part-9.parquet
│ └── final_obt
│ ├── part-0.parquet
│ ├── part-1.parquet
│ ├── part-2.parquet
│ ├── part-3.parquet
│ ├── part-4.parquet
│ ├── part-5.parquet
│ ├── part-6.parquet
│ ├── part-7.parquet
│ ├── part-8.parquet
│ └── part-9.parquet
├── raw
│ ├── age_census
│ │ └── DimenLookupAge8277.csv
│ ├── area_census
│ │ └── DimenLookupArea8277.csv
│ ├── covid
│ │ └── owid-covid-data.csv
│ ├── ethnic_census
│ │ └── DimenLookupEthnic8277.csv
│ ├── fato_census
│ │ └── Data8277.csv
│ ├── sex_census
│ │ └── DimenLookupSex8277.csv
│ └── year_census
│ └── DimenLookupYear8277.csv
└── stage
├── age_census
│ └── part-0.parquet
├── area_census
│ └── part-0.parquet
├── covid
│ └── part-0.parquet
├── ethnic_census
│ └── part-0.parquet
├── fato_census
│ └── part-0.parquet
├── sex_census
│ └── part-0.parquet
└── year_census
└── part-0.parquet

Below image of folders:

DON'T FORGET DOWNLOAD DATA AND CREATE FOLDER STRUCTURE

Let's Start

Adding on cargo.toml dependencies:

Run cargo build to install libraries.
**Note:** If you have some problems with parquet or arrow you need to update rustup.

Overview of commands

First create a context

You can create session context to run your processing queries or methods.

let ctx: SessionContext = SessionContext::new();

Read a CSV

To read csv we can use the 2 different methods:

Dataframe API

// read csv to dataframe
let df = ctx.read_csv(“file.csv”, CsvReadOptions::new()).await?;

If you want to pass methods in CsvReadOptions you can pass after `.new()`

// read csv read csv to dataframe passing method delimiter to enum CsvReadOptionslet df = ctx.read_csv(“file.csv”, CsvReadOptions::new().delimiter(b’;’)).await?;// execute and print resultsdf.show_limit(5).await?;Ok(()) 

SQL API

// Register table from csv
ctx.register_csv(“table_name”, “file.csv”, CsvReadOptions::new()).await?;
// Create a plan to run a SQL querylet df = ctx.sql(“SELECT * FROM table_name”).await?;

Let's Show our data and Schema

To we show the data we can write code below:

// execute and print resultsdf.show().await?; // show entirely df to show n lines use show_limitOk(())

Hint: Note that show of data is likely apache spark.. some methods are like too. We have difference is in show method for show a determined number of rows you need use show_limit.

Save Dataframe as parquet

We can save dataframe as parquet, and method also are similar to spark.

df.write_parquet("folder_to_save/{}",None).await?;

Transforming Data

Go to transformation and make some exploratory of data

To we select only columns that we need we can use select we can see more in this link.

Select

let df = df.select(vec![col("a"), col("b")])?;
df.show_limit(5).await?;

Filter

df.filter(col("column_name").eq(lit(0_i32)))?.show_limit(5).await?;

Distinct

let df = df.distinct()?;
df.show().await?;

Others: You can try a lot of another functions like union, union_distinct, sort, join, collect, aggregate, etc… I will show join and aggregate.

Case, When, Then

//Use a auxiliar functionlet function = when(col("column_name").eq(lit("stringtoget")), lit(0_u32)).otherwise(col("column_name"))?;let dataframe = dataframe.with_column("column_name",cast(function, Int64))?;

Join

let join_dataframe = left_dataframe.join(right_dataframe, JoinType::Inner, &["col_a", "col_b"], &["col_a2", "col_b2"], None)?;

Aggregate

let agg_df = dataframe.aggregate( 
vec![col("col1"),col("col2"),col("col3"),col("col4"),col("col5")], vec![sum(col("total_count"))])?;

My Pipeline for test

I wrote a simple pipeline to get data on local folder and save intermediate like a data lake approach and made some transformations in data to test and check functionality of framework.

I resume all of things that I do because in code have some info and I explain above some methods.

Pipeline

First importing libs.

After open a main function.

Creating of context and set of files path as variables.

Reading files as Dataframe API. Note that I use a loop to print schema and first 5 lines of Dataframe.

Renaming columns and printing to check.

Saving files as staged area and after it I made a join and some aggregate, and to finish I save these files on local storage.

Conclusion

I like to write some code for data in Rust and I think Datafusion is a good library but I found some problems:

  1. When I try to save large file in parquet the file is not saved and is corrupted
  2. I try parse and change column schema but when I try to save, file is not saved (I describe on issue with link below) I try filter also and the problem is same.
  3. I try to save parquet with partition and the same problem occurs.
  4. I needed to use a limit size (100K) to finish my tests and… yes I have success to join and aggregate Dataframe (I said this because when I try it with all columns of Dataframe I receive error of compiler)

My concerns: Write code in Rust Is love.. I don't think Rust is a hard language to learn ( and I'm not a master blaster expert on software engineering) and Datafusion is likely spark but with particularities of Rust.
Although Rust Is awesome and Datafusion is good, I think need more maturity to change your spark code to Datafusion. I understand that is a project in ascension, the community of users are not greater than spark or pyspark and I hope that in the future we will have more projects in Rust.

Issues I opened:

Thanks

Thanks for reading and feel free to reach me on linkedin.
And in some weeks I write Part 2 of this series.

References

All codes are in my repo: https://gitlab.com/miyake-diogo/rust-big-data-playground

https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip?_ga=2.20958136.102282402.1663639578-985979153.1663098055

--

--

Diogo Miyake

Big Data Platform Security Engineer with knowledge in architecture. Enthusiast in Technology, Software, trying to help people to live better.