Big Data with RUST Part 3/3 — Yes we can!!!

Diogo Miyake
6 min readMar 13, 2024

--

A series of posts to show Rust Frameworks that allows we work in Big Data area. In total there are 3 posts: the introduction with part 1/3 using Datafusion, part 2/3 with Polars and this part 3/3 with Fluvio.

Ps.: Sorry for the delay, I had some personal problems and needed to interrupt some personal projects to rest and better understand my goals and control my time, projects and, of course, take care of my family and my mind.

I hope you enjoy this series and help you to enter in Rustverse.

Other parts of this series

Introduction

Welcome to the third and final part of our series on Big Data with Rust, where we explore the use of Rust for working with big data. In this article, we will focus on Fluvio, an open-source data streaming platform built in Rust. Fluvio provides low-latency, high-performance programmable streaming on cloud-native architecture, and is a great alternative to other data streaming platforms like Kafka and Pulsar.

In this article, we will provide an overview of Fluvio’s architecture and features, and demonstrate how to use it for data streaming using Rust. We will cover the basics of Fluvio, including how to create a cluster, produce and consume messages. We make a deploy of Fluvio in local, but there are a lot of options, like docker, k8s and cloud.

I focus only in API with rust to show how we can use Rust in streaming with a Tool Created in Rust.

By the end of this article, you will have a shallow understanding of Fluvio and how to use it for data streaming in Rust. So, let’s get started!

About Fluvio

Fluvio is an open-source data streaming platform written in Rust that provides low latency, high performance programmable streaming on cloud native architecture.[1]

Fluvio provides CLI, API's and a concept called smartmodules, that allow developers and users to write declarative code in yaml to automate repetitive tasks, orchestrate data flows, and enable advanced functionalities without requiring deep technical knowledge or extensive coding.[2]

An image of Fluvio overview with producers and connectors in left, in middle core resourses like smartmodules, data streams, immutable store, in the right consumers and connectors, and finnaly up of core concepts some transformation functions like filter, aggregate and map.
Fluvio overview image, ref: https://www.fluvio.io/docs/

Before start

Yo can start creating your own cluster in diverse ways like:

  • Fluvio Cloud [3]
  • Fluvio local instalation [4]
  • Fluvio in Kubernetes [5]

High Level architecture

Fluvio’s architecture centers around real time streaming, and the platform can scale horizontally to accommodate large volumes of data.[6]

Basically there is a Streaming controller (SC) that manages the Streaming Processing Units (SPUs), SPU are responsible to all data streaming tasks, both are independent and loosely coupled. It is like a master and worker nodes architecture.

If you need to understand in High Level of SC and SPU check these documentations attached on text.

Connectors and SmarModules

Connectors: Way to simple import (inbound) or export (outbound) data.

Image showing in left an arrow entering in a box like an inbound connector with 4 steps of connector: Protocol, Extract, Filter and Shape. Upside there are protocol transformation with arrow to box and user defined transformation with arrow to box also and an arrow representing outbound of connector
Steps of connector, ref: [6]

There are 4 steps to the connector: [6]

  • Protocol: Parses data according to the wire format of the connected data platform.
  • Extract: Extracts raw data from the protocol format and packages it neatly into data structures that may be used by subsequent stages or be produced directly to a topic.
  • Filter (optional): A user-provided SmartModule that may determine whether a given record should be discarded before sending it over the network to Fluvio, saving bandwidth.
  • Shape (optional): A user-provided SmartModule that may take the extracted data structures and transform them in to an application-specific format.

There are some inbound and outbound connectors created and user can create your own connector. [7]

SmartModules: Way to automatize and perform actions or processes, it's help users to automate tasks, orchestrate dataflows and enable other functionalities.[8]

Example:[9]

This smartmodule filter informations of data.

transforms:
- uses: infinyon/jolt@0.3.0
with:
spec:
- operation: shift
spec:
fact: "animal.fact"
length: "length"
- uses: infinyon/regex-filter@0.1.0
with:
regex: "[Cc]at"

Starting Tests

Install Fluvio and FVM (Fluvio Version Manager)

curl -fsS https://hub.infinyon.cloud/install/install.sh | bash

Create a local cluster with command fluvio cluster start

Create a topic: To do this you can use CLI or API.

With API I used this code to Create a topic and produce messages:

use async_std::stream::StreamExt;
use fluvio::metadata::topic::TopicSpec;
use fluvio::{Fluvio, RecordKey};
use rand::Rng;
use serde::{Serialize, Deserialize};

const TOPIC_NAME: &str = "customer";
const PARTITION_NUM: u32 = 0;
const PARTITIONS: u32 = 1;
const REPLICAS: u32 = 1;

#[derive(Serialize, Deserialize, Debug)]
struct Customer {
id: i32,
name: String,
email: String,
}

impl Customer {
fn random() -> Self {
let mut rng = rand::thread_rng();
Customer {
id: rng.gen_range(1..=1000000),
name: format!("Customer-{}", rng.gen::<u32>()),
email: format!("@example.com">customer-{}@example.com", rng.gen::<u32>()),
}
}
}

#[async_std::main]
async fn main() {
// Connect to Fluvio cluster
let fluvio = Fluvio::connect().await.unwrap();

// Create a topic
let admin = fluvio.admin().await;
let topic_spec = TopicSpec::new_computed(PARTITIONS, REPLICAS, None);
let _topic_create = admin
.create(TOPIC_NAME.to_string(), false, topic_spec)
.await;

// Produce to a topic
let producer = fluvio::producer(TOPIC_NAME).await.unwrap();
for _ in 0..1000 {
let customer = Customer::random();
let record_key = RecordKey::NULL;
let record_value = serde_json::to_string(&customer).unwrap();
producer.send(record_key, record_value).await.unwrap();
}
producer.flush().await.unwrap();

}

I used this code to consumer messages:

use async_std::stream::StreamExt;
use fluvio::metadata::topic::TopicSpec;
use fluvio::{Fluvio, RecordKey};
use serde::{Deserialize, Serialize};

const TOPIC_NAME: &str = "customer";
const PARTITION_NUM: u32 = 0;

#[derive(Serialize, Deserialize, Debug)]
struct Customer {
id: i32,
name: String,
email: String,
}

#[async_std::main]
async fn main() {
// Connect to Fluvio cluster
let fluvio = Fluvio::connect().await.unwrap();

// Create a consumer
let consumer = fluvio::consumer(TOPIC_NAME, PARTITION_NUM).await.unwrap();
let mut stream = consumer.stream(fluvio::Offset::from_end(10)).await.unwrap();

// Consume records from the topic
while let Some(Ok(record)) = stream.next().await {
let customer: Customer = serde_json::from_slice(record.value()).unwrap();
println!("{:?}", customer);
}
}

Just enter in the folder and run a cargo run

To delete a local cluster run this command fluvio cluster delete

References:

Conclusion

Although I make only simple tests with Fluvio, I think it is needed more tests with SmartModules and Connectors to check an usability of tool.

Below my points about this tool:

  • Producer and Consumer API are easy to understand because is similar with Kafka.
  • The Kubernetes Helm chart and configs to deploy are good to make some cloud tests.
  • There is cloud version that InfinyOn supports.
  • Fluvio there are a lot of configurations and resources like kafka.
  • There are few examples of use with Rust and other features.
  • Documentation about SmartModules is hard to understand, because there are some examples in Rust and others using Yaml.
  • Documentation of crate is poor, for example about multiplePartitionConsumer.
image of documentation of consumer, ref.: [11]

Last Observations:

If you are interested in use fluvio I recommend to test tool and make benchmarks with Kafka, Pulsar, and another tools also I recommend to test SmartModules and Connectors in a real life (or more closest) data pipeline.

I found one example of Fluvio, DeepCausality and Rust on infiniOn blog:

References

[1] https://www.fluvio.io/docs/

[2] https://www.fluvio.io/smartmodules/core-concepts/

[3] https://infinyon.cloud/ui/login

[4] https://www.fluvio.io/docs/install/

[5] https://www.fluvio.io/docs/kubernetes/install/

[6] https://www.fluvio.io/connectors/core-concepts/

[7] https://www.fluvio.io/connectors/cdk/overview/

[8] https://www.fluvio.io/smartmodules/

[9] https://www.fluvio.io/smartmodules/core-concepts/

[11]https://docs.rs/fluvio/latest/fluvio/struct.Fluvio.html#method.consumer

[12] https://www.fluvio.io/connectors/cdk/github-examples/

[13] https://gitlab.com/miyake-diogo/rust-big-data-playground

--

--

Diogo Miyake
Diogo Miyake

Written by Diogo Miyake

Big Data Platform Security Engineer with knowledge in architecture. Enthusiast in Technology, Software, trying to help people to live better.

No responses yet