開源、輕量級資料處理框架Smallpond綜合指南

繼 DeepSeek R1 的開創性影響之後，DeepSeek AI 的最新產品將繼續推動創新：Smallpond。這一輕量級資料處理框架結合了用於 SQL分析的 DuckDB 和用於高效能分散式儲存的 3FS，旨在高效處理 PB 級資料集。Smallpond 有望簡化人工智慧和大資料應用的資料處理，消除對長期執行服務和複雜基礎設施的需求，標誌著 DeepSeek 團隊的又一次重大飛躍。在本文中，我們將探討 DeepSeek AI 的 Smallpond 框架的功能、元件和應用，並學習如何使用它。

學習目標

瞭解 DeepSeek Smallpond 是什麼，以及它如何擴充套件 DuckDB 進行分散式資料處理。
瞭解如何安裝 Smallpond、建立 Ray 叢集和配置計算環境。
瞭解如何使用 Smallpond 的應用程式介面攝取、處理和分割資料。
確定人工智慧培訓、金融分析和日誌處理等實際用例。
權衡使用Smallpond進行分散式分析的優勢和挑戰。

什麼是DeepSeek Smallpond？

Smallpond 是 DeepSeek AI 開發的一個開源、輕量級資料處理框架，旨在將DuckDB（一種高效能、程序內分析資料庫）的功能擴充套件到分散式環境中。

透過將 DuckDB 與 Fire-Flyer 檔案系統（3FS）整合，Smallpond 為處理 PB 級資料集提供了一個可擴充套件的解決方案，而無需像 Apache Spark 這樣的傳統大資料框架那樣開銷巨大。

作為 DeepSeek 開源周的一部分，Smallpond 於 2025 年 2 月 28 日釋出，主要面向需要高效、簡單和高效能分散式分析工具的資料工程師和科學家。

Smallpond 的主要特點

高效能：利用 DuckDB 的本地 SQL 引擎和 3FS 每分鐘數百兆位元組的吞吐量。
可擴充套件性：透過手動分割槽在分散式節點上處理 PB 級資料。
簡單：沒有長期執行的服務或複雜的依賴關係–只需最少的設定即可部署和使用。
靈活性：支援 Python (3.8-3.12)，並與 Ray 整合用於並行處理。
開源：MIT 許可，促進社羣貢獻和定製。

DeepSeek Smallpond的核心元件

現在讓我們來了解 DeepSeek Smallpond 框架的核心元件。

DuckDB

DuckDB 是一個嵌入式、程序內 SQL OLAP 資料庫，針對分析工作負載進行了最佳化。它擅長在大型資料集上執行復雜查詢，延遲極小，是單節點分析的理想選擇。Smallpond 將 DuckDB 的功能擴充套件到分散式系統，並保留了其效能優勢。

3FS（Fire-Flyer 檔案系統）

3FS 是 DeepSeek 為人工智慧和高效能運算（HPC）工作負載設計的分散式檔案系統。它利用現代固態硬碟和 RDMA 網路提供低延遲、高吞吐量的儲存（例如，在 180 節點叢集中的讀吞吐量為 6.6 TiB/s）。與傳統檔案系統不同，3FS 優先考慮隨機讀取而不是快取，這與人工智慧培訓和分析的需求相一致。

在Smallpond中整合DuckDB和3FS

Smallpond 使用 DuckDB 作為計算引擎，使用 3FS 作為儲存骨幹。資料以 Parquet 格式儲存在 3FS 上，由使用者手動分割槽，並在 Ray 的協調下使用 DuckDB 例項跨節點並行處理。這種整合將 DuckDB 的查詢效率與 3FS 的可擴充套件儲存相結合，實現了無縫分散式分析。

開始使用Smallpond

現在，讓我們來學習如何安裝和使用 Smallpond。

第 1 步：安裝

Smallpond 基於 Python，可透過 pip 安裝，僅適用於 Linux 發行版。確保安裝了 Python 3.8-3.11，以及一個相容的 3FS 叢集（或用於測試的本地檔案系統）。

# Install Smallpond with dependecies

pip install smallpond

# Optional: Install development dependencies (e.g., for testing)

pip install "smallpond[dev]"

# Install Ray Clusters

pip install 'ray[default]'

# Install Smallpond with dependecies pip install smallpond # Optional: Install development dependencies (e.g., for testing) pip install "smallpond[dev]" # Install Ray Clusters pip install 'ray[default]'

# Install Smallpond with dependecies
pip install smallpond
# Optional: Install development dependencies (e.g., for testing)
pip install "smallpond[dev]"
# Install Ray Clusters
pip install 'ray[default]'

對於 3FS，請從 GitHub 倉庫克隆並構建：

git clone https://github.com/deepseek-ai/3fs

cd 3fs

git submodule update --init --recursive

./patches/apply.sh

# Install dependencies (Ubuntu 20.04/22.04 example)

sudo apt install cmake libuv1-dev liblz4-dev libboost-all-dev

# Build 3FS (refer to 3FS docs for detailed instructions)

git clone https://github.com/deepseek-ai/3fs cd 3fs git submodule update --init --recursive ./patches/apply.sh # Install dependencies (Ubuntu 20.04/22.04 example) sudo apt install cmake libuv1-dev liblz4-dev libboost-all-dev # Build 3FS (refer to 3FS docs for detailed instructions)

git clone https://github.com/deepseek-ai/3fs
cd 3fs
git submodule update --init --recursive
./patches/apply.sh
# Install dependencies (Ubuntu 20.04/22.04 example)
sudo apt install cmake libuv1-dev liblz4-dev libboost-all-dev
# Build 3FS (refer to 3FS docs for detailed instructions)

第 2 步：設定環境

如果使用 3FS，請按照以下程式碼為 ray 叢集初始化 ray 例項：

#intialize ray accordingly

ray start --head --num-cpus=<NUM_CPUS> --num-gpus=<NUM_GPUS>

#intialize ray accordingly ray start --head --num-cpus=<NUM_CPUS> --num-gpus=<NUM_GPUS>

#intialize ray accordingly
ray start --head --num-cpus=<NUM_CPUS> --num-gpus=<NUM_GPUS>

執行上述程式碼將產生類似下圖的輸出結果：

Smallpond設定環境

現在，我們可以使用上圖所示的地址透過 3FS 初始化 Ray。要在 smallpond 中初始化 Ray，請配置一個計算叢集（如 AWS EC2、企業內部），在配備固態硬碟的節點上部署 3FS 或本地測試（Linux/Ubuntu）時，使用檔案系統路徑。

import smallpond

# Initialize Smallpond session (local filesystem for testing)

sp = smallpond.init(data_root="Path/to/local/Storage",ray_address="192.168.214.165:6379")# Enter your own ray address

# For 3FS cluster (update with your 3FS endpoint and ray address)

sp = smallpond.init(data_root="3fs://cluster_endpoint",ray_address="192.168.214.165:6379")# Enter your own ray address

import smallpond # Initialize Smallpond session (local filesystem for testing) sp = smallpond.init(data_root="Path/to/local/Storage",ray_address="192.168.214.165:6379")# Enter your own ray address # For 3FS cluster (update with your 3FS endpoint and ray address) sp = smallpond.init(data_root="3fs://cluster_endpoint",ray_address="192.168.214.165:6379")# Enter your own ray address

import smallpond
# Initialize Smallpond session (local filesystem for testing)
sp = smallpond.init(data_root="Path/to/local/Storage",ray_address="192.168.214.165:6379")# Enter your own ray address 
# For 3FS cluster (update with your 3FS endpoint and ray address)
sp = smallpond.init(data_root="3fs://cluster_endpoint",ray_address="192.168.214.165:6379")# Enter your own ray address

第 3 步：資料輸入和準備

支援的資料格式

Smallpond 主要支援 Parquet 檔案，該檔案針對列式儲存和 DuckDB 相容性進行了最佳化。其他格式（如 CSV）也可以透過 DuckDB 的本地功能來支援。

讀寫資料

使用 Smallpond 的高階 API 載入和儲存資料。

# Read Parquet file

df = sp.read_parquet("data/input.prices.parquet")

# Process data (example: filter rows)

df = df.map("price > 100") # SQL-like syntax

# Write results back to Parquet

df.write_parquet("data/output/filtered.prices.parquet")

# Read Parquet file df = sp.read_parquet("data/input.prices.parquet") # Process data (example: filter rows) df = df.map("price > 100") # SQL-like syntax # Write results back to Parquet df.write_parquet("data/output/filtered.prices.parquet")

# Read Parquet file
df = sp.read_parquet("data/input.prices.parquet")
# Process data (example: filter rows)
df = df.map("price > 100")  # SQL-like syntax
# Write results back to Parquet
df.write_parquet("data/output/filtered.prices.parquet")

資料分割槽策略

手動分割槽是 Smallpond 可擴充套件性的關鍵。請根據您的資料和工作負載選擇策略：

按檔案數量：分割成固定數量的檔案。
按行：平均分配行數。
按雜湊：根據列的雜湊值進行分割，以實現均衡分配。

# Partition by file count

df = df.repartition(3)

# Partition by rows

df = df.repartition(3, by_row=True)

# Partition by column hash (e.g., ticker)

df = df.repartition(3, hash_by="ticker")

# Partition by file count df = df.repartition(3) # Partition by rows df = df.repartition(3, by_row=True) # Partition by column hash (e.g., ticker) df = df.repartition(3, hash_by="ticker")

# Partition by file count
df = df.repartition(3)
# Partition by rows
df = df.repartition(3, by_row=True)
# Partition by column hash (e.g., ticker)
df = df.repartition(3, hash_by="ticker")

Step 4：API引用

高階應用程式介面概述

高階應用程式介面簡化了資料載入、轉換和儲存：

read_parquet(path) ：載入 Parquet 檔案。
write_parquet(path) ：儲存處理過的資料。
repartition(n, [by_row, hash_by]) ：分割資料。
map(expr) ：應用轉換。

底層應用程式介面概述

對於高階應用，Smallpond 直接整合了 DuckDB 的 SQL 引擎和 Ray 的任務分配：

透過 partial_sql 執行原始 SQL
管理用於自定義並行的 Ray 任務。

詳細函式說明

sp.read_parquet(path) ：將 Parquet 檔案讀入分散式 DataFrame。

df = sp.read_parquet("3fs://data/input/*.parquet")

df = sp.read_parquet("3fs://data/input/*.parquet")

df.map(expr)：應用類似 SQL 或 Python 的轉換。

# SQL-like

df = df.map("SELECT ticker, price * 1.1 AS adjusted_price FROM {0}")

# Python function

df = df.map(lambda row: {"adjusted_price": row["price"] * 1.1})

# SQL-like df = df.map("SELECT ticker, price * 1.1 AS adjusted_price FROM {0}") # Python function df = df.map(lambda row: {"adjusted_price": row["price"] * 1.1})

# SQL-like
df = df.map("SELECT ticker, price * 1.1 AS adjusted_price FROM {0}")
# Python function
df = df.map(lambda row: {"adjusted_price": row["price"] * 1.1})

df.partial_sql(query, df)：在資料幀上執行 SQL

df = sp.partial_sql("SELECT ticker, MIN(price), MAX(price) FROM {0} GROUP BY ticker", df)

df = sp.partial_sql("SELECT ticker, MIN(price), MAX(price) FROM {0} GROUP BY ticker", df)

效能基準測試

在 GraySort 等基準測試中，Smallpond 的效能大放異彩，在一個擁有 50 個節點、25 個 3FS 儲存節點的計算叢集上，它能在 30 分 14 秒內對 8,192 個分割槽的 110.5 TiB 資料進行排序（吞吐量為 3.66 TiB/分鐘）。

Smallpond效能基準測試

最佳化效能的最佳實踐

明智分割槽：根據節點記憶體和工作負載匹配分割槽大小。
充分利用 3FS：使用固態硬碟和 RDMA 獲得最大 I/O 吞吐量。
儘量減少分割槽：預先分割槽資料，減少網路開銷。

可擴充套件性考慮因素

10TB-1PB：適合擁有適度叢集的 Smallpond。
超過 1PB：需要大量基礎設施（如 180 多個節點）。
叢集管理：使用託管 Ray 服務（如 Anyscale）簡化擴充套件。

Smallpond的應用

人工智慧資料預處理：準備 PB 級規模的訓練資料集。
金融分析：彙總和分析分散式節點上的市場資料。
日誌處理：並行處理伺服器日誌，以獲得即時見解。
DeepSeek的人工智慧培訓：使用 Smallpond 和 3FS 在 31 分鐘內對 110.5 TiB 資料進行分類，支援高效的模型訓練。

Smallpond的優缺點

功能	優點	缺點
可擴充套件性	高效處理 PB 級資料	叢集管理開銷
效能	優秀的基準效能	可能無法最佳化單節點效能
成本	開源且成本效益高	依賴外部框架
可用性	面向 ML 開發人員的使用者友好 API	與 DeepSeek 人工智慧模型有關的安全問題
架構	利用 DuckDB 和 Ray Core 進行分散式計算	無

小結

透過將 DuckDB 的分析能力與 3FS 的高效能儲存相結合，Smallpond 重新定義了分散式資料處理。它的簡易性、可擴充套件性和開源性使其成為現代資料工作流的理想選擇。無論您是預處理人工智慧資料集還是分析 TB 級的日誌，Smallpond 都能為您提供輕便而強大的解決方案。深入其中，嘗試使用程式碼，加入社羣，塑造未來！

Smallpond 是一個開源的分散式資料處理框架，使用 3FS 和 Ray 擴充套件了 DuckDB 的 SQL 功能。
它目前只支援 Linux 發行版，需要 Python 3.8-3.12。
Smallpond 是人工智慧預處理、金融分析和大資料工作負載的理想選擇，但需要謹慎的叢集管理。
它是 Apache Spark 的高價效比替代品，開銷較低，易於部署。
儘管它有很多優點，但需要考慮基礎設施問題，例如叢集設定和 DeepSeek 模型的安全問題。

DeepSeek Smallpond 資料處理

開源、輕量級資料處理框架Smallpond綜合指南

學習目標

什麼是DeepSeek Smallpond？

Smallpond 的主要特點

DeepSeek Smallpond的核心元件

DuckDB

3FS（Fire-Flyer 檔案系統）

在Smallpond中整合DuckDB和3FS

開始使用Smallpond

第 1 步：安裝

第 2 步：設定環境

第 3 步：資料輸入和準備

支援的資料格式

讀寫資料

資料分割槽策略

Step 4：API引用

高階應用程式介面概述

底層應用程式介面概述

詳細函式說明

效能基準測試

最佳化效能的最佳實踐

可擴充套件性考慮因素

Smallpond的應用

Smallpond的優缺點

小結

評論留言

取消回覆

文章目录

開源、輕量級資料處理框架Smallpond綜合指南

學習目標

什麼是DeepSeek Smallpond？

Smallpond 的主要特點

DeepSeek Smallpond的核心元件

DuckDB

3FS（Fire-Flyer 檔案系統）

在Smallpond中整合DuckDB和3FS

開始使用Smallpond

第 1 步：安裝

第 2 步：設定環境

第 3 步：資料輸入和準備

支援的資料格式

讀寫資料

資料分割槽策略

Step 4：API引用

高階應用程式介面概述

底層應用程式介面概述

詳細函式說明

效能基準測試

最佳化效能的最佳實踐

可擴充套件性考慮因素

Smallpond的應用

Smallpond的優缺點

小結

相關文章

評論留言

取消回覆

文章目录