Software

3 minute read

# Introducing chunklet-py:

February 23, 2026

The Smart Text Chunking Library You Didn’t Know You Needed

Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?

Yeah, I’ve been there too. That’s exactly why I built chunklet-py — a Python library that actually understands text structure.

This post hits the highlights — visit the full documentation for everything else, including:

Custom sentence splitters for specialized languages
Custom document processors for unusual file formats
Custom tokenizers to match your LLM
CLI flags for batch processing, parallel jobs, error handling, timeouts
Advanced features like overlap, offset, strict mode, docstring modes

The Problem with Dumb Splitting

Here’s what usually happens:

# The naive approach
chunks = [text[i:i+500] for i in range(0, len(text), 500)]

This works… until it doesn’t:

Sentences cut mid-way (“The model got 75%” → “75%” becomes meaningless)
No context between chunks
Broken code if you’re chunking source files

Solution: chunklet-py

A smart text and code chunking library that respects natural boundaries.

Features

50+ languages supported — Auto-detects language and applies the right splitting rules. No more treating German the same as English.

Multiple constraint types — Mix and match:

max_sentences — group by sentences
max_tokens — respect LLM context limits
max_section_breaks — keep Markdown headers together (headings ##, horizontal rules ---,
tags)
max_lines — for code chunking
max_functions — keep functions together

Multiple file formats — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.

Rich metadata — Every chunk comes with source references, character spans, and structural info.

Composable constraints — Mix and match limits to get exactly the chunks you need.

Pluggable architecture — Swap in custom tokenizers, sentence splitters, or document processors.

What’s New in v2.2.0

API Unification — Methods renamed to chunk_text, chunk_file, chunk_texts, chunk_files for consistency
Visualizer redesign — Fullscreen mode, 3-row layout, smoother hovers
More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
Ruff — Switched to Ruff for faster linting

Check the What’s New page for full details.

Installation

pip install chunklet-py

For document support:

pip install chunklet-py[structured-document]

For code:

pip install chunklet-py[code]

For visualization:

pip install chunklet-py[visualization]

Code Examples

Core Imports

from chunklet import DocumentChunker   # For PDFs, DOCX, and general text
from chunklet import CodeChunker       # For source code
from chunklet import SentenceSplitter  # For just sentences
from chunklet import visualizer        # Web-based visualizer

DocumentChunker API

Four methods cover most use cases:

Method	Input	Return Type
`chunk_text(text)`	str	List[Chunk]
`chunk_file(path)`	Path or str	List[Chunk]
`chunk_texts(list)`	List[str]	Generator[Chunk]
`chunk_files(list)`	List[Path]	Generator[Chunk]

DocumentChunker Example

chunker = DocumentChunker()

# Feel free to mix and match these
chunks = chunker.chunk_text(
    text,
    max_sentences=3,       # Stop after X sentences
    max_tokens=500,        # Don't blow up the LLM context
    max_section_breaks=2,  # Respect the Markdown headers
    overlap_percent=20,    # Give it some "memory" of the last chunk
    offset=0               # Skip the first N sentences
)

CodeChunker Example

chunker = CodeChunker()

chunks = chunker.chunk_text(
    code,
    max_lines=50,          # Height limit
    max_tokens=512,        # Width limit
    max_functions=1,       # One function per chunk
    strict=True            # True: Crash on big blocks; False: Slice anyway
)

SentenceSplitter (Just Sentences)

from chunklet import SentenceSplitter

splitter = SentenceSplitter()
sentences = splitter.split_text(text, lang="en")

Handles tricky cases like “Dr.” or “U.S.A.” without breaking them up.

Output Object

Chunkers return Chunk objects (Box instances), so you use dot notation:

for chunk in chunks:
    print(chunk.content)   # The actual text/code
    print(chunk.metadata)  # Chunk metadata

Visualizer (Interactive Web UI)

Launch a web interface to experiment with chunking parameters:

chunklet visualize

Or programmatically:

from chunklet import visualizer

v = visualizer.Visualizer(host="127.0.0.1", port=8000)
v.serve()  # Opens in your browser

CLI Examples

Prefer the terminal? chunklet-py ships with a full CLI:

# Basic text chunking
chunklet chunk "Your text here." --max-tokens 500

# Chunk a file
chunklet chunk --source document.pdf --max-tokens 500 --metadata

# Split text into sentences
chunklet split "Your text here." --lang en

# Split a file into sentences
chunklet split --source my_file.txt --destination sentences.txt

# Start the interactive visualizer
chunklet visualize

# Code chunking
chunklet chunk --code --source my_script.py --max-functions 1

# Batch processing a directory
chunklet chunk --doc --source ./my_docs --destination ./chunks --n-jobs 4

# With error handling
chunklet chunk --doc --source ./my_docs --on-errors skip

How It Compares

Library	The Deal	Focus
chunklet-py	All-in-one, lightweight, multilingual, language-agnostic.	Text, Code, Docs
LangChain	Full LLM framework with basic splitters. Good for prototyping.	Full Stack
Chonkie	Chunking + embeddings + vector DB all-in-one.	Pipelines
Semchunk	Text-only, fast semantic splitting.	Text

Wrap Up

Chunklet-py is production-ready. It’s lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.

Check it out: github.com/speedyk-005/chunklet-py

Questions? Drop them in the comments!

[Jan 2026] AI Community — Activity Highlights and Achievements

February 23, 2026

Software

I Built a 15-Feature AI App in 48 Hours as a Solo Founder. Here’s the Exact Method.

February 23, 2026

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

A Small Prototype for Cost-Aware Bug Investigation

OpenAI says GPT 5.6 is the ‘preferred model’ for Microsoft Copilot amid breakup chatter

An AI agent startup just let its agent run its $100 million fundraise

Trending Tags

# Introducing chunklet-py:

The Problem with Dumb Splitting

Solution: chunklet-py

Features

What’s New in v2.2.0

Installation

Code Examples

Core Imports

DocumentChunker API

DocumentChunker Example

CodeChunker Example

SentenceSplitter (Just Sentences)

Output Object

Visualizer (Interactive Web UI)

CLI Examples

How It Compares

Wrap Up

Leave a Reply Cancel reply

Previous Post

[Jan 2026] AI Community — Activity Highlights and Achievements

Next Post

I Built a 15-Feature AI App in 48 Hours as a Solo Founder. Here’s the Exact Method.

# Introducing chunklet-py:

The Problem with Dumb Splitting

Solution: chunklet-py

Features

What’s New in v2.2.0

Installation

Code Examples

Core Imports

DocumentChunker API

DocumentChunker Example

CodeChunker Example

SentenceSplitter (Just Sentences)

Output Object

Visualizer (Interactive Web UI)

CLI Examples

How It Compares

Wrap Up

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts