# Introducing chunklet-py:

The Smart Text Chunking Library You Didn’t Know You Needed

Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?

Yeah, I’ve been there too. That’s exactly why I built chunklet-py — a Python library that actually understands text structure.

This post hits the highlights — visit the full documentation for everything else, including:

  • Custom sentence splitters for specialized languages
  • Custom document processors for unusual file formats
  • Custom tokenizers to match your LLM
  • CLI flags for batch processing, parallel jobs, error handling, timeouts
  • Advanced features like overlap, offset, strict mode, docstring modes

The Problem with Dumb Splitting

Here’s what usually happens:

# The naive approach
chunks = [text[i:i+500] for i in range(0, len(text), 500)]

This works… until it doesn’t:

  • Sentences cut mid-way (“The model got 75%” → “75%” becomes meaningless)
  • No context between chunks
  • Broken code if you’re chunking source files

Solution: chunklet-py

A smart text and code chunking library that respects natural boundaries.

Features

50+ languages supported — Auto-detects language and applies the right splitting rules. No more treating German the same as English.

Multiple constraint types — Mix and match:

  • max_sentences — group by sentences
  • max_tokens — respect LLM context limits
  • max_section_breaks — keep Markdown headers together (headings ##, horizontal rules ---,

    tags)
  • max_lines — for code chunking
  • max_functions — keep functions together

Multiple file formats — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.

Rich metadata — Every chunk comes with source references, character spans, and structural info.

Composable constraints — Mix and match limits to get exactly the chunks you need.

Pluggable architecture — Swap in custom tokenizers, sentence splitters, or document processors.

What’s New in v2.2.0

  • API Unification — Methods renamed to chunk_text, chunk_file, chunk_texts, chunk_files for consistency
  • Visualizer redesign — Fullscreen mode, 3-row layout, smoother hovers
  • More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
  • Ruff — Switched to Ruff for faster linting

Check the What’s New page for full details.

Installation

pip install chunklet-py

For document support:

pip install chunklet-py[structured-document]

For code:

pip install chunklet-py[code]

For visualization:

pip install chunklet-py[visualization]

Code Examples

Core Imports

from chunklet import DocumentChunker   # For PDFs, DOCX, and general text
from chunklet import CodeChunker       # For source code
from chunklet import SentenceSplitter  # For just sentences
from chunklet import visualizer        # Web-based visualizer

DocumentChunker API

Four methods cover most use cases:

Method Input Return Type
chunk_text(text) str List[Chunk]
chunk_file(path) Path or str List[Chunk]
chunk_texts(list) List[str] Generator[Chunk]
chunk_files(list) List[Path] Generator[Chunk]

DocumentChunker Example

chunker = DocumentChunker()

# Feel free to mix and match these
chunks = chunker.chunk_text(
    text,
    max_sentences=3,       # Stop after X sentences
    max_tokens=500,        # Don't blow up the LLM context
    max_section_breaks=2,  # Respect the Markdown headers
    overlap_percent=20,    # Give it some "memory" of the last chunk
    offset=0               # Skip the first N sentences
)

CodeChunker Example

chunker = CodeChunker()

chunks = chunker.chunk_text(
    code,
    max_lines=50,          # Height limit
    max_tokens=512,        # Width limit
    max_functions=1,       # One function per chunk
    strict=True            # True: Crash on big blocks; False: Slice anyway
)

SentenceSplitter (Just Sentences)

from chunklet import SentenceSplitter

splitter = SentenceSplitter()
sentences = splitter.split_text(text, lang="en")

Handles tricky cases like “Dr.” or “U.S.A.” without breaking them up.

Output Object

Chunkers return Chunk objects (Box instances), so you use dot notation:

for chunk in chunks:
    print(chunk.content)   # The actual text/code
    print(chunk.metadata)  # Chunk metadata

Visualizer (Interactive Web UI)

Launch a web interface to experiment with chunking parameters:

chunklet visualize

Or programmatically:

from chunklet import visualizer

v = visualizer.Visualizer(host="127.0.0.1", port=8000)
v.serve()  # Opens in your browser

CLI Examples

Prefer the terminal? chunklet-py ships with a full CLI:

# Basic text chunking
chunklet chunk "Your text here." --max-tokens 500

# Chunk a file
chunklet chunk --source document.pdf --max-tokens 500 --metadata

# Split text into sentences
chunklet split "Your text here." --lang en

# Split a file into sentences
chunklet split --source my_file.txt --destination sentences.txt

# Start the interactive visualizer
chunklet visualize

# Code chunking
chunklet chunk --code --source my_script.py --max-functions 1

# Batch processing a directory
chunklet chunk --doc --source ./my_docs --destination ./chunks --n-jobs 4

# With error handling
chunklet chunk --doc --source ./my_docs --on-errors skip

How It Compares

Library The Deal Focus
chunklet-py All-in-one, lightweight, multilingual, language-agnostic. Text, Code, Docs
LangChain Full LLM framework with basic splitters. Good for prototyping. Full Stack
Chonkie Chunking + embeddings + vector DB all-in-one. Pipelines
Semchunk Text-only, fast semantic splitting. Text

Wrap Up

Chunklet-py is production-ready. It’s lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.

Check it out: github.com/speedyk-005/chunklet-py

Questions? Drop them in the comments!

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

[Jan 2026] AI Community — Activity Highlights and Achievements

Next Post

I Built a 15-Feature AI App in 48 Hours as a Solo Founder. Here’s the Exact Method.

Related Posts