Software

1 minute read

Extract Plain Text from Medium Posts for RAG and Search Indexes

May 30, 2026

Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts.

Extract Plain Text from Medium Posts for RAG and Search Indexes

HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.

Tool outcome: ingest-medium-article.ts → chunked documents in your vector DB.

Pipeline

Discover ids via user feed or search.
GET /article/{id}/content → plain text.
Optional: GET /article/{id} for title, tags, author metadata.
Chunk → embed → upsert vector store.
Query in your chat UI or internal search.

Ingest script

const API = 'https://api.zenndra.com';
const headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };

export async function fetchArticleText(articleId) {
  const [contentRes, metaRes] = await Promise.all([
    fetch(`${API}/article/${articleId}/content`, { headers }),
    fetch(`${API}/article/${articleId}`, { headers }),
  ]);

  const { content } = await contentRes.json();
  const meta = await metaRes.json();

  return {
    id: articleId,
    title: meta.title,
    tags: meta.tags,
    text: content,
  };
}

export function chunkText(text, { size = 800, overlap = 100 } = {}) {
  const words = text.split(/s+/);
  const chunks = [];
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '));
  }
  return chunks.filter(Boolean);
}

Wire chunkText to OpenAI embeddings, Ollama, or your host’s model—swap the vector client, keep the ingest shape.

Chunking tips

Include title + tags in the embedding preamble for better retrieval.
Store article_id and chunk_index in metadata for citations.
Deduplicate re-ingest with content hash if posts are edited rarely.

Compliance (non-optional)

Respect Medium’s Terms of Service and author rights.
Many teams only index their own posts or licensed partners.
Do not expose paywalled or member-only content through public bots without permission.

For human-readable syndication, see embed articles—different threat model than LLM training.

Keywords

medium plain text api, medium rag pipeline, medium embeddings, medium article content extraction, llm medium.

How to Film a Great Whiteboard Friday Video — Whiteboard Friday

May 30, 2026

Software

HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift

May 30, 2026

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

When Digital Manufacturing Creates More Reconciliation Than Clarity

[Advanced Rust] 1.14. Memory Types Pt.2 – Dynamically Sized Types and Wide Pointers, Packed Layouts, Larger Alignment for Speci…

I gave an AI agent a business and $0. It found the one thing AI can’t sell.

Trending Tags

Extract Plain Text from Medium Posts for RAG and Search Indexes

Extract Plain Text from Medium Posts for RAG and Search Indexes

Pipeline

Ingest script

Chunking tips

Compliance (non-optional)

Keywords

Further reading

Leave a Reply Cancel reply

Previous Post

How to Film a Great Whiteboard Friday Video — Whiteboard Friday

Next Post

HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift

Extract Plain Text from Medium Posts for RAG and Search Indexes

Extract Plain Text from Medium Posts for RAG and Search Indexes

Pipeline

Ingest script

Chunking tips

Compliance (non-optional)

Keywords

Further reading

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts