Extract Plain Text from Medium Posts for RAG and Search Indexes

Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts.

Extract Plain Text from Medium Posts for RAG and Search Indexes

HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.

Tool outcome: ingest-medium-article.ts → chunked documents in your vector DB.

Pipeline

  1. Discover ids via user feed or search.
  2. GET /article/{id}/content → plain text.
  3. Optional: GET /article/{id} for title, tags, author metadata.
  4. Chunk → embed → upsert vector store.
  5. Query in your chat UI or internal search.

Ingest script

const API = 'https://api.zenndra.com';
const headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };

export async function fetchArticleText(articleId) {
  const [contentRes, metaRes] = await Promise.all([
    fetch(`${API}/article/${articleId}/content`, { headers }),
    fetch(`${API}/article/${articleId}`, { headers }),
  ]);

  const { content } = await contentRes.json();
  const meta = await metaRes.json();

  return {
    id: articleId,
    title: meta.title,
    tags: meta.tags,
    text: content,
  };
}

export function chunkText(text, { size = 800, overlap = 100 } = {}) {
  const words = text.split(/s+/);
  const chunks = [];
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '));
  }
  return chunks.filter(Boolean);
}

Wire chunkText to OpenAI embeddings, Ollama, or your host’s model—swap the vector client, keep the ingest shape.

Chunking tips

  • Include title + tags in the embedding preamble for better retrieval.
  • Store article_id and chunk_index in metadata for citations.
  • Deduplicate re-ingest with content hash if posts are edited rarely.

Compliance (non-optional)

  • Respect Medium’s Terms of Service and author rights.
  • Many teams only index their own posts or licensed partners.
  • Do not expose paywalled or member-only content through public bots without permission.

For human-readable syndication, see embed articles—different threat model than LLM training.

Keywords

medium plain text api, medium rag pipeline, medium embeddings, medium article content extraction, llm medium.

Further reading

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

How to Film a Great Whiteboard Friday Video — Whiteboard Friday

Related Posts
dynamic-delivery-for-a-android-apps

Dynamic Delivery for a Android apps

Image source: https://www.google.com/search?q=dynamic+feature+module+android&sca_esv=590257647&rlz=1C5CHFA_enUS1005US1005&tbm=isch&source=lnms&sa=X&ved=2ahUKEwj259bh1YqDAxWicWwGHajFCMoQ_AUoA3oECAEQBQ&biw=1718&bih=933&dpr=2#imgrc=6s2Tjm-u62f8iM Need and Significance of Dynamic Delivery App size is an important consideration for developers because it…
Read More