Software

3 minute read

I used Node.js to OCR “Meme Monday” threads

August 6, 2023

i-used-node.js-to-ocr-“meme-monday”-threads

I love programming-related memes and jokes, and I’m sure you do as well. @ben‘s weekly “Meme Monday” posts are an amazing source for humor I always look forward to weekly.

What we’re building

We will build a simple project that outputs a markdown file with all the memes on a Meme Monday thread. Each meme will be outputted with the OCR (Optical character recognition) -detected text.

OCR detection will be done with Tesseract.

Setup

Spin up a Node.js Repl on Replit.

Installing Tesseract

If you run tesseract in the shell, you will notice the command does not exist since it isn’t installed.

In the top-right-corner of the filetree, click the three dots and select “Show hidden files”.

Navigate to the replit.nix configuration file and add pkgs.tesseract4 to the package dependency list.

{ pkgs }: {
  deps = [
    pkgs.tesseract4
    pkgs.nodejs-18_x
    pkgs.nodePackages.typescript-language-server
    pkgs.yarn
    pkgs.replitPackages.jest
  ];
}

Run tesseract in the shell. It should show some options now.

Dependencies

Install node-tesseract-ocr and node-fetch.

npm install node-tesseract-ocr node-fetch

We’re all set, let’s get coding.

Building the thing

Navigate to index.js.

Require/import the following dependencies at the top of the file.

const tesseract = require("node-tesseract-ocr");
const fetch = require("node-fetch");
const fs = require("fs");

Fetching article comments

Create an asynchronous function fetchArticleComments that takes a slug argument.

const fetchArticleComments = async (slug) => {

}

Let’s hit the dev.to API and get an article by its slug. If the response fails, let’s throw an error.

if (!articleRes.ok) throw new Error("Failed to fetch article")

const article = await articleRes.json();

Derive the article’s ID and fetch the article comments with it. Return the comments if the response is successful.

const fetchArticleComments = async (slug) => {
  const articleRes = await fetch("https://dev.to/api/articles/" + slug)

  if (!articleRes.ok) throw new Error("Failed to fetch article")

  const article = await articleRes.json();

  const commentsRes = await fetch("https://dev.to/api/comments?a_id=" + article.id);

  if (!commentsRes.ok) throw new Error("Failed to fetch comments")

  return await commentsRes.json();
}

Extracting URLs

Create and call an asynchronous main function at the end of the file.

async function main() {

}

main();

Within the main function, fetch the comments of a dev.to article and create a urls array in which we’ll store the extracted URLs.

const comments = await fetchArticleComments("ben/meme-monday-59gk");

// Embedded Image URLs found in the comments
const urls = [];

Create a for loop and iterate through the comments. For each comment, let’s use a regular expression to match an image URL from an image src prop and push it to urls.

for (const comment of comments) {
  // Get embedded images from the comment
  const images = comment.body_html.match(/src="[^"]+.(jpg|png|webp|jpeg)"/g);

  // Extract the image URLs from the embedded images
  if (images?.length) {
    const imageUrls = images.map(str => str.replace(/src="https://dev.to/, "").replace(/"https://dev.to/, ""));

    urls.push(...imageUrls);
  }
}

OCR Text Extraction

Create an array variable images for storing URLs and the extracted OCR text.

const images = [];

Create a for loop to iterate through urls. Use fetch and res.ok to ensure that the image exists.

 for (const i in urls) {
  const url = urls[i];

  // Make sure the image exists
  const res = await fetch(url);

  if (res.ok) {

  }
}

Within the if (res.ok) statement, use await tesseract.recognize(url) to get the text from the respective URL and push it to images.

if (res.ok) {
  const text = await tesseract.recognize(url);

  images.push({
    url,
    text
  });

  console.log("Finished Processing URL", Number(i) + 1, "of", urls.length);
}

Finally, at the end of the main function, use fs.writeFileSync to write the changes to a file named memes.md.

fs.writeFileSync(
  "memes.md",
  images
    .map(({ url, text }) => {
      // Sanitize the text to be an image alt by removing newlines and special markdown tokens
      const sanitizedText = text.replace(/[|]|"/g, c => "\" + c).replaceAll("n", "");

      // Return the text followed by a markdown-formatted image
      return `${text}nn![${sanitizedText}](${url})`
    })
    .join("nn")
);

Run the Repl. You should see as each image gets processed and at the end you will see a memes.md file full of the memes along with the OCR-extracted text.

If you use the Markdown tool, you can preview the output markdown file.

And that’s it! Thanks for reading

Demo & Source Code

Say hi 👋

Is Microsoft Project the Best Project Management Software Choice?

August 5, 2023

Project Management

Objectives and Key Results: What are OKRs?

August 6, 2023

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph NeuralNetworks

Entity-based SEO: An explainer for SEOs and content marketers

Web Development Is Meant to Be Built, Not Watched

Trending Tags

I used Node.js to OCR “Meme Monday” threads

What we’re building

Setup

Installing Tesseract

Dependencies

Building the thing

Fetching article comments

Extracting URLs

OCR Text Extraction

Say hi 👋

Leave a Reply Cancel reply

Previous Post

Is Microsoft Project the Best Project Management Software Choice?

Next Post

Objectives and Key Results: What are OKRs?

I used Node.js to OCR “Meme Monday” threads

What we’re building

Setup

Installing Tesseract

Dependencies

Building the thing

Fetching article comments

Extracting URLs

OCR Text Extraction

Say hi 👋

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts