Software

2 minute read

A flexible nodejs crawler library —— x-crawl

April 12, 2023

x-crawl

x-crawl is a flexible nodejs crawler library. Used to crawl pages, crawl interfaces, crawl files, and poll crawls. Flexible and simple to use, friendly to JS/TS developers.

If you like x-crawl, you can give x-crawl repository a star to support it, not only for its recognition, but also for Approved by the developer.

Features

🔥 Async/Sync – Just change the mode property to toggle async/sync crawling mode.
⚙️Multiple functions – Can crawl pages, crawl interfaces, crawl files and poll crawls. And it supports crawling single or multiple.
🖋️ Flexible writing method – A function adapts to multiple crawling configurations and obtains crawling results. The writing method is very flexible.
⏱️ Interval crawling – no interval/fixed interval/random interval, can effectively use/avoid high concurrent crawling.
🔄 Retry on failure – It can be set for all crawling requests, for a single crawling request, and for a single request to set a failed retry.
🚀 Priority Queue – Use priority crawling based on the priority of individual requests.
☁️ Crawl SPA – Batch crawl SPA (Single Page Application) to generate pre-rendered content (ie “SSR” (Server Side Rendering)).
⚒️ Controlling Pages – Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
🧾 Capture Record – Capture and record the crawled results, and highlight them on the console.
🦾 TypeScript – Own types, implement complete types through generics.

Example

Take some pictures of Airbnb hawaii experience and Plus listings automatically every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 3000, min: 2000 } })

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const res = await myXCrawl.crawlPage([
    'https://zh.airbnb.com/s/hawaii/experiences',
    'https://zh.airbnb.com/s/hawaii/plus_homes'
  ])

  // Store the image URL
  const imgUrls = []
  const elSelectorMap = ['.c14whb16', '.a1stauiv']
  for (const item of res) {
    const { id } = item
    const { page } = item.data

    // Gets the URL of the page's wheel image element
    const boxHandle = await page.$(elSelectorMap[id - 1])
    const urls = await boxHandle.$$eval('picture img', (imgEls) => {
      return imgEls.map((item) => item.src)
    })
    imgUrls.push(...urls)

    // Close page
    page.close()
  }

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({
    requestConfigs: imgUrls,
    fileConfig: { storeDir: './upload' }
  })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl

The importance of internal communication (plus how to crush it!)

April 12, 2023

Planning

7 Hidden dangers of project management (Or why even well-planned projects sometimes fail)

April 12, 2023

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Entity-based SEO: An explainer for SEOs and content marketers

Web Development Is Meant to Be Built, Not Watched

Don’t Trust AI Blindly: A Leader’s Approach to AI Model Validation for Accuracy and Performance

Trending Tags

A flexible nodejs crawler library —— x-crawl

x-crawl

Features

Example

More

Leave a Reply Cancel reply

Previous Post

The importance of internal communication (plus how to crush it!)

Next Post

7 Hidden dangers of project management (Or why even well-planned projects sometimes fail)

A flexible nodejs crawler library —— x-crawl

x-crawl

Features

Example

More

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts