Software

2 minute read

A flexible nodejs crawler library —— x-crawl

March 27, 2023

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages in batches, network requests in batches, download file resources in batches, polling and crawling, etc. Supports asynchronous/synchronous mode crawling. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

🔥 Asynchronous/Synchronous – Support asynchronous/synchronous mode batch crawling.
⚙️ Multiple functions – Batch crawling of pages, batch network requests, batch download of file resources, polling crawling, etc.
🖋️ Flexible writing style – Multiple crawling configurations and ways to get crawling results.
⏱️ Interval crawling – no interval/fixed interval/random interval, you can use/avoid high concurrent crawling.
☁️ Crawl SPA – Batch crawl SPA (Single Page Application) to generate pre-rendered content (ie “SSR” (Server Side Rendering)).
⚒️ Controlling Pages – Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
🧾 Capture Record – Capture and record the crawled results, and highlight the reminders.
🦾TypeScript – Own types, implement complete types through generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { page } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // set request configuration
  const plusBoxHandle = await page.$('.a1stauiv')
  const requestConfig = await plusBoxHandle!.$$eval('picture img', (imgEls) => {
    return imgEls.map((item) => item.src)
  })

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })

  // Close page
  page.close()
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl

Mind-Blowing Technologies That Will Revolutionize the Future

March 27, 2023

Marketing

3 Effective B2B Client Retention Tactics That Will Stand The Test of Time

March 27, 2023

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Build an AI Chatbot in Reverb Livewire with Laravel 12 Step-by-Step

Introducing Gemma 3n: The developer guide

Unlock deeper insights with the new Python client library for Data Commons

Trending Tags

A flexible nodejs crawler library —— x-crawl

x-crawl

Example

More

Leave a Reply Cancel reply

Previous Post

Mind-Blowing Technologies That Will Revolutionize the Future

Next Post

3 Effective B2B Client Retention Tactics That Will Stand The Test of Time

A flexible nodejs crawler library —— x-crawl

x-crawl

Example

More

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts