A flexible nodejs crawler library —— x-crawl

a-flexible-nodejs-crawler-library-——-x-crawl

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages in batches, network requests in batches, download file resources in batches, polling and crawling, etc. Supports asynchronous/synchronous mode crawling. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

  • 🔥 Asynchronous/Synchronous – Support asynchronous/synchronous mode batch crawling.
  • ⚙️ Multiple functions – Batch crawling of pages, batch network requests, batch download of file resources, polling crawling, etc.
  • 🖋️ Flexible writing style – Multiple crawling configurations and ways to get crawling results.
  • ⏱️ Interval crawling – no interval/fixed interval/random interval, you can use/avoid high concurrent crawling.
  • ☁️ Crawl SPA – Batch crawl SPA (Single Page Application) to generate pre-rendered content (ie “SSR” (Server Side Rendering)).
  • ⚒️ Controlling Pages – Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
  • 🧾 Capture Record – Capture and record the crawled results, and highlight the reminders.
  • 🦾TypeScript – Own types, implement complete types through generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { page } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // set request configuration
  const plusBoxHandle = await page.$('.a1stauiv')
  const requestConfig = await plusBoxHandle!.$$eval('picture img', (imgEls) => {
    return imgEls.map((item) => item.src)
  })

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })

  // Close page
  page.close()
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

More

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
mind-blowing-technologies-that-will-revolutionize-the-future

Mind-Blowing Technologies That Will Revolutionize the Future

Next Post
3-effective-b2b-client-retention-tactics-that-will-stand-the-test-of-time

3 Effective B2B Client Retention Tactics That Will Stand The Test of Time

Related Posts