crawler Archives - ProdSens.live

AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?

Teresa Garanhel — Wed, 17 Apr 2024 05:20:57 +0000

AI and Node.js crawler combination

When AI is paired with Node.js crawlers, this combination makes data collection smarter and more efficient. AI can help Node.js crawler to achieve more accurate target positioning. Traditional crawlers often rely on fixed rules or templates to capture data, but this way is often powerless in the face of complex and changeable web structure.

Why do we need AI-assisted crawlers

With the rapid development of network technology, website updates become more and more frequent, and the change of class name or structure often brings no small challenge to the crawlers that rely on these elements. In this context, crawlers combined with AI technology have become a powerful weapon to deal with this challenge.

First, a change in class name or structure after a website is updated can render traditional crawling strategies ineffective. This is because crawlers often rely on fixed class names or structures to locate and extract the information they need. Once these elements change, crawlers may not be able to accurately find the required data, which affects the effectiveness and accuracy of data fetching.

However, crawlers that incorporate AI technology are better able to cope with this change. AI can also understand and parse the semantic information of web pages through technologies such as natural language processing, so as to extract the required data more accurately.

In summary, crawlers combined with AI technology can better cope with the problem of class names or structural changes after website updates.

What is x-crawl?

x-crawl is a flexible Node.js AI-assisted crawler library. Flexible use and powerful AI-assisted functions make the crawler work more efficient, intelligent and convenient.

It consists of two parts:

Crawler: Composed of crawler API and various functions, it can work properly even without relying on AI.
AI: Currently based on the large AI model provided by OpenAI, AI simplifies many tedious operations.

x-crawl GitHub: https://github.com/coder-hxl/x-crawl

x-crawl Documentation: https://coder-hxl.github.io/x-crawl/cn/

Features

AI Assistance – Powerful AI assistance function makes crawler work more efficient, intelligent and convenient.
Flexible writing – A single crawling API is suitable for multiple configurations, and each configuration method has its own advantages.
Multiple uses – Supports crawling dynamic pages, static pages, interface data and file data.
Control page – Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
Device Fingerprinting – Zero configuration or custom configuration to avoid fingerprint recognition to identify and track us from different locations.
Asynchronous Sync – Asynchronous or synchronous crawling mode without switching crawling API.
Interval crawling – no interval, fixed interval and random interval, determine whether to crawl with high concurrency.
Failed Retry – Customize the number of retries to avoid crawling failures due to temporary problems.
Rotation proxy – Automatic proxy rotation with failed retries, custom error times and HTTP status codes.
Priority Queue – Based on the priority of a single crawl target, it can be crawled ahead of other targets.
Crawl information – Controllable crawl information, which will output colored string information in the terminal.
TypeScript – Own types and implement complete types through generics.

Example of AI and x-crawl crawler combination

The combination of crawler and AI allows the crawler and AI to obtain pictures of high-rated vacation rentals according to our instructions:

import { createCrawl, createCrawlOpenAI } from 'x-crawl'

//Create a crawler application
const crawlApp = createCrawl({
  maxRetry: 3,
  intervalTime: { max: 2000, min: 1000 }
})

//Create AI application
const crawlOpenAIApp = createCrawlOpenAI({
  clientOptions: { apiKey: process.env['OPENAI_API_KEY'] },
  defaultModel: { chatModel: 'gpt-4-turbo-preview' }
})

// crawlPage is used to crawl pages
crawlApp.crawlPage('https://www.airbnb.cn/s/select_homes').then(async (res) => {
  const { page, browser } = res.data

  // Wait for the element to appear on the page and get the HTML
  const targetSelector = '[data-tracking-id="TOP_REVIEWED_LISTINGS"]'
  await page.waitForSelector(targetSelector)
  const highlyHTML = await page.$eval(targetSelector, (el) => el.innerHTML)

  // Let the AI get the image link and de-duplicate it (the more detailed the description, the better)
  const srcResult = await crawlOpenAIApp.parseElements(
    highlyHTML,
    `Get the image link, don't source it inside, and de-duplicate it`
  )

  browser.close()

  // crawlFile is used to crawl file resources
  crawlApp.crawlFile({
    targets: srcResult.elements.map((item) => item.src),
    storeDirs: './upload'
  })
})

You can even pass the entire HTML to AI to help us operate, because the website content is more complex you also need to more accurately describe the location to take, and will consume a lot of Tokens..

Procedure:

If you want to see the HTML the AI needs to process or see the srcResult (img url) returned by the AI after parsing the HTML according to our instructions:
here at the bottom, because there are too many HTML fragments inconvenient to view at the bottom of this example, you can go to see.

AI Intelligent on-demand analysis elements

No need to manually analyze the HTML page structure to extract the required element attributes or values. Now you just need to input the HTML code into AI and tell AI which elements you want to get information about, and AI will automatically analyze the page structure and extract the corresponding element attributes or values.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

const HTMLContent = `
   
     Women's hooded sweatshirt
     Men's sweatshirts
     Women's sweatshirt
     Men's hooded sweatshirt
   
   
     Men's pure cotton short sleeves
     Men's pure cotton short sleeves
     Women's pure cotton short sleeves
     Men's ice silk short sleeves
     Men's round neck short sleeves
   
`

xCrawlOpenAIApp
  .parseElements(HTMLContent, `Take all men's clothing and remove duplicates`)
  .then((res) => {
    console.log(res)
    /*
      res:
      {
        elements: [
          { content: "Men's hooded sweatshirt" },
          { content: "Men's sweatshirts" },
          { content: "Men's pure cotton short sleeves" },
          { content: "Men's ice silk short sleeves" },
          { content: "Men's round neck short sleeves" }
        ],
        type: 'multiple'
      }
    */
  })

Intelligent generation of element selectors

It can help us quickly locate specific elements on the page. Just enter the HTML code into AI and tell AI which elements you want to get selectors for, and AI will automatically generate appropriate selectors for you based on the page structure, greatly simplifying the tedious process of determining selectors.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

const HTMLContent = `
   
     Women's hooded sweatshirt
     Men's sweatshirts
     Women's sweatshirt
     Men's hooded sweatshirt
   
   
     Men's pure cotton short sleeves
     Men's pure cotton short sleeves
     Women's pure cotton short sleeves
     Men's ice silk short sleeves
     Men's round neck short sleeves
   
`

xCrawlOpenAIApp
  .getElementSelectors(HTMLContent, `all Women's wear`)
  .then((res) => {
    console.log(res)
    /*
      res:
      {
        selectors: '.scroll-list:nth-child(2) .list-item:nth-child(3)',
        type: 'multiple'
      }
    */
  })

Intelligent reply to crawler questions

Can provide you with intelligent answers and suggestions. Whether it is about crawling strategies, anti-crawling techniques or data processing, you can ask AI questions, and AI will provide you with professional answers and suggestions based on its powerful learning and reasoning capabilities to help you complete your tasks better. Reptile task.

import { createXCrawlOpenAI } from 'x-crawl'

const xCrawlOpenAIApp = createXCrawlOpenAI({
  clientOptions: { apiKey: 'Your API Key' }
})

xCrawlOpenAIApp.help('What is x-crawl').then((res) => {
  console.log(res)
  /*
    res:
    x-crawl is a flexible Node.js AI-assisted web crawling library. It offers powerful AI-assisted features that make web crawling more efficient, intelligent, and convenient. You can find more information and the source code on x-crawl's GitHub page: https://github.com/coder-hxl/x-crawl.
   */
})

xCrawlOpenAIApp
  .help('Three major things to note about crawlers')
  .then((res) => {
    console.log(res)
    /*
      res:
      There are several important aspects to consider when working with crawlers:

      1. **Robots.txt:** It's important to respect the rules set in a website's robots.txt file. This file specifies which parts of a website can be crawled by search engines and other bots. Not following these rules can lead to your crawler being blocked or even legal issues.

      2. **Crawl Delay:** It's a good practice to implement a crawl delay between your requests to a website. This helps to reduce the load on the server and also shows respect for the server resources.

      3. **User-Agent:** Always set a descriptive User-Agent header for your crawler. This helps websites identify your crawler and allows them to contact you if there are any issues. Using a generic or misleading User-Agent can also lead to your crawler being blocked.

      By keeping these points in mind, you can ensure that your crawler operates efficiently and ethically.
   */
  })

Summary

In the latest version of x-crawl, we have introduced powerful AI-assisted features to make crawler work more efficient, intelligent and convenient. This innovative feature is mainly reflected in the following aspects:

1. Intelligent on-demand analysis elements

Traditional crawler work often requires manual analysis of the HTML page structure to extract the desired element attributes or values. Now, with the AI assistance of x-crawl, you can easily implement intelligent on-demand analysis of elements. Just tell the AI which elements you want to get information about, and the AI will automatically analyze the page structure and extract the corresponding element attributes or values.

2. Intelligent generation element selector

Selectors are an integral part of the crawler’s work, helping us quickly locate specific elements on the page. Now, x-crawl’s AI assistant can intelligently generate element selectors for you. Simply input the HTML code into the AI, and the AI will automatically generate the appropriate selector for you based on the page structure, greatly simplifying the tedious process of determining the selector.

3. Intelligent Reply to crawler problems

In the crawling work, we will inevitably encounter various problems and challenges. And x-crawl’s AI assistance can provide you with intelligent answers and suggestions. Whether it is about crawling strategy, anti-crawling skills or data processing, you can ask AI questions, AI will provide you with professional answers and suggestions based on its strong learning and reasoning ability to help you better complete the crawling task.

4. User-defined AI function

In order to meet the individual needs of different users, x-crawl also provides the ability to customize the AI. This means that you can customize and optimize the AI according to your needs, making it better suited to your crawler work. Whether it’s adjusting the AI’s analysis strategy, optimizing the generation algorithm of the selector, or adding new functional modules, you can do it with simple operations to make the AI more in line with your usage habits and workflow.

x-crawl GitHub: https://github.com/coder-hxl/x-crawl

x-crawl Documentation: https://coder-hxl.github.io/x-crawl

If you find x-crawl helpful, or if you like x-crawl, you can star x-crawl repository on GitHub. Your support is our motivation for continuous improvement! Thank you for your support!

The post AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling? appeared first on ProdSens.live.

x-crawl v7 version has been released

Vince Venditti — Thu, 27 Apr 2023 08:03:07 +0000

x-crawl

x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

If you also like x-crawl, you can give x-crawl repository a star to support it, thank you for your support!

Features

Asynchronous Synchronous – Just change the mode property to toggle asynchronous or synchronous crawling mode.
Multiple purposes – It can crawl pages, crawl interfaces, crawl files and poll crawls to meet the needs of various scenarios.
Flexible writing style – The same crawling API can be adapted to multiple configurations, and each configuration method is very unique.
Interval Crawling – No interval, fixed interval and random interval to generate or avoid high concurrent crawling.
Failed Retry – Avoid crawling failure due to short-term problems, and customize the number of retries.
Proxy Rotation – Auto-rotate proxies with failure retry, custom error times and HTTP status codes.
Device Fingerprinting – Zero configuration or custom configuration, avoid fingerprinting to identify and track us from different locations.
Priority Queue – According to the priority of a single crawling target, it can be crawled ahead of other targets.
Crawl SPA – Crawl SPA (Single Page Application) to generate pre-rendered content (aka “SSR” (Server Side Rendering)).
Control Page – You can submit form, keyboard input, event operation, generate screenshots of the page, etc.
Capture Record – Capture and record crawling, and use colored strings to remind in the terminal.
TypeScript – Own types, implement complete types through generics.

Example

Take the automatic acquisition of some photos of experiences and homes around the world every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({maxRetry: 3,intervalTime: { max: 3000, min: 2000 }})

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const res = await myXCrawl.crawlPage({
    targets: [
      'https://www.airbnb.cn/s/experiences',
      'https://www.airbnb.cn/s/plus_homes'
    ],
    viewport: { width: 1920, height: 1080 }
  })

  // Store the image URL to targets
  const targets = []
  const elSelectorMap = ['._fig15y', '._aov0j6']
  for (const item of res) {
    const { id } = item
    const { page } = item.data

    // Wait for the page to load
    await new Promise((r) => setTimeout(r, 300))

    // Gets the URL of the page image
    const urls = await page.$$eval(
      `${elSelectorMap[id - 1]} img`,
      (imgEls) => {
        return imgEls.map((item) => item.src)
      }
    )
    targets.push(...urls)

    // Close page
    page.close()
  }

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ targets, storeDir: './upload' })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl

The post x-crawl v7 version has been released appeared first on ProdSens.live.

A flexible nodejs crawler library —— x-crawl

Stephanie Whalley — Mon, 27 Mar 2023 08:02:48 +0000

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages in batches, network requests in batches, download file resources in batches, polling and crawling, etc. Supports asynchronous/synchronous mode crawling. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

Asynchronous/Synchronous – Support asynchronous/synchronous mode batch crawling.
Multiple functions – Batch crawling of pages, batch network requests, batch download of file resources, polling crawling, etc.
Flexible writing style – Multiple crawling configurations and ways to get crawling results.
Interval crawling – no interval/fixed interval/random interval, you can use/avoid high concurrent crawling.
Crawl SPA – Batch crawl SPA (Single Page Application) to generate pre-rendered content (ie “SSR” (Server Side Rendering)).
Controlling Pages – Headless browsers can submit forms, keystrokes, event actions, generate screenshots of pages, etc.
Capture Record – Capture and record the crawled results, and highlight the reminders.
TypeScript – Own types, implement complete types through generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { page } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // set request configuration
  const plusBoxHandle = await page.$('.a1stauiv')
  const requestConfig = await plusBoxHandle!.$$eval('picture img', (imgEls) => {
    return imgEls.map((item) => item.src)
  })

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })

  // Close page
  page.close()
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

For more detailed documentation, please check: https://github.com/coder-hxl/x-crawl

The post A flexible nodejs crawler library —— x-crawl appeared first on ProdSens.live.

crawler Archives - ProdSens.live

AI+Node.js x-crawl crawler: Why are traditional crawlers no longer the first choice for data crawling?

AI and Node.js crawler combination

Why do we need AI-assisted crawlers

What is x-crawl?

Features

Example of AI and x-crawl crawler combination

AI Intelligent on-demand analysis elements

Intelligent generation of element selectors

Intelligent reply to crawler questions

Summary

x-crawl v7 version has been released

x-crawl

Features

Example

More

A flexible nodejs crawler library —— x-crawl

x-crawl

Example

More