Software

1 minute read

I built an industrial scale web scraper. Here’s what I learned.

Michael Luchen

July 18, 2023

Recently, I built an industrial scale web scraper. Here’s what I learned.

1. Why build a scalable scraper/crawler?

Google’s primary product (their Search Engine) is empowered by web scrapers & crawl extracting data from the internet at an unfathomable level of scale.
Open AI’s capability (and willingness) to access data using scrapers & crawlers at internet wide scale is what empowered them to build (and continually improve) ChatGPT.
Unlike last decade, intelligence is something you can build, use, and sell with the one catch being you require an immense amount of one resource to do so and that resource is a hell of a lot of data.

*2. Using chromium programmatically is helpful (I chose Puppeteer)
*

3. Industrial scale requires using proxies (I rotated between residential proxies)

*4. Bots can find rules via a robots.txt file for a site (Ask SEO experts about it)
*

5. Bypassing captchas, although ethically questionable, doesn’t seem to be an illegal act to program your robot to take. (I explored Github python programs that were capable of this to satisfy my own curiosity).

The New Vercel AI SDK: Your Own Chatbot in a Flash

July 18, 2023

Product Management

Product Team Structure – A Guide For SaaS Product Teams

July 18, 2023

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Spec Is Not the Cure — Unless It’s Discovered Through Discussion

Machine Vision Lighting Solutions for Unwanted Glare

I Fine Tuned an Open Source Model and the Bhagavad Gita Explained It Better Than Any Paper

Trending Tags

I built an industrial scale web scraper. Here’s what I learned.

Leave a Reply Cancel reply

Previous Post

The New Vercel AI SDK: Your Own Chatbot in a Flash

Next Post

Product Team Structure – A Guide For SaaS Product Teams

I built an industrial scale web scraper. Here’s what I learned.

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts