GPT Crawler Uniqcret style

Feb 7, 20244 min read

Updated: Feb 4

In today's rapidly advancing digital landscape, the need for efficient data collection, processing, and utilization has never been more vital. The GPT Crawler is revolutionizing the way we gather, create, and utilize data for Custom GPTs, AI assistants, and the OpenAI Playground. This blog post will walk you through the updated process of using the GPT Crawler to scrape the web and generate JSON knowledge files. With these enhancements, content creation and data compilation become more accessible and efficient, offering solutions for managing large files and overcoming common challenges like lazy loading and infinite scroll.

Introduction to the Enhanced GPT Crawler

As artificial intelligence and machine learning continue to evolve, having access to comprehensive and current data is crucial. The GPT Crawler is a sophisticated tool that automates web data collection, transforming it into a structured format that can be directly used by Custom GPTs and AI models. This tool has become indispensable for developers, content creators, and researchers, offering a streamlined, efficient approach to data acquisition.

Step-by-Step Guide

Getting Started

To get started with the GPT Crawler, setting up your environment is the first step. This process involves downloading and installing essential software such as Node.js and Visual Studio Code. Here are the links to get these resources:

Node.js: Download Node.js
Visual Studio Code: Download Visual Studio Code

Installation and Setup

Visit the GPT Crawler's GitHub page to clone or download the repository. Follow the provided instructions to install any necessary dependencies and configure the crawler for your needs.

Running the Crawler

Once your environment is ready, you can run the GPT Crawler. The tool is designed to be user-friendly, enabling you to specify the websites you want to scrape and define the output file format. This flexibility makes it an excellent choice for various projects.

Handling Large Files

The GPT Crawler now includes improved functionality for managing large files, which can sometimes be challenging to upload directly. Here are the two main strategies to handle large datasets:

Splitting Files: Use the maxFileSize option in the config.ts file to automatically split large files into smaller, manageable sizes.
Tokenization: Reduce file size by using the maxTokens option in the config.ts file, which breaks down the data into smaller, tokenized segments.

Overcoming Common Challenges: Lazy Loading and Infinite Scroll

It's essential to be aware of potential challenges when using the GPT Crawler on websites that utilize lazy loading or infinite scroll. These features can prevent the crawler from accessing all content, as seen in some initial issues where content wasn't fully loaded. To tackle this:

Pagination: Adjust your website settings to display content in numbered pages (e.g., 1, 2, 3, 4...) instead of using infinite scroll. This change ensures the crawler can systematically access all content, page by page, delivering comprehensive results.
Crawler Configuration: Update your crawler settings to accommodate these changes. Ensure the crawler navigates through each page correctly and captures all necessary data.

Practical Solution for Splitting Files

After running the GPT Crawler or any web scraping tool, you may end up with a sizable JSON file containing all the scraped data. Splitting this file into smaller segments helps with:

Handling Large Files: Uploading and processing one massive file can be difficult.
Incremental Updates: If you re-crawl or only add newly published articles, splitting and managing those new portions alone is more efficient than redoing everything from scratch.
Ease of Editing: Working with smaller, separate files (one per article or data entry) makes it simpler to fix or update content.

Why Incremental (Partial) Crawling?

Many websites frequently add new articles or content. Rather than starting the crawler from the first page and going all the way to the last each time, an incremental or partial crawl lets you only fetch newly added articles since your last crawl. For instance:

Faster: Saves time and reduces unnecessary processing.
Easier Updates: Whenever 10 new articles (for example) are posted, you can have your crawler gather just those new ones, then update your JSON output.
AI Search Engine Integration: An AI-based search engine often only needs article titles and URLs, so reprocessing just the new batch of articles is enough to keep the search index up to date.

Once you have your main JSON file (either the entire dataset or just the latest crawl), splitting it simplifies your workflow. You can edit or revise specific entries easily before potentially recombining them later if needed.

Download Splitting JSON Files file

How to Use:

Place your JSON file (e.g., output-1.json) in the same directory as Splitting JSON Files.py.
Run the script: python Splitting JSON Files.py.
Output: A folder named after the input file (e.g., output-1) will be created, containing individual JSON files (one per article).
Titles List: A titles_list.txt file records processed titles to prevent duplicates.

Combining and Further Processing

After splitting files, you might need to recombine (merge) them for simplified uploading or usage in other projects. The previously shown scripts—Combine.py and Get a list of links.py—are useful for:

Combining multiple JSON files back into a single JSON file.
Removing unneeded fields (e.g., html) and extracting only essential information (like title and url).

Download Combine file

Download Get a list of links file

Key Takeaways

Incremental Crawling: Only fetch newly published articles rather than re-crawling everything.
Split for Simplicity: Edit or examine articles individually.
Combine for Final Use: Bring them together into a single JSON when needed.
Clean Up: Remove large or unnecessary fields to optimize for AI-based searches.
Titles & URLs: Perfect for building an AI search index or quick lookups.

Conclusion

Implementing an incremental (partial) crawling approach with file splitting creates a more efficient workflow. It saves time by focusing only on new content each time you crawl. After that, splitting large JSON files into smaller, per-article files lets you update or edit entries individually without disturbing the rest. If you ever need a consolidated overview, you can recombine the split files. Finally, for AI or search-engine purposes, you can strip out unnecessary details—like html content—and feed the final JSON containing only titles and URLs to your AI-based search index. This modular flow—crawl → split → edit/update → combine → final data—helps keep your project organized, flexible, and ready for ongoing changes.