Intro
This article will walk through converting search results into Markdown format, suitable for use in large language models (LLMs) and other applications.
Markdown is a lightweight markup language that provides a simple, readable way to format text with plain-text syntax. Checkout Markdown Guide for more information:
Use Case
Markdown's simple, readable format allows for the transformation of raw webpage data into clean, actionable information across different use cases:
- LLM Training: Generate Q&A datasets or custom knowledge bases.
- Content Aggregation: Create training datasets or compile research.
- Market Research: Monitor competitors or gather product information.
SerpApi
SerpApi is a web scraping company that allows developers to extract search engine results and data from various search engines, including Google, Bing, Yahoo, Baidu, Yandex, and others. It provides a simple way to access search engine data programmatically without dealing directly with the complexities of web scraping.
This guide focuses on the Google Search API, but the concepts and techniques discussed can be adapted for use with SerpApi’s other APIs.
Google Search API
The Google Search API lets developers programmatically retrieve structured JSON data from live Google searches. Key benefits include:
- CAPTCHA and browser automation: Avoid manual intervention or IP blocks.
- Structured data: Output is clean and easy to parse.
- Global and multilingual support: Search in specific languages or regions.
- Scalability: Perform high-volume searches without disruptions.
Getting Started
This section provides a complete code example for fetching Google search results using SerpApi, parsing the webpage content, and converting it to Markdown. While this example uses Node.js (JavaScript), the same principles apply in other languages.
Required Packages
Make sure to install the following pages in your Node.js project.
SerpApi JavaScript: Scrape and parse search engine results using SerpApi. Get search results from Google, Bing, Baidu, Yandex, Yahoo, Home Depot, eBay and more.
Cheerio: A fast, flexible, and elegant library for parsing and manipulating HTML and XML.
Turndown: Convert HTML into Markdown with JavaScript.
Importing Packages
First, we must import all of our required packages:
import dotenv from "dotenv";
import fetch from "node-fetch";
import fs from "fs/promises";
import path from "path";
import { getJson } from "serpapi";
import * as cheerio from "cheerio";
import TurndownService from "turndown";
Fetching Search Results
The fetchSearchResults
function retrieves search results using SerpApi’s Google Search API:
const fetchSearchResults = async (query) => {
return await getJson("google", {
api_key: process.env.SERPAPI_KEY,
q: query,
num: 5,
});
};
Create a .env file, include your SerpApi key, and install the dotenv package. Or, replace the process.env.SERPAPI_KEY
process with your API key if you are simply running the script locally.
Parsing Webpage Content
The parseUrl
function fetches the HTML of a given URL, cleans it, and converts it to Markdown:
const parseUrl = async (url) => {
try {
// Configure fetch request with browser-like headers
const response = await fetch(url, {
headers: {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
Accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
},
});
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const html = await response.text();
// Initialize HTML parser and markdown converter
const $ = cheerio.load(html);
const turndown = new TurndownService({
headingStyle: "atx",
codeBlockStyle: "fenced",
});
// Clean up HTML by removing unnecessary elements
$("script, style, nav, footer, iframe, .ads").remove();
// Extract title and main content
const title = $("title").text().trim() || $("h1").first().text().trim();
const mainContent =
$("article, main, .content, #content, .post").first().html() ||
$("body").html();
const content = turndown.turndown(mainContent || "");
return { title, content };
} catch (error) {
console.error(`Failed to parse ${url}:`, error.message);
return null;
}
};
This function ensures a clean, readable Markdown by removing non-essential elements like scripts and ads.
Sanitizing Keywords
To prevent filename issues, we can sanitize keywords before using them in filenames:
const sanitizeKeyword = (keyword) => {
return keyword
.replace(/\\s+/g, "_") // Replace spaces with underscores
.substring(0, 15) // Truncate to 15 characters
.toLowerCase(); // Convert to lowercase
};
Writing to Markdown
This function writes the parsed content to a Markdown file, using the sanitize function to set the file's name:
const writeToMarkdown = async (data, keyword, index, url) => {
const sanitizedKeyword = sanitizeKeyword(keyword);
const filename = path.join(
"output",
`${new Date().toISOString()}_${sanitizedKeyword}_${index + 1}.md`
);
const content = `[//]: # (Source: ${url})\\n\\n# ${data.title}\\n\\n${data.content}`;
await fs.writeFile(filename, content, "utf-8");
return filename;
};
Main Execution
The main script invokes the process. Update the keywords array to keywords relevant to your use case:
// Example Keyword array
const keywords = ["coffee", "playstation 5", "web scraping"];
// Main execution block
(async () => {
try {
// Create output directory if it doesn't exist
await fs.mkdir("output", { recursive: true });
// Process each keyword
for (const keyword of keywords) {
const results = await fetchSearchResults(keyword);
// Process search results if available
if (results.organic_results && results.organic_results.length > 0) {
for (const [index, result] of results.organic_results.entries()) {
try {
const data = await parseUrl(result.link);
const filename = await writeToMarkdown(
data,
keyword,
index,
result.link
);
console.log(`Written to: ${filename}`);
} catch (err) {
console.error(`Failed to process ${result.link}:`, err.message);
continue;
}
}
} else {
console.log(`No organic results found for keyword: ${keyword}`);
}
}
} catch (error) {
console.error(error);
}
})();
To summarize the above, we:
- Setup output directory: Ensures files are saved to an appropriate location.
- Fetch and parse results: Process each search result URL for relevant content.
- Error handling: Prevents the entire process from failing due to individual errors.
Next Steps
While the above should get you started, you may need to configure Cheerio or Turndown further to dial in the sections you're scraping.
You can find a repository for the above code here:
Conclusion
SerpApi simplifies accessing structured search engine data through programmatic methods. By leveraging code-based solutions, developers can efficiently extract and transform web pages from search results into usable formats, enabling data collection and analysis.