What will be scraped

scrape_google_scholar_case_law_what_will_be_scraped_01

Prerequisites

Separate virtual environment

If you're on Linux:

python -m venv env && source env/bin/activate

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system, thus preventing libraries or Python version conflicts.

Install libraries:

pip install pandas google-search-results  

Scrape and save Google Scholar Case Law results to CSV

If you don't need an explanation, try it in the online IDE.

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd

def case_law_results():

    print("Extracting case law results..")

    params = {
        "api_key": "...",                 # https://serpapi.com/manage-api-key
        "engine": "google_scholar",       # Google Scholar search results
        "q": "minecraft education ",      # search query
        "hl": "en",                       # language
        "start": "0",                     # first page
        "as_sdt": "6"                     # case law results. Wierd, huh? Try without it.
    }
    search = GoogleSearch(params)

    case_law_results_data = []

    while True:
        results = search.get_dict()

        if "error" in results:
            break

      print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")

      for result in results["organic_results"]:
          title = result.get("title")
          publication_info_summary = result["publication_info"]["summary"]
          result_id = result.get("result_id")
          link = result.get("link")
          result_type = result.get("type")
          snippet = result.get("snippet")

        try:
          file_title = result["resources"][0]["title"]
        except: file_title = None

        try:
          file_link = result["resources"][0]["link"]
        except: file_link = None

        try:
          file_format = result["resources"][0]["file_format"]
        except: file_format = None

        cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
        cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
        cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
        total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
        all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
        all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

        case_law_results_data.append({
          "page_number": results['serpapi_pagination']['current'],
          "position": result["position"] + 1,
          "result_type": result_type,
          "title": title,
          "link": link,
          "result_id": result_id,
          "publication_info_summary": publication_info_summary,
          "snippet": snippet,
          "cited_by_count": cited_by_count,
          "cited_by_link": cited_by_link,
          "cited_by_id": cited_by_id,
          "total_versions": total_versions,
          "all_versions_link": all_versions_link,
          "all_versions_id": all_versions_id,
          "file_format": file_format,
          "file_title": file_title,
          "file_link": file_link
        })

      if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
      else:
        break

    return case_law_results_data


def save_case_law_results_to_csv():
    print("Waiting for case law results to save..")
    pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)

    print("Case Law Results Saved.")

Code explanation

Import libraries:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
  • pandas will be used to easily save extracted data to CSV file.
  • urllib will be used in the pagination process.

Create, pass search parameters to SerpApi and create a temporary list() to store extracted data:

params = {
    "api_key": "...",                 # https://serpapi.com/manage-api-key
    "engine": "google_scholar",       # Google Scholar search results
    "q": "minecraft education ",      # search query
    "hl": "en",                       # language
    "start": "0",                     # first page
    "as_sdt": "6"                     # case law results
}
search = GoogleSearch(params)

case_law_results_data = []

as_sdt is used to determine and filter which Court(s) are targeted in an API call. Refer to supported SerpApi Google Scholar Courts or select courts on Google Scholar and pass it to as_sdt parameter.

Note: if you want to search results for Missouri Court Of Appeals, as_sdt parameter would become as_sdt=4,204. Pay attention to 4,, without it, article results will appear instead.

Set up a while loop, add an if statement to be able to exit the loop:

while True:
    results = search.get_dict()

    # if any backend service error or search fail
    if "error" in results:
      break

    # extraction code here... 

    # if next page is present -> update previous results to new page results.
    # if next page is not present -> exit the while loop.
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
    else:
        break

search.params_dict.update() will split next page URL in parts and pass updated search param values to GoogleSearch(search) as a dictionary.

Extract results in a for loop and handle exceptions:

for result in results["organic_results"]:
    title = result.get("title")
    publication_info_summary = result["publication_info"]["summary"]
    result_id = result.get("result_id")
    link = result.get("link")
    result_type = result.get("type")
    snippet = result.get("snippet")
  
    try:
      file_title = result["resources"][0]["title"]
    except: file_title = None
  
    try:
      file_link = result["resources"][0]["link"]
    except: file_link = None
  
    try:
      file_format = result["resources"][0]["file_format"]
    except: file_format = None
  
    # if something is None it will return an empty {} dict()
    cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
    cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
    cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
    total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
    all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
    all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

Append results to temporary list() as a dictionary {}:

case_law_results_data.append({
    "page_number": results['serpapi_pagination']['current'],
    "position": position + 1,
    "result_type": result_type,
    "title": title,
    "link": link,
    "result_id": result_id,
    "publication_info_summary": publication_info_summary,
    "snippet": snippet,
    "cited_by_count": cited_by_count,
    "cited_by_link": cited_by_link,
    "cited_by_id": cited_by_id,
    "total_versions": total_versions,
    "all_versions_link": all_versions_link,
    "all_versions_id": all_versions_id,
    "file_format": file_format,
    "file_title": file_title,
    "file_link": file_link
})

Return extracted data:

return case_law_results_data

Save returned case_law_results() data to_csv():

pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)
  • data argument inside DataFrame is your data.
  • encoding='utf-8' argument just to make sure everything will be saved correctly. I used it explicitly even thought it's a default value.
  • index=False argument to drop default pandas row numbers.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞