Websites like Bing Image Search and Walmart render pages with JavaScript and deliver page content via JSON APIs. While it's possible to scrape dynamic web pages using the browser automation, I prefer fetching data from the API endpoints directly. It usually (not always) works faster and is more reliable.

I was debugging the Bing Image Search to help implement our new Bing Reverse Image Search API. Initially, I've used mitmproxy because Ctrl+Shift+F in the browser dev tools haven't found the request. Then I've figured out how to filter network requests in the browser dev tools, examined the response, and made a draft data adapter.

Algorithms to reverse engineer a JSON API on the SPA

Two ways I've used to reverse engineer a JSON API used on the Bing Image Search: mitmproxy and browser developer tools. I explain the devtools process because it's used more often.

Browser devtools

  1. Ctrl+F in the Network tab of browser dev tools.

image

  1. Go to the Preview tab of the JSON response.
  2. Expand JS object recursively (my Brave Browser doesn't search in the collapsed JSON 😕)

image

  1. Ctrl+F the target string

image

  1. Copy property path

image

  1. Navigate up and down in JS object (with arrow keys) to learn its structure and create an adapter.
  2. Copy as cURL and transform response with jq to check my assumption.

mitmproxy

Ctrl+Shift+F in the browser dev tools no longer searches across all responses.

image

I've proxied the browser network connections via mitmproxy. Then filtered response bodies with ~bs "TEXT_FROM_THE_HTML_ELEMENT_I_"LOOKING_FOR".

  1. Start mitmproxy with view filter
$ mitmproxy --view-filter '~bs "Freshsales"'
  1. Start chromium-based browser with the target URL and the following flags and parameters
  • Proxy requests via mitmproxy: --proxy-server='http://127.0.0.1:8080'.
  • Use incognito mode (1) with a temporary user profile (2) ignoring insecure connections (3) and certificate errors (4): --temp-profile -incognito --user-data-dir="mktemp -d" --no-first-run --ignore-certificate-errors --allow-insecure-localhost. (I ignore certificate errors in a temporary browser profile to not install mitmproxy's certificates system-wide.)
$ brave-browser 'https://www.bing.com/images/search?view=detailV2&insightstoken=bcid_RLKVsIV2BwkFXg*ccid_spWwhXYH&form=SBIHMP&iss=SBIUPLOADGET&sbisrc=ImgPicker&idpbck=1&sbifsz=927+x+524+%c2%b7+25.15+kB+%c2%b7+png&sbifnm=serpapi-serpbear.png&thw=927&thh=524&ptime=223&dlen=34344&expw=798&exph=451&selectedindex=0&id=-1051855017&ccid=spWwhXYH&vt=2&sim=11' --proxy-server='http://127.0.0.1:8080'  --temp-profile -incognito --user-data-dir="`mktemp -d`" --no-first-run --ignore-certificate-errors --allow-insecure-localhost

image

  1. mitmproxy will display the matched requests

image

Conclusion

mitmproxy can be used to find the HTTP request with the needed data in addition browser dev tools. At some point, I'll explore tcpdump and wireshark to reverse engineer websites for web scraping and share the findings with you.

If you have anything to share, any questions, suggestions, or something that isn't working correctly, feel free to reach out via Twitter at @ilyazub_, or @serp_api, or Mastodon at @iz.