Recently in SerpApi, our customers faced an issue in which using a query that contains some form of Unicode wouldn’t work, for example synergy²-madison-3, noticed the ². This is supported by the browsers, navigate to https://www.yelp.com/biz/synergy²-madison-3 in Chrome should work just fine, but there is an issue with the underlying library that we use.

This happened to Yelp Place API and we have made a fix on the issue, feel free to checkout our playground.

Our Yelp Place API takes place_id parameter and forms a complete HTTP URL, for example, https://www.yelp.com/biz/synergy²-madison-3 where place_id is synergy²-madison-3. We then made the HTTP call to download the data, but this time it didn't work as expected. The response was always 301 - Moved Permanently. It might be wise to enable follow redirect in the http library (we are on version v4), but that would end up in EndlessRedirectError.

After digging through the library code, turns out that, http library normalizes the URL before requesting it for various reasons. Thus, https://www.yelp.com/biz/synergy²-madison-3 became https://www.yelp.com/biz/synergy2-madison-3 after normalization, it is no longer the same URL. Yelp is smart enough to redirect .../biz/synergy2-madison-3b back to .../biz/synergy²-madison-3, thus the 301 - Moved Permanently response and that also explains the EndlessRedirectError after we enabled follow redirect, because the http library will normalize again the URL to .../biz/synergy2-madison-3b before requesting.

http v5 uses a more conservative URL normalization and it resolves our issue, which Unicode character is percent-encoded .../biz/synergy%C2%B2-madison-3 instead of changing its meaning.

Underlying the http library is addressable where the actual normalization is done (we were on v2.7). The update that fixed our issue is on the version v2.8.2, it changed the encoding from NFKC to NFC. Here is the comparison between them:

Encoding synergy²-madison-3 in NFKCsynergy2-madison-3
Encoding synergy²-madison-3 in NFCsynergy²-madison-3

There is a lot of information about which encoding is used for URL normalization and the reasons behind them, I am not an expert so I can't share more about it. But clearly, using NFC resolved our issue. The final solution we went for is to upgrade addressable to the latest v2.8.6 as it only involves a single dependency upgrade.

Why is URI normalization needed?

There are a lot of reasons why normalization is important and this topic can get very complicated. As for Google, URI normalization can help to prevent duplicated content. If the same content is accessible via multiple URLs, it can dilute the page's ranking or lead to confusion about which version is the canonical one. URL normalization helps to prevent such duplication.

In terms of security, URI normalization prevents attacks set up by people with intention. URI normalization helps in identifying potential security vulnerabilities. Malicious actors might use variations of URIs to bypass security controls or deceive users. By normalizing URIs, it's easier to detect and block such attempts.

Normalizing URIs can also help in efficient caching and storage. If multiple representations of the same resource are cached or stored separately, it can lead to the wastage of resources. Normalization ensures that only one canonical representation is used for a given resource.

Conclusion

It is interesting to get to know implementations like URL Normalization. It is something we might not often think about or even take for granted, but I believe it is one of the important building blocks of the web for security, efficiency, and other critical reasons.