RFC3986 - which pchars need to be percent-encoded?

I need to generate a href to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element should appear as (I know that path;element represents a single entity).

Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).

So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") class. So far so good. But what about the opposite case? RFC only mentions that percent (%) always needs encoding. But what about the others?

Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket ( does not necessarily need encoding but semicolon ; does. If I don't encode it I end up looking for /first* when following . But following I always end up looking for /first(second, as expected. What confuses me is that both ( and ; are in the same sub-delims class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?

Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th which is no good. Similar results observed with:

  • javax.ws.rs.core.UriBuilder
  • Spring's UriUtils - I have tried both encodePath(String, String) and encodePathSegment(String, String)

[*] /first is a result of call to HttpServletRequest.getServletPath() in the server side when clicking on

EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.

10
задан mindas 9 May 2011 в 13:26
поделиться