Tomcat and UTF 8 encoding

I've spent a fair amount of time trying to get Tomcat and various apps (Apache Solr, SearchSite, Sitemesh) running within it to handle Chinese and implicitly UTF-8 encoded URIs and content.
Given the amount of effort it has taken on the parts of various people, I figured I might as well list the stuff needed to get it all working since it's the sensible thing to do at 3:15 am.

Sitemesh first. This fix was supplied by my colleague Li Mo.
Problem: Sitemesh refused to decorate html files containing UTF-8 encoded Chinese text without converting all of it to gibberish.
Solution: Create an encoding filter as described below

public class CharsetEncodingFilter implements Filter {
public void init(FilterConfig filterConfig) throws ServletException {
}

public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {

servletRequest.setCharacterEncoding("UTF-8");
servletResponse.setContentType("text/html;charset=UTF-8");


filterChain.doFilter(servletRequest, servletResponse);
}

public void destroy() {
}
}


Map this filter to /* in your web.xml

Apache Solr next. This one supplied by Ram.
Problem: Chinese characters passed in URL query params were reaching Solr all mangled
Solution: Add the attribute URIEncoding="UTF-8" to the connector element in tomcat's server.xml

SearchSite. This was an odd one.
Problem: I had a servlet within which I needed to read params containing Chinese characters passed using a GET. I was getting gibberish. Strangely, this app was running in the same Tomcat as Solr, just in a different context. And Solr was receiving Chinese characters in its GET requests just fine (I'd already applied the previous fix). I trawled through a lot of interesting articles like this one on JGuru as well as this one. Both seemed to indicate that there is a bug in Tomcat. The post at JGuru even had a comment with the code fix for Tomcat. However I wasn't desperate enough to actually try to rebuild Tomcat (yet). Eventually, I tried Li Mo's fix again (what I'd used for SiteMesh) - and it worked. Interestingly, doing a servletRequest.setCharacterEncoding("utf-8") directly in the servlet (as opposed to having this in a filter as described above) has no effect whatsoever. Sure, a servletRequest.getCharacterEncoding() at any point after returns "UTF-8" but your params will still be gibberish.
Why this should be necessary for one app running in a Tomcat instance, but not the other beats me. I did spend a fair amount of time going through Solr source trying to see if they had some sort of fix, but I couldn't find one. They don't seem to be doing anything magical with the incoming servletRequest - they just do a servletRequest.getParameterMap() and inject it into a MultiMapSolrParams which provide a few utility getters to read params from the map. Solr doesn't have any Filters configured either. If anyone can tell me what's going on, I'd appreciate it.
Post a Comment