Tomcat and UTF 8 encoding

I've spent a fair amount of time trying to get Tomcat and various apps (Apache Solr, SearchSite, Sitemesh) running within it to handle Chinese and implicitly UTF-8 encoded URIs and content.
Given the amount of effort it has taken on the parts of various people, I figured I might as well list the stuff needed to get it all working since it's the sensible thing to do at 3:15 am.

Sitemesh first. This fix was supplied by my colleague Li Mo.
Problem: Sitemesh refused to decorate html files containing UTF-8 encoded Chinese text without converting all of it to gibberish.
Solution: Create an encoding filter as described below

public class CharsetEncodingFilter implements Filter {
public void init(FilterConfig filterConfig) throws ServletException {
}

public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {

servletRequest.setCharacterEncoding("UTF-8");
servletResponse.setContentType("text/html;charset=UTF-8");


filterChain.doFilter(servletRequest, servletResponse);
}

public void destroy() {
}
}


Map this filter to /* in your web.xml

Apache Solr next. This one supplied by Ram.
Problem: Chinese characters passed in URL query params were reaching Solr all mangled
Solution: Add the attribute URIEncoding="UTF-8" to the connector element in tomcat's server.xml

SearchSite. This was an odd one.
Problem: I had a servlet within which I needed to read params containing Chinese characters passed using a GET. I was getting gibberish. Strangely, this app was running in the same Tomcat as Solr, just in a different context. And Solr was receiving Chinese characters in its GET requests just fine (I'd already applied the previous fix). I trawled through a lot of interesting articles like this one on JGuru as well as this one. Both seemed to indicate that there is a bug in Tomcat. The post at JGuru even had a comment with the code fix for Tomcat. However I wasn't desperate enough to actually try to rebuild Tomcat (yet). Eventually, I tried Li Mo's fix again (what I'd used for SiteMesh) - and it worked. Interestingly, doing a servletRequest.setCharacterEncoding("utf-8") directly in the servlet (as opposed to having this in a filter as described above) has no effect whatsoever. Sure, a servletRequest.getCharacterEncoding() at any point after returns "UTF-8" but your params will still be gibberish.
Why this should be necessary for one app running in a Tomcat instance, but not the other beats me. I did spend a fair amount of time going through Solr source trying to see if they had some sort of fix, but I couldn't find one. They don't seem to be doing anything magical with the incoming servletRequest - they just do a servletRequest.getParameterMap() and inject it into a MultiMapSolrParams which provide a few utility getters to read params from the map. Solr doesn't have any Filters configured either. If anyone can tell me what's going on, I'd appreciate it.

7 comments:

Anonymous said...

Once the request parameters are access, the encoding is 'determined' for that request -- that's why trying to set it in the servlet wasn't working -- it was 'too late'. The filter was early enough and thus worked. My guess is that the innocent looking call to getParameterMap was what was causing the encoding to be established for the request.

Anonymous said...

Thanks for the tip about Solr / server.xml / URIEncoding="UTF-8" . That was my problem.

Anonymous said...

Thank you very much. I've searched a couple of hours today to find a similar encoding related problem. The URIEncoding="UTF-8" solved it.

Anonymous said...

This comment about solr is very useful and still is :)

Anonymous said...

Thank you very much for taking the time to share this.

aaron jeffries said...

This blog keeps on giving. thanks.

Nestor Miliyaev said...

Spent a whole day hunting around for the solution. I use Spring+jsp+Tomcat. Your solution nailed it!
Thanks thanks thanks!

Use the following to wire it in web.xml, then stick the URIEncoding="UTF-8" in the server.xml

charsetFilter
uk.ac.ed.vfb.servlets.CharsetEncodingFilter


charsetFilter
/*


That was the essence of it.
Optionally (although considered a good style), set

<%@ page language="java" pageEncoding="UTF-8" contentType="text/html;charset=utf-8" isELIgnored="false"%>
and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Although it seems the servlet sets the content type correctly which most sane browsers pick up no problem.