Given the amount of effort it has taken on the parts of various people, I figured I might as well list the stuff needed to get it all working since it's the sensible thing to do at 3:15 am.
Sitemesh first. This fix was supplied by my colleague Li Mo.
Problem: Sitemesh refused to decorate html files containing UTF-8 encoded Chinese text without converting all of it to gibberish.
Solution: Create an encoding filter as described below
public class CharsetEncodingFilter implements Filter {
public void init(FilterConfig filterConfig) throws ServletException {
}
public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {
servletRequest.setCharacterEncoding("UTF-8");
servletResponse.setContentType("text/html;charset=UTF-8");
filterChain.doFilter(servletRequest, servletResponse);
}
public void destroy() {
}
}
Map this filter to
/*
in your web.xmlApache Solr next. This one supplied by Ram.
Problem: Chinese characters passed in URL query params were reaching Solr all mangled
Solution: Add the attribute
URIEncoding="UTF-8"
to the connector
element in tomcat's server.xmlSearchSite. This was an odd one.
Problem: I had a servlet within which I needed to read params containing Chinese characters passed using a GET. I was getting gibberish. Strangely, this app was running in the same Tomcat as Solr, just in a different context. And Solr was receiving Chinese characters in its GET requests just fine (I'd already applied the previous fix). I trawled through a lot of interesting articles like this one on JGuru as well as this one. Both seemed to indicate that there is a bug in Tomcat. The post at JGuru even had a comment with the code fix for Tomcat. However I wasn't desperate enough to actually try to rebuild Tomcat (yet). Eventually, I tried Li Mo's fix again (what I'd used for SiteMesh) - and it worked. Interestingly, doing a
servletRequest.setCharacterEncoding("utf-8")
directly in the servlet (as opposed to having this in a filter as described above) has no effect whatsoever. Sure, a servletRequest.getCharacterEncoding()
at any point after returns "UTF-8" but your params will still be gibberish.Why this should be necessary for one app running in a Tomcat instance, but not the other beats me. I did spend a fair amount of time going through Solr source trying to see if they had some sort of fix, but I couldn't find one. They don't seem to be doing anything magical with the incoming
servletRequest
- they just do a servletRequest.getParameterMap()
and inject it into a MultiMapSolrParams
which provide a few utility getters to read params from the map. Solr doesn't have any Filters configured either. If anyone can tell me what's going on, I'd appreciate it.
7 comments:
Once the request parameters are access, the encoding is 'determined' for that request -- that's why trying to set it in the servlet wasn't working -- it was 'too late'. The filter was early enough and thus worked. My guess is that the innocent looking call to getParameterMap was what was causing the encoding to be established for the request.
Thanks for the tip about Solr / server.xml / URIEncoding="UTF-8" . That was my problem.
Thank you very much. I've searched a couple of hours today to find a similar encoding related problem. The URIEncoding="UTF-8" solved it.
This comment about solr is very useful and still is :)
Thank you very much for taking the time to share this.
This blog keeps on giving. thanks.
Spent a whole day hunting around for the solution. I use Spring+jsp+Tomcat. Your solution nailed it!
Thanks thanks thanks!
Use the following to wire it in web.xml, then stick the URIEncoding="UTF-8" in the server.xml
charsetFilter
uk.ac.ed.vfb.servlets.CharsetEncodingFilter
charsetFilter
/*
That was the essence of it.
Optionally (although considered a good style), set
<%@ page language="java" pageEncoding="UTF-8" contentType="text/html;charset=utf-8" isELIgnored="false"%>
and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Although it seems the servlet sets the content type correctly which most sane browsers pick up no problem.
Post a Comment