Regex to remove DOCTYPE prolog

While using HTML Tidy I needed to remove the DOCTYPE prolog to prevent
‘org.xml.sax.SAXParseException: Already seen doctype.’ exception.

Regex is quite simple, only catch is that we need to make sure we include the \n\r in our selecton and make it not greedy.

 convertedData = convertedData.replaceAll("<!DOCTYPE((.|\n|\r)*?)\">", "");

This will consume multiline as well as single declarations

/*		
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
*/

1 thought on “Regex to remove DOCTYPE prolog”

  1. Hi Greg,
    I am using one java library for my android project. So I am exactly getting this exception. Now its code is packaged as .jar, in which SAXParser has been used. So how can I use your above solution to handle this? Can you please give me a concrete example in Java?
    Thanks
    Regards..

Leave a Comment

Your email address will not be published. Required fields are marked *