Converting RTF to HTML From VS.NET
Monday, July 14, 2008 | Brock Weaver
So you write this great section of .NET code that addresses a need lots of people are having. You want to share it with the world so others don't fumble around as much as you did. Or maybe you just need to document it in an HTML-based repository. You go to post the code on your blog or website -- and realize when you try to copy from VS.NET, it does it in Microsoft's RTF format instead of HTML as you need.
You google around a bit, find a whole bunch of tools to do it, but don't like any of them because they're too complex, cost you money, you don't want to submit your code to another company, they require you to install something, or they're simply not what you had in mind.
So I sat down and decided to write a very rough-and-tumble version of an RTF-to-HTML conversion function simply for copying code from VS.NET. After about 10 minutes of looking at the latest RTF spec and about 20 minutes of coding, I ended up with the following method:
public static string RtfToHtml(string rtf) {
// escape quotes, apostrophes, less than, greater than, and ampersands
string output = (String.IsNullOrEmpty(rtf) ? "" :
rtf.Replace("&", "&")
.Replace("<", "<")
.Replace(">", ">")
// .Replace("'", "'")
.Replace("\"", """));
// throw out line breaks
output = output.Replace("\r\n", "");
// create the color mappings (yep this is a total kludge)
Regex reColor = new Regex(@"\\red([0-9]{1,3})\\green([0-9]{1,3})\\blue([0-9]{1,3})");
MatchCollection mcColor = reColor.Matches(output);
List<string> colors = new List<string>();
colors.Add("#000000");
foreach (Match mColor in mcColor) {
string redHex = int.Parse(mColor.Groups[1].Value).ToString("X2");
string greenHex = int.Parse(mColor.Groups[2].Value).ToString("X2");
string blueHex = int.Parse(mColor.Groups[3].Value).ToString("X2");
colors.Add("#" + redHex + greenHex + blueHex);
}
// rip off everything before first font statement
Regex reFontSize = new Regex(@"\\fs([0-9]*) ");
Match mFontSize = reFontSize.Match(output);
if (mFontSize.Success) {
int fontStart = mFontSize.Index;
int fontSize = int.Parse(mFontSize.Groups[1].Value) / 2;
output = output.Substring(fontStart + mFontSize.Value.Length);
// find control groups
MatchCollection mcControlGroups = Regex.Matches(output, @"[^\\]{\\([^ ]+) ([^}]*?)}");
foreach (Match mControlGroup in mcControlGroups) {
string fullString = mControlGroup.Value;
string tag = mControlGroup.Groups[1].Value;
string withoutFormatting = mControlGroup.Groups[2].Value;
string newString = "";
switch (tag) {
case "tab":
newString = " " + withoutFormatting;
break;
case "par":
newString = withoutFormatting + "<br />";
break;
case "b":
newString = "<b>" + withoutFormatting + "</b>";
break;
case "i":
newString = "<i>" + withoutFormatting + "</i>";
break;
}
// we don't capture the first char, but we need to match on it so we
// don't grab unescaped brackets -- so we lop it off when we do the replace
// and just leave the existing value
output = output.Replace(fullString.Substring(1), newString);
}
// replace rudimentary rtf spacing and formatting with html equivalents
output = output.Replace(@"\tab ", " ");
output = output.Replace(@"\par ", "<br />");
// unescape rtf-escaped {} chars
output = output.Replace(@"\{", "{").Replace(@"\}", "}");
// replace embedded color tags with their mapping equivalents
for(int i=0;i<colors.Count;i++){
output = output.Replace(@"\cf" + i + " ", "</span><span style='color:" + colors[i] + "'>");
}
// yank off the invalid close span generated from above replacement
int firstSpan = output.IndexOf("</span>");
if (firstSpan > -1) {
output = output.Substring(0, firstSpan) + output.Substring(firstSpan + 7);
}
// since we fudged the header stuff (see the fs parsing above -- we just cut it off)
output = output.Substring(0, output.Length - 1);
// throw in \r\n so page source is easier to read
output = output.Replace("<br />", "<br />\r\n");
// suppress double-escaping
output = output.Replace(@"\\", @"\");
// apply font family to whole string. Yeah I punted on reading it from the input.
// This is intended to be somewhat quick-and-dirty, so I figured why not :-)
output = @"<span style=""font-family:'Courier New'; font-size:" + fontSize + @"pt"">" + output + "</span></span>";
}
return output;
}
Notice there are tons of hacks in there -- I'm ignoring font family and hard coding Courier New, lopping off most everything before the \fs control tag, etc. But for what I needed, this got me to where I could copy C# code and paste it as HTML. You'll need to add using System.Text.RegularExpressions; but otherwise the code should run as-is when plopped into the standard .cs template file.
NOTE! If you view uncomment the line near the top of the method where apostrophe is getting escaped to ', it will not display properly in IE.