How to remove the quoted text from an email and only show the new text(如何从电子邮件中删除引用的文本并仅显示新文本)
I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).
1st email (start of conversation)
This is the first email
2nd email (reply to first)
This is the second email
Tim said:
This is the first email
The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the new email text only, that would also be acceptable.
I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):
/** general spacers for time and date */
private static final String spacers = "[\s,/\.\-]";
/** matches times */
private static final String timePattern = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\s)?[AP]M)?";
/** matches day of the week */
private static final String dayPattern = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";
/** matches day of the month (number and st, nd, rd, th) */
private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";
/** matches months (numeric and text) */
private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
/** matches years (only 1000's and 2000's, because we are matching emails) */
private static final String yearPattern = "(?:[1-2]?[0-9])[0-9][0-9]";
/** matches a full date */
private static final String datePattern = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
"(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
spacers + "+" + yearPattern;
/** matches a date and time combo (in either order) */
private static final String dateTimePattern = "(?:" + datePattern + "[\s,]*(?:(?:at)|(?:@))?\s*" + timePattern + ")|" +
"(?:" + timePattern + "[\s,]*(?:on)?\s*"+ datePattern + ")";
/** matches a leading line such as
* ----Original Message----
* or simply
* ------------------------
private static final String leadInLine = "-+\s*(?:Original(?:\sMessage)?)?\s*-+
/** matches a header line indicating the date */
private static final String dateLine = "(?:(?:date)|(?:sent)|(?:time)):\s*"+ dateTimePattern + ".*
/** matches a subject or address line */
private static final String subjectOrAddressLine = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*
/** matches gmail style quoted text beginning, i.e.
* On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
private static final String gmailQuotedTextBeginning = "(On\s+" + dateTimePattern + ".*wrote:
/** matches the start of a quoted section of an email */
private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
"(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
gmailQuotedTextBeginning + ")"
I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!
- 在螺旋中写一个字符串 2022-01-01
- 如何在不安装整个 WTP 包的情况下将 Tomcat 8 添加到 Eclipse Kepler 2022-01-01
- Java 中保存最后 N 个元素的大小受限队列 2022-01-01
- Spring Boot Freemarker从2.2.0升级失败 2022-01-01
- 首次使用 Hadoop,MapReduce Job 不运行 Reduce Phase 2022-01-01
- 如何对 HashSet 进行排序? 2022-01-01
- 如何强制对超级方法进行多态调用? 2022-01-01
- 如何使用 Eclipse 检查调试符号状态? 2022-01-01
- 由于对所需库 rt.jar 的限制,对类的访问限制? 2022-01-01
- 如何使用 Stream 在集合中拆分奇数和偶数以及两者的总和 2022-01-01