How to remove the quoted text from an email and only show the new text(如何从电子邮件中删除引用的文本并仅显示新文本)
问题描述
我正在解析电子邮件.当我看到对电子邮件的回复时,我想删除引用的文本,以便我可以将文本附加到上一封电子邮件(即使它是回复).
I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).
通常,您会看到:
第一封电子邮件(对话开始)
1st email (start of conversation)
This is the first email
第二封电子邮件(回复第一封)
2nd email (reply to first)
This is the second email
Tim said:
This is the first email
此输出将仅为这是第二封电子邮件".尽管不同的电子邮件客户端引用文本的方式不同,但如果有办法只获取大部分新的电子邮件文本,那也是可以接受的.
The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the new email text only, that would also be acceptable.
推荐答案
我使用以下正则表达式来匹配引用文本的前导(最后一个是重要的):
I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):
/** general spacers for time and date */
private static final String spacers = "[\s,/\.\-]";
/** matches times */
private static final String timePattern = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\s)?[AP]M)?";
/** matches day of the week */
private static final String dayPattern = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";
/** matches day of the month (number and st, nd, rd, th) */
private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";
/** matches months (numeric and text) */
private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
"|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";
/** matches years (only 1000's and 2000's, because we are matching emails) */
private static final String yearPattern = "(?:[1-2]?[0-9])[0-9][0-9]";
/** matches a full date */
private static final String datePattern = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
"(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
spacers + "+" + yearPattern;
/** matches a date and time combo (in either order) */
private static final String dateTimePattern = "(?:" + datePattern + "[\s,]*(?:(?:at)|(?:@))?\s*" + timePattern + ")|" +
"(?:" + timePattern + "[\s,]*(?:on)?\s*"+ datePattern + ")";
/** matches a leading line such as
* ----Original Message----
* or simply
* ------------------------
*/
private static final String leadInLine = "-+\s*(?:Original(?:\sMessage)?)?\s*-+
";
/** matches a header line indicating the date */
private static final String dateLine = "(?:(?:date)|(?:sent)|(?:time)):\s*"+ dateTimePattern + ".*
";
/** matches a subject or address line */
private static final String subjectOrAddressLine = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*
";
/** matches gmail style quoted text beginning, i.e.
* On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
*/
private static final String gmailQuotedTextBeginning = "(On\s+" + dateTimePattern + ".*wrote:
)";
/** matches the start of a quoted section of an email */
private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
"(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
gmailQuotedTextBeginning + ")"
);
我知道在某些方面这有点矫枉过正(而且可能会很慢!)但效果很好.如果您发现任何与此不符的地方,请告诉我,以便我改进!
I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!
这篇关于如何从电子邮件中删除引用的文本并仅显示新文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何从电子邮件中删除引用的文本并仅显示新文
基础教程推荐
- 在螺旋中写一个字符串 2022-01-01
- 如何在不安装整个 WTP 包的情况下将 Tomcat 8 添加到 Eclipse Kepler 2022-01-01
- Java 中保存最后 N 个元素的大小受限队列 2022-01-01
- Spring Boot Freemarker从2.2.0升级失败 2022-01-01
- 首次使用 Hadoop,MapReduce Job 不运行 Reduce Phase 2022-01-01
- 如何对 HashSet 进行排序? 2022-01-01
- 如何强制对超级方法进行多态调用? 2022-01-01
- 如何使用 Eclipse 检查调试符号状态? 2022-01-01
- 由于对所需库 rt.jar 的限制,对类的访问限制? 2022-01-01
- 如何使用 Stream 在集合中拆分奇数和偶数以及两者的总和 2022-01-01