Programming/Java
정규식을 활용해 HTML 태그 제거하기. 순수Text만읽기
ehdtnn
2016. 10. 30. 22:52
반응형
http://blog.acronym.co.kr/227
- private String getText(String content) {
- Pattern SCRIPTS = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);
- Pattern STYLE = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);
- Pattern TAGS = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");
- Pattern nTAGS = Pattern.compile("<\\w+\\s+[^<]*\\s*>");
- Pattern ENTITY_REFS = Pattern.compile("&[^;]+;");
- Pattern WHITESPACE = Pattern.compile("\\s\\s+");
- Matcher m;
- m = SCRIPTS.matcher(content);
- content = m.replaceAll("");
- m = STYLE.matcher(content);
- content = m.replaceAll("");
- m = TAGS.matcher(content);
- content = m.replaceAll("");
- m = ENTITY_REFS.matcher(content);
- content = m.replaceAll("");
- m = WHITESPACE.matcher(content);
- content = m.replaceAll(" ");
- return content;
- }
반응형