문제

I want to clean an HTML page of its tags, using Ruby. I have the raw HTML, and would like to define a list of tags, e.g. ['span', 'li', 'div'], and create an array of regular expressions that I could run sequentially, so that I have

clean_text = raw.gsub(first_regex,' ').gsub(second_regex,' ')...

with two regular expressions per tag (start and end).

Do I have a way to do this programmatically (i.e. pre-build the regex array from a tag array and then run them in a fluent pattern)?

EDIT: I realize I actually asked two questions at once - The first about transforming a list of tags to a list of regular expressions, and the second about calling a list of regular expressions as a fluent. Thanks for answering both questions. I will try to make my next questions single-themed.

도움이 되었습니까?

해결책

This should produce a single regexp to remove all your tags.

clean_text = raw.gsub(/<\/?(#{tags.join("|")})>/, '')

However, you have to improve it to support tags with attributes (e.g. <a href="...">), currently only simple tags are removed (e.g. <a>)

다른 팁

Assuming you have a build_regex method to turn a tag into a regex, this should do it:

tags = %w(span div li)
clean_text = tags.inject(raw) {|text, tag| text.gsub build_regex(tag), ' ' }

The inject call passes the result of each substitution into the next iteration of the block, giving the effect of running each gsub on the string one by one.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top