SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Guru Majglow's Avatar
    Join Date
    Aug 1999
    Location
    B-Town
    Posts
    645
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    My HTML sanitizer. Comments please

    For an application I am working on, I have to allow user submitted HTML (via TinyMCE). I only want to allow a tiny subset of the HTML specification. I tried to find a function that would be able to strip away the unwanted HTML tags, attributes, and styles that were submitted, but didn't find one. So, I wrote a function myself.

    The parameter format is similar to TinyMCE's valid_elements, but differs (in that 1) it offers a lot less and 2) I wasn't exactly 100% sure about what each bit of the valid_elements in TinyMCE's documentation did.

    However, being a ruby newbie, I thought I would post what I came up with here for feedback and comments on improving my "rubyisms". As of now, it's still pretty limited, but suits my minimum needs. I probably will expand on it in the future, maybe convert it to a Rails plugin.

    Anyway, here's the source:

    Code:
    # "-strong/b,-em/i,-strike,-u,-p[style|align],-ol[style],-ul[style],-li[style],+br,-blockquote[style],+hr[style],-font[face|size|style|color]"
    
    # alias_for   =>  Element name (string)
    # attributes  =>  Valid Attributes (array of strings)
    # open        =>  Does the element have a closing tag? If not, make sure that there is a trailing slash. ie: <hr/> (true / false)
    # empty       =>  The element can be empty. (not implemented) If false, all empty elements will be removed. (true / false)
    
    require 'erb'
    
    class String
      
      include ERB::Util
      
      def sanitize_html(valid_elements = nil, valid_styles = nil)
        valid_elements = extract_elements(valid_elements ||= "")
        valid_styles   = (valid_styles ||= "").split('|').map{|s| s.strip}.select{|s| s.length > 0}
        
        html    = self
        new_str = ""
        tags    = Array.new  #Stack of elements
        
        while html =~ /<\s*(\/)?\s*([a-z]+)(\s*[a-z]+\s*=\s*"[^"<>]+"\s*)*\s*\/?>/im do 
          # Extract the parts from the RegEx match
          html_tag    = $&
          closing_tag = $1 == '/'
          element     = $2.downcase.to_sym
          attributes  = $3
          before      = $`
          after       = $'
          
          # Append everything before the match to the return string
          new_str << html_escape(before)
          
          # Checks is the element is valid
          if valid_elements[element]
            
            element = valid_elements[element][:alias_for] if valid_elements[element][:alias_for]
            
            if not closing_tag
              
              tags.push(element)
              
              new_str << '<'
              new_str << element.to_s
              new_str << extract_attributes(attributes, valid_elements[element][:attributes], valid_styles)
              new_str << '/' if valid_elements[element][:open]
              new_str << '>'
            elsif tags.last == element
              tags.pop
              new_str << "</#{element.to_s}>" unless valid_elements[element][:open]
            else
              while tags.last && tags.last != element
                last_tag = tags.pop
                new_str << "</#{last_tag.to_s}>"
              end
              
              if tags.last == element
                last_tag = tags.pop
                new_str << "</#{last_tag.to_s}>"
              else
                new_str << html_escape(html_tag)
              end
            end
          else
            new_str << html_escape(html_tag)
          end
          
          # Repeat the loop with the rest of the html string
          html = after
        end
        new_str << html_escape(html)
        #new_str
      end
      
      private
      
      def extract_elements(elements)
        elements = "" unless elements
        elements = elements.split(',').compact.map{|e| e.strip}
        
        # Go over each element and create a hash describing the logic
        valid_elements = elements.inject(Hash.new) do |memo, word|
          
          if word =~ /^([-+]?)([a-z]+)(\/[a-z]+)?(\[[a-z]+(\|[a-z]+)*\])?$/i
            
            element    = $2.downcase.to_sym
            aliases    = $3
            attributes = $4
            
            if not memo[element]
              memo[element] = Hash.new
              memo[element][:open]  = $1 == '+'
              memo[element][:empty] = $1 == '-'
              
              aliases.split('/').select{|e| e.length > 0}.each do |e|
                if not memo[e.downcase.to_sym]
                  memo[e.downcase.to_sym] = {:alias_for => element}
                end
              end if aliases
              
              attributes.delete('[').delete(']').split('|').select{|e| e.length > 0}.each do |a|
                (memo[element][:attributes] ||= Array.new) << a
              end if attributes
            end
          end
          memo
        end
      end
      
      def extract_attributes(attributes, valid_attributes, valid_styles)
        validated_attributes = String.new
        attributes           = "" unless attributes
        
        attributes.scan(/\s*([a-z]+)\s*=\s*"\s*([^"<>]+)\s*"\s*/) do |name, value|
          if valid_attributes.include?(name.downcase)
            if name.downcase == "style"
              value = validate_css(value, valid_styles)
            end
            validated_attributes << " #{name}=\"#{value}\""
          end
        end if attributes.length > 0
        
        validated_attributes
      end
      
      def validate_css(styles, valid_styles)
        styles = styles.split(';').map{|e| e.strip}.select{|e| e.length > 0}
        
        styles = styles.inject(String.new) do |validated_styles, style|
          if style =~ /\A\s*([-a-z]+)\s*:(.*)\z/mi
            if valid_styles.include?($1.downcase)
              validated_styles << " #{$1}: #{$2.strip}"
            end
          end
          validated_styles
        end
        styles.strip
      end
    end
    And a quick unit test (I will expand on it still)

    Code:
    require 'test/unit'
    require '../lib/sanitize.rb'
    
    class TestSanitize < Test::Unit::TestCase
      
      def setup
        @html1 = <<-END_OF_HTML
        <p>Hello <b>World</b>, I'm just <u><i>trying to </U>write out some html so that I can run these test cases</p>
        <div>This is a div thrown out in the middle of the text</div>
        <table>
          <tr>
            <td>Here's a table cell</td>
            <td>Here's another</td>
          </tr>
        </table>
        now for some html in the middle of nowhere.<br><br/>
        <hr>
        oh cssnap! <style>
          body { visibility: none; }
          <style>
        <ul>This is no good
        <li>One</li>
        <li>Two</li>
        <     
        script          > alert('hello'); <
        /
                         script
        >
        END_OF_HTML
      end
      
      def teardown
        # again... zomg
      end
      
      def test_simple_element_list
        # Sanitize the HTML
        valid_elements = "strong/b,em/i,strike,u,p,ol,ul,li,+br,+hr"
        valid_styles   = ""
        sanitized_html = @html1.sanitize_html(valid_elements, valid_styles)
        
        # Make sure the bad tags are removed
        assert_no_match(/<b>|<\/b>/, sanitized_html)
        assert_no_match(/<i>|<\/i>/, sanitized_html)
        assert_no_match(/<div>|<\/div>/, sanitized_html)
        assert_no_match(/<table>|\/table>/, sanitized_html)
        assert_no_match(/<tr>|<\/tr>/, sanitized_html)
        assert_no_match(/<td>|<\/td>/, sanitized_html)
        assert_no_match(/<style>|<\/style>/, sanitized_html)
        assert_no_match(/<\s*script\s*>|<\s*\/\s*script\s*>/, sanitized_html)
        
        # Make sure there are no open tags without the /
        assert_no_match(/<br>|<hr>/, sanitized_html)
        
        # Make sure that the good tags remain
        assert_match(/<p>Hello/, sanitized_html)
        assert_match(/<strong>World/, sanitized_html)
        assert_match(/<\/em><\/u>write/, sanitized_html)
        assert_match(/cases<\/p>/, sanitized_html)
        assert_match(/nowhere\.<br\/><br\/>\s*<hr\/>/, sanitized_html)
        assert_match(/<ul>This is no good/, sanitized_html)
        assert_match(/<li>One<\/li>/, sanitized_html)
        assert_match(/<li>Two<\/li>/, sanitized_html)
        
        # Make sure that the bad tags get converted
        assert_match(/&lt;div&gt;/, sanitized_html)
        assert_match(/&lt;table&gt;/, sanitized_html)
        assert_match(/&lt;tr&gt;/, sanitized_html)
        assert_match(/&lt;td&gt;Here's a table cell&lt;\/td&gt;/, sanitized_html)
        assert_match(/&lt;style&gt;/, sanitized_html)
        # assert_match(/&lt;\/style&gt;/, sanitized_html)
        assert_match(/&lt;\s*script\s*&gt;/, sanitized_html)
        assert_match(/&lt;\s*\/\s*script\s*&gt;/, sanitized_html)
      end
    end
    Ohai!

  2. #2
    SitePoint Guru silver trophy Luke Redpath's Avatar
    Join Date
    Mar 2003
    Location
    London
    Posts
    794
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If it works, then great - I'd expand on the unit tests - lots of fine grained test cases is better than one large test. Once you've done that, and assuming the tests still pass, I'd seriously look into refactoring it. If you start seeing methods over 10 lines long, I'd start getting worried. The first thing I'd look at is removing all of that from Ruby's string class and creating your own class with some kind of class method (process() might be a good name) that invokes the cleanup and returns the cleaned string. I'd then look at doing a lot of extract method refactoring on your code until you are left with some cleaner, more efficient code. Your unit tests should give you the confidence do this. Ruby is at its heart an object-oriented language but you have a very procedural solution.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •