Problem with encoding - simple_html_dom

Hello
I downloaded simple_html_dom.php class to get data from a web page,
and it works great with English, but when I’m trying to get hebrew characters from the web page, it shows: ���

I checked the encoding, UTF-8 (hebrew works on UTF-8, im using UTF-8 without BOM)
also used <meta http-equiv=“content-type” content=“text/html; charset=utf-8” /> on the web page. hebrew works, but the hebrew characters retrived by the simple_html_dom class shows us above.

What is the problem?

I’ve used simple_html_dom some days ago but only on english pages and it works.
So my conclusion is that the script itself works with another charset. Therefor you have to look into the code and find some hints there.

Ok, I took a closer look into the simple_html_dom.php file.
If you have also version 1.5 you will find in line 764 the function

function convert_text($text)

.
Try some output/debug here to see if it converts the text in your way.

The output is still ������ ��� �����

My code

    foreach($div->find('.theater') as $movie) {
	$movier = $movie->find('.name a',0)->innertext;
	$movier = $movie->convert_text($movier);
	print 'Theater: '.$movier.'<br/>';

Output for $movier:
������ ��� �����

Please try this before the output:
http://php.net/manual/de/function.utf8-encode.php
or this here
http://de1.php.net/mb_convert_encoding

Thank you, I used utf8_encode, but it still doesn’t work - but instead of question marks (the symbol above) it shows jibrish:
âìåáåñ î÷ñ àùãåã

Probably not this simple, but are you sure your browser has the font needed to render the characters?
Maybe try looking at the page using different browsers that might have that font.
If you go to a site you know has those characters, can you see them there OK?

Are you sure that there are browsers out there today that don’t support all charsets?

theunreal can you please describe your problem a bit more specific?
Or give us that page you try to read from?
I’ve never worked with so specific characters like hebrews but maybe you encoded it the wrong way and you need to encode it to hebrew.

Yea, I can type hebrew freely in the website but when it comes to the code above it shows jibrish when it’s hebrew
I’m using google chrome and it has no problem with hebrew ofcourse…
About the font I checked it now and it’s 100% ok (Arial - works on hebrew)

I’m trying to read from google movies
the exact url is

http://www.google.co.il/movies?near=רוגוזין&q=ההוביט

even in google itself you can see the
<meta http-equiv=content-type content=“text/html; charset=UTF-8”>
im using the same in my website…

Seems that would be good enough then :fangel:
Maybe a lang=“he” dir=“rtl” thing?

Tried, still jibrish
I’m pretty sure it’s someting with the simple html dom because hebrew works in the website…

I think you’re most likely right. You may need to thow use of iconv into the mix somewhere.

iconv changes encoding, How changing utf8 to utf8 will help me :S

:lol: yes converting to the same is unlikely to help much. I was thinking it would allow for testing others eg.

Hebrew
ISO-8859-8-1
Windows-1255
IBM-862

Hebrew Visual
ISO-8859-8

Everything failed
but I was trying the last one, IBM-862, but it does not seem to work (it does not recognize this encode)
blah. It’s annoying \: I think I will just get the English words and use str_replace to make it hebrew

I don’t read Hebrew so I don’t know if the direction is correct. But the problem seems to be the use of utf8_encode. Using this

require_once('simple_html_dom.php');
/*
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.google.co.il/movies?q=%D7%92%27%D7%A7');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
*/
$str = "<div id='movie_results'><div class='theater'><div class='name'>";
$str .= "<a>&#1492;&#1513;&#1488;&#1497;&#1500;&#1514;&#1492; &#1513;&#1500;&#1498; - &#1490;'&#1511; - &#1500;&#1488; &#1514;&#1488;&#1502;&#1492; &#1500;&#1488;&#1507; &#1489;&#1497;&#1511;&#1493;&#1512;&#1493;&#1514; &#1505;&#1512;&#1496;&#1497;&#1501;, &#1494;&#1502;&#1504;&#1497; &#1492;&#1510;&#1490;&#1492; &#1513;&#1500; &#1505;&#1512;&#1496; &#1488;&#1493; &#1488;&#1493;&#1500;&#1502;&#1493;&#1514; &#1492;&#1511;&#1493;&#1500;&#1504;&#1493;&#1506;.</a></div>";
$str .= "<div class='address'>&#1492;&#1513;&#1488;&#1497;&#1500;&#1514;&#1492; &#1513;&#1500;&#1498; - &#1490;'&#1511; - &#1500;&#1488; &#1514;&#1488;&#1502;&#1492; &#1500;&#1488;&#1507; &#1489;&#1497;&#1511;&#1493;&#1512;&#1493;&#1514; &#1505;&#1512;&#1496;&#1497;&#1501;, &#1494;&#1502;&#1504;&#1497; &#1492;&#1510;&#1490;&#1492; &#1513;&#1500; &#1505;&#1512;&#1496; &#1488;&#1493; &#1488;&#1493;&#1500;&#1502;&#1493;&#1514; &#1492;&#1511;&#1493;&#1500;&#1504;&#1493;&#1506;.</div>";
$str .= "<div class='times'>&#1492;&#1513;&#1488;&#1497;&#1500;&#1514;&#1492; &#1513;&#1500;&#1498; - &#1490;'&#1511; - &#1500;&#1488; &#1514;&#1488;&#1502;&#1492; &#1500;&#1488;&#1507; &#1489;&#1497;&#1511;&#1493;&#1512;&#1493;&#1514; &#1505;&#1512;&#1496;&#1497;&#1501;, &#1494;&#1502;&#1504;&#1497; &#1492;&#1510;&#1490;&#1492; &#1513;&#1500; &#1505;&#1512;&#1496; &#1488;&#1493; &#1488;&#1493;&#1500;&#1502;&#1493;&#1514; &#1492;&#1511;&#1493;&#1500;&#1504;&#1493;&#1506;.</div></div></div>";
$html = str_get_html($str);

foreach($html->find('#movie_results') as $div) {
    // print all the movies with showtimes
    foreach($div->find('.theater') as $movie) {
		$movier = $movie->find('.name a',0)->innertext;
//		$movier = utf8_encode($movier);
		print '&#1511;&#1493;&#1500;&#1504;&#1493;&#1506;: ' . $movier . '<br/>';
//		print '&#1499;&#1514;&#1493;&#1489;&#1514;: '.utf8_encode($movie->find('.address',0)->innertext).'<br />';
//		print '&#1502;&#1493;&#1511;&#1512;&#1503; &#1489;&#1513;&#1506;&#1493;&#1514;: '.utf8_encode($movie->find('.times',0)->innertext).'<hr/>';
		print '&#1499;&#1514;&#1493;&#1489;&#1514;: ' . $movie->find('.address',0)->innertext . '<br />';
		print '&#1502;&#1493;&#1511;&#1512;&#1503; &#1489;&#1513;&#1506;&#1493;&#1514;: ' . $movie->find('.times',0)->innertext . '<hr/>';
    }
}

I got

You wrote it on the .php file himself, you didn’t receive this text from the web page
I need to read it from http://www.google.co.il/movies?near=רוגוזין&q=אני

Using the $str on an hebrew text will result hebrew works great as you did above,
but when $str = curl_exec($curl); (read it from the web page) it shows the encoding problems

Yes, it does seem to be a Curl thing.

When I put the DOM portion of interest inside HEREDOC I get

With Curl without utf8_encode

With Curl with utf8_encode

Very frustrating! I think at least part of the problem is Google’s horrendous mark-up

<div id=movie_results>
	<div class=movie_results>
		<div class=movie itemscope itemtype="http://schema.org/Movie">
			<div class=header>
				<div class=desc style="margin-right:0px">
					<h2 itemprop="name">&#1488;&#1504;&#1497; &#1508;&#1512;&#1504;&#1511;&#1504;&#1513;&#1496;&#1497;&#1497;&#1503; - &#1514;&#1500;&#1514; &#1502;&#1497;&#1502;&#1491;</h2>
					<div style="display:inline-block;height:15px;position:relative;top:2px">
						<g:plusone href="http://www.google.com/movies?mid=da48f6edd96ff86f&plus=1&gl=il&hl=iw" size="small" width="350" annotation="inline" recommendations="false" source="google:SHOWTIMES"></g:plusone>
					</div>
					<div class=info>&#8207;92 &#1491;&#1511;&#1493;&#1514;&#8207;&#8207; - &#1491;&#1497;&#1512;&#1493;&#1490; &#1492;&#1505;&#1512;&#1496;: PG-13&#8207;&#8207;&#8207; - &#1488;&#1497;&#1502;&#1492;/&#1502;&#1514;&#1495;&#8207;&#8207; - &#1488;&#1504;&#1490;&#1500;&#1497;&#1514;&#8207;</div>
					<div class=syn>
						<span itemprop="description"></span>
					</div>
				</div>
				<meta itemprop="datePublished" content=""/>
				<meta itemprop="sameas" content="http://www.imdb.com/title/tt1418377/"/>
				<p class=clear>
			</div>
			<h2 class=section_title>&#1494;&#1502;&#1504;&#1497; &#1492;&#1510;&#1490;&#1492;</h2>
			<div class=showtimes>
				<div class=show_right>
					<div class=theater>
						<div id=theater_2535238655091238248 >
							<div class=name>
								<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&amp;tid=232efa1fe18d6568" id=link_1_theater_2535238655091238248>&#1490;&#1500;&#1493;&#1489;&#1493;&#1505; &#1502;&#1511;&#1505; &#1488;&#1513;&#1491;&#1493;&#1491;</a>
							</div>
							<div class=address>&#1492;&#1490;&#1491;&#1493;&#1491; &#1492;&#1506;&#1489;&#1512;&#1497; 6 - &#1511;&#1504;&#1497;&#1493;&#1503; &#1505;&#1497;-&#1502;&#1493;&#1500;, &#1488;&#1513;&#1491;&#1493;&#1491;
								<a href="" class=fl target=_top></a>
							</div>
						</div>
						<div class=times>
							<span style="color:">
								<span style="padding:0 ">
								&#8207;</span><!--  -->17:00&#8207;
							</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp
								&#8207;</span><!--  -->19:30
							&#8207;</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp&#8207;
								</span><!--  -->22:00&#8207;
							</span>
						</div>
					</div>
					<div class=theater>
						<div id=theater_12245479986072718034 >
							<div class=name>
								<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&amp;tid=a9f0af2b02050ad2" id=link_1_theater_12245479986072718034>&#1490;&#1500;&#1493;&#1489;&#1493;&#1505; &#1511;&#1504;&#1497;&#1493;&#1503; &#1495;&#1493;&#1510;&#1493;&#1514; &#1488;&#1513;&#1511;&#1500;&#1493;&#1503;</a>
							</div>
							<div class=address>&#1512;&#1495;&#39; &#1513;&#1491;&#39; &#1489;&#1503; &#1490;&#1493;&#1512;&#1497;&#1493;&#1503;, &#1492;&#1504;&#1495;&#1500; 1- &#1511;&#1504;&#1497;&#1493;&#1503; &#1495;&#1493;&#1510;&#1493;&#1514; &#1488;&#1513;&#1511;&#1500;&#1493;&#1503;, &#1488;&#1513;&#1511;&#1500;&#1493;&#1503;
								<a href="" class=fl target=_top></a>
							</div>
						</div>
						<div class=times>
							<span style="color:">
								<span style="padding:0 ">
								&#8207;</span><!--  -->17:00
							</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp
								&#8207;</span><!--  -->19:30
							&#8207;</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp
								&#8207;</span><!--  -->22:00
							&#8207;</span>
						</div>
					</div>
				</div>
				<div class=show_left>
					<div class=theater>
						<div id=theater_17631652651284870555 >
							<div class=name>
								<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&amp;tid=f4b036ab7ae8b59b" id=link_1_theater_17631652651284870555>&#1512;&#1489;-&#1495;&#1503; &#1512;&#1495;&#1493;&#1489;&#1493;&#1514;</a>
							</div>
							<div class=address>&#1489;&#1497;&#1500;&#1493; 1- &#1511;&#1504;&#1497;&#1493;&#1503; &#1512;&#1495;&#1493;&#1489;&#1493;&#1514;, &#1512;&#1495;&#1493;&#1489;&#1493;&#1514;
								<a href="" class=fl target=_top></a>
							</div>
						</div>
						<div class=times>
							<span style="color:">
								<span style="padding:0 ">&#8207;</span><!--  -->17:00
							&#8207;</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp&#8207;</span><!--  -->19:30
							&#8207;</span>
							<span style="color:">
								<span style="padding:0 "> &nbsp&#8207;</span><!--  -->21:50&#8207;
							</span>
						</div>
					</div>
				</div>
				<p class=clear>
			</div>
		</div>
	</div>
</div>

Do they have an API you could use instead of scraping that page (which BTW might be against their TOS)?

They don’t. I think this is the only option, but it’s very frustrating that I can’t read the page in Hebrew. I really don’t get why is that :S

I’ve been trying a lot of different things but haven’t gotten anywhere yet. :frowning: