Hello
I downloaded simple_html_dom.php class to get data from a web page,
and it works great with English, but when I’m trying to get hebrew characters from the web page, it shows: ���
I checked the encoding, UTF-8 (hebrew works on UTF-8, im using UTF-8 without BOM)
also used <meta http-equiv=“content-type” content=“text/html; charset=utf-8” /> on the web page. hebrew works, but the hebrew characters retrived by the simple_html_dom class shows us above.
What is the problem?
Tipo
January 21, 2014, 12:47pm
2
I’ve used simple_html_dom some days ago but only on english pages and it works.
So my conclusion is that the script itself works with another charset. Therefor you have to look into the code and find some hints there.
Tipo
January 21, 2014, 12:54pm
3
Ok, I took a closer look into the simple_html_dom.php file.
If you have also version 1.5 you will find in line 764 the function
function convert_text($text)
.
Try some output/debug here to see if it converts the text in your way.
Tipo:
Ok, I took a closer look into the simple_html_dom.php file.
If you have also version 1.5 you will find in line 764 the function
function convert_text($text)
.
Try some output/debug here to see if it converts the text in your way.
The output is still ������ ��� �����
My code
foreach($div->find('.theater') as $movie) {
$movier = $movie->find('.name a',0)->innertext;
$movier = $movie->convert_text($movier);
print 'Theater: '.$movier.'<br/>';
Output for $movier:
������ ��� �����
Tipo
January 21, 2014, 6:00pm
5
Thank you, I used utf8_encode, but it still doesn’t work - but instead of question marks (the symbol above) it shows jibrish:
âìåáåñ î÷ñ àùãåã
Probably not this simple, but are you sure your browser has the font needed to render the characters?
Maybe try looking at the page using different browsers that might have that font.
If you go to a site you know has those characters, can you see them there OK?
Tipo
January 21, 2014, 8:20pm
8
Are you sure that there are browsers out there today that don’t support all charsets?
theunreal can you please describe your problem a bit more specific?
Or give us that page you try to read from?
I’ve never worked with so specific characters like hebrews but maybe you encoded it the wrong way and you need to encode it to hebrew.
Mittineague:
Probably not this simple, but are you sure your browser has the font needed to render the characters?
Maybe try looking at the page using different browsers that might have that font.
If you go to a site you know has those characters, can you see them there OK?
Yea, I can type hebrew freely in the website but when it comes to the code above it shows jibrish when it’s hebrew
I’m using google chrome and it has no problem with hebrew ofcourse…
About the font I checked it now and it’s 100% ok (Arial - works on hebrew)
Tipo:
Are you sure that there are browsers out there today that don’t support all charsets?
theunreal can you please describe your problem a bit more specific?
Or give us that page you try to read from?
I’ve never worked with so specific characters like hebrews but maybe you encoded it the wrong way and you need to encode it to hebrew.
I’m trying to read from google movies
the exact url is
http://www.google.co.il/movies?near=רוגוזין&q=ההוביט
even in google itself you can see the
<meta http-equiv=content-type content=“text/html; charset=UTF-8”>
im using the same in my website…
Seems that would be good enough then :fangel:
Maybe a lang=“he” dir=“rtl” thing?
Tried, still jibrish
I’m pretty sure it’s someting with the simple html dom because hebrew works in the website…
I think you’re most likely right. You may need to thow use of iconv into the mix somewhere.
iconv changes encoding, How changing utf8 to utf8 will help me :S
yes converting to the same is unlikely to help much. I was thinking it would allow for testing others eg.
Hebrew
ISO-8859-8-1
Windows-1255
IBM-862
Hebrew Visual
ISO-8859-8
Everything failed
but I was trying the last one, IBM-862, but it does not seem to work (it does not recognize this encode)
blah. It’s annoying \: I think I will just get the English words and use str_replace to make it hebrew
I don’t read Hebrew so I don’t know if the direction is correct. But the problem seems to be the use of utf8_encode. Using this
require_once('simple_html_dom.php');
/*
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.google.co.il/movies?q=%D7%92%27%D7%A7');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
*/
$str = "<div id='movie_results'><div class='theater'><div class='name'>";
$str .= "<a>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</a></div>";
$str .= "<div class='address'>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</div>";
$str .= "<div class='times'>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</div></div></div>";
$html = str_get_html($str);
foreach($html->find('#movie_results') as $div) {
// print all the movies with showtimes
foreach($div->find('.theater') as $movie) {
$movier = $movie->find('.name a',0)->innertext;
// $movier = utf8_encode($movier);
print 'קולנוע: ' . $movier . '<br/>';
// print 'כתובת: '.utf8_encode($movie->find('.address',0)->innertext).'<br />';
// print 'מוקרן בשעות: '.utf8_encode($movie->find('.times',0)->innertext).'<hr/>';
print 'כתובת: ' . $movie->find('.address',0)->innertext . '<br />';
print 'מוקרן בשעות: ' . $movie->find('.times',0)->innertext . '<hr/>';
}
}
I got
Mittineague:
I don’t read Hebrew so I don’t know if the direction is correct. But the problem seems to be the use of utf8_encode. Using this
require_once('simple_html_dom.php');
/*
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.google.co.il/movies?q=%D7%92%27%D7%A7');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
*/
$str = "<div id='movie_results'><div class='theater'><div class='name'>";
$str .= "<a>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</a></div>";
$str .= "<div class='address'>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</div>";
$str .= "<div class='times'>השאילתה שלך - ג'ק - לא תאמה לאף ביקורות סרטים, זמני הצגה של סרט או אולמות הקולנוע.</div></div></div>";
$html = str_get_html($str);
foreach($html->find('#movie_results') as $div) {
// print all the movies with showtimes
foreach($div->find('.theater') as $movie) {
$movier = $movie->find('.name a',0)->innertext;
// $movier = utf8_encode($movier);
print 'קולנוע: ' . $movier . '<br/>';
// print 'כתובת: '.utf8_encode($movie->find('.address',0)->innertext).'<br />';
// print 'מוקרן בשעות: '.utf8_encode($movie->find('.times',0)->innertext).'<hr/>';
print 'כתובת: ' . $movie->find('.address',0)->innertext . '<br />';
print 'מוקרן בשעות: ' . $movie->find('.times',0)->innertext . '<hr/>';
}
}
I got
You wrote it on the .php file himself, you didn’t receive this text from the web page
I need to read it from http://www.google.co.il/movies?near=רוגוזין&q=אני
Using the $str on an hebrew text will result hebrew works great as you did above,
but when $str = curl_exec($curl); (read it from the web page) it shows the encoding problems
Yes, it does seem to be a Curl thing.
When I put the DOM portion of interest inside HEREDOC I get
With Curl without utf8_encode
With Curl with utf8_encode
Very frustrating! I think at least part of the problem is Google’s horrendous mark-up
<div id=movie_results>
<div class=movie_results>
<div class=movie itemscope itemtype="http://schema.org/Movie">
<div class=header>
<div class=desc style="margin-right:0px">
<h2 itemprop="name">אני פרנקנשטיין - תלת מימד</h2>
<div style="display:inline-block;height:15px;position:relative;top:2px">
<g:plusone href="http://www.google.com/movies?mid=da48f6edd96ff86f&plus=1&gl=il&hl=iw" size="small" width="350" annotation="inline" recommendations="false" source="google:SHOWTIMES"></g:plusone>
</div>
<div class=info>‏92 דקות‏‏ - דירוג הסרט: PG-13‏‏‏ - אימה/מתח‏‏ - אנגלית‏</div>
<div class=syn>
<span itemprop="description"></span>
</div>
</div>
<meta itemprop="datePublished" content=""/>
<meta itemprop="sameas" content="http://www.imdb.com/title/tt1418377/"/>
<p class=clear>
</div>
<h2 class=section_title>זמני הצגה</h2>
<div class=showtimes>
<div class=show_right>
<div class=theater>
<div id=theater_2535238655091238248 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=232efa1fe18d6568" id=link_1_theater_2535238655091238248>גלובוס מקס אשדוד</a>
</div>
<div class=address>הגדוד העברי 6 - קניון סי-מול, אשדוד
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">
‏</span><!-- -->17:00‏
</span>
<span style="color:">
<span style="padding:0 ">  
‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 ">  ‏
</span><!-- -->22:00‏
</span>
</div>
</div>
<div class=theater>
<div id=theater_12245479986072718034 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=a9f0af2b02050ad2" id=link_1_theater_12245479986072718034>גלובוס קניון חוצות אשקלון</a>
</div>
<div class=address>רח' שד' בן גוריון, הנחל 1- קניון חוצות אשקלון, אשקלון
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">
‏</span><!-- -->17:00
</span>
<span style="color:">
<span style="padding:0 ">  
‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 ">  
‏</span><!-- -->22:00
‏</span>
</div>
</div>
</div>
<div class=show_left>
<div class=theater>
<div id=theater_17631652651284870555 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=f4b036ab7ae8b59b" id=link_1_theater_17631652651284870555>רב-חן רחובות</a>
</div>
<div class=address>בילו 1- קניון רחובות, רחובות
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">‏</span><!-- -->17:00
‏</span>
<span style="color:">
<span style="padding:0 ">  ‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 ">  ‏</span><!-- -->21:50‏
</span>
</div>
</div>
</div>
<p class=clear>
</div>
</div>
</div>
</div>
Do they have an API you could use instead of scraping that page (which BTW might be against their TOS)?
Mittineague:
Yes, it does seem to be a Curl thing.
When I put the DOM portion of interest inside HEREDOC I get
With Curl without utf8_encode
With Curl with utf8_encode
Very frustrating! I think at least part of the problem is Google’s horrendous mark-up
<div id=movie_results>
<div class=movie_results>
<div class=movie itemscope itemtype="http://schema.org/Movie">
<div class=header>
<div class=desc style="margin-right:0px">
<h2 itemprop="name">אני פרנקנשטיין - תלת מימד</h2>
<div style="display:inline-block;height:15px;position:relative;top:2px">
<g:plusone href="http://www.google.com/movies?mid=da48f6edd96ff86f&plus=1&gl=il&hl=iw" size="small" width="350" annotation="inline" recommendations="false" source="google:SHOWTIMES"></g:plusone>
</div>
<div class=info>‏92 דקות‏‏ - דירוג הסרט: PG-13‏‏‏ - אימה/מתח‏‏ - אנגלית‏</div>
<div class=syn>
<span itemprop="description"></span>
</div>
</div>
<meta itemprop="datePublished" content=""/>
<meta itemprop="sameas" content="http://www.imdb.com/title/tt1418377/"/>
<p class=clear>
</div>
<h2 class=section_title>זמני הצגה</h2>
<div class=showtimes>
<div class=show_right>
<div class=theater>
<div id=theater_2535238655091238248 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=232efa1fe18d6568" id=link_1_theater_2535238655091238248>גלובוס מקס אשדוד</a>
</div>
<div class=address>הגדוד העברי 6 - קניון סי-מול, אשדוד
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">
‏</span><!-- -->17:00‏
</span>
<span style="color:">
<span style="padding:0 "> *
‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 "> *‏
</span><!-- -->22:00‏
</span>
</div>
</div>
<div class=theater>
<div id=theater_12245479986072718034 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=a9f0af2b02050ad2" id=link_1_theater_12245479986072718034>גלובוס קניון חוצות אשקלון</a>
</div>
<div class=address>רח' שד' בן גוריון, הנחל 1- קניון חוצות אשקלון, אשקלון
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">
‏</span><!-- -->17:00
</span>
<span style="color:">
<span style="padding:0 "> *
‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 "> *
‏</span><!-- -->22:00
‏</span>
</div>
</div>
</div>
<div class=show_left>
<div class=theater>
<div id=theater_17631652651284870555 >
<div class=name>
<a href="/movies?near=%D7%A8%D7%95%D7%92%D7%95%D7%96%D7%99%D7%9F&tid=f4b036ab7ae8b59b" id=link_1_theater_17631652651284870555>רב-חן רחובות</a>
</div>
<div class=address>בילו 1- קניון רחובות, רחובות
<a href="" class=fl target=_top></a>
</div>
</div>
<div class=times>
<span style="color:">
<span style="padding:0 ">‏</span><!-- -->17:00
‏</span>
<span style="color:">
<span style="padding:0 "> *‏</span><!-- -->19:30
‏</span>
<span style="color:">
<span style="padding:0 "> *‏</span><!-- -->21:50‏
</span>
</div>
</div>
</div>
<p class=clear>
</div>
</div>
</div>
</div>
Do they have an API you could use instead of scraping that page (which BTW might be against their TOS)?
They don’t. I think this is the only option, but it’s very frustrating that I can’t read the page in Hebrew. I really don’t get why is that :S
I’ve been trying a lot of different things but haven’t gotten anywhere yet.