In my PHP project (Laravel middleware) I have a curl that takes the HTML dom from an URL. Subsequently, I have to save the line number in which words are found inside the dom in a variable. The problem is that running the code on a URL in localhost prints me the right line numbers, but running it on a web URL just prints me line 1
EXAMPLE
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlToTest);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->formatOutput=false;
$dom->preserveWhiteSpace=true;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$curlDom = curl_exec($ch);
libxml_use_internal_errors(true);
$dom->loadHTML($curlDom);
libxml_use_internal_errors(false);
if ($curlDom === false) {
dd(curl_error($ch), curl_errno($ch));
}
$dom = preg_replace("/&#?[a-z0-9]+;/i","",$dom); // Replace of malformed entities
curl_close($ch);
$resultOfRegex=preg_match_all($regex,$dom, $matches, PREG_OFFSET_CAPTURE);
foreach (current($matches) as $match) {
$matchElement = $match[0];
$matchLineNumber = substr_count(mb_substr($dom, 0, $match[1],'UTF-8'), PHP_EOL) + 1;
$list_of_match .= $matchElement . " at line " . $matchLineNumber . "<br/>";
}
After the code above I build a pdf in which I print the match list. Down below are the different outputs.
OUTPUT of LOCALHOST webpages ex (localhost/testpage.php)
test at line 4 test at line 9 test2 at line 59
OUTPUT of WWW webpages (https://www.testsite.it/testpage.php) AND also output of localpages like (localhost/testsite/wordpress/testpage/)
test at line 1 test at line 1 test2 at line 1
UPDATES:
- DD of preg replaced string print the curled dom indented as expected
- Using mb_ version of substr_count doesn't change the output
- The value of $matches is an array of arrays where every element is an array consisting of the matched string at offset 0 and its string offset into the subject at offset 1. as mentioned in documentation preg_match_all with PREG_OFFSET_CAPTURE
- VAR_DUMP of preg replaced string return the dom displayed with around 30 lines of weird text under it like below
cURL error found from VAR_DUMP
int(0) {message: "Attempt to read property "headers" on null", exception: "ErrorException",…} exception: "ErrorException" file: "pathtoproject\vendor\laravel\framework\src\Illuminate\Foundation\Http\Middleware\VerifyCsrfToken.php" line: 191 message: "Attempt to read property "headers" on null" trace: [{,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…}, {,…},…]
UPDATE Adding $dom = nl2br($dom);
after curl_close($ch)
fix the wrong count of lines from mb_substring for wordpress sites URLs and web URLs (Seems like cURL on wp website get a "minified" version of the dom ex: <head>\n
instead of <head>\r\n
), but sometimes 1 or 2 lines are still printed wrong. Altrough the error in the network Dev-Tools request is still there
"Attempt to read property "headers" on null"
After some tests this error seems to have to do with the CORS in Laravel
source https://stackoverflow.com/questions/68552262/error-on-curl-from-laravel-middleware-attempt-to-read-property-headers-on-null
Comments
Post a Comment