Please Log In to save to favorites
TAGs
php, regular expression, reg exp, wikipedia, harvesting, collect


Versions
Get Wikipedia Article
Get the main Wikipedia article.

A plain wikipedia article will pop up, all links have been removed, references have been removed, but html tags remain in there for easy readability.

I don't recommend using everytime your page loads, I would definitely recommend caching to not use up Wikipedia's bandwidth unecessarily.

Don't forget to set up a user_agent to make this work.
      
    1.   $lang = "en";
    2.   $item = "U2";
    3.   $mainURL = "http://".$lang.".wikipedia.org/wiki/".$item;
    4.   $str = file_get_contents($mainURL);
    5.   preg_match_all('#<!-- start content -->(.*?)<!-- end content -->#es', $str, $array);
    6.   if ( is_array($array[1]) ) {
    7.    $str = $array[1][0];
    8.   } else {
    9.    echo 'Nothing found...';
    10.    die();
    11.   }
    12.   $str = preg_replace("/<table.*?>.+?<\/table>/is" , "" , $str);
    13.   $str = preg_replace("/(<a .*?\">)/is" , "" , $str);
    14.   $str = preg_replace("/<\/a>/is" , "" , $str);
    15.   $str = preg_replace("/<sup.+?>.+?<\/sup>/is" , "" , $str);
    16.   $str = preg_replace("/<script.*?>.+?<\/script>/is" , "" , $str);
    17.   $str = preg_replace("/<div.*?>.+?<\/div>/is" , "" , $str);
    18.   $str = preg_replace("/^.*?<body.*?>/is" , "" , $str);
    19.   $str = preg_replace("/<script.*?>.*?<\/script>/is", "", $str);
    20.   $str = preg_replace("/(\[bewerken\]|\[edit\])/is" , "" , $str);
    21.   echo $str;
Comments