We want to be able to find in a string a specific HTML tag. For example, we might need to extract content of a div tag. The problem is when tags are nested.
<body> <div class="top-nav"> </div> <div class="book" id="1"> <div class="inner"> </div> </div> <div class="book" id="2"> <div class="inner"> </div> </div> <div class="menu"> </div> </body>
We will try to find a solution to the problem.
Of course we are going to use regular expressions. To build the correct regular expression, let's discuss some cases.
We can start with the following regex /<div class=“book” id=“\d*”>.*<\/div>/. Applied to above HTML it will match everything starting with the first div with class book and ending with the last closing div tag.
<div class="book" id="1"> <div class="inner"> </div> </div> <div class="book" id="2"> <div class="inner"> </div> </div> <div class="menu"> </div>
This happens, because the regular expressions by default are greedy. Well we can modify this behavior by adding the U modifier: /<div class=“book” id=“\d*”>.*<\/div>/U or appending a question mark after the asterisk: /<div class=“book” id=“\d*”>.*?<\/div>/. In both cases the result will be the same. The regular expression will return two matches that are not complete (according to what we need):
<div class="book" id="1"> <div class="inner"> </div>
and
<div class="book" id="2"> <div class="inner"> </div>
Seems that we are very close. All that we need is to replace the .* part with something that will include all complete div tags nested inside the div that is of interest to us.
TODO