Recently, I'm doing some research on Web information extraction , Because I haven't been in touch with this series of knowledge , So go blogging , Look at the document ~~ Look at finallyly Great God's blog and documents , Learning while watching and summarizing ~~

  • Extract information from web pages , Page parsing is needed , There are several analytical methods :

1、 utilize HTML The distribution of markers is analyzed

2、 utilize HTML The relationship between tags is analyzed

3、 Use the visual features of the page to analyze

We need to summarize and adjust the rules manually , More rules are needed , The addition of a rule will affect the web page that has been successfully parsed . therefore , Maintaining the consistency of rule sets is a big difficulty .

4、 utilize TABLE Analyze the layout characteristics of the tag . More commonly used .

  • Before parsing a web page , We need to standardize the web page . That is to say , hold HTML Document conversion to XML file .

  Yes HTML The arrangement of documents is mainly as follows 4 In terms of :

(1) In addition to page tags tag Other places out there “<” and “>” use &lt; and &gt; Replace

(2) Put all tag attribute values in quotation marks , Such as :<a href="">

(3) All the tags match . Such as :<div>…</div>

(4) All tags are nested correctly .

HTML Standardized tools -- HtmlParser

HTML The benefits of Standardization

Standardized Html Code has many advantages for a website , such as : The revision is convenient 、 The code is easy to maintain 、 Small amount of code 、 The website opens fast 、 Suitable for more people to read, etc , Here is not a list of . Single from seo Optimize From the perspective of , Standardized Html Code is more useful Search engine rankings . But a lot of stationmaster did not realize this however , Influence the ranking of websites in search engines .

About XHTML Some knowledge points of


What is? DOCTYPE?

DOCTYPE yes Document Type Abbreviation , Understand what is DOCTYPE Is that right !DOCTYPE It's the document type , To illustrate your HTML or XHTML What version is it , The browser will follow you DOCTYPE As defined in DTD(Document Type Definition) To explain the page code , As one can imagine , FALSE DOCTYPE What will happen .

XHTML1.0 It provides us with three DOCTYPE:

1 . Transitional type (Transitional)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

2 . Strict type (Strict)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

3 . Frame type (Frameset)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "">

Transitional compatible tables 、 Identification, etc. , For beginners , Just choose the transition type !

  • Set up a namespace

stay DOCTYPE Then add the following code :
<html xmlns="">
Xmlns yes XHTML namespace Abbreviation , be called “ Namespace ”, Usually our website only has <html>, Why is there xmlns Well ? The name space is to mark the document , State who this document specification belongs to . See? ? If you don't understand, just Pass.

  • Declaration language encoding

Simplified Chinese website can be defined as :
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
English website can be defined as :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

  • <Head></Head> Other settings between

1 . Favorite icons

To make a 16*16 Of ico Icon , Name it favicon.ico, Put it in the root directory of the website , Then put the following code in the <Head></Head> Between .
<link rel="icon" href="/favicon.ico" type="image/x-icon"/>
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>

2 . Author and copyright information

<meta name="author" content="hxstream "/>
<meta name="copyright" content=", copyright "/>

3 . The site is introduced

<meta name="description" content=" brief introduction " />

4 . Site keywords

<meta content=" Search engine optimization ,seo" name="keywords"/>

  • Close all tags

The open label must be closed , for example <p></p>, Of course, there's another way to close it , Such as :<br/>

  • Attribute values are “” Cover up

for example :<img height= "80 "……/>

  • Assign values to all properties

Incorrect writing :<input …… checked/>
The correct way to write is :<input …… checked= "checked"/>

  • be-all XHTML The names of elements and their attributes are lowercase

XHTML Case sensitive
The wrong way to write it is :<TITLE></TITLE>
The standard writing is :<title></title>

  • Tags should be nested reasonably

Incorrect writing :<div><h1></div></h1>
The correct way to write is :<div><h1></h1></div>

  • Special characters are coded

Such as " <" use "&lt;" Express ," >" use "&gt;" Express .

  • Add... To the picture alt attribute

alt Property specifies to display alternate text when the image cannot be displayed .
Such as :<img src="data:images/logo.gif" alt="seo168 serve you "/>

  • Output content with structured elements

for example : You want to enter three lines of text , It can be used :<br/><br/>
I suggest replacing the above with the following :

