Recently, I'm doing some research on Web information extraction , Because I haven't been in touch with this series of knowledge , So go blogging , Look at the document ~~ Look at finallyly Great God's blog and documents , Learning while watching and summarizing ~~

  • Extract information from web pages , Page parsing is needed , There are several analytical methods :

1、 utilize HTML The distribution of markers is analyzed

2、 utilize HTML The relationship between tags is analyzed

3、 Use the visual features of the page to analyze

We need to summarize and adjust the rules manually , More rules are needed , The addition of a rule will affect the web page that has been successfully parsed . therefore , Maintaining the consistency of rule sets is a big difficulty .

4、 utilize TABLE Analyze the layout characteristics of the tag . More commonly used .

  • Before parsing a web page , We need to standardize the web page . That is to say , hold HTML Document conversion to XML file .

  Yes HTML The arrangement of documents is mainly as follows 4 In terms of :

(1) In addition to page tags tag Other places out there “<” and “>” use &lt; and &gt; Replace

(2) Put all tag attribute values in quotation marks , Such as :<a href="http://www.baidu.com">

(3) All the tags match . Such as :<div>…</div>

(4) All tags are nested correctly .

HTML Standardized tools -- HtmlParser

HTML The benefits of Standardization

Standardized Html Code has many advantages for a website , such as : The revision is convenient 、 The code is easy to maintain 、 Small amount of code 、 The website opens fast 、 Suitable for more people to read, etc , Here is not a list of . Single from seo Optimize From the perspective of , Standardized Html Code is more useful Search engine rankings . But a lot of stationmaster did not realize this however , Influence the ranking of websites in search engines .

About XHTML Some knowledge points of

  • Add DOCTYPE

What is? DOCTYPE?

DOCTYPE yes Document Type Abbreviation , Understand what is DOCTYPE Is that right !DOCTYPE It's the document type , To illustrate your HTML or XHTML What version is it , The browser will follow you DOCTYPE As defined in DTD(Document Type Definition) To explain the page code , As one can imagine , FALSE DOCTYPE What will happen .

XHTML1.0 It provides us with three DOCTYPE:

1 . Transitional type (Transitional)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3c.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

2 . Strict type (Strict)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

3 . Frame type (Frameset)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

Transitional compatible tables 、 Identification, etc. , For beginners , Just choose the transition type !

  • Set up a namespace

stay DOCTYPE Then add the following code :
<html xmlns="http://www.w3.org/1999/xhtml">
Xmlns yes XHTML namespace Abbreviation , be called “ Namespace ”, Usually our website only has <html>, Why is there xmlns Well ? The name space is to mark the document , State who this document specification belongs to . See? ? If you don't understand, just Pass.

  • Declaration language encoding

Simplified Chinese website can be defined as :
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
English website can be defined as :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

  • <Head></Head> Other settings between

1 . Favorite icons

To make a 16*16 Of ico Icon , Name it favicon.ico, Put it in the root directory of the website , Then put the following code in the <Head></Head> Between .
<link rel="icon" href="/favicon.ico" type="image/x-icon"/>
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>

2 . Author and copyright information

<meta name="author" content="hxstream "/>
<meta name="copyright" content="www.cnblogs.com, copyright "/>

3 . The site is introduced

<meta name="description" content=" brief introduction " />

4 . Site keywords

<meta content=" Search engine optimization ,seo" name="keywords"/>

  • Close all tags

The open label must be closed , for example <p>www.seo168.com</p>, Of course, there's another way to close it , Such as :<br/>

  • Attribute values are “” Cover up

for example :<img height= "80 "……/>

  • Assign values to all properties

Incorrect writing :<input …… checked/>
The correct way to write is :<input …… checked= "checked"/>

  • be-all XHTML The names of elements and their attributes are lowercase

XHTML Case sensitive
The wrong way to write it is :<TITLE>www.seo168.com</TITLE>
The standard writing is :<title>www.seo168.com</title>

  • Tags should be nested reasonably

Incorrect writing :<div><h1>www.seo168.com</div></h1>
The correct way to write is :<div><h1>www.seo168.com</h1></div>

  • Special characters are coded

Such as " <" use "&lt;" Express ," >" use "&gt;" Express .

  • Add... To the picture alt attribute

alt Property specifies to display alternate text when the image cannot be displayed .
Such as :<img src="data:images/logo.gif" alt="seo168 serve you "/>

  • Output content with structured elements

for example : You want to enter three lines of text , It can be used :
www.seo168.com<br/>www.seo168.com<br/>www.seo168.com
I suggest replacing the above with the following :
<ul>
<li>www.seo168.com</li>
<li>www.seo168.com</li>
<li>www.seo168.com</li>
</ul>

Appendix a : Related links

Appendix 2 : Special character code table

character Decimal system Character numbers Entity name
--- Unuse
Space   --- Space bar
---  Exclamation mark Exclamation mark
" ;  &quot;  Double quotes Quotation mark
---  Digital signs Number sign
---  The dollar sign Dollar sign
---  Percent sign Percent sign
&amp;  Ampersand
---  Single quotation marks Apostrophe
---  The left part of the brackets Left parenthesis
---  The right part of the brackets Right parenthesis
---  asterisk Asterisk
---  plus Plus sign
---  comma Comma
---  hyphen Hyphen
---  Full stop Period (fullstop)
---  Slash Solidus (slash)
0 ---  Numbers 0 Digit 0
1 ---  Numbers 1 Digit 1
2 ---  Numbers 2 Digit 2
3 ---  Numbers 3 Digit 3
4 ---  Numbers 4 Digit 4
5 ---  Numbers 5 Digit 5
6 ---  Numbers 6 Digit 6
7 ---  Numbers 7 Digit 7
8 ---  Numbers 8 Digit 8
9 ---  Numbers 9 Digit 9
---  The colon Colon
---  A semicolon Semicolon
&lt;  Less than no. Less than
---  Equal to the sign Equals sign
&gt;  More than no. Greater than
---  question mark Question mark
---  Commercial at
---  Capitalization A Capital A
---  Capitalization B Capital B
C ;  ---  Capitalization C Capital C
---  Capitalization D Capital D
---  Capitalization E Capital E
---  Capitalization F Capital F
---  Capitalization G Capital G
---  Capitalization H Capital H
---  Capitalization J Capital I
---  Capitalization K Capital J
---  Capitalization L Capital K
---  Capitalization K Capital L
---  Capitalization M Capital M
---  Capitalization N Capital N
---  Capitalization O Capital O
---  Capitalization P Capital P
---  Capitalization Q Capital Q
---  Capitalization R Capital R
---  Capitalization S Capital S
---  Capitalization T Capital T
U ;  ---  Capitalization U Capital U
---  Capitalization V Capital V
---  Capitalization W Capital W
X ;  ---  Capitalization X Capital X
---  Capitalization Y Capital Y
---  Capitalization Z Capital Z
---  The left part of the brackets Left square bracket
---  The backslash Reverse solidus (backslash )
---  The right part of the brackets Right square bracket
^ ;  ---  Caret
---  Underline H orizontal bar (underscore)
---  Sharp accent Acute accent
a ;  ---  A lowercase letter a Small a
---  A lowercase letter b Small b
---  A lowercase letter c Small c
---  A lowercase letter d Small d
---  A lowercase letter e Small e
---  A lowercase letter f Small f
---  A lowercase letter g Small g
---  A lowercase letter h Small h
---  A lowercase letter i Small i
---  A lowercase letter j Small j
---  A lowercase letter k Small k
---  A lowercase letter l Small l
---  A lowercase letter m Small m
---  A lowercase letter n Small n
---  A lowercase letter o Small o
p ;  ---  A lowercase letter p Small p
q ;  ---  A lowercase letter q Small q
---  A lowercase letter r Small r
---  A lowercase letter s Small s
---  A lowercase letter t Small t
u ;  ---  A lowercase letter u Small u
---  A lowercase letter v Small v
---  A lowercase letter w Small w
---  A lowercase letter x Small x
---  A lowercase letter y Small y
---  A lowercase letter z Small z
---  The left part of the brace Left curly brace
---  A vertical bar Vertical bar
---  The right part of the brace Right curly brace
---  Tilde
---  ---  not used Unused
   &nbsp;  Space Nonbreaking space
¡  ¡  &iexcl;  Inverted exclamation
¢  ¢  &cent;  The mark of currency Cent sign
£  £  &pound;  The pound mark Pound sterling
¤  ¤  &curren ;  The common currency symbol General currency sign
¥  ¥  &yen;  The yen sign Yen sign
¦  ¦  &brvbar; or &brkbar;  Break the vertical line Broken vertical bar
§  §  &sect;  Section number Section sign
¨  ¨  &uml ; or &die;  Diacritical symbols Umlaut
    &copy ;  Copyright mark Copyright
ª  ª  &ordf ;  Feminine ordinal
«  «  &laquo;  Left angle quote, guillemet left
¬  ¬  &not  Not sign
­  ­  &shy;  Soft hyphen
    &reg;  Registered trademark mark Registered trademark
¯  ¯  &macr; or &hibar ;  A long tone sign Macron accent
°  °  &deg ;  Degree sign Degree sign
±  ±  &plusmn ;  Add or subtract Plus or minus
²  ²  &sup2;  Superscript 2 Superscript two
³  ³  &sup3 ;  Superscript 3 Superscript three
´  ´  &acute;  Sharp accent Acute accent
µ  µ  &micro;  Micro sign
¶  ¶  &para;  Paragraph sign
·  ·  &middot;  Middle dot
¸  ¸  &cedil ;  Cedilla
¹  ¹  &sup1;  Superscript 1 Superscript one
º  º  &ordm;  Masculine ordinal
»  » ;  &raquo ;  Right angle quote, guillemet right
¼  ¼  &frac14 ;  quarter Fraction one-fourth
½  ½  &frac12;  A half Fraction one-half
¾  ¾  &frac34;  three-fourths Fraction three-fourths
¿  ¿  &iquest;  Inverted question mark
À  À ;  &Agrave ;  Capital A, grave accent
Á  Á  &Aacute;  Capital A , acute accent
    &Acirc;  Capital A , circumflex
à à &Atilde;  Capital A, tilde
Ä  Ä ;  &Auml;  Capital A, di?esis / umlaut
Å  Å  &Aring;  Capital A, ring
Æ  Æ  &AElig;  Capital AE ligature
Ç  Ç  &Ccedil;  Capital C, cedilla
È  È  &Egrave;  Capital E, grave accent
É  É ;  &Eacute;  Capital E, acute accent
Ê  Ê  &Ecirc ;  Capital E, circumflex
Ë  Ë  &Euml;  Capital E, di?esis / umlaut
Ì  Ì  &Igrave;  Capital I, grave accent
Í  Í  &Iacute ;  Capital I, acute accent
Π Π &Icirc ;  Capital I, circumflex
Ï  Ï ;  &Iuml;  Capital I , di?esis / umlaut
Р Р &ETH;  Capital Eth, Icel andic
Ñ  Ñ ;  &Ntilde;  Capital N , tilde
Ò  Ò  &Ograve;  Capital O, grave accent
Ó  Ó ;  &Oacute;  Capital O , acute accent
Ô  Ô  &Ocirc;  Capital O, circumflex
Õ  Õ  &Otilde;  Capital O, tilde
Ö  Ö  &Ouml;  Capital O, di?esis / umlaut
×  ×  &times;  Multiplication sign Multiply sign
Ø  Ø  &Oslash;  Capital O, slash
Ù  Ù  &Ugrave;  Capital U, grave accent
Ú  Ú  &Uacute;  Capital U, acute accent
Û  Û  &Ucirc;  Capital U, circumflex
Ü  Ü  &Uuml;  Capital U, di?esis / umlaut
Ý  Ý  &Yacute ;  Capital Y, acute accent
Þ  Þ  &TH ORN ;  Capital Thorn, Icel andic
ß  ß  &szlig ;  Small sharp s, German sz
à  à  &agrave ;  Small a, grave accent
á  á  &aacute;  Small a, acute accent
â  â  &acirc;  Small a, circumflex
ã  ã  &atilde;  Small a, tilde
ä  ä  &auml;  Small a , di?esis / umlaut
å  å  &aring;  Small a, ring
æ  æ  &aelig;  Small ae ligature
ç  ç  &ccedil;  Small c, cedilla
è  è ;  &egrave;  Small e, grave accent
é  é ;  &eacute;  Small e, acute accent
ê  ê  &ecirc;  Small e, circumflex
ë  ë  &euml;  Small e, di?esis / umlaut
ì  ì  &igrave;  Small i, grave accent
í  í  &iacute;  Small i, acute accent
î  î  &icirc ;  Small i, circumflex
ï  ï  &iuml;  Small i, di?esis / umlaut
ð  ð  &eth;  Small eth, Icelandic
ñ  ñ  &ntilde;  Small n, tilde
ò  ò  &ograve;  Small o, grave accent
ó  ó ;  &oacute;  Small o, acute accent
ô  ô ;  &ocirc;  Small o, circumflex
õ  õ  &otilde;  Small o , tilde
ö  ö  &ouml;  Small o, di?esis / umlaut
÷  ÷  &divide;  devide Division sign
ø  ø  &oslash;  Small o, slash
ù  ù  &ugrave;  Small u, grave accent
ú  ú  &uacute;  Small u, acute accent
û  û  &ucirc;  Small u, circumflex
ü  ü  &uuml ;  Small u, di?esis / umlaut
ý  ý  &yacute ;  Small y, acute accent
þ  þ  &thorn;  Small thorn, Icelandic
ÿ  ÿ  &yuml;  Small y, umlaut

Reptile technology -- Based on learning ( One )HTML Normalization ( Special character code table attached ) More articles about

  1. Reptile technology -- Based on learning ( Four )HtmlParser Basic knowledge

    After using crawler technology to get the web page source code , Extract its specific text content for the web page , Using regular expressions and extraction tools , Can better extract these contents . Here is an extraction tool -- HtmlParser HtmlParser It's a way to parse H ...

  2. Reptile technology -- Based on learning ( 5、 ... and ) Solve the problem of page coding recognition ( attach c# Code )

    from Web Before extracting text from a web page , First, identify the encoding of the web page , Sometimes it's necessary to further identify the language used by the web page . Because the same code may correspond to multiple languages , for example UTF-8 The encoding may correspond to languages such as English or Chinese . The overall process of identification and coding is as follows : (1) ...

  3. Reptile technology -- Based on learning ( 3、 ... and ) understand URL and URI The connection and difference between

    The basic operation of a web crawler is to grab a web page . First of all, understand URL~~ Understanding URL Before , Let's get to know URI, I used to confuse the two concepts ~@_@|| What is? URI? Web Every resource available on the Internet , Such as :html file . video , The pictures and so on are all made up of a ...

  4. Reptile technology -- Advanced learning ( 7、 ... and ) Simple crawler grab example ( attach c# Code )

    This is my first crawler code ... It's a beta version of the code . Don't spray, Daniel ... By giving an initial address startPiont And then capture the web page , Then the URL is matched by regular expression . List<string&g ...

  5. Reptile technology -- Advanced learning ( 8、 ... and ) Simulate a simple browser ( attach c# Code )

    Because I'm doing my graduation project recently , You need to use some simple browser functions , So I learned , By the way, write a blog ~~ Daniel, please don't spray , Rookies practice their hands ~ The implementation interface is as follows :( The simple version @_@||) button_go The implementation is as follows : private vo ...

  6. Reptile technology -- Advanced learning ( Nine ) Use HtmlAgilityPack Get page links ( attach c# Code and plug-in download )

    rookie HtmlAgilityPack First experience ... Weak code ... Html Agility Pack It's an open source project , Provides a standard DOM API and XPath Navigation . Use WebBrowser and HttpW ...

  7. Reptile technology -- Advanced learning ( Ten ) Netease News page information capture (htmlagilitypack collocation scrapysharp)

    I've been working on web crawlers recently , Online to see about htmlagilitypack collocation scrapysharp The article , So I decided to have a try ~ So it came to https://www.nuget.org/packages/Scrapy ...

  8. Reptile technology -- Advanced learning ( 11、 ... and )【 Add 】 obtain html in meta In the tag content The content of

    Last Netease News page information capture -- htmlagilitypack collocation scrapysharp There's a lot about how to quickly grab html The statement of the text in , however meta In the tag content Content capture , There is no mention of ! ...

  9. android Basic study of network technology ( 7、 ... and )

    Use httpclient Protocol access network : public class MainActivity extends Activity implements OnClickListener{ public vo ...

Random recommendation

  1. LeetCode 26 Remove Duplicates from Sorted Array

    Problem: Given a sorted array, remove the duplicates in place such that each element appear only onc ...

  2. Allegro Shortcut key settings

    One . Shortcut key settings Allegro It can be modified by env File to set shortcut keys , It's useful to learn from other software such as AD or PADS For migrated users , You can follow the old operating habits , It's still very meaningful . Allegro The total number of variable files is 2 individual : One is ...

  3. scala Learning notes (02) Tuples Tuple、 Array Array、Map、 File read and write 、 Web crawling example

    package yjmyzz import java.io.PrintWriter import java.util.Date import scala.io.Source object ScalaA ...

  4. web abnormal 、 Concurrency and security

  5. Spring @ResponseBody Only return String Type data solutions

    Build your own today Spring MVC Frame play , Use AJAX call Spring controller  And back to map object , Suddenly found that , Ah , Yes? @Response Can only return String,  I am using Spring 3 And the version of ...

  6. Use jQuery obtain GridView The number of data lines for

    A colleague in the group raised the above question , Another colleague gave the answer , Tried it on , Not bad . Post the code and rendering : <html xmlns="http://www.w3.org/1999/xhtml" ...

  7. Simple play etimer &lt;contiki Learning notes 9 &gt;

    ok , I admit that etimer It's a little complicated , Mainly because it seems to be related to contiki Of process Stir together , It's everywhere call_process. Then search first contiki Under the etimer Of example have a look , And then try to write ...

  8. the second assignment of software testing

    Homework 2 I'd like to learn from it . There are still unfinished projects in the first phase , For example, you should specify the scope of the topic you are reading , It's about what . Assignment 1 : Install and use CheckStyle/PMD And FindBug Now I've found something on the Internet checkSty ...

  9. Qt The trick is &ldquo;To-Do matter &rdquo;

    Qt Creator 2.5 This plug-in was added to the version It hasn't been used much Now remember google For a moment Let's summarize Print the picture first That's it It's easy to use Anywhere in the project Write a note Just above 5 A pass ...

  10. BZOJ 1613: [Usaco2007 Jan]Running Bessie's morning exercise program ( dp)

    dp Just mess with it ...( That's what I am A Of .. Later, I wanted to change it faster .. And then WA 了 ... Ignore it ------------------------------------------------------- ...