hooyantsing's Blog

20分钟带你搞懂XPath-Scrapy数据解析神器

字数统计: 3.3k阅读时长: 20 min
2022/04/07

视频源:20分钟带你搞懂XPath-Scrapy数据解析神器

认识 XPath

1. 什么是 XPath

  1. 解析 XML 的一种语言(HTML 其实是 XML 的子类),广泛用于解析 HTML 数据;
  2. 几乎所有语言都能使用 XPath,比如 Java 和 C 语言;
  3. 除了 XPath 还有其他手段用于 XML 解析,比如:BeautifulSoup、lxml、DOM、SAX、JSDOM、DOM4J、minixml 等。

2. XPath 语法

XPath 语法 3 大类:

  • 层级:/ 直接子集、// 跳级;
  • 属性: @ 属性访问;
  • 函数:contains()text()

使用 XPath

1. 在浏览器中使用 XPath

1
//div[@class="opr-recommends-merge-content"]//div[contains(@class,"opr-recommends-merge-item")]

跳级 标签名 [@class=” 属性仅包含一个类名 “] 跳级 标签名 [contains(@class,” 属性包含多个类名中的一个 “)]

eg:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<!-- more html ... -->

<div class="result-op xpath-log new-pmd" srcid="21102" fk="21102_" id="1" tpl="right_recommends_merge" mu="http://www.baidu.com/s?wd=&amp;srcid=21102" data-op="{'y':''}" data-click="{&quot;p1&quot;:1,&quot;rsv_bdr&quot;:&quot;&quot;,&quot;fm&quot;:&quot;alxr&quot;,&quot;rsv_stl&quot;:0,&quot;p5&quot;:1}" data-cost="{&quot;renderCost&quot;:1,&quot;dataCost&quot;:3}" m-name="aladdin-san/app/right_recommends_merge/result_b895932" m-path="https://pss.bdstatic.com/r/www/cache/static/aladdin-san/app/right_recommends_merge/result_b895932" nr="1">
<div class="cr-content container_2AHLd"><section data-click="{
'rsv_card_index': 0
}"><div class="cr-title c-clearfix"><!--8--><!--9--><!--11--><span title="相关应用软件">相关应用软件</span><!--11--><!--10--><!--8--><!--7--></div><div class="
container_EBGt2

has-attr_1DAxq
fold_2kZgh
"><div class="c-row row_19xr-"><div><section data-click="{
'rsv_item_index': 0
}"><div class="c-span2 container_f_bS8" data-click="{&quot;rsv_re_ename&quot;:&quot;j2sdk&quot;,&quot;rsv_re_uri&quot;:&quot;29d5861b9ff94d14b0f463c1129011a3&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=j2sdk&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=29d5861b9ff94d14b0f463c1129011a3"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t11.baidu.com/it/u=3391737136,525361792&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=j2sdk&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=29d5861b9ff94d14b0f463c1129011a3"></a><!--17--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="j2sdk" href="/s?wd=j2sdk&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=29d5861b9ff94d14b0f463c1129011a3">j2sdk</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">sun公司开发编程工具</p><!--19--></div><!--18--></div></section><section data-click="{
'rsv_item_index': 1
}"><div class="c-span2 container_f_bS8" data-click="{&quot;rsv_re_ename&quot;:&quot;PythonWin&quot;,&quot;rsv_re_uri&quot;:&quot;4ab40fa61802411e82889aaac1428974&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=PythonWin&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=4ab40fa61802411e82889aaac1428974"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t12.baidu.com/it/u=1526885684,164265061&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=PythonWin&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=4ab40fa61802411e82889aaac1428974"></a><!--21--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="PythonWin" href="/s?wd=PythonWin&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=4ab40fa61802411e82889aaac1428974">PythonWin</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">Python集成开发环境</p><!--23--></div><!--22--></div></section><section data-click="{
'rsv_item_index': 2
}"><div class="c-span2 container_f_bS8 c-span-last-s" data-click="{&quot;rsv_re_ename&quot;:&quot;mySQL&quot;,&quot;rsv_re_uri&quot;:&quot;9773c871a0a642f0b481e5f2d8755490&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=mySQL&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=9773c871a0a642f0b481e5f2d8755490"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t12.baidu.com/it/u=949912871,2851013736&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=mySQL&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=9773c871a0a642f0b481e5f2d8755490"></a><!--25--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="mySQL" href="/s?wd=mySQL&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=dbfeZL2CxS3Uk2594ohlJvEooa1Ju0BYeb4x5MEYcvlLqkKOJZHkxvyHhbE&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=9773c871a0a642f0b481e5f2d8755490">mySQL</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">关系型数据库管理系统</p><!--27--></div><!--26--></div></section><section data-click="{
'rsv_item_index': 3
}"><div class="c-span2 container_f_bS8 c-span-last last-item_cG9Ps" data-click="{&quot;rsv_re_ename&quot;:&quot;PyScripter&quot;,&quot;rsv_re_uri&quot;:&quot;8076a6780c194b188088d1556da59fae&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=PyScripter&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=8076a6780c194b188088d1556da59fae"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t11.baidu.com/it/u=3630809422,288105807&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=PyScripter&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=8076a6780c194b188088d1556da59fae"></a><!--29--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="PyScripter" href="/s?wd=PyScripter&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=8076a6780c194b188088d1556da59fae">PyScripter</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">语法自动补全功能</p><!--31--></div><!--30--></div></section><!--15--></div><!--14--></div><div class="c-row row_19xr-"><div><section data-click="{
'rsv_item_index': 4
}"><div class="c-span2 container_f_bS8" data-click="{&quot;rsv_re_ename&quot;:&quot;矿工三兄弟2&quot;,&quot;rsv_re_uri&quot;:&quot;6e13e1056e7a46df80255f3948ba3379&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=%E7%9F%BF%E5%B7%A5%E4%B8%89%E5%85%84%E5%BC%9F2&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=6e13e1056e7a46df80255f3948ba3379"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t11.baidu.com/it/u=2848745716,3683045690&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=%E7%9F%BF%E5%B7%A5%E4%B8%89%E5%85%84%E5%BC%9F2&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=6e13e1056e7a46df80255f3948ba3379"></a><!--35--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="矿工三兄弟2" href="/s?wd=%E7%9F%BF%E5%B7%A5%E4%B8%89%E5%85%84%E5%BC%9F2&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=6e13e1056e7a46df80255f3948ba3379">矿工三兄弟2</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">益智类游戏</p><!--37--></div><!--36--></div></section><section data-click="{
'rsv_item_index': 5
}"><div class="c-span2 container_f_bS8" data-click="{&quot;rsv_re_ename&quot;:&quot;MyEclipse&quot;,&quot;rsv_re_uri&quot;:&quot;218fcdd0de454010be92fe4fc28ad846&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=MyEclipse&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=218fcdd0de454010be92fe4fc28ad846"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t10.baidu.com/it/u=1611440940,2767759072&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=MyEclipse&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=218fcdd0de454010be92fe4fc28ad846"></a><!--39--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="MyEclipse" href="/s?wd=MyEclipse&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=218fcdd0de454010be92fe4fc28ad846">MyEclipse</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">软件编程开发服务平台</p><!--41--></div><!--40--></div></section><section data-click="{
'rsv_item_index': 6
}"><div class="c-span2 container_f_bS8 c-span-last-s" data-click="{&quot;rsv_re_ename&quot;:&quot;netbeans&quot;,&quot;rsv_re_uri&quot;:&quot;0223caf24051401c82d912a6f4c0dc55&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=netbeans&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=0223caf24051401c82d912a6f4c0dc55"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t12.baidu.com/it/u=385794101,2115048122&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=netbeans&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=0223caf24051401c82d912a6f4c0dc55"></a><!--43--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="netbeans" href="/s?wd=netbeans&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=0954IDkQWSYx%2BvAgJjIlQlpaYENK81UNEaWdeHBR4JjCqMvPzJxPsyNJJDY&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=0223caf24051401c82d912a6f4c0dc55">netbeans</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">世界级的Java IDE</p><!--45--></div><!--44--></div></section><section data-click="{
'rsv_item_index': 7
}"><div class="c-span2 container_f_bS8 c-span-last last-item_cG9Ps" data-click="{&quot;rsv_re_ename&quot;:&quot;WinRAR&quot;,&quot;rsv_re_uri&quot;:&quot;298fb6b007c34e01adc9253773989edf&quot;}"><div class="img-container_2JSl6"><a target="_blank" class="c-img c-img2 c-img-s c-img-radius-large cover-img_PLe_S" href="/s?wd=WinRAR&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=b034dvWyplUCRtFu9m5cPScxeb7SX%2B7FHh0cNDEnVX6GWWV2olDKGGIGlB4&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=298fb6b007c34e01adc9253773989edf"><span class="cover-img-boder_1-OG1 c-img-radius-large"></span><img src="https://t10.baidu.com/it/u=2082045759,3140308789&amp;fm=58" class="c-img c-img2 c-img-radius-large"></a><a class="img-container-mask_1S9Kw c-img-radius-large" target="_blank" href="/s?wd=WinRAR&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=b034dvWyplUCRtFu9m5cPScxeb7SX%2B7FHh0cNDEnVX6GWWV2olDKGGIGlB4&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=298fb6b007c34e01adc9253773989edf"></a><!--47--></div><div class="title_1v7d9"><a target="_blank" class="c-font-medium inc_rs_a" title="WinRAR" href="/s?wd=WinRAR&amp;usm=2&amp;ie=utf-8&amp;rsv_pq=bd1060300003d291&amp;oq=xpath&amp;rsv_t=b034dvWyplUCRtFu9m5cPScxeb7SX%2B7FHh0cNDEnVX6GWWV2olDKGGIGlB4&amp;rsv_cq=&amp;rsv_dl=0_right_recommends_merge_21102&amp;euri=298fb6b007c34e01adc9253773989edf">WinRAR</a></div><div class="attr-container_22wB9"><p class="attr-text_3jLeU">压缩包管理器</p><!--49--></div><!--48--></div></section><!--33--></div><!--32--></div><!--13--></div></section><!--5--></div>
</div>

<!-- more html ... -->
1
//div[contains(@class,"xpath-log")]//div[contains(@class,"c-span2")]

image-20220407104537384

CATALOG
  1. 1. 认识 XPath
    1. 1.1. 1. 什么是 XPath
    2. 1.2. 2. XPath 语法
  2. 2. 使用 XPath
    1. 2.1. 1. 在浏览器中使用 XPath