Está en la página 1de 20

A Collection of Website Benchmarks Labelled for

Template Detection and Content Extraction


Julian Alarte, David Insa, Josep Silva, Salvador Tamarit
MiST Research Group, Universitat Polit`
ecnica de Val`
encia
and
Babel Research Group, Universidad Polit
ecnica de Madrid

XV Jornadas Sobre Programaci


on y Lenguajes (PROLE15)
September 15th, 2015

Context and Motivation


Template Detection
Identifies the template of a webpage.
Essential for indexing tasks:
I
I

Templates represent between 40% and 50% of data on the Web


Usually contain irrelevant information (e.g. advertisements, menus and
banners)

Avoids waste of resources (storage space, bandwidth, etc.)


Important tool for website developers and analyzers.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

1 / 12

Context and Motivation


Template Detection
Identifies the template of a webpage.
Essential for indexing tasks:
I
I

Templates represent between 40% and 50% of data on the Web


Usually contain irrelevant information (e.g. advertisements, menus and
banners)

Avoids waste of resources (storage space, bandwidth, etc.)


Important tool for website developers and analyzers.

Content Extraction
Identifies the main content of the webpage.
Essential for many information retrieval and processing tasks.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

1 / 12

Context and Motivation


Template Detection & Content Extraction
Two of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

2 / 12

Benchmark Suite for Template Detection


and Content Extraction

Testing, Comparing and Tuning


Collections of heterogeneous benchmarks: ensures generality of the
techniques
Gold standard: ensures the same evaluation criteria.

Using a benchmark suite


Training phase: to optimize the techniques by adjusting parameters
Evaluation phase: to measure the performance with objective criteria.
They need disjoint sets of webpages.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

3 / 12

Benchmark Suite for Template Detection


and Content Extraction

Previous Situation
Lack of a public and neutral benchmark suite
Evaluations:
I
I
I

with dierent benchmarks


with dierent kinds of templates
using dierent criteria

Results hardly comparable with other techniques.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

4 / 12

Benchmark Suite for Template Detection


and Content Extraction

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I

Widely used in the literature

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

5 / 12

Benchmark Suite for Template Detection


and Content Extraction

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

5 / 12

Benchmark Suite for Template Detection


and Content Extraction

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Second option: Contacted the authors of other techniques

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

5 / 12

Benchmark Suite for Template Detection


and Content Extraction

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Second option: Contacted the authors of other techniques


I

We could not use their benchmarks due to:


F
F
F

privacy
copyright
unavailability

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

5 / 12

Benchmark Suite for Template Detection


and Content Extraction

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Second option: Contacted the authors of other techniques


I

We could not use their benchmarks due to:


F
F
F

privacy
copyright
unavailability

Final choice: Build our own free and publicly available benchmark suite.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

5 / 12

The TECO Benchmark Suite


Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:
I
I
I

Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process

Scope:
I
I

Template detection
Content extraction

Goal:
I
I
I

Test
Compare
Tune

Uses:
I
I

Training
Evaluation

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

6 / 12

The TECO Benchmark Suite


Features
Result of a research project:
I
I

A new technique for content extraction


Later adapted for template detection.

40 real heterogeneous websites downloaded from Internet.


Open, extensible, publicly available and free.
Webpages in dierent languages: to test language-independent features.
Downloading of the webpages:
I
I

All needed elements for correct visualization: HTML, images, scripts, CSS...
SiteSucker (OS X) and wget (Linux).

Each benchmark is composed of:


1
2

Key page. Target webpage.


All those webpages (from the same website) that are linked by the key page as
well as the webpages linked by them.

Gold standard (for each key page) using labels:


I

HTML classes notTemplate and mainContent.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

7 / 12

Producing the Gold Standard

Four dierent engineers


Independently:
I
I

Manually explored the key page and the webpages accessible from it
Choose what part of the webpage is the template and what part is the main
content.

Together:
I

Same actions sharing their individual opinions.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

8 / 12

Benchmark Classification
Classification 1: All benchmarks have been classified into five groups:
Companies / Shops,
Forums / Social,
Personal websites / Blogs,
Media / Communication,
Institutions / Associations.
www.bbc.co.uk/news/index.html (Media / Communication)
Classification 2: All benchmarks have been classified according to their size and
the proportion of their template / main content.
Id
24

Benchmark
www.bbc.co.uk/news/index.html

Nodes
2991

T. Nodes
364

M.C. Nodes
1360

Classification 3: The benchmarks were also classified according to the number of


webpages that implement the template.
Id
24

VL
5

TT
0

Alarte, Insa, Silva, Tamarit (UPV & UPM)

PT
5

DT
0

Notes (peculiarities)
Several templates (but very similar).

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

9 / 12

Downloading and using the suite

Download
Directory with 40 folders.
Scripts to automatize the
benchmarking process

Rules for using the suite


1

Publish the results so that they are


publicly available.

Provide enough information so that


anyone can easily duplicate the
experiments.

http://users.dsic.upv.es/~jsilva/retrieval/teco/

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

10 / 12

Rules for extending the suite

Websites included in TECO must be real and online websites not created by
the people who submit the benchmark.

All benchmarks must be localized, so all resources are accessible oine.

Each benchmark must be composed of a webpage and at least all webpages


accessible from it with two clicks.

All benchmarks must be manually reviewed by at least two people before


being submitted.

All benchmarks submitted must be signed.

Researchers must follow the labeling guidelines of TECO.

All benchmarks submitted should not have a direct relation with a particular
technique or tool.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

11 / 12

Conclusions & Future Work


Context and Motivation

Benchmark Suite for Template Detection


and Content Extraction

Template Detection & Content Extraction


Two of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Second option: Contacted the authors of other techniques


I

We could not use their benchmarks due to:


F
F
F

privacy
copyright
unavailability

Final choice: Build our own free and publicly available benchmark suite.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

2 / 12

Alarte, Insa, Silva, Tamarit (UPV & UPM)

The TECO Benchmark Suite

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

5 / 12

Downloading and using the suite

Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:
I
I
I

Download
Directory with 40 folders.

Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process

Scripts to automatize the


benchmarking process

Scope:
I
I

Template detection
Content extraction

Rules for using the suite

Goal:
I
I
I

Test
Compare
Tune

Uses:
I
I

Training
Evaluation

Alarte, Insa, Silva, Tamarit (UPV & UPM)

Publish the results so that they are


publicly available.

Provide enough information so that


anyone can easily duplicate the
experiments.

http://users.dsic.upv.es/~jsilva/retrieval/teco/

TECO Benchmark Suite

Alarte, Insa, Silva, Tamarit (UPV & UPM)

Septembe 15th, 2015

(PROLE15)

6 / 12

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

TECO Benchmark Suite

Septembe 15th, 2015

September 15th, 2015

(PROLE15)

10 / 12

(PROLE15)

12 / 12

Conclusions & Future Work


Context and Motivation

Benchmark Suite for Template Detection


and Content Extraction

Template Detection & Content Extraction


Two of the main areas of information retrieval applied to the Web.
Complementary: main content is not part of the template.

Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I

Widely used in the literature


Not prepared for template detection

Second option: Contacted the authors of other techniques


I

We could not use their benchmarks due to:


F
F
F

privacy
copyright
unavailability

Final choice: Build our own free and publicly available benchmark suite.

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

2 / 12

Alarte, Insa, Silva, Tamarit (UPV & UPM)

The TECO Benchmark Suite

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

5 / 12

Downloading and using the suite

Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:
I
I
I

Download
Directory with 40 folders.

Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process

Scripts to automatize the


benchmarking process

Scope:
I
I

Template detection
Content extraction

Rules for using the suite

Goal:
I
I
I

Test
Compare
Tune

Uses:
I
I

Training
Evaluation

Alarte, Insa, Silva, Tamarit (UPV & UPM)

Publish the results so that they are


publicly available.

Provide enough information so that


anyone can easily duplicate the
experiments.

http://users.dsic.upv.es/~jsilva/retrieval/teco/

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

6 / 12

Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

Septembe 15th, 2015

(PROLE15)

10 / 12

Ongoing Extension (TECO 2.0)


Includes 90 benchmarks (50 more than TECO 1.0).
Contains explicit information about subtemplates.
Alarte, Insa, Silva, Tamarit (UPV & UPM)

TECO Benchmark Suite

September 15th, 2015

(PROLE15)

12 / 12

A Collection of Website Benchmarks Labelled for


Template Detection and Content Extraction
Julian Alarte, David Insa, Josep Silva, Salvador Tamarit
MiST Research Group, Universitat Polit`
ecnica de Val`
encia
and
Babel Research Group, Universidad Polit
ecnica de Madrid

XV Jornadas Sobre Programaci


on y Lenguajes (PROLE15)
September 15th, 2015

También podría gustarte