Documentos de Académico
Documentos de Profesional
Documentos de Cultura
(PROLE15)
1 / 12
Content Extraction
Identifies the main content of the webpage.
Essential for many information retrieval and processing tasks.
(PROLE15)
1 / 12
(PROLE15)
2 / 12
(PROLE15)
3 / 12
Previous Situation
Lack of a public and neutral benchmark suite
Evaluations:
I
I
I
(PROLE15)
4 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
(PROLE15)
5 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
(PROLE15)
5 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
(PROLE15)
5 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
privacy
copyright
unavailability
(PROLE15)
5 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
privacy
copyright
unavailability
Final choice: Build our own free and publicly available benchmark suite.
(PROLE15)
5 / 12
Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process
Scope:
I
I
Template detection
Content extraction
Goal:
I
I
I
Test
Compare
Tune
Uses:
I
I
Training
Evaluation
(PROLE15)
6 / 12
All needed elements for correct visualization: HTML, images, scripts, CSS...
SiteSucker (OS X) and wget (Linux).
(PROLE15)
7 / 12
Manually explored the key page and the webpages accessible from it
Choose what part of the webpage is the template and what part is the main
content.
Together:
I
(PROLE15)
8 / 12
Benchmark Classification
Classification 1: All benchmarks have been classified into five groups:
Companies / Shops,
Forums / Social,
Personal websites / Blogs,
Media / Communication,
Institutions / Associations.
www.bbc.co.uk/news/index.html (Media / Communication)
Classification 2: All benchmarks have been classified according to their size and
the proportion of their template / main content.
Id
24
Benchmark
www.bbc.co.uk/news/index.html
Nodes
2991
T. Nodes
364
M.C. Nodes
1360
VL
5
TT
0
PT
5
DT
0
Notes (peculiarities)
Several templates (but very similar).
(PROLE15)
9 / 12
Download
Directory with 40 folders.
Scripts to automatize the
benchmarking process
http://users.dsic.upv.es/~jsilva/retrieval/teco/
(PROLE15)
10 / 12
Websites included in TECO must be real and online websites not created by
the people who submit the benchmark.
All benchmarks submitted should not have a direct relation with a particular
technique or tool.
(PROLE15)
11 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
privacy
copyright
unavailability
Final choice: Build our own free and publicly available benchmark suite.
(PROLE15)
2 / 12
(PROLE15)
5 / 12
Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:
I
I
I
Download
Directory with 40 folders.
Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process
Scope:
I
I
Template detection
Content extraction
Goal:
I
I
I
Test
Compare
Tune
Uses:
I
I
Training
Evaluation
http://users.dsic.upv.es/~jsilva/retrieval/teco/
(PROLE15)
6 / 12
(PROLE15)
10 / 12
(PROLE15)
12 / 12
Our Experience
Initial intention: use a public benchmark suite, CleanEval
I
I
privacy
copyright
unavailability
Final choice: Build our own free and publicly available benchmark suite.
(PROLE15)
2 / 12
(PROLE15)
5 / 12
Our approach
TECO (TEmplate detection and COntent extraction benchmarks suite)
Consists in:
I
I
I
Download
Directory with 40 folders.
Benchmark suite
Gold Standard
Scripts to automatize the benchmarking process
Scope:
I
I
Template detection
Content extraction
Goal:
I
I
I
Test
Compare
Tune
Uses:
I
I
Training
Evaluation
http://users.dsic.upv.es/~jsilva/retrieval/teco/
(PROLE15)
6 / 12
(PROLE15)
10 / 12
(PROLE15)
12 / 12