This is a shortened English version of survey on Estonian web pages validation. The full survey is unfortunately only available in Estonian, but I try to include as much core information as possible to this English overview.
The purpose of this survey was to check how many web pages in Estonia are valid according to they’re Document Type Declarations. Also information about the use of character encodings and HTML-elements was gathered.
The survey was conducted on 21,905 pages, taken from the
Neti.ee Estonian servers list
at February 11th, 2005. Only addresses with form www.*.ee
were
used, and only the main page was checked.
The validation tool used was W3C Markup Validation Service, widely known as the W3C Validator.
To reduce the human work required, an batch-validation application was developed. The application requires an *NIX system with Perl and Python, and is available for download and free modification: batchval.tar.gz (a simple user-manual is included with the program.)
The program checked all pages and excluded pages with:
The program did not check for HTTP headers. (Everything concerning these was handled by Python.)
The program counted statistics of different HTML-elements in page. Sent the page for W3C Validator and recorded statistics of Error messages (if there were any). If the Validator gave an Fatal Error with Encoding, then the page was resent to validation with encoding-override to force ISO-8859-1.
The survey was conducted between 2005-02-11 17:37 and 2005-02-13 17:29 (Estonian Time).
All the results of this survey are available in digital format.
The program excluded 2250 pages before validation, so that 19,655 pages only made the W3C Validator test.
The main reason for exclusion was redirecting to another page and the second was being not able to contact the server.
The proportion of valid pages was, of course, small – only 2.2% (436 pages) were valid, plus 0.5% which were Tentatively Valid (because of the lack of encoding specification). (figure 1.)
Figure 1. Proportion of valid pages.
42 pages out of 94 tentatively valid were using virtual hosting of artfotoplus.ee, and were not set up. Others were either undetected redirections, under construction or pages with very little content.
NB! In all the following comparisons between valid and invalid pages the tentatively valid pages are excluded.
A list of all valid pages is available in appendix 3 of Estonian version of this survey.
35% of invalid pages had Document Type Declaration.
As seen from figures 2 and 3, The most common document type was HTML 4.01 Transitional. Probably because many HTML-editors add this DTD be default, and it’s the DTD with most features and least restrictions.
Second place belongs to HTML 4.0 Transitional, but when you look at the valid pages only, then it’s XHTML 1.0 Transitional.
Interestingly XHTML 1.1 is quite widespread, although it should
be served as application/xhtml+xml
, it still is mainly provided as
text/html
, which is far from correct.
Figure 2. Document types on all pages.
Figure 3. Document types on valid pages.
Encoding was only specified on 68% of pages (with meta element). For 12% of pages the W3C validator gave an error message because of problems with encoding.
The overall favorite encoding was ISO‑8859‑1 (figure 4), which is according to ISO standard does not really fit for Estonian alphabet, because of the missing ž and š (which are present in Windows‑1252 encoding, which is used by browsers instead of ISO‑8859‑1). The second place Windows‑1257 (which supplements ISO‑8859‑13) fits well for Estonian text. The officially recommended character set for Estonian (ISO‑8859‑15) is not quite popular.
Cirillic encodings were (starting from most popular): Windows‑1251, KOI8‑R and ISO‑8859‑5.
Universal UTF‑8 was only used on about 1000 pages. Estimated about half of these were in Estonian and another half in two or more languages simultaneously.
Figure 4. Popularity of different encodings.
The most overused HTML element was <img>
. Followed by
<a>
, <br>
, <table>
and
<font>
. (Figure 5.)
Figure 6 shows proportionally, what kind of elements commonly exist on a page and illustrates the present situation a bit better.
By looking at images and tables we can say, that about ¾ of pages uses tables for layout and HTML-embedded images for graphics.
77% of pages uses <br>
, <p>
or both.
40% of these uses both, 21% only the line brake, and 16% only the paragraph.
60% of pages uses <link>
, <style>
or
both. So it’s possible to say that 60% of pages makes use CSS. (Of course there
are inline-styles, but it’s about as good as using the
<font>
-tag.)
45% of pages uses <b>
and/or <strong>
.
63% of these prefers <b>
, 24% goes for
<strong>
and 13% uses both.
7% of pages uses <i>
and/or <em>
. 70%
of these prefers <i>
, 27% goes for <em>
and 3% uses both.
Entire 44% of pages uses more or less client-side scripting
(element <script>
); mostly to achieve simple hover-effects
on images.
A lot of obsolete elements are used:
<font>
42%, <center>
21%,
<map>
12%, <hr>
9% and
<u>
3%.
10% of pages uses <frameset>
and almost 3%
uses <iframe>
, which makes 13% of pages with frames.
The use of headings is scarce. <h1>
is used by 7% and
<h2>
by 3%.
As element <embed>
was not recorded by this survey
(because it’s not listed in HTML 4.01 specification), we can’t conclude
that <object>
is the most popular element for implementing
Flash’i, Java, video, sound etc on web page.
Figure 5. HTML elements by total amount.
Figure 6. HTML elements by occurrence frequency on pages.
The proportion of tableless pages on valid pages was almost twice as large as on the proportion of tableless pages on invalid pages (figure 7).
Figure 7. Tables on valid pages (on left) and on invalid pages (on right).
The program collected 865,602 error messages from 19,655 pages. The average number of errors per page was 44, but this number was largened by few pages with extremely high load errors (maximum of 3177 errors on single page: www.jogevalv.ee). The most frequent was 5 error messages per page, median was 18. (Figure 8.)
Figure 8. Histogram of validation errors.
But simply counting error messages may not give us good comparison between pages, as same errors are often repeated several times per page, and one error at the beginning may introduce a cascade of others throughout the whole page.
If we only count different error messages, then we receive different numbers: average of 6 different errors per page and maximum of 27 (www.katoliku.ee). Most frequently 2 different errors per page and median of 5. The distribution of different error messages is illustrated by figure 9.
Figure 9. Histogram of different validation errors.
Table 1 also compares different statistics of error messages.
error messages | average | weighted avg. | max | median | mode | std. dev. |
---|---|---|---|---|---|---|
All | 44.04 | 39.31 | 3177 | 18 | 5 | 100.04 |
Different | 6.01 | 5.97 | 27 | 5 | 2 | 4.08 |
Table 1. Statistics of error messages.
Next we look at different error messages - they’re amount (figure 10) and incidence (figure 11); the exact numbers are available in appendix 2 of Estonian version.
Figure 10. The amount of different error messages. (Only 15 most common are shown.)
Figure 11. The incidence of different error messages. (Only 18 most common are shown.)
The most numerous error message „there is no attribute
"FOO"...“
, was generated by the use of un standard
attributes.
One of the most common error messages was „required attribute
"FOO" not specified“
, mainly because of the lack of
alt
and type
.
Almost 63% of pages was without the document type declaration (error message
„no document type declaration...“
).
Unbelievably frequent was „document type does not allow element
"FOO" here...“
. This was mainly the sign of some really
stupid mistakes, like placing <style>
outside of
<head>
, putting table <td>
outside of
<tr>
, <form>
between
<table>
and <tr>
etc.
One third of pages had the error message
„end tag for element "FOO" which is not open“
.
This was the sign of typos, like in the following example:
<!-- begin paragraph -> <p>Lorem ipsum dolor sit<br /> amet; just go to --> </b>home</b>.</p> </body> </tml>
The next four error messages were always together:
„reference to entity "FOO" for which no system identifier
could be generated“
,„entity was defined here“
,„general entity "FOO" not defined and no default
entity“
and„cannot generate system identifier for general entity
"FOO"“
.The use of & instead of &
was of course the reason.
The message „element "FOO" undefined“
had four
different main reasons:
<marquee>
,
<blink>
and <embed>
,<frameset>
and <frame>
without (X)HTML X.X Frameset doctype,The message „end tag for "FOO" omitted...“
and
it’s companion „start tag was here“
were often on pages
which declared them self as XHTML, but didn’t follow the rules of XML.
Similar surveys have been conducted by Thomas Dowling (Validating HTML , 1997), and by Greg Lanier (Universal web design: a survey of web accessibility and usability, 2003). But the most similar ones, and most recent, have been conducted by Soren Johannessen in Denmark at 2004:
Compared to these surveys, the 2.2% of valid pages in Estonia is not that bad when compared to the first Danish survey’ 3.05%, and they’re a whole better when compared to the second survey’s 0.4% (only one page was valid); the third survey had 12.24% of valid pages, but because of the registration fee, the selection of pages was clearly of higher class. Anyway, the selections of all these surveys was a lot thinner and because of that cannot be really compared to the results of this survey.
The percentage of pages with doctype was in the first Danish survey 35%, which precisely maps to the results of Estonian pages. The second Danish survey had 38% and the third 61%.
Similarly the most widely used doctype was HTML 4.01 Transitional (figure 12), followed by HTML 4.0 Transitional.
Figure 12. Document Type Declarations on Danish authorities web pages by 2004 Jan-Feb survey (based on Johannessen’s data.)
73% pages uses tables for layout and embedded images for graphics.
The presentative HTML is widely used: <font>
on 42% and
<center>
on 21% of pages.
44% uses JavaScript, mainly for effects which could be attained with proper use of CSS. But CSS is only used on 60% of pages and probably underused.
The Document Type Declarations is not used on most pages – either because of lack of knowledge or care. If doctype is specified, it’s usually one of the Transitional’s.
Only two third of pages has got an encoding specification. Mostly the ISO-8859-1 encoding is used, which is not really suitable for Estonian.
A lot of good structural HTML elements are underused, for example headings are used on maximum of 10% of pages, and only 5% of pages uses lists.
The creators of dynamic pages don’t know or care of using
&
instead of &. Many don’t know or care that the
paragraphs should be marked up with <p>
, not separated with
<br><br>
.
Too many web authors don’t even seem to know the basic rules of HTML and nest elements in almost random order.
Kirjutatud 27. aprillil 2005, viimati muudetud 28. aprillil 2005.
RSS, RSS kommentaarid, XHTML, CSS, AA