出典(authority):フリー百科事典『ウィキペディア(Wikipedia)』「2013/12/24 12:34:25」(JST)
MIME Type | application/sgml, text/sgml |
---|---|
UTI | public.xml |
開発者 | ISO |
種別 | メタ言語 |
派生元 | GML |
拡張 | HTML, XML |
国際標準 | ISO 8879 |
テンプレートを表示 |
Standard Generalized Markup Language(スタンダード ジェネラライズド マークアップ ランゲージ、略:SGML)、文書記述言語SGMLは、マークアップ言語を定義するためのメタ言語の一つである。
簡単に言えば、マニュアルなどの文書の電子化のための書式の規格である。HTMLおよびXMLの、スーパーセット(親)にあたる。その標準規格はISOによって1986年に定められた (ISO 8879:1986、JIS X 4151:1992) 。
そもそもコンピュータやインターネットは軍事技術を目的として開発されたのが起源であり、SGMLも例外ではなかった。軍艦や軍用機などのマニュアルは膨大な量にのぼり、改良が加えられた時などにもマニュアルを新しく書き直す作業、いわゆるメンテナンスなどが大変であった。このことから、マニュアルを電子化して、大量の紙を削減し、かつメンテナンスの簡素化をはかるための技術が必要とされた。
ただし、軍艦や軍用機などは数十年という長期間の保有が必要になるため、長期間にわたりデータが利用可能とならなければならない。電子文書は特定の企業のワープロソフトを用いるとそのソフトのバージョンが上がったり、最悪の場合そのソフトを開発している会社が開発を中止したり、倒産したりしてソフトウェアが無くなった場合は、今まで作成したデータが読めなくなるという問題が発生してしまう。そこで、プレーンテキストのみを用いて、「タグ」を使うことによってデータに意味を持たせることが考えられた。
このようにして規格化されたのがSGMLであった。
1979年、IBMでプロジェクトマネージャをしていたチャールズ・ゴールドファーブ(英語版) は、Edward MosherおよびRaymond Lorieらとともに、「GML」(Generalized Markup Language) を発表し、それは「DCF」(Document Composition Facility) の名で商業化された。この成功でゴールドファーブは有名になり、IBMを退職してGMLの後継言語であるSGMLを開発することになったのである。
ISOのSGML規約は1986年の出版後2ヶ月も経たないうちにベストセラーとなった(その10年前に発売されたFORTRANのISO規約の部数を2ヶ月で超えた)。
SGMLは、ISOから正式に承認される以前から、すでに、アメリカ国防総省やECの公式出版事務局など、数々の公的機関で使用され始めていた。ゴールドファーブの古巣のIBM社でも導入され、同社の文書システムに大変革をおこした。ヨーロッパでもCERN(欧州原子核研究機構)など、広く採用され、例えばフランスを例に挙げると、エアバス社、SNECMA(フランス国営の航空機エンジンメーカー)およびフランス軍などで採用されることになった。
日本においては、厚生省への新薬申請のデータ形式としてSGMLが採用された。それに伴い製薬会社やその関連企業においても導入された。他にも特許庁などでも導入された。航空機産業・防衛産業、自動車産業においても海外との共同開発や部品供給時の情報交換やマニュアル・報告書の電子化などに利用されることとなった。
<!DOCTYPE memo PUBLIC "-//SuzukiCorp//DTD Memo//JP"> <memo> <from>木村 <to>富田様 <date>2001/10/01 <subject>役員会議 <para>役員会議の場所は会議室Bに変更になりました。 </memo>
SGML文書を、人間が読めるように、レイアウトして表示することは、SGMLパーサという名のプログラムが行う。つまり構文解析およびレイアウトを行うプログラムである。SGMLパーサの最も初期の市販品としては、ブリュッセルのSOBEMAP社のものおよび、シカゴのDatalogics データロジックス社[注 2]のものがあった。[注 3]
SGMLは機能が満載されていたことにより、そのままでは全てを実装することは困難であった。また、タグの構造が原因で、パーサのアルゴリズムが比較的複雑になることも難点だった。そこで後に、SGMLを簡略化および改良した形のXMLが開発され、普及してゆくことになった。SGML文書はXML文書へと順次、変換・移行されることになった。
SGMLはこれら2つのマークアップ言語の源流であり、現在のインターネット利用者は皆SGMLの恩恵に浴しているのである。
Internet media type | application/sgml, text/sgml |
---|---|
Uniform Type Identifier | public.xml |
Developed by | ISO |
Type of format | Markup Language |
Extended from | GML |
Extended to | HTML, XML |
Standard(s) | ISO 8879 |
The Standard Generalized Markup Language (SGML; ISO 8879:1986) is an ISO-standard technology for defining generalized markup languages for documents. ISO 8879 Annex A.1 defines generalized markup:
Generalized markup is based on two novel postulates:[citation needed]
- Markup should be declarative: it should describe a document's structure and other attributes, rather than specify the processing to be performed on it. Declarative markup is less likely to conflict with unforeseen future processing needs and techniques.
- Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and databases can be used for processing documents as well.
HTML is an example of an SGML-based language.
SGML is an ISO standard: "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)", of which there are three versions:
SGML is part of a trio of enabling ISO standards for electronic documents developed by ISO/IEC JTC1/SC34[1][2] (ISO/IEC Joint Technical Committee 1, Subcommittee 34 – Document description and processing languages) :
SGML is supported by various technical reports, in particular
SGML descended from IBM's Generalized Markup Language (GML), which Charles Goldfarb, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the “GML” term using their surname initials.[5] The syntax of SGML is closer to the COCOA format.[clarification needed] As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many such documents must remain readable for several decades—a long time in the information technology field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The advent of the XML profile has made SGML suitable for widespread application for small-scale, general-purpose use.
SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of ISO 8879 (from the public draft[6]):
A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.
A type-valid SGML document is defined by the standard as
An SGML document in which, for each document instance, there is an associated document type declaration (DTD) to whose DTD that instance conforms.
A tag-valid SGML document is defined by the standard as
An SGML document, all of whose document instances are fully tagged. There need not be a document type declaration associated with any of the instances. Note: If there is a document type declaration, the instance can be parsed with or without reference to it.
Tag-validity was introduced in SGML (ENR+WWW) to support XML which allows documents with no DOCTYPE declaration but which can be parsed without a grammar or documents which have a DOCTYPE declaration that makes no XML Infoset contributions to the document. The standard calls this fully tagged. Integrally stored reflects the XML requirement that elements end in the same entity in which they started. Reference-free reflects the HTML requirement that entity references are for special characters and do not contain markup. SGML validity commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), covers type-validity only.
The SGML emphasis on validity supports the requirement for generalized markup that markup should be rigorous. (ISO 8879 A.1)
An SGML document may have three parts:
An SGML document may be composed from many entities (discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the concrete syntax of the document.
Although full SGML allows implicit markup and some other kinds of tags, the XML specification (s4.3.1) states:
Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.
For introductory information on basic, modern SGML syntax, see XML. The following material concentrates on features not in XML and is not a comprehensive summary of SGML syntax.
SGML generalizes and supports a wide range of markup languages as found in the mid 1980s. These ranged from terse Wiki-like syntaxes to RTF-like bracketed languages to HTML-like matching-tag languages. SGML did this by a relatively simple default reference concrete syntax augmented with a large number of optional features that could be enabled in the SGML Declaration. Not every SGML parser can necessarily process every SGML document. Because each processor's System Declaration can be compared to the document's SGML Declaration it is always possible to know whether a document is supported by a particular processor.
Many SGML features relate to markup minimization. Other features relate to parallel asynchronous markup (CONCUR), to linking processing attributes (LINK), and to embedding SGML documents within SGML documents (SUBDOC).
The notion of customizable features was not appropriate for Web use, so one goal of XML was to minimize optional features. However XML's well-formedness rules cannot support Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.
The usual (default) SGML concrete syntax resembles this example, which is the default HTML concrete syntax:
<QUOTE TYPE="example"> typically something like <ITALICS>this</ITALICS> </QUOTE>
SGML provides an abstract syntax that can be implemented in many different types of concrete syntax. Although the markup norm is using angle brackets as start- and end- tag delimiters in an SGML document (per the standard-defined reference concrete syntax), it is possible to use other characters—provided a suitable concrete syntax is defined in the document's SGML declaration.[7] For example, an SGML interpreter might be programmed to parse GML, wherein the tags are delimited with a left colon and a right full stop, thus, an :e prefix denotes an end tag: :xmp.Hello, world:exmp.
. According to the reference syntax, letter-case (upper- or lower-) is not distinguished in tag names, thus the three tags: (i) <quote>
, (ii) <QUOTE>
, and (iii) <quOtE>
are equivalent. (NOTE: A concrete syntax might change this rule via the NAMECASE NAMING declarations).
SGML has features for reducing the number of characters required to mark up a document, which must be enabled in the SGML Declaration. SGML processors need not support every available feature, thus allowing applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid structures. XML is intolerant of syntax omissions, and does not require a DTD for validation.
Both start tags and end tags may be omitted from a document instance, provided:
#REQUIRED
) attributes, andFor example, if OMITTAG YES is specified in the SGML Declaration (enabling the OMITTAG feature), and the DTD includes the following declarations:
<!ELEMENT chapter - - (title, section+)> <!ELEMENT title o o (#PCDATA)> <!ELEMENT section - - (title, subsection+)>
then this excerpt:
<chapter>Introduction to SGML <section>The SGML Declaration <subsection> ...
which omits two <title>
tags and two </title>
tags, would represent valid markup.
Note also that omitting tags is optional - the same excerpt could be tagged like this:
<chapter><title>Introduction to SGML</title> <section><title>The SGML Declaration</title> <subsection> ...
and would still represent valid markup.
Note: The OMITTAG feature is unrelated to the tagging of elements whose declared content is EMPTY
as defined in the DTD:
<!ELEMENT image - o EMPTY>
Elements defined like this have no end tag, and specifying one in the document instance would result in invalid markup. This is syntactically different than XML empty elements in this regard.
Tags can be replaced with delimiter strings, for a terser markup, via the SHORTREF feature. This markup style is now associated with wiki markup, e.g. wherein two equals-signs (==), at the start of a line, are the “heading start-tag”, and two equals signs (==) after that are the “heading end-tag”.
SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks—either double “ ”
(LIT) or single ’ ’
(LITA)—so that the previous markup example could be written:
<QUOTE TYPE=example> typically something like <ITALICS>this</> </QUOTE>
One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag </>
in <ITALICS>this</>
"inherits" its value from the nearest previous full start tag, which, in this example, is <ITALICS>
(in other words, it closes the most recently opened item). The expression is thus equivalent to <ITALICS>this</ITALICS>
.
Another feature is the NET (Null End Tag) construction: <ITALICS/this/
, which is structurally equivalent to <ITALICS>this</ITALICS>
.
Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:
<QUOTE></QUOTE>
can be written as:
<QUOTE// <!-- not a typo! -->
Wherein the first slash ( / ) stands for the NET-enabling “start-tag close” (NESTC), and the second slash stands for the NET. NOTE: XML defines NESTC with a /, and NET with an > (angled bracket)—hence the corresponding construct in XML appears as <QUOTE/>.
The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:
<!ELEMENT lines (line*) <!ELEMENT line O - (#PCDATA)> <!ENTITY line-tagc "</line>"> <!SHORTREF one-line "&#RE;&#RS;" line-tagc> <!USEMAP one-line line>
(and "&#RE;&#RS;" is a short-reference delimiter in the concrete syntax), then:
<lines> first line second line </lines>
is equivalent to:
<lines> <line>first line</line> <line>second line</line> </lines>
SGML has many features that defied convenient description with the popular formal automata theory and the contemporary parser technology of the 1980s and the 1990s. The standard warns in Annex H:
The SGML model group notation was deliberately designed to resemble the regular expression notation of automata theory, because automata theory provides a theoretical foundation for some aspects of the notion of conformance to a content model. No assumption should be made about the general applicability of automata to content models.
A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser,[8] notes
the DTD-grammar in SGML must conform to a notion of unambiguity which closely resembles the LL(1) conditions
and specifies various differences.
There appears to be no definitive classification of full SGML against a known class of formal grammar. Plausible classes may include tree-adjoining grammars and adaptive grammars.
XML is described as being generally parsable like a two-level grammar for non-validated XML and a Conway-style pipeline of coroutines (lexer, parser, validator) for valid XML.[9] The SGML productions in the ISO standard are reported to be LL(3) or LL(4).[10] XML-class subsets are reported to be expressible using a W-grammar.[11] According to one paper,[12] and probably considered at an information set or parse tree level rather than a character or delimiter level:
The class of documents that conform to a given SGML document grammar forms an LL(1) language. ... The SGML document grammars by themselves are, however, not LL(1) grammars.
The SGML standard does not define SGML with formal data structures, such as parse trees, however, an SGML document is constructed of a rooted directed acyclic graph (RDAG) of physical storage units known as “entities”, which is parsed into a RDAG of structural units known as “elements”. The physical graph is loosely characterized as an entity tree, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an element tree, but the ID/IDREF markup allows arbitrary arcs.
The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities.
The SGML standard describes it in terms of maps and recognition modes (s9.6.1). Each entity, and each element, can have an associated notation or declared content type, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated delimiter map (and short reference map), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a state machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner, while the tokenizer relates to the recognition modes.
Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to recognize lexical structures, and actively — to generate missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize.
SGML uses the term validation for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML without a DTD (e.g. simple XML), is a grammar or a language; SGML with a DTD is a metalanguage. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism is a metalanguage.
SGML has an abstract syntax implemented by many possible concrete syntaxes, however, this is not the same usage as in an abstract syntax tree and as in a concrete syntax tree. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by John McCarthy.
The W3C XML (Extensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the World Wide Web. In addition to disabling many SGML options present in the reference syntax (such as omitting tags and nested subdocuments) XML adds a number of additional restrictions on the kinds of SGML syntax. For example, despite enabling SGML shortened tag forms, XML does not allow unclosed start or end tags. It also relied on many of the additions made by the WebSGML Annex. XML currently is more widely used than full SGML. XML has lightweight internationalization based on Unicode. Applications of XML include XHTML, XQuery, XSLT, XForms, XPointer, JSP, SVG, RSS, Atom, XML-RPC, RDF/XML, and SOAP.
While HTML was developed partially independently and in parallel with SGML, its creator Tim Berners-Lee, intended it to be an application of SGML. The design of HTML (Hyper Text Markup Language) was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application, however, the HTML markup language has many legacy- and exception- handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 – SGML.[13]
The charter for the recently revived World Wide Web Consortium HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML'".[14] Although HTML syntax closely resembles SGML syntax with the default reference concrete syntax, HTML5 abandons any attempt to define HTML as an SGML application, explicitly defining its own parsing rules which more closely match existing implementations and documents. It does, however, define an alternative XHTML serialization, which conforms to XML and therefore to SGML as well.[15]
The second edition of the Oxford English Dictionary (OED) is entirely marked up with an SGML-based markup language.[16]
The third edition is marked up as XML.
Other document markup languages are partly related to SGML and XML, but — because they cannot be parsed or validated or other-wise processed using standard SGML and XML tools — they are not considered either SGML or XML languages; the Z Format markup language for typesetting and documentation is an example.
Several modern programming languages support tags as primitive token types, or now support Unicode and regular expression pattern-matching. An example is the Scala programming language.
Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the World Wide Web. The following list is of pre-XML SGML applications.
Significant open source implementations of SGML have included:
SP and Jade, the associated DSSSL processors, are maintained by the OpenJade project, and are common parts of Linux distributions. A general archive of SGML software and materials resides at SUNET. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.
|
関連記事 | 「SG」 |
.