返回正常中文阅读

想对这篇译文“指手画脚”吗?

您的参与将有助于译者提高译文的质量;同时,大家一起对问题的讨论也是最佳的学习方式。还等什么?请现在就注册登录译言,开始眉批!
大错 小错 不顺 建议

Secret war: Public search VS. Commercial search

Public search VS. Commercial search

Secret war

This article is derived and adapted from a technical report of our research group (about A digital library project, http://dris.hust.edu.cn/English/main.htm), Also an introduction of a proposed solution for IETF.


Commercial search engines, MSN, Yahoo and Google are directing a most wonderful film since the Netscape' IPO. The winner will have absolute predominance in future competition of IT industry. Billion of dollars make many every of us swirl. The investor need keep calm especially in this time. This fanaticism may destroy a good company, or even bring the .com to another winter.

Search engine technology is almost an occult art to common user. Learning its development in these years and its future is very useful. Neglecting the development of technology always result in fetal error, especially in IT industry. This article just gives some tips for these.

We have heard too much praise for commercial search engine. Now through the King's beautiful clothes, let's review some disadvantages of current search engine and then forecast its future.

Perfect technology?
Search engine may represent the most powerful technology on this planet. It can give you what you want in one second from billions of web pages. Without them, WWW may be still a primordial information sea. But Current search technology is still far from perfection. Beyond too many news and fantasies about search engine, we found finding the information we really want on the WWW is still a hit-and-miss affair. You even can't distinguish "who is the creator of Google" from millions of search results of Google. We can easily feel some inconveniences in current search engine.
Whenever you input a query word in search engine, you will obtain thousands of search results. Too much information always means no information.

Current search engine may be the best tools to rank who is who, but not for Internet surf. Now the average update interval of most pages database is almost one month. Some information you obtain from search engine happened month ago. Some technologies like "page caching" also can't completely solve this problem. Moreover, there have been five billion web pages in Google's database, but it still no more than 50 per of all the pages on Internet. It's still an optimistic evaluation. There are more dynamic pages which can't be indexed. Many other resources such as Pdf, picture, video also couldn't be efficiently integrated in current search engine. Just as a web pages search engine, it can't continue to index the entire Web as it grows. Managing all the information resources on Internet may be only a beautiful dream.
There are three main characters about information: precise, fresh, comprehensive, but all these can't be ensured in current search engine. It's almost possible for commercial search system to solve these problems in current architecture.

Ideal commercial mode?
In the beginning, the search engine is only an accessorial tool in some famous portal sites. Till now, many search engines make money from them by providing search service. But the profit from these sites is limited. Especially in the winter of .com, no company will concentrate on such unprofitable goods. The search engine companies have to raise themselves. When all the companies are arduously looking for the new gold mine, Overture created an unprecedented concept, ranking auctions, an ideal commercial mode for search engine. Now search engine is becoming the unfailing gold mine of many .com companies. But there are also many impugnations for this commercial model since it came into being.

What is the web search engine? Some private companies downloaded billions of pages belonging to other people without their copyright permission. And then they can provide the advertisement service in the searching process. Some sites even have to pay some money to be indexed. No one could really agree the modus of search engine companies, but only few people express their discontentment. Just this few people have brought many troubles for current search engine. If strictly complying with the copyright protection law, maybe no a commercial search engine could survival.

Search engine originally is tools for the convenience of Internet customers, but search engine companies have to apply advertisement or selling ranking prominence, somewhat inconvenient to information retrieval for their profit In other words, search engines make money at the cost of inconvenience of most Internet users, but not its high quality of search service. To maintain their survival, search engine companies have to seek the dangerous tradeoff between search quality and money.

Is it an ideal commercial mode? Maybe just a "True lies" on Internet.

Impetus of IT technology?
After the breakdown of IT industry foam, Google has become the Angel of Internet. Search technology is promoting many aspects of Internet technologies. Commercial search engine may bring .Com to a new Spring.
It may be another fantasy. In these years, processing power continues to advance, according to Moore's Law, while network bandwidth, wireless, storage and graphics capabilities are growing at even faster rates. But one of most important services on Internet, web search service has no any obvious improvement from 1998.Although their web pages databases greatly increased, there is still no a search engine which can cover more than half amount of web pages on Internet. The average update interval is even longer than before. Maybe only one "improvement", one query can return more search results.

Except search engine, all the services of Internet such as E-mail, BBS, and FTP are all based on public protocols. There is no secret technology on Internet. But the web information retrieval service, may be the most important service on Internet, is still dominated by few search engine companies. Its basic algorithm "PageRank" is even a patent. Many experts know the basic "PageRank" algorithm, but no one know its detail, which is top commercial secret. No public surveillance, no real candid ranking algorithm. We all know another world famous algorithm very well, "money can elevate ranking score ". This may not comply with the basic rules of Internet, a public and free world.

The secrecy policy causes search engine technology to remain largely a black art and to be advertising oriented. In these yeas, we almost couldn't learn academic research for web search engine. Up until now most search engine development has gone on at companies with little publication of technical details. Commercial search engine may be not the impetus of IT technology, but becomes an obstacle for the further development of new system. Though the creator of Google hope Google will be a resource for searchers and researcher all around the world and will spark the next generation of search engine technology in 1998. However, it may be impossible to get its data now, mainly because it is considered commercially valuable.

Future of Web search

As mentioned above, with the rapid increase of web pages, some bottleneck problems in coverage and update interval of current search engine have become very obvious. These make some experts rethink the basic architecture of current search engine. In this condition, the research for the public search engine reloaded. What¡¯s the future of search engine, public or commercial?

Now many peoples believe search engine administrates what you could know and what you couldn't reach to a large extent. If all the information is in the charge of a small group, the combination of this absolutely administration and commercial profit may be very dangerous to the benefit of common users. We want to know why so many search engine companies always discuss how they can make more money from advertisements but not propose a perfect solution for us. Just for this reason, it may be better to convert the web search engine in to a public service.

Moreover, almost all the Internet technologies, from TCP/IP to E-mail, are public opening technologies, but the better commercial service could be designed based on them. It may be a basic principle for the continued development of Internet. Search engine may also comply with this principle. We can find a better commercial mode for search engine. First, there should be a basic public opening web search system. Then, some companies can design better search system based on this public platform.
But we have to solve two problems. First, designing a public search system better than current commercial system. At least, this system should provide the same quality of search service as Google. Second, who are willing to build such system? Many excellent technologies failed just for this reason.
Some tips could be found in digital library research. We all knew Google came from Stanford digital library project (http://www-diglib.stanford.edu/). It's just a small part of this project. Only few experts knew the main research of this project, InfoBus system. The main goal of this system is to integrate different kinds of digital resources on Internet, which is a basic research topic of digital library. For some reasons, this system is not widely implemented. But after 20th century, many advanced technologies such as Webservice, GRID appeared. This digital library research topic could go ahead based on these new technologies.

HUST digital project is just for it (http://dris.hust.edu.cn/English/main.htm). This research group proposed DRIS (Domain resource integration system), whose goal is also to integrate all kinds of information resources and build an Internet information retrieval infrastructure. And very interested, here is also a small component for a web search, which is called web search engine based on DNS. In theory, this system can cover all the pages on Internet. Its update interval could even be one day. There have been some discussions for it in IETF. Its test bed is being built on CERNET (China education and research network, including all the universities in China).

Here is a very brief introduction for this system. WWW is a large distributed dynamic world, but most practical and commercially operated Internet search engines are based on a centralized architecture. So with the exploding of Internet information, this contradiction became very serious. This is just the origin of bottleneck problems of current search engine. So a better solution must apply a completely different architecture. Some research like SIREN in IRTF (research department of IETF) had some considerations for extending the DNS's navigation function to web pages search function. The hierarchical distributed architecture of DNS is an efficient architecture to manage the WWW. Not only for titles of web site, but also for their content. HUST Digital Library projects gives a practical solution based on this basic idea. A distributed system must be a public system. Moreover, its layered structure can also ensure it's a practical system. Its bottom layer provides the search engines in local networks of different organizations like universities. This is the basic impetus to build such system. Then these small scale search engines are integrated by other technologies to build a search engine covering entire WWW. More information could be found in its site.
This research may give a promising solution for a better public search engine. Although academic research for this topic is still in low tide, a public search system may be inevitable.

Some principles are available at anytime. First, in any free market, customer should be God forever. Second, the evolutionism in technology, only the technology that meet the demand of common users better could survival.


Author
Wang Liang
Currently, Ph.D in informatics school at Huazhong university of science and technology (HUST).One of main directors of HUST digital library project

Two similar reports:
http://www.circleid.com/article/588_0_1_0_C
http://www.trnmag.com/Stories/2004/0...ch_040704.html

公共搜索 VS. 商业搜索

公共搜索 vs 商业搜索

隐秘的战争

本文源自于我们研究小组的一份关于一个数字图书馆项目的技术报告(http://dris.hust.edu.cn/English/main.htm),并在此基础上加以修正。同时,也介绍了一个已经提交过的,针对IETF的解决方案。


自从Netscape首次公开上市后,诸如MSN、Yahoo和Google等得商业搜索引擎之间,就开始上演了一个壮丽的电影。胜者将在未来IT行业的竞争中获得绝对的压倒性的优势。数十亿计美元的投入让每一个人都在彷徨。投资者有其需要在这个时刻保持清醒的头脑。这种狂热将可能彻底摧毁一个公司,或者给.com时代带来另一个冬天。


搜索引擎技术对于一般用户而言实在是一个神秘的存在。研究它的历史和预测它的未来将会非常有益。忽视技术的发展通常会导致企业的无可挽回的失败,尤其是在IT工业中。本文将在此方面给出一些提示。


我们已经听够了对商业搜索引擎的溢美之词。是时候揭开华丽的外表,看一看当今搜索引擎技术的劣势,并展望它的未来。


完美的技术?


搜索引擎或许是这个星球上最强大的技术。它能够在极端的时间里从数十亿的web页面中找到你所需要的信息。没有这项技术,万维网将只是一个原始的信息的海洋。但是,当今搜索引擎技术离完美还有着极大的距离。撇开新闻和狂热的言论,我们就能发现,在万维网上找到我们需要的信息仍然是一个“碰运气”的事情。你甚至无法从Google的版玩机的搜索结果中,得到“谁创立了Google”这个问题的准确答案。我们很容易就能体会到,当今的搜索引擎并不那么方便。


无论你要搜索什么,你总能得到成千上万的结果,而“过多的信息也等同于没有信息”。


当今的搜索引擎或许能完美的分辨出谁是谁,但是并不适合在网上冲浪。现在,数据库更新大部分库存页面的间隔平均在一个月。这就意味着,某些你搜索到的信息实际上是一个月以前的。而诸如“页面缓存”的技术同样无法完全的解决这个问题。此外,Google的数据库里虽然存储着50亿个页面,但这个数字还没有互联网上所有页面的一半多,而且这还是乐观的估计。除此以外,还有一些动态的页面无法被编号。其他格式的一些资源,比如pdf、图片和视频,也同样无法有效地被现在的搜索引擎收录。搜索引擎无法跟上着互联网的不断发展壮大的脚步。容纳整个互联网上的所有信息看起来似乎只是一个美丽的梦。


信息有三个基本的要素:准确、即时和全面,但是当前的搜索引擎无法保证其中的任何一条。而在当今搜索引擎的架构下,这些问题是几乎不可能得到解决的。


理想的商业模式?

 

  早先的时候,搜索引擎只是某些著名网站的一个附属工具。而现今,许多搜索引擎通过提供搜索服务而收入颇丰。但是来自这些网站的收入是有限的。特别是在.com时代的冬天,没有一家公司会关注与这个不能创造利润的行业。搜索引擎公司只得自力更生。当每一家公司都在辛勤的探寻着新的金矿时,一个新的、史无前例的、搜索引擎完美的模式出现了——竞价排名。现在,搜索引擎已经成为许多.com公司不落的太阳。然而,这个商业模式自从出现以来,也一直盘随着众多的指责。


什么是搜索引擎?某私有公司下载了数十亿计的网页,而没有得到页面所有者的授权。然后他们居然在搜索过程中放置了商业广告。更有甚者,一些网站需要付钱来保证它的站点被收录。没有人真正赞成这种商业模式,但只有很少的一部分人表达了他们的不满。正是这极少数人,给现在的搜索引擎带来了许多的麻烦。如果严格的按照版权法来说,没有一家商业搜索引擎公司能够幸存。


搜索引擎原本只是一个给互联网用户提供方便的工具,但是搜索引擎公司却在其中投放广告或者出售搜索结果的靠前的位置,即使这样会带来一些不便。另一方面,搜索引擎从给绝大多数互联网用户提供不方便的服务上攫取利润,而非通过高质量的服务。为了保证自己的生存,搜索引擎公司不得不走在权衡搜索质量和商业利润的钢丝上。


这真的是一个理想的商业模式么?或许这只是互联网的一个“真实的谎言”。


IT技术的推动力?


在IT行业的泡沫破灭后,Google成为了互联网的天使。搜索技术不断的促进着互联网技术的各方各面地发展。商业搜索引擎似乎给.com带来了一个新的春天。


这也许是另一个幻想。近些年里,处理器的能力按照摩尔定律描述的那样飞速提高,同时网络带宽、无线上网、存储和图形处理能力的发展速度也在不断加快。而互联网时代的一个最重要的服务——web搜索,却自从1998年以来没有任何显著的发展。尽管web页面的库存量有了巨大的增长,当仍没有任何一个搜索引擎能够覆盖覆盖互联网网页总量的一半。平均的更新时间甚至变得比以前更长。或许唯一有所提高的是,一次搜索所得到的结果总数。


除了搜索引擎,所有其他的互联网服务,诸如E-mail、BBS和FTP都给予公开的协议。在互联网上没有任何隐秘的技术。但是互联网上或许最重要的技术,网络信息搜索技术,却仍然掌握在少数几家搜索引擎公司手中。它的基础算法“PageRank”甚至还是一个专利。许多专家都认识“PageRank”算法,但是没有人知道的更多,因为这是最高的商业机密。离开公众的监督,就不会有诚实的分级算法。众所周知,这个世界有一个著名的法则,“有钱能使鬼推磨”。这与互联网的基本原则,公开和自由,并不那么融洽。


这个隐密的法则,导致搜索引擎技术不为众人所熟知。近些年里,我们几乎看不到关于搜索引擎技术的学术研究。直到现今,搜索引擎技术的发展都有一些公司把持,并只公开很少的信息。商业搜索引擎或许并不是IT技术的推动力,甚至成为了未来新系统发展的障碍。1998年,Google的创始人,提出让Google成为全球的研究者的源头,并激发出下一代的搜索引擎技术的希望。然而,这也许已经无法成为现实,主要原因,就是因为这项技术所包含的商业价值。

网络搜索的未来

正如前面所说的那样,随着网页数量的快速增长,在覆盖面和更新间隔上的瓶颈变得十分明显。这迫使专家们重新思考向前搜索引擎的架构。在这种情况下。公共搜索引擎出现了。搜索引擎的未来到底属于谁,公共还是商业?

现在,很多人相信搜索引擎决定了你能够知道什么,和知道多少。如果信息都被一个小的集团所掌握,这种绝对的控制和对商业利润的追求,有可能会危害到普通用户的利益。我们想知道,为什么如此众多的搜索引擎公司都只是想着如何从广告中获取更多的利润,而不为我们提供一个完美的解决方案。出于这个原因,将搜索引擎转变为一个公开的服务会更好。
此外,几乎所有的互联网技术,从TCP/IP协议到E-mail,都是公开的技术,但是基于这些,同样可以设计出很好的商业服务。这或许就是互联网发展的一个基本的原则。这个原则同样可以应用于搜索引擎。我们能够为搜索引擎找到更好的商业模式。首先,要有一个基础的公开开放的网络搜索系统。然后,在此公共的平台的基础上,一些公司可以设计出更好的搜索系统。
但是我们必须要解决两个问题。第一,设计一个比现有商业系统更好的公共搜索系统。至少它要能提供和Google同样质量的搜索服务。第二,谁愿意去建造这样的系统?许多杰出的技术正是栽在这个原因之上。
对数字图书馆的研究能有所提示。众所周知,Google源自于斯坦福的数字图书馆项目(http://www-diglib.stanford.edu/)。他只是这个项目的一小部分。只有很少的几个专家知道这个项目的主要研究目标,InfoBus(信息快车)系统。这个系统的主要目的,就是综合互联网上各种不同的信息,这也是数字图书馆计划的一个基础研究课题。不知什么原因,这个系统并没有被广泛使用。但是在20世纪之后,许多诸如网络服务、网格等先进的技术出现了这项数字图书馆计划的研究课题,终于能在这些新技术的帮助下扬帆远航。

HUST数字计划正是这样(http://dris.hust.edu.cn/English/main.htm)。这个研究小组提出了DRIS(域信息整合系统),其目标正是综合各种各样的信息资源,并构建一个互联网上的信息回馈设施。而且有趣的是,这里也有一个很小的关于网络搜索的组件,叫做基于DNS的网络搜索引擎。理论上,这个系统能够覆盖互联网上的所有页面。它的更新间隔甚至缩短到了一天。在IETF中还有许多关于它的讨论。这个项目正在CERNET(中国教育和研究网,包含所有的中国大学)上建立一个测试平台。

接下来将是一个关于这个系统的非常简要的介绍。万维网是一个很大很分散的动态的世界,但是几乎所有的实践中的和商业运作中的互联网搜索引擎都是基于一种中心集中结构。因此,随着互联网信息的爆炸,这种结构本身的诟病变得更加严重。这正是现今搜索引擎瓶颈的源头。所以,所为“更好的解决方案”必须采用一种完全不同的架构。一些研究易购,比如IRTF(IETF的研究机构)中的SIREN机构,就提出扩展DNS的导航功能到网络页面搜索功能。这种DNS的层级分布式结构正能够提供有效地万维网管理。不仅仅包含站点的标题,同样也包含站点的内容。HUST数字图书馆计划给出了基于这个简单的想法的有可行性的解决方案。分布的系统必然是公共的系统。此外,它的分层结构也保证了它是一个可行的系统。它在底层提供了本地网中不同机构,比如大学,的搜索引擎。这正是构建这样的系统的原始动力。然后,这些小型的搜索引擎被其他的技术整合在一起,从而构建出一个能覆盖整个万维网的搜索引擎。通过它,将能获取到更多的信息。
这项研究或许能给出一个正如许诺的那样的公共的搜索引擎。尽管关于这个课题的学术研究没有很大的起色,但是公共的搜索系统终将被实现。

一些原则放之各处皆可行。首先,在自由市场中,消费者应该是上帝。其次,在技术革命中,只有适合大众需求的才能最终生存。

作者
Wang Liang
华中理工大学信息论学博士。HUST数字图书馆计划的主要策划人之一。

两篇类似的报告:
http://www.circleid.com/article/588_0_1_0_C
http://www.trnmag.com/Stories/2004/0...ch_040704.html


阅读
发现
翻译