手机版
您的当前位置: 首页 > 行业英语 > 其他行业英语 > Blog mining 博客掘宝

Blog mining 博客掘宝

来源:其他行业英语 时间:2018-12-05 点击:

“I NOTICED that the doormat was at a slightly crooked angle. I reached down and moved the mat back into its correct place.” Thus began a recent entry on The dullest blog in the world. Although this publication is something of a satire on the internet’s inane blogs, scientists are finding—to their surprise—that useful information can actually be mined from the tedium of the blogosphere.
“我注意到门口的垫子有点歪,就蹲下来把它摆正。”这是“世上最无聊博客”网站上最新一篇日志的开头。虽然发布这篇文章,对互联网上空洞愚蠢的博客像是一个讽刺,但科学家却惊讶地发现,从博客世界这样的单调乏味中,确实能挖掘到有用的信息。

Andrew Gordon and his colleagues at the University of Southern California’s Institute for Creative Technologies in Los Angeles have been trying to teach computers about cause and effect. Computers are not good at dealing with causality. They can identify particular events but working out relationships is more difficult. This is particularly true when it comes to using computers to analyse the human experience.
安德鲁•戈登与他在洛杉矶南加州大学创新技术研究院的同事,一直在设法教电脑了解因果关系。电脑并不擅长处理因果关系。它们能识别特定事件,但难以找出其中的关联。在涉及用电脑分析人类经验时,尤其如此。

But it turns out that computers can learn a lot about causality by reading personal blogs. Of the million or so blog entries that are written in English every day, most are comments on news, plans for activities, or personal thoughts about life. Roughly 5% are narratives telling stories about events that have recently happened to the author.
不过,事实证明,电脑通过阅读个人博客,对因果关系能了解多多。每天约有一百万篇英语撰写的博文,其中大部分是新闻评论、活动策划以及关于人生的个人感悟。大致有5%是记叙文,讲述博主最近发生的一些故事。

To enable their computer system to learn from blogs, the team followed a two-step process. The first step was for humans to flag thousands of blog entries as either “story” or “not story”. People use different words with different frequencies when they are telling stories, as compared with other forms of discourse. By tallying up the frequencies of parts of speech such as pronouns (I, she, we) and past-tense verbs (went, said, thought) in these flagged blogs, it is possible to distinguish between the two types—regardless of what the story is actually about, says Dr Gordon. His computer system could then look at other blog entries and work out whether they were narrative or not.
为让其电脑系统能从博客中获得一些东西,该小组实施了如下两个步骤。第一步,将数千篇博文以人类的定义标记为“叙事型”或“非叙事型”。与其他形式的讲述相比,人们讲故事时,不同词语出现的频率也不同。戈登博士说,通过统计那些标记好的博客中的某些词——比如代词(I,she,we)和过去时态动词(went, said, thought)——出现的频率,无论博文故事内容到底如何,将其区分为上述两种类型,都是可能的。因此,他的电脑系统能浏览其他博文,分析出其属于记叙文还是不属于记叙文。

The second step was to teach the system to identify causal connections. Here the team used much the same technique. Dr Gordon and his students read thousands of random blog entries and specifically pointed out phrasing associated with causal relationships (such as “I did X so then Y happened”) for the computer to pick up on. Identifying such phrases in blog entries then enables the computer to pick out and categorise those sentences that contain a cause and an effect, such as “I slammed on the brakes but ended up smashing into the car in front of me” or “The doctor scolded me for eating too much fat and risking a heart condition.”
第二步,教这个系统识别因果关系。此时,研究小组采用了与第一步几乎一样的技术。为让电脑能够识别,戈登博士及其学生随机浏览了数千篇博文,明确指出了与因果关系相关的句式(比如,“我做了X因此Y发生了”)。电脑识别出了博文中的这些句法,因此才能找出来,并将这些包含有因果关系的句子(如“我猛踩刹车,最终却一头撞上了我前面的汽车”或“医生骂我摄入脂肪过量,有得心脏病的危险”)分门别类。

The idea is that this will eventually lead to a system that can gather aggregated statistics on a day-by-day basis about the personal lives of large populations—information that would be impossible to garner from any other source. Ultimately, Dr Gordon expects the analysis of personal stories in weblogs to be used much like Google’s flu tracker, but on a much grander scale. Google’s flu-tracking scheme can detect early signs of influenza outbreaks by mining search data for flurries of flu-related search terms in a particular region.
这项研究的想法是,最终引导一个系统产生,该系统能日复一日汇总庞大人口的个人生活统计数据——这些信息不可能从其他任何来源获取到。最后,戈登博士期待这种对博文个人故事的分析,能够像谷歌的“流感追踪”一样广泛应用,但应用规模会更为庞大。谷歌的流感追踪计划,通过挖掘特定地区跟流感相关的搜索用语骤增这样的搜索数据,能发现流感爆发的早期迹象。

The web could be mined to track information about emerging trends and behaviours, covering everything from drug use or racial tension to interest in films or new products. The nature of blogging means that people are quick to comment on events in their daily lives. Mining this sort of information might therefore also reveal information about exactly how ideas are spread and trends are set.
挖掘网络,能追踪那些与新趋势及新行为相关的信息,这些信息包罗万象,从毒品使用与种族关系紧张到电影喜好与新产品。博客的本质意味着人们会迅速评论日常生活事件。挖掘这类信息,或许也会因此揭示出观念到底如何传播,趋势究竟怎样产生。

In the world before the web, chatter about the trivialities of everyday life was shared in person, and not written down, so it could not be subjected to such analysis. While recording their words for posterity and obsessively checking their hit counters to see if anyone is reading them, today’s blog authors can console themselves with the thought that computers, at least, find their work fascinating. 更多信息请访问:http://www.24en.com/
网络问世前,人与人靠闲聊来分享日常生活琐事,并不会诉诸笔端,因此这些闲聊并不会进行如此分析。今天的博主,一边为子孙记下自己的言行,并锲而不舍地查看博文点击数,了解他人是否在浏览这些文字,一边还能用下列想法聊以自慰,那就是,至少还有计算机,认为他们的大作引人入胜。

神马英语网—在线英语学习_免费英语学习 https://www.smyyk.com

Copyright © 2002-2018 . 神马英语网—在线英语学习_免费英语学习 版权所有 京ICP备10015900号

Top