Sunday, 24 August 2014

LIST OF SEARCH ENGINE

General

Name Language
Baidu Chinese, Japanese
Bing Multilingual
Blekko English
DuckDuckGo English
Exalead Multilingual
Gigablast English
Google Multilingual
Munax Multilingual
Qwant Multilingual
Sogou Chinese
Soso.com Chinese
Yahoo! Multilingual
Yandex Multilingual
Youdao Chinese

P2P search engines

Name Language
FAROO English
Seeks (Open Source) English
YaCy (Free and fully decentralized) Multilingual

Metasearch engines

Name Language
Blingo English
Yippy (formerly Clusty) English
DeeperWeb English
Dogpile English
Excite English
Harvester42
HotBot English
Info.com English
Ixquick (StartPage) Multilingual
Kayak and SideStep Multilingual
Mamma
Metacrawler English
Mobissimo Multilingual
Otalo English
PCH Search and Win
WebCrawler English

Geographically limited scope

Name Language Country
Accoona Chinese, English China, United States
Alleba English Philippines
Ansearch English Australia, United States, United Kingdom, New Zealand
Biglobe Japanese Japan
Daum
Korea
Egerin Kurdish Kurdistan
Goo Japanese Japan
Guruji.com
India
Leit.is
Iceland
Maktoob
Arab World
Miner.hu
Hungary
Najdi.si
Slovenia
Naver Korean Korea
Onkosh
Arab World
Rambler
Russia
Rediff
India
SAPO
Portugal, Angola, Cabo Verde, Mozambique
Search.ch
Switzerland
Sesam
Norway, Sweden
Seznam
Czech Republic
Walla!
Israel
Yandex.ru
Russia, Turkey, Ukraine, Belarus, Kazakhstan
Yehey!
Philippines
ZipLocal
Canada/United States

Semantic

See also: Semantic search
Search Engine Name Description Speciality
Sophia Search Limited Specialises in auto-tagging of content for semantic search and discovery search engine
True Knowledge (now Evi)[1] Specialises in knowledge base and semantic search answer engine
Yummly Semantic web search engine for food, cooking and recipes food related
Swoogle Searching over 10,000 ontologies Semantic web ontologies. Indexes over 4 million semantic web documents.

Accountancy

Business

Computers

Enterprise

Fashion

Food/Recipes

Genealogy

Mobile/Handheld

Job

Main article: Job search engine

Legal

Medical

News

People

Real estate / property

Television

Video Games

By information type

Search engines dedicated to a specific kind of information

Forum

Blog

Multimedia

Source code

BitTorrent

These search engines work across the BitTorrent protocol.

Email

Maps

Price

Question and answer

Human answers

Automatic answers

Natural language

SEARCH ENGINE BIAS

Although search engines are programmed to rank websites based on some combination of their popularity and relevancy, empirical studies indicate various political, economic, and social biases in the information they provide. These biases can be a direct result of economic and commercial processes (e.g., companies that advertise with a search engine can become also more popular in its organic search results), and political processes (e.g., the removal of search results to comply with local laws).
Biases can also be a result of social processes, as search engine algorithms are frequently designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing algorithms of major search engines skew towards coverage of U.S.-based sites, rather than websites from non-U.S. countries.
Google Bombing is one example of an attempt to manipulate search results for political, social or commercial reasons.

MARKET SHARE

Google is the world's most popular search engine, with a marketshare of 68.69 per cent. Baidu comes in a distant second, answering 17.17 per cent online queries.
The world's most popular search engines are:
Search engine Market share in July 2014
Google 68.69%
Baidu 17.17%
Yahoo! 6.74%
Bing 6.22%
Excite 0.22%
Ask 0.13%
AOL 0.13%

East Asia and Russia

East Asian countries and Russia constitute a few places where Google is not the most popular search engine. Soso (search engine) is more popular than Google in China.
Yandex commands a marketshare of 61.9 per cent in Russia, compared to Google's 28.3 per cent. In China, Baidu is the most popular search engine.South Korea's homegrown search portal, Naver, is used for 70 per cent online searches in the country. Yahoo! Japan and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan, respectively.


OPERATIONS OF A SEARCH ENGINE

A search engine operates in the following order:
  1. Web crawling
  2. Indexing
  3. Searching
Web search engines work by storing information about many web pages, which they retrieve from the HTML markup of the pages. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web crawler which follows every link on the site. The site owner can exclude specific pages by using robots.txt.

The search engine then analyzes the contents of each page to determine how it should be indexed (for example, words can be extracted from the titles, page content, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible.Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find.

 This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment, since the user normally expects that the search terms will be on the returned pages. Increased search relevance makes these cached pages very useful as they may contain data that may no longer be available elsewhere.

High-level architecture of a standard Web crawler
When a user enters a query into a search engine (typically by using keywords), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. From 2007 the Google.com search engine has allowed one to search by date by clicking "Show search tools" in the leftmost column of the initial search results page, and then selecting the desired date range.Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search, which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com. 

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This first form relies much more heavily on the computer itself to do the bulk of the work.

Most Web search engines are commercial ventures supported by advertising revenue and thus some of them allow advertisers to have their listings ranked higher in search results for a fee. Search engines that do not accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.

HISTORY OF SEARCH ENGINE

During early development of the web, there was a list of webservers edited by Tim Berners-Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains,but as more and more webservers went online the central list could no longer keep up. On the NCSA site, new servers were announced under the title "What's New!"

The very first tool used for searching on the Internet was Archie. The name stands for "archive" without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites since the amount of data was so limited it could be readily searched manually.

The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor.

In the summer of 1993, no search engine existed for the web, though numerous specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically mirrored these pages and rewrote them into a standard format. This formed the basis for W3Catalog, the web's first primitive search engine, released on September 2, 1993.
 
In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format.

JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web pages and to build its index, and used a web form as the interface to its query program. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below. Because of the limited resources available on the platform it ran on, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered.

One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.

Soon after, many search engines appeared and vied for popularity. These included Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than its full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search.
Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet.

In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.

Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1999 and ended in 2001.

Around 2000, Google's search engine rose to prominence. The company achieved better results for many searches with an innovation called PageRank, as was explained in the paper Anatomy of a Search Engine written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal. In fact, Google search engine became so popular that spoof engines emerged such as Mystery Seeker.

By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.

Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Looksmart, blended with results from Inktomi. For a short time in 1999, MSN Search used results from AltaVista were instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).
Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology.


SEARCH ENGINE

A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.

Sunday, 27 July 2014

THE INFORMATION SEARCHING

The information search process (ISP) is a six-stage process of information seeking behavior in library and information science. The ISP was first suggested by Carol Kuhlthau in 1991.

Stage 1: Initiation

During the first stage, initiation, the information seeker recognizes the need for new information to complete an assignment. As they think more about the topic, they may discuss the topic with others and brainstorm the topic further. This stage of the information seeking process is filled with feelings of apprehension and uncertainty.

Stage 2: Selection

In the second stage, selection, the individual begins to decide what topic will be investigated and how to proceed. Some information retrieval may occur at this point. The uncertainty associated with the first stage often fades with the selection of a topic, and is replaced with a sense of optimism.

Stage 3: Exploration

In the third stage, exploration, information on the topic is gathered and a new personal knowledge is created. Students endeavor to locate new information and situate it within their previous understanding of the topic. In this stage, feelings of anxiety may return if the information seeker finds inconsistent or incompatible information.

Stage 4: Formulation

During the fourth stage, formulation, the information seeker starts to evaluate the information that has been gathered. At this point, a focused perspective begins to form and there is not as much confusion and uncertainty as in earlier stages. Formulation is considered to be the most important stage of the process. The information seeker will here formulate a personalized construction of the topic from the general information gathered in the exploration phase.

Stage 5: Collection

During the fifth stage, collection, the information seeker knows what is needed to support the focus. Now presented with a clearly focused, personalized topic, the information seeker will experience greater interest, increased confidence, and more successful searching.

Stage 6: Search closure

In the sixth and final stage, search closure, the individual has completed the information search. Now the information seeker will summarize and report on the information that was found through the process. The information seeker will experience a sense of relief and, depending on the fruits of their search, either satisfaction or disappointment.


www2

WWW2 and WWW3 are hostnames or subdomains, typically used to identify a series of closely related websites within a domain, such as www.example.org, www2.example.org, and www3.example.org; the series may be continued with additional numbers: WWW4, WWW5, WWW6 etc. Traditionally, such websites are mirrors used for server load balancing. In some cases, the specific hostname may be obscured, creating the appearance that the user is viewing the "www" subdomain, even if they are actually viewing a mirror site.
WWW2 or WWW3 may also refer to:
  • Subdomain, part of a domain name identifying a website
  • World Wide Web, a system of interlinked, hypertext documents that runs over the Internet

Sunday, 20 July 2014

CACHING


If a user revisits a web page after a short interval, the browser may not need to re-obtain the page data from the source web server. Almost all web browsers cache recently obtained data, usually on the local hard drive. HTTP requests from a browser usually ask only for data that has changed since the last download. If locally cached data is still current, the browser reuses it. Caching reduces the amount of web traffic on the Internet. Decisions about expiration are made independently for each downloaded file, whether image, stylesheetJavaScript, HTML, or other web resource. Thus even on sites with highly dynamic content, many basic resources refresh only occasionally. Web site designers find it worthwhile to collate resources such as CSS data and JavaScript into a few site-wide files so that they can be cached efficiently. This helps reduce page download times and lowers demands on the web server.
There are other components of the Internet that can cache web content. Corporate and academic firewalls often cache Web resources requested by one user for the benefit of all. (See also caching proxy server.) Some search engines also store cached content from websites. Apart from the facilities built into web servers that can determine when files have been updated and so must be re-sent, designers of dynamically generated web pages can control the HTTP headers sent back to requesting users, so that transient or sensitive pages are not cached. Internet banking and news sites frequently use this facility. Data requested with an HTTP 'GET' is likely to be cached if other conditions are met; data obtained in response to a 'POST' is assumed to depend on the data that was Posted and so is not cached.

SPEED ISSUES


Frustration over congestion issues in the Internet infrastructure and the high latency that results in slow browsing has led to a pejorative name for the World Wide Web: the World Wide Wait. Speeding up the Internet is an ongoing discussion over the use of peering and QoS technologies. Other solutions to reduce the congestion can be found atW3C. Guidelines for web response times are:
  • 0.1 second (one tenth of a second). Ideal response time. The user does not sense any interruption.
  • 1 second. Highest acceptable response time. Download times above 1 second interrupt the user experience.
  • 10 seconds. Unacceptable response time. The user experience is interrupted and the user is likely to leave the site or system.

STATISTICS


Between 2005 and 2010, the number of web users doubled, and was expected to surpass two billion in 2010. Early studies in 1998 and 1999 estimating the size of the Web using capture/recapture methods showed that much of the web was not indexed by search engines and the Web was much larger than expected. According to a 2001 study, there were a massive number, over 550 billion, of documents on the Web, mostly in the invisible Web, or Deep Web. A 2002 survey of 2,024 million web pages determined that by far the most web content was in the English language: 56.4%; next were pages in German (7.7%), French (5.6%), and Japanese (4.9%). A more recent study, which used web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion web pages in the publicly indexable web as of the end of January 2005.As of March 2009, the indexable web contains at least 25.21 billion pages. On 25 July 2008, Google software engineers Jesse Alpert and Nissan Hajaj announced that Google Search had discovered one trillion unique URLs. As of May 2009, over 109.5 million domains operated. Of these 74% were commercial or other domains operating in the .com generic top-level domain.
Statistics measuring a website's popularity are usually based either on the number of page views or on associated server 'hits' (file requests) that it receives.

INTERNATIONALIZATION

The W3C Internationalization Activity assures that web technology works in all languages, scripts, and cultures. Beginning in 2004 or 2005, Unicode gained ground and eventually in December 2007 surpassed both ASCII and Western European as the Web's most frequently used character encoding.Originally RFC 3986 allowed resources to be identified by URI in a subset of US-ASCII. RFC 3987 allows more characters—any character in the Universal Character Set—and now a resource can be identified by IRI in any language.

ACCESSIBILITY


There are methods for accessing the Web in alternative mediums and formats to facilitate use by individuals with disabilities. These disabilities may be visual, auditory, physical, speech related, cognitive, neurological, or some combination. Accessibility features also help people with temporary disabilities, like a broken arm, or aging users as their abilities change. The Web receives information as well as providing information and interacting with society. The World Wide Web Consortium claims it essential that the Web be accessible, so it can provide equal access and equal opportunity to people with disabilities.Tim Berners-Lee once noted
, "The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect." Many countries regulate web accessibility as a requirement for websites. International cooperation in the W3C Web Accessibility Initiative led to simple guidelines that web content authors as well as software developers can use to make the Web accessible to persons who may or may not be using assistive technology.

STANDARDS


Many formal standards and other technical specifications and software define the operation of different aspects of the World Wide Web, the Internet, and computer information exchange. Many of the documents are the work of the World Wide Web Consortium (W3C), headed by Berners-Lee, but some are produced by the Internet Engineering Task Force (IETF) and other organizations.
Usually, when web standards are discussed, the following publications are seen as foundational:
  • Recommendations for markup languages, especially HTML and XHTML, from the W3C. These define the structure and interpretation of hypertext documents.
  • Recommendations for stylesheets, especially CSS, from the W3C.
  • Standards for ECMAScript (usually in the form of JavaScript), from Ecma International.
  • Recommendations for the Document Object Model, from W3C.
Additional publications provide definitions of other essential technologies for the World Wide Web, including, but not limited to, the following:
  • Uniform Resource Identifier (URI), which is a universal system for referencing resources on the Internet, such as hypertext documents and images. URIs, often called URLs, are defined by the IETF's RFC 3986 / STD 66: Uniform Resource Identifier (URI): Generic Syntax, as well as its predecessors and numerous URI scheme-defining RFCs;
  • HyperText Transfer Protocol (HTTP), especially as defined by RFC 2616: HTTP/1.1 and RFC 2617: HTTP Authentication, which specify how the browser and server authenticate each other.

SECURITY


For criminals, the Web has become the preferred way to spread malware. Cybercrime on the Web can include identity theft, fraud, espionage and intelligence gathering. Web-based vulnerabilities now outnumber traditional computer security concerns, and as measured by Google, about one in ten web pages may contain malicious code. Most web-based attacks take place on legitimate websites, and most, as measured by Sophos, are hosted in the United States, China and Russia. The most common of all malwarethreats is SQL injection attacks against websites. Through HTML and URIs, the Web was vulnerable to attacks like cross-site scripting (XSS) that came with the introduction of JavaScript and were exacerbated to some degree by Web 2.0 and Ajax web design that favors the use of scripts. Today by one estimate, 70% of all websites are open to XSS attacks on their users.
Proposed solutions vary to extremes. Large security vendors likeMcAfee already design governance and compliance suites to meet post-9/11 regulations, and some, likeFinjan have recommended active real-time inspection of code and all content regardless of its source. Some have argued that for enterprise to see security as a business opportunity rather than a cost center, "ubiquitous, always-on digital rights management" enforced in the infrastructure by a handful of organizations must replace the hundreds of companies that today secure data and networks. Jonathan Zittrain has said users sharing responsibility for computing safety is far preferable to locking down the Internet.