Mynij - search faster, offline

Mynij Milestone 6: Crawler Proxy

Mynij Crawler proxy is a http reverse proxy that also add CORS headers to the proxied request which is required for javascript Ajax request in browser.
  • Last Update:2021-11-25
  • Version:001
  • Language:en

The goal of this Milestone is to build a web proxy that will be used by Mynij PWA to crawl websites from provided sitemaps (or RSS feeds) and also to make online search with searx. The proxy will solve the problem of cross-origin request in Javascript by setting appropriate headers in the http response to Mynij.

 

Cross-origin request problem

CORS (Cross-Origin Resource Sharing) is a mechanism that allows resources on a web page to be requested from another domain outside the domain the resource originated from. This kind of "cross-domain" requests is forbidden by web browsers, by same origin security policy. CORS headers sent in ajax requests define a way in which the browser and the server can interact to determine whether or not cross-origin is allowed.

In our case, Mynij needs to crawl various URLs which are all from differents domains, this mean that browers will block request because of missing CORS headers in the response which allows source website to interact with Mynij server. To solve this problem, we need a proxy which will forward requests using python library like requests, urllib, etc and send back response to Mynij including required CORS headers.

The proxy will add the following header in the response:

Access-Control-Allow-Origin: https://mynij.app.officejs.com

The Origin url sent by the proxy in the response headers is the origin of the website that makes the request. This mean that if Mynij PWA URL is https://mynij.app.officejs.com, then proxy responses will always contain .Access-Control-Allow-Origin: https://mynij.app.officejs.com. This allows any website to use the proxy and make cross-origin requests without CORS issues.

 

Mynij Proxy

Mynij proxy is a Python proxy web server which forwards requests to 3rd party web sites and ensures that responses will not be rejected by the originating web browser. The proxy also ensures a fast response and saves bandwidth by caching some requests. Proxy source files are hosted in lab.nexedi.com, and were developped using Starlette which is a lightweight ASGI framework/toolkit, ideal for building high performance asyncio services.

By deploying Mynij proxy with SlapOS and combining it with a Rapid.Space CDN, we ensure the proxy is available in different locations worldwide for crawling URLs, including in China. Rapid.Space CDN service accelerates content delivery by reducing the time to negotiate SSL/TLS sessions and by keeping a copy of content close to end-users.

 

Proxy deployment

To simplify proxy deployment, a Software Release for SlapOS was introduced. It automates proxy build and deployment including all required dependencies. Theses dependencies are mostly python eggs:

  • Gunicorn for Python WSGI HTTP Server
  • Starlette which is a lightweight ASGI web server frameworks
  • Httptools is a Python binding for the nodejs HTTP parser

Mynij Proxy being a python egg, can also be installed with python3 or pip3 using a version released on pypi.

Source code for Mynij Proxy Software Release is accessible on Nexedi gitlab https://lab.nexedi.com/Mynij/slapos-mynij/tree/master/software/mynij-proxy, it can be deployed in SlapOS using Theia or Webrunner. The picture below shows a deployment with Webrunner.

URL connection parameter is use to access the proxy. A sample command to fetch a web content using wget utility is:

wget PROXY_BASE_URL/proxy?url=URL_TO_FETCH

With our deployed proxy, the command to get https://nexedi.com home page is then: 

wget --no-check-certificate https://[2001:67c:1254:e:9::e702]:3001/proxy?url=https://nexedi.com -O index.html

Proxy also deploys a Slapos Monitoring stack which checks the status of all involved services using so-called "promises" (in the sense of Mark Burgess).


After the proxy is deployed with Webrunner, a CDN should be added. This document rapidspace-HowTo.Request.A.CDN explains how to request a CDN on rapid.space for the deployed proxy and make it accessible through IPv4 around the world, even though the proxy backend is hosted on IPv6 only.


Performances tests

We did some proxy performances tests to measure how the proxy behaves when there are many simultaneous requests, which will be the case whenever many users are building sitemaps. In the test bellow, we check the proxy availability by simulating 1000 users connecting to the server during 30 seconds. The test was done on a Mynij instance deployed inside a Virtual Machine running on Linux with 1G of RAM and 4 CPU cores.

The server responded to 792642 requests in 30,07 seconds which is about 26363.64 per seconds.

$ ./wrk -t12 -c1000 -d30s https://[2001:67c:1254:e:9::e702]:3001/ping
Running 30s test @ https://[2001:67c:1254:e:9::e702]:3001/ping
  12 threads and 1000 connections
    Latency    37.59ms   15.13ms 386.96ms   86.76%
    Req/Sec     2.24k   337.05     3.28k    73.05%
  792642 requests in 30.07s, 102.05MB read
Requests/sec:  26363.64
Transfer/sec:      3.39MB

In the second test, we deploy a local http server to serve a small file at URL http://localhost:8080/obj/script.o, then use the proxy to get that file with 1000 simultaneous users again during 30 seconds. The server handles 12434 requests in 30 seconds, which is about 413.10 requests per seconds.

$ ./wrk -t12 -c1000 -d30s https://[2001:67c:1254:e:9::e702]:3001/proxy?url=http://localhost:8080/obj/script.o
Running 30s test @ https://[2001:67c:1254:e:9::e702]:3001/proxy?url=http://localhost:8080/obj/script.o
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   931.95ms  396.16ms   2.00s    72.63%
    Req/Sec    62.84     32.59   270.00     64.59%
  12434 requests in 30.10s, 293.84MB read
  Socket errors: connect 0, read 0, write 0, timeout 1793
Requests/sec:    413.10
Transfer/sec:      9.76MB

Now, we run again the same test to get the same small file from the file server but this time without the proxy in between. The result shows that the server handle more requests, but this can be explained by the fact that going through the proxy involves two http servers and thus more processing time. Yet, the proxy is mandatory for cross-origin requests and can provide the advantage of caching.

$ ./wrk -t12 -c1000 -d30s http://localhost:8080/obj/script.o
Running 30s test @ http://localhost:8080/obj/script.o
  12 threads and 1000 connections
    Latency     4.61ms   47.04ms   1.73s    99.37%
    Req/Sec   235.98    235.49     1.66k    87.43%
  30754 requests in 30.10s, 724.52MB read
  Socket errors: connect 140, read 0, write 0, timeout 74
Requests/sec:   1021.78
Transfer/sec:     24.07MB

Mynij proxy is not faster than direct access, unlike caching proxies such as Apache traffic server. However, it can handle around 400 requests per second, which is good enough at the current stage of Mynij. To optimize this result and reduce the latency while building sitemaps, Mynij is able to manage a swarm of proxy servers all at the same time. Therefore, if the number of requests is too high, all proxies configured on Mynij can share the load and scale.