# SCAN: ragflow
date: 2026-02-12 | program: RAGFlow | repo: https://github.com/infiniflow/ragflow | bounty: $500-$1500

## summary
raw_findings: 8 | real: 5 | high_conf: 2 | med_conf: 3 | low_conf: 0 | false_pos: 3

Note: semgrep could not run (bash permission issue). Findings from manual code review only. Trufflehog not run.

## findings

### F1: Inverted Authorization Check in web_crawl Endpoint
severity: high | confidence: high | type: Auth Bypass | cwe: CWE-863
file: /Users/sebas/Code/bug-bounty/data/repos/ragflow/api/apps/document_app.py:116 | tool: manual

```python
@manager.route("/web_crawl", methods=["POST"])
@login_required
@validate_request("kb_id", "name", "url")
async def web_crawl():
    # ... snip ...
    e, kb = KnowledgebaseService.get_by_id(kb_id)
    if not e:
        raise LookupError("Can't find this dataset!")
    if check_kb_team_permission(kb, current_user.id):  # BUG: inverted logic
        return get_json_result(data=False, message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)

    blob = html2pdf(url)  # Fetches the URL
```

analysis: check_kb_team_permission() returns True when the user IS authorized (confirmed at line 25-37 in check_team_permission.py). The condition on line 116 denies authorized users and allows unauthorized users. Compare to line 86 in the same file which correctly uses "if not check_kb_team_permission(...)". This allows any authenticated user to use the web_crawl feature on ANY knowledge base they do NOT own, and denies access to actual owners.
attack_vector: POST /v1/document/web_crawl with kb_id of another user's knowledge base, name, and url parameters. The authenticated user must NOT be a team member of the target KB to gain access.
impact: Any authenticated user can crawl arbitrary URLs and store results in any knowledge base they can discover (but not their own). Combined with the SSRF protections being bypassable via DNS rebinding, this could be used for internal network reconnaissance.
recommendation: REPORT

---

### F2: Path Traversal via Server-Controlled Filename in /parse Endpoint
severity: high | confidence: high | type: Path Traversal | cwe: CWE-22
file: /Users/sebas/Code/bug-bounty/data/repos/ragflow/api/apps/document_app.py:878 | tool: manual

```python
@manager.route("/parse", methods=["POST"])
@login_required
async def parse():
    req = await get_request_json()
    url = req.get("url", "")
    if url:
        if not is_valid_url(url):
            return get_json_result(data=False, message="The URL format is invalid", ...)
        download_path = os.path.join(get_project_base_directory(), "logs/downloads")
        # ... Chrome fetches the URL ...
        r = re.search(r"filename=\"([^\"]+)\"", str(res_headers))
        if not r or not r.group(1):
            return get_json_result(data=False, ...)
        f = File(r.group(1), os.path.join(download_path, r.group(1)))  # PATH TRAVERSAL
        txt = FileService.parse_docs([f], current_user.id)
```

analysis: The filename is extracted from the HTTP response Content-Disposition header (server-controlled). It is joined with download_path using os.path.join() without sanitization. A malicious server at the target URL can return Content-Disposition with a traversal payload like filename="../../etc/passwd" causing os.path.join() to resolve to a path outside the download directory. The File.read() method then opens and reads this arbitrary file. The is_valid_url() check prevents internal network access but does not prevent a user from pointing to their own attacker-controlled server.
attack_vector: 1) Set up a server at https://evil.com/download that returns Content-Disposition: attachment; filename="../../../../etc/passwd". 2) POST /v1/document/parse with url pointing to it. 3) The response contains the contents of /etc/passwd parsed as a document.
impact: Arbitrary file read on the server. Attacker can read configuration files, secrets, source code, database credentials from the ragflow server.
recommendation: REPORT

---

### F3: SSRF via MCP Server Registration (No URL Validation)
severity: medium | confidence: medium | type: SSRF | cwe: CWE-918
file: /Users/sebas/Code/bug-bounty/data/repos/ragflow/api/apps/mcp_server_app.py:88-108 | tool: manual

```python
@manager.route("/create", methods=["POST"])
@login_required
@validate_request("name", "url", "server_type")
async def create() -> Response:
    req = await get_request_json()
    # ...
    url = req.get("url", "")
    if not url:
        return get_data_error_result(message="Invalid url.")
    # No is_valid_url() check!
    # ...
    mcp_server = MCPServer(id=server_name, name=server_name, url=url, ...)
    server_tools, err_message = await thread_pool_exec(get_mcp_tools, [mcp_server], timeout)
```

analysis: The MCP server create/update endpoints allow authenticated users to register arbitrary URLs. Unlike the web_crawl and crawler endpoints, there is NO is_valid_url() check. The system connects to the provided URL via SSE or HTTP to discover MCP tools. An attacker can register internal URLs (e.g., http://127.0.0.1:9380, http://redis:6379, http://mysql:3306) to probe internal services. The MCP client (sse_client/streamablehttp_client) will attempt TCP connections and the error messages may leak information about internal services.
attack_vector: POST /v1/mcp_server/create with body containing name, url set to http://169.254.169.254/latest/meta-data/ or internal service URLs, and server_type set to streamable_http.
impact: Internal network reconnaissance, potential access to cloud metadata endpoints, service enumeration. Information leakage through error messages about which ports/services are open.
recommendation: REPORT

---

### F4: SSRF via DNS Rebinding in is_valid_url TOCTOU
severity: medium | confidence: medium | type: SSRF | cwe: CWE-918
file: /Users/sebas/Code/bug-bounty/data/repos/ragflow/api/utils/web_utils.py:159-173 | tool: manual

```python
def is_valid_url(url: str) -> bool:
    if not re.match(r"(https?)://...", url):
        return False
    parsed_url = urlparse(url)
    hostname = parsed_url.hostname
    if not hostname:
        return False
    try:
        ip = socket.gethostbyname(hostname)  # DNS resolution 1
        if is_private_ip(ip):
            return False
    except socket.gaierror:
        return False
    return True
# ... later ...
# driver.get(url)  # DNS resolution 2 -- could resolve to different IP
```

analysis: The SSRF protection in is_valid_url resolves the hostname to check if it is private, then the actual HTTP request (via Selenium Chrome or crawl4ai) performs a separate DNS resolution. An attacker can use DNS rebinding: first resolution returns a public IP (passes the check), second resolution returns 127.0.0.1 or an internal IP. The time window between the two resolutions is significant since the check and the actual browser request are separate calls. This affects the web_crawl, parse, and upload_info endpoints.
attack_vector: 1) Set up a DNS rebinding service that alternates between a public IP and 169.254.169.254. 2) Call POST /v1/document/web_crawl or /v1/document/parse with the rebinding URL. 3) First resolution passes is_valid_url, second hits the cloud metadata endpoint.
impact: Bypass SSRF protection to access internal services, cloud metadata (AWS/GCP/Azure credentials), internal APIs.
recommendation: INVESTIGATE

---

### F5: Unauthenticated Image/Storage Access via get_image Endpoint
severity: medium | confidence: medium | type: Auth Bypass | cwe: CWE-306
file: /Users/sebas/Code/bug-bounty/data/repos/ragflow/api/apps/document_app.py:803-816 | tool: manual

```python
@manager.route("/image/<image_id>", methods=["GET"])
# @login_required  <-- COMMENTED OUT
async def get_image(image_id):
    try:
        arr = image_id.split("-")
        if len(arr) != 2:
            return get_data_error_result(message="Image not found.")
        bkt, nm = image_id.split("-")
        data = await thread_pool_exec(settings.STORAGE_IMPL.get, bkt, nm)
        response = await make_response(data)
        response.headers.set("Content-Type", "image/JPEG")
        return response
```

analysis: The login_required decorator is commented out. The endpoint takes a user-supplied image_id, splits it by "-" to get bucket and name, then retrieves arbitrary objects from the storage backend (MinIO/S3). If an attacker can guess or enumerate bucket IDs (which are kb_id or user_id UUIDs) and file names, they can access any stored file without authentication. The thumbnails endpoint is also unauthenticated and leaks document thumbnail data. The bucket format is UUIDs which limits brute-force but are predictable if any are leaked elsewhere.
attack_vector: GET /v1/document/image/{kb_id}-{filename} without any authentication token. The kb_id can be obtained if user has access to any shared dataset or from leaked references.
impact: Unauthorized access to stored documents and images from any user's knowledge base. Data exfiltration across tenant boundaries.
recommendation: INVESTIGATE

---

## skipped
| file:line | rule | reason |
|-----------|------|--------|
| rag/nlp/search.py:284 | eval() on TAG_FLD data | Data comes from LLM-generated tags stored in doc store, not directly user-controlled. Would require complex prompt injection chain through LLM to exploit. Severity: low. |
| api/utils/configs.py:61 | pickle.loads in deserialize_b64 | Only used by SerializedField(PICKLE type) which is defined but not actually used on any model fields. All models use JsonSerializedField. Dead code path. |
| common/doc_store/infinity_conn_base.py:688 | subprocess with SQL via psql -c | SQL is LLM-generated from user questions (indirect injection). psql -c passes as SQL argument not shell command. Requires prompt injection to exploit. Already commonly known pattern in RAG apps. |