Architecture & Lifecycle of the DHT Crawler

The DHT and BitTorrent protocols are (rather impenetrably) documented at bittorrent.org. Relevant resources include:

The rest of what I’ve figured out about how to implement a DHT crawler was cobbled together from the now archived magnetico project and anacrolix’s BitTorrent libraries.

The following diagram illustrates roughly how the crawler has been implemented within bitmagnet. It’s debatable if this will help stop anyone’s brain from melting, including my own.

Todo

This diagram is out-of-date and needs updating to reflect the new DHT crawler design.

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart TB
    START{Start}
    START -->STEP_OPEN_DHT_CONNECTION
    STEP_OPEN_DHT_CONNECTION(Open DHT connection)
    STEP_OPEN_DHT_CONNECTION -.->DHT
    STEP_OPEN_DHT_CONNECTION -->STEP_crawl(Crawl bootstrap nodes)
    STEP_crawl(Crawl bootstrap nodes) --> DHT_find_node[[DHT: find_node]]
    DHT_find_node -.->|Add to routing table| ROUTING_TABLE
    DHT_find_node -.->|Loop| STEP_crawl
    ROUTING_TABLE[/Routing Table/]
    STEP_select_node(Select a node from routing table and acquire lock)
    ROUTING_TABLE -.->STEP_select_node
    STEP_select_node -->DHT_sample_infohashes[[DHT: sample_infohashes]]
    DHT_sample_infohashes -->STEP_add_to_staging(Add hashes to staging)
    STEP_OPEN_DHT_CONNECTION --> STEP_select_node
    STEP_add_to_staging -->STEP_check_in_progress
    STEP_add_to_staging -->|Loop| STEP_select_node
    subgraph InfoHash staging
        STEP_check_in_progress(Is request for InfoHash already in progress?)
        STEP_check_in_progress -->|No| STEP_gather_infohashes
        STEP_gather_infohashes(Gather InfoHashes for batch DB check)
        STEP_gather_infohashes -->STEP_check_persisted_infohashes
        STEP_check_persisted_infohashes(Is InfoHash already persisted?)
        STEP_torrent_received(Torrent info received in staging)
        STEP_torrent_received -->STEP_persist_torrent
        STEP_torrent_received -->STEP_publish_classify_job
        STEP_persist_torrent(Persist torrent to database)
        STEP_publish_classify_job(Publish classify job)
        STEP_remove_torrent_from_staging("Remove torrent from staging")
        STEP_persist_torrent -->STEP_remove_torrent_from_staging
        STEP_publish_classify_job -->STEP_remove_torrent_from_staging
    end
    STEP_torrent_to_staging(Send torrent to staging)
    STEP_torrent_to_staging -->STEP_torrent_received
    STEP_remove_torrent_from_staging -->END
    STEP_persist_torrent -.->POSTGRES
    STEP_check_in_progress -->|Yes| END
    STEP_check_persisted_infohashes -->|Yes| END
    POSTGRES -.->STEP_check_persisted_infohashes
    STEP_check_persisted_infohashes -->|No| STEP_request_torrent_info(Request torrent info)
    STEP_request_torrent_info -->DHT_get_peers[[DHT: get_peers]]
    DHT_get_peers -->BT_request_meta_info[[BT: Request MetaInfo]]
    DHT_get_peers -.->|Add to routing table| ROUTING_TABLE
    STEP_request_torrent_info -->DHT_get_peers_scrape[["DHT: get_peers (scrape)"]]
    DHT_get_peers_scrape -->BT_request_meta_info[[BT: Request MetaInfo]]
    BT_request_meta_info -->STEP_meta_info_success(Did meta info request succeed for any peer?)
    STEP_meta_info_success -->|No| STEP_remove_torrent_from_staging
    STEP_meta_info_success -->|Yes| STEP_torrent_to_staging
    POSTGRES[(Postgres Database)]
    MESSAGE_QUEUE[(Postgres Message Queue)]
    STEP_publish_classify_job -.->MESSAGE_QUEUE
    MESSAGE_QUEUE -.->STEP_classify_torrent(Classify torrent content)
    STEP_persist_torrent_content(Persist content metadata)
    STEP_classify_torrent -->STEP_persist_torrent_content
    STEP_persist_torrent_content -.->POSTGRES
    STEP_persist_torrent_content -->END
    DHT((DHT connection))
    DHT_find_node <-.->DHT
    DHT_sample_infohashes <-.->DHT
    DHT_get_peers <-.->DHT
    DHT_get_peers_scrape <-.->DHT
    END{End}