You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

170 lines
6.6 KiB

6 years ago
  1. # Running on Google Cloud ML Engine
  2. The Tensorflow Object Detection API supports distributed training on Google
  3. Cloud ML Engine. This section documents instructions on how to train and
  4. evaluate your model using Cloud ML. The reader should complete the following
  5. prerequistes:
  6. 1. The reader has created and configured a project on Google Cloud Platform.
  7. See [the Cloud ML quick start guide](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
  8. 2. The reader has installed the Tensorflow Object Detection API as documented
  9. in the [installation instructions](installation.md).
  10. 3. The reader has a valid data set and stored it in a Google Cloud Storage
  11. bucket. See [this page](preparing_inputs.md) for instructions on how to generate
  12. a dataset for the PASCAL VOC challenge or the Oxford-IIIT Pet dataset.
  13. 4. The reader has configured a valid Object Detection pipeline, and stored it
  14. in a Google Cloud Storage bucket. See [this page](configuring_jobs.md) for
  15. details on how to write a pipeline configuration.
  16. Additionally, it is recommended users test their job by running training and
  17. evaluation jobs for a few iterations
  18. [locally on their own machines](running_locally.md).
  19. ## Packaging
  20. In order to run the Tensorflow Object Detection API on Cloud ML, it must be
  21. packaged (along with it's TF-Slim dependency and the
  22. [pycocotools](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools)
  23. library). The required packages can be created with the following command
  24. ``` bash
  25. # From tensorflow/models/research/
  26. bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
  27. python setup.py sdist
  28. (cd slim && python setup.py sdist)
  29. ```
  30. This will create python packages dist/object_detection-0.1.tar.gz,
  31. slim/dist/slim-0.1.tar.gz, and /tmp/pycocotools/pycocotools-2.0.tar.gz.
  32. ## Running a Multiworker (GPU) Training Job on CMLE
  33. Google Cloud ML requires a YAML configuration file for a multiworker training
  34. job using GPUs. A sample YAML file is given below:
  35. ```
  36. trainingInput:
  37. runtimeVersion: "1.12"
  38. scaleTier: CUSTOM
  39. masterType: standard_gpu
  40. workerCount: 9
  41. workerType: standard_gpu
  42. parameterServerCount: 3
  43. parameterServerType: standard
  44. ```
  45. Please keep the following guidelines in mind when writing the YAML
  46. configuration:
  47. * A job with n workers will have n + 1 training machines (n workers + 1 master).
  48. * The number of parameters servers used should be an odd number to prevent
  49. a parameter server from storing only weight variables or only bias variables
  50. (due to round robin parameter scheduling).
  51. * The learning rate in the training config should be decreased when using a
  52. larger number of workers. Some experimentation is required to find the
  53. optimal learning rate.
  54. The YAML file should be saved on the local machine (not on GCP). Once it has
  55. been written, a user can start a training job on Cloud ML Engine using the
  56. following command:
  57. ```bash
  58. # From tensorflow/models/research/
  59. gcloud ml-engine jobs submit training object_detection_`date +%m_%d_%Y_%H_%M_%S` \
  60. --runtime-version 1.12 \
  61. --job-dir=gs://${MODEL_DIR} \
  62. --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
  63. --module-name object_detection.model_main \
  64. --region us-central1 \
  65. --config ${PATH_TO_LOCAL_YAML_FILE} \
  66. -- \
  67. --model_dir=gs://${MODEL_DIR} \
  68. --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
  69. ```
  70. Where `${PATH_TO_LOCAL_YAML_FILE}` is the local path to the YAML configuration,
  71. `gs://${MODEL_DIR}` specifies the directory on Google Cloud Storage where the
  72. training checkpoints and events will be written to and
  73. `gs://${PIPELINE_CONFIG_PATH}` points to the pipeline configuration stored on
  74. Google Cloud Storage.
  75. Users can monitor the progress of their training job on the [ML Engine
  76. Dashboard](https://console.cloud.google.com/mlengine/jobs).
  77. Note: This sample is supported for use with 1.12 runtime version.
  78. ## Running a TPU Training Job on CMLE
  79. Launching a training job with a TPU compatible pipeline config requires using a
  80. similar command:
  81. ```bash
  82. gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%m_%d_%Y_%H_%M_%S` \
  83. --job-dir=gs://${MODEL_DIR} \
  84. --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
  85. --module-name object_detection.model_tpu_main \
  86. --runtime-version 1.12 \
  87. --scale-tier BASIC_TPU \
  88. --region us-central1 \
  89. -- \
  90. --tpu_zone us-central1 \
  91. --model_dir=gs://${MODEL_DIR} \
  92. --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
  93. ```
  94. In contrast with the GPU training command, there is no need to specify a YAML
  95. file and we point to the *object_detection.model_tpu_main* binary instead of
  96. *object_detection.model_main*. We must also now set `scale-tier` to be
  97. `BASIC_TPU` and provide a `tpu_zone`. Finally as before `pipeline_config_path`
  98. points to a points to the pipeline configuration stored on Google Cloud Storage
  99. (but is now must be a TPU compatible model).
  100. ## Running an Evaluation Job on CMLE
  101. Note: You only need to do this when using TPU for training as it does not
  102. interleave evaluation during training as in the case of Multiworker GPU
  103. training.
  104. Evaluation jobs run on a single machine, so it is not necessary to write a YAML
  105. configuration for evaluation. Run the following command to start the evaluation
  106. job:
  107. ```bash
  108. gcloud ml-engine jobs submit training object_detection_eval_`date +%m_%d_%Y_%H_%M_%S` \
  109. --runtime-version 1.12 \
  110. --job-dir=gs://${MODEL_DIR} \
  111. --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
  112. --module-name object_detection.model_main \
  113. --region us-central1 \
  114. --scale-tier BASIC_GPU \
  115. -- \
  116. --model_dir=gs://${MODEL_DIR} \
  117. --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
  118. --checkpoint_dir=gs://${MODEL_DIR}
  119. ```
  120. Where `gs://${MODEL_DIR}` points to the directory on Google Cloud Storage where
  121. training checkpoints are saved (same as the training job), as well as
  122. to where evaluation events will be saved on Google Cloud Storage and
  123. `gs://${PIPELINE_CONFIG_PATH}` points to where the pipeline configuration is
  124. stored on Google Cloud Storage.
  125. Typically one starts an evaluation job concurrently with the training job.
  126. Note that we do not support running evaluation on TPU, so the above command
  127. line for launching evaluation jobs is the same whether you are training
  128. on GPU or TPU.
  129. ## Running Tensorboard
  130. You can run Tensorboard locally on your own machine to view progress of your
  131. training and eval jobs on Google Cloud ML. Run the following command to start
  132. Tensorboard:
  133. ``` bash
  134. tensorboard --logdir=gs://${YOUR_CLOUD_BUCKET}
  135. ```
  136. Note it may Tensorboard a few minutes to populate with results.