The advent of the Large Language Model (LLM) like ChatGPT has marked a significant milestone in the field of artificial intelligence with its unprecedented language capabilities in various language-related tasks, such as writing novels, generating code, etc. However, when it comes to solving complex real-world tasks that require task decomposition and multi-step planning, LLM can only provide general or plausible plans without execution results, which might not be useful for specific tasks. This limitation is primarily caused by its inability to interact directly with real-life tools; the lack of real tools limits LLM's ability to perform tasks in the real world. Furthermore, an absence of customized feedback from real-life tools and environments makes it impossible to refine its planning, which significantly limits its potential in real-life applications. To address this, we propose an innovative framework that enhances LLM with over 50 real-life tools, including web APIs like Wikipedia search, machine learning models, image processing software, etc. This framework can understand user requests and parse them into a multi-step plan. For example, if a user inputs: "I want to know whether the review of 'Iron Man' is positive or negative", the framework would develop a plan like (IMDB API [Web API] -> Sentiment Analysis [Machine Learning Model]), and then automatically execute tools to produce the result. Additionally, we also build a feedback system to provide feedback on both planning and execution quality. It evaluates the quality of the planning (e.g. format, rationality) and the execution result (e.g. alignment with the user's request), which significantly improves the quality and rationality of the planning by GPT. Besides this framework, we also developed a comprehensive benchmark that includes thousands of human-verified multi-step, multi-tool request-planning pairs that cover a variety of real-life scenarios. We conduct benchmarks on several state-of-the-art models like GPT-4, LLaMA, Gemini, etc.